BACKGROUND OF THE INVENTION
[0001] This invention relates to the reception and processing of data packets at a data
processing system having a plurality of processor cores.
[0002] In conventional networked personal computers and servers having more than one CPU
core, the processing of received data packets is usually performed on just one of
the CPU cores. When incoming packets are received at the network card of the computer
or server (generically, a data processor), they are delivered into host memory via
a delivery channel. A delivery channel has an associated notification mechanism, which
serves to inform the host software of the arrival of packets. Typically the notification
mechanism includes an interrupt, which is raised when received packets are available
for processing. In response to the interrupt an interrupt service routine is invoked
on one of the CPU cores, and causes that CPU core to perform the work associated with
processing the incoming packets. This work can be significant, and at high data rates
this CPU core typically becomes a bottleneck in the system.
[0003] The above problem is especially severe with high data rate network architectures
such as 10 and 100 Gb Ethernet. Current single processor designs struggle to cope
at peak high throughput of data and any sources of inefficiency in the handling of
incoming data packets must be minimised if the promise of such high data transfer
speeds is to be realised.
[0004] Figure 1 shows a typical structure of a data processing system having a monolithic
operating system architecture. Kernel 103 manages hardware such as a network interface
device (NIC) 101 by means of driver 107 and controls the resources of the system.
The kernel receives hardware interrupts 115 at interrupt handler 113 and, in response
to being notified that there is available incoming packet data, performs receive processing
of the data. The processed traffic data is delivered to the appropriate socket 111
and application 109, which executes in user space 105.
[0005] With the aim of mitigating some of the problems described above, Microsoft has developed
the Receive Side Scaling (RSS) architecture which improves performance by spreading
the processing load across multiple CPU cores. Each received packet is delivered to
the host via one of a number of delivery channels, each associated with a notification
channel. Each notification channel is associated, typically via an interrupt, with
a different CPU core, so that the packets delivered to different delivery channels
are processed on different CPU cores. It is arranged that all packets of a given data
flow are delivered to a single delivery channel, and so are processed at the same
CPU. This is necessary to ensure that packets of a given data flow are processed in
the order that they arrive.
[0006] RSS aims to provide a solution to the above problems for the monolithic Microsoft
Windows operating system, and is also used on other monolithic operating systems.
However, there are other multi-CPU system architectures, such as virtualised systems
supporting multiple operating systems or systems supporting untrusted packet processing
entities, in which RSS is not helpful because received packets are forwarded to other
software domains that may not run on the same CPU core.
[0007] The architecture of a typical virtualised system is illustrated in figure 2A. Virtualised
operating system instances 205 are generally untrusted and control of the hardware
and resource allocation falls to a hypervisor or trusted domain 203. The hypervisor
manages the hardware and the virtualised operating system instances. Each virtualised
OS instance can access the network via a software-emulated network interface 207,
which is typically implemented as a communication channel between the virtualised
OS and the hypervisor. Packets received by the real network interface controller (NIC)
201 are delivered to the hypervisor, which routes them to the appropriate virtualised
domain via the software-emulated network interface. A problem with this approach is
that it incurs significant additional processing overheads when compared with the
native OS receive path, and the forwarding of packets through the hypervisor can become
a bottleneck in the system.
[0008] Some smart NICs are able to support multiple protected interfaces for sending and
receive packets, known as virtualised network interface controllers (VNICs). Each
virtualised OS domain may be given direct access to a VNIC via a memory mapping onto
the NIC hardware or via a shared memory segment. The virtualised OS uses a VNIC to
receive packets directly from the NIC, bypassing the hypervisor and associated forwarding
overheads. Each VNIC includes a delivery channel for delivering packets and a means
to notify to the virtualised OS. Such smart NICs typically have a filter table or
forwarding table that maps received packets to the appropriate VNIC and virtualised
OS. Received packets that do not map to any VNIC may be delivered via a default delivery
channel to the host domain.
[0009] In some configurations the notification means in a VNIC includes an interrupt, which
is able to invoke the virtualised OS directly when packets arrive. Alternatively the
virtualised OS may be invoked via a virtual interrupt. In the latter case, instead
of raising an interrupt, a virtual interrupt notification is sent to the hypervisor
via a default notification channel. The hypervisor receives this virtual interrupt
notification and in response invokes the virtualised OS via a virtual interrupt.
[0010] Such accelerated virtualised network I/O for virtualised OSs described above improves
performance considerably. However, processing of all packets received by a guest domain
is performed on just one CPU core, which may therefore become a bottleneck in the
system.
[0011] Conventional methods for distributing packet processing over multiple processors,
such as RSS, suffer from two main problems:
(i) Locks
[0012] State information relating to a particular data flow may be accessed by code executing
on multiple processors and must therefore be protected from concurrent access. Typically
this is achieved through the use of state locks. When locks are not contended they
incur a relatively small (but still significant) overhead. However, when locks are
contended, the loss in efficiency is very high. This can occur when a receive path
executes on more than one processor core and each core is required to access the same
state information of that receive path. In particular, while a kernel thread running
on a processor is blocked waiting for a lock, that processor will probably not be
able to perform any useful work. Processors in conventional multi-processor networked
systems can spend a significant time waiting for locks.
(ii) Cache effects
[0013] As the network stack executes on a processor, any state in host memory that it touches
(reads or writes) will be copied into the cache(s) close to that processor core. When
state is written, it is purged from the caches of any other processor cores. Thus,
in the case when a network stack executes concurrently on multiple cores, if more
than one processor writes cache lines in the state of the stack the cache lines will
bounce between the cores. This is highly inefficient since each cache write operation
to the network stack state by a particular processor causes the other processors handling
that stack to purge, and later reload, those cache lines.
[0014] Where locks are used to protect shared state, the memory that implements those locks
is itself shared state, and is also subject to cache-line bouncing.
[0015] Lock-free techniques for managing concurrent access to shared state may not suffer
from the blocking behaviour of locks, but do suffer from cache-bouncing.
[0016] There is therefore a need for an improved method of distributing the processing load
associated with handling network packets in data processing systems having multiple
CPU cores.
SUMMARY OF THE INVENTION
[0018] According to a first aspect of the present invention there is provided a method according
to claim 1.
[0019] Suitably, said selected delivery channel has one or more associated buffers into
which incoming packets are written, wherein said associated buffers are either part
of said network interface device (301; 401) or part of a system memory of said data
processing system that is accessible to said network interface device. Preferably
each delivery channel is arranged such that receive processing of all packet data
accepted into a delivery channel is performed at the same processing core. Suitably
the network interface performs at least some protocol processing of the packet data.
[0020] Preferably the mapping step includes delivering a subset of the received packet data
into the said delivery channel. The subset may comprise payload data. The payload
data may be TCP payload data.
[0021] The processing core may be a processing unit of the network interface device.
[0022] Preferably the selecting step is performed at the network interface device.
[0023] Suitably the data flow corresponds to a network socket of the particular software
domain. Suitably the one or more characteristics of the packet data comprise one or
more fields of the packet header.
[0024] Preferably the selecting step comprises: matching a first subset of the one or more
characteristics of the packet data to a set of stored characteristics so as to identify
the particular software domain; and choosing, in dependence on a second subset of
the one or more characteristics of the packet data, the delivery channel within the
particular software domain.
function is a Toeplitz function and the choosing step further includes using an indirection
table.
[0025] Preferably the mapping step comprises: writing the data packet to a delivery channel
of the data processing system; and delivering a notification event into a notification
channel associated with the selected delivery channel. Suitably the notification channel
is configured to, on receiving the notification event, cause an interrupt to be delivered
to the processing core associated with the selected delivery channel. Alternatively,
the notification channel is configured to, on receiving the notification event, cause
a wakeup notification event to be delivered to an interrupting notification channel
that is arranged to cause an interrupt to be delivered to a processing core. Preferably
the said processing core is the processing core associated with the selected delivery
channel.
[0026] Preferably the network interface device performs stateless packet processing.
[0027] The particular software domain may be a virtualised operating system instance. The
particular software domain may be a user-level process. The particular software domain
may have a lower privilege level than a kernel or hypervisor supported by the data
processing system. The particular software domain may be an operating system kernel.
[0028] According to a second aspect of the present invention there is provided a data processing
system arranged to perform the method according to the first aspect of the present
invention.
[0029] According to a third aspect of the present invention there is provided a method for
transmitting data packets onto a network by means of a data processing system having
a plurality of processing cores and supporting a network interface device and a set
of at least two software domains, each of said software domains comprising one of:
a virtualised operating system , part of a monolithic operating system, an application
and a network stack, wherein each of said software domains has a privilege level below
that of a kernel of the data processing system or a hypervisor of the data processing
system, the method characterised by each software domain: (a) carrying a plurality
of data flows; (b) supporting a least two transmit channels between the network interface
device and the respective software domain; and (c) being operable to process notification
events associated with the transmission of data at a processing core of the data processing
system, the method comprising: at a particular one of the software domains, selecting
in dependence on the data flow to which a set of data for transmission belongs, one
of at least two transmit channels, said transmit channel being associated with a particular
one of the processing cores of the system, wherein said at least two transmit channels
are associated with different processing cores of said at least two processing cores;
and processing notification events associated with the transmission of data through
the said transmit channel on the processing core associated with said transmit channel.
[0030] Preferably the transmit channel has an associated notification channel to which notification
events associated with the said transmit channel are delivered. Preferably the notification
channel is associated with the particular processing core associated with the selected
transmit channel. Suitably the notification channel is configured to, on receiving
the notification event, cause an interrupt to be delivered to the particular processing
core associated with the selected transmit channel. Alternatively, the notification
channel is configured to, on receiving the notification event, cause a wakeup notification
event to be delivered to an interrupting notification channel that is arranged to
cause an interrupt to be delivered to a processing core. Preferably the said processing
core is the processing core associated with the selected transmit channel.
[0031] The software domain may be a virtualised operating system instance. The software
domain may be a user-level process. The software domain may have a lower privilege
level than a kernel or hypervisor supported by the data processing system. The software
domain may be an operating system kernel.
[0032] According to a fourth aspect of the present invention there is provided a data processing
system arranged to perform the method according to the third aspect of the present
invention.
[0033] According to a fifth aspect of the present invention there is provided a data processing
system arranged to perform the method according to the first aspect of the present
invention and the method according to the third aspect of the present invention, wherein
the system is further arranged such that the receive processing of received packet
data of a first data flow and the processing of notification events associated with
the transmission of data of a second data flow are performed at the same processing
core if the first and second data flows are the same data flow.
[0034] According to a sixth aspect of the present invention there is provided a method for
managing interaction between a data processing system and a network interface device,
the data processing system having a plurality of processing cores and a set of at
least two software domains, each of said software domains comprising one of: a virtualised
operating system, part of a monolithic operating system, an application and a network
stack, wherein each of said software domains has a privilege level below that of a
kernel of the data processing system or a hypervisor of the data processing system,
the method characterised by each of software domain: (a) carrying a plurality of data
flows; (b) supporting a set of at least two notification channels; and (c) being operable
to process notification events at at least two processing cores of the data processing
system, the method comprising: at the network interface device, in response to processing
data of a data flow of one of the software domains, selecting in dependence on one
or more characteristics of the data flow one of a set of notification channels of
the software domain, each notification channel being associated with a particular
one of the at least two processing cores of the data processing system, wherein said
at least two notification channels are associated with different processing cores
of said at least two processing cores; delivering a notification event to the selected
notification channel; and responsive to receiving the notification event at the selected
notification channel, causing an interrupt to be delivered to the processing core
associated with the selected notification
channel such that processing of the notification event is performed at that processing
core by a processing entity of the software domain.
[0035] The notification event may indicate that one or more data packets have been received
at the network interface device. The notification event may indicate that one or more
data packets have been transmitted by the network interface device.
[0036] Suitably at least some protocol processing of the data packets is performed at the
network interface device. Preferably at least some protocol processing of the data
packets is performed by the processing entity of the software domain.
DESCRIPTION OF THE DRAWINGS
[0037] The present invention will now be described by way of example with reference to the
accompanying drawings, in which:
Figure 1 shows the conventional structure of a data processing system having a monolithic
operating system architecture.
Figure 2A shows the architecture of a conventional virtualised data processing system.
Figure 2B is a schematic diagram of a data processing system in accordance with the
present invention.
Figure 3 is a schematic diagram of a virtualised data processing system in accordance
with a first embodiment of the present invention.
Figure 4 is a schematic diagram of a data processing system having a user-level networking
architecture in accordance with a second embodiment of the present invention.
Figure 5 is a flow diagram illustrating a two-step mapping of incoming data packets
in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION OF THE DRAWINGS
[0038] The following description is presented to enable any person skilled in the art to
make and use the invention, and is provided in the context of a particular application.
Various modifications to the disclosed embodiments will be readily apparent to those
skilled in the art.
[0039] Thus, the present invention is not intended to be limited to the embodiments shown,
but is to be accorded the widest scope consistent with the principles and features
disclosed herein.
[0040] The present invention has general application to multi-processor data processing
systems, particularly those supporting a plurality of software domains that are each
capable of distributing packet processing across more than one processor. The term
"processor" as used herein may refer to a CPU having one or more processing cores,
a single core of a multi-core processor, some or all cores of a multi-core processor,
a logical processor, a dedicated packet processing unit, or any section of a processing
unit or integrated circuit. A processor can be any entity configured to execute an
instruction thread in the context of the state information for that thread. The term
"processor" is used to indicate both physical processors of a system and the logical
processors available to a particular software domain.
[0041] An example of a data processing system to which the present invention relates is
shown in figure 2B. Data processing system 1 comprises a plurality of processors 9
and two or more software domains 7. Preferably the data processing system is a multi-CPU
system. Each software domain may be a virtualised operating system, part of a monolithic
operating system, an application, or a network stack. At least some of the software
domains can have a privilege level below that of the system kernel or hypervisor (which
is itself a software domain) - these software domain(s) having a lower privilege level
are be referred to as "untrusted" software domains. The data processing system is
arranged to access a network 3 by means of one or more network interface devices 5.
A network interface device may or may not be considered to be part of data processing
system 1. At least some of the two or more software domains are operable to receive
and/or transmit data of the network 3 by means of network interface device(s) 5.
[0042] At least some of the software domains are configured to operate in accordance with
the present invention. There may in fact be further software domains of the data processing
system which do not operate in accordance with the present invention.
[0043] The term "data flow" is to be understood in the context of whichever system the present
invention is embodied. Each data flow of a data processing system is generally identified
by reference to an end point of the data processing system - for example, a socket,
application or kernel entity. Packets belonging to two or more different data flows
could be multiplexed together or aggregated in some way for transmission across network
3.
[0044] Each of the software domains 7 supports one or more delivery channels. A delivery
channel is tied to a particular processor of the system and accepts incoming data
packets belonging to any one of a set of data flows of the software domain such that
all packets of a particular data flow of the software domain are processed at the
same processor. There may be one or more data flows in the set accepted by a given
delivery channel. Preferably, for each software domain, there are as many delivery
channels as there are processors of the system configured to perform packet processing
(the so-called "RSS processors"). There may be a different number of RSS processors
available to each software domain. One or more of the RSS processors of a system may
be supported at one or more network interface devices.
[0045] In the case of TCP offload architectures, such as Microsoft Chimney, the packets
themselves are processed on the NIC. However, there are other aspects of receive processing
that are still performed at a processor of the system - for example, completing the
I/O operation associated with the received data, and waking processes or threads that
are waiting for the data. Note that stateless processing, such as performing packet
checksums, is preferably performed at the NIC.
[0046] A trusted software domain (such as a kernel or hypervisor) of the system preferably
allocates resources to a delivery channel. A trusted software domain, or the software
domain supporting a particular delivery channel, could determine which processor of
the system will perform the processing of data packets accepted into that delivery
channel. A data processing system has at least one trusted software domain.
[0047] When a data packet is received at the network interface device, the data packet is
mapped into a particular delivery channel of a particular software domain in dependence
on the data flow to which the data packet belongs. Each data packet carries an identifier
which indicates the data flow to which it belongs - this may be a device address,
a port number, or any other characteristic of the data packet which indicates the
data flow to which it belongs. Typically, the identifier will be located in a header
of the packet: for example, TCP packets specify a data flow by means of IP addresses
and port numbers carried in the packet headers.
[0048] The mapping of packets to delivery channels is preferably performed at the network
interface device. The mapping may be performed in dependence on header information
or other characteristics of the received data packets. The mapping of packets from
the NIC to their respective delivery channels may be performed in any suitable manner,
including: writing each received packet to an area of system memory accessible to
its respective software domain; passing each received packet to a kernel or trusted
entity of the data processing system. A preferred embodiment of the mapping is set
out in figure 5.
[0049] By ensuring that all packets of a given data flow are accepted into a single delivery
channel, all data packets of that data flow will be processed at the same processor.
This eliminates cache bouncing and state lock problems. Furthermore, use of a delivery
channel as described ensures that packets of a given data flow will be processed in
the order in which they arrive at the network interface device.
Virtualised systems
[0050] In a first embodiment of the present invention, illustrated in figure 3, a data processing
system supports two or more virtualised operating system (VOS) instances 303. Each
VOS is a (typically untrusted) software domain of the data processing system that
is operable to receive and/or transmit data by means of the network interface 301.
[0051] It is advantageous if each VOS (or applications supported by each VOS) can receive
and/or transmit data over the network by means of the network interface, without that
data being passed to hypervisor 315 (a trusted software domain). In other words, it
is preferable that hypervisor 315 is not required to consume processor time (e.g.
due to moving data around the system or performing packet processing) when a VOS is
receiving or transmitting data by means of the NIC.
[0052] Preferably, each virtualised operating system instance supports a virtualised network
interface (VNIC) or device driver that provides a software abstraction 317 of the
NIC to the VOS and allows the VOS to directly access the send/receive interface of
the NIC. Use of a VNIC/driver in a VOS improves transmit/receive performance because
it eliminates the time- and resource-expensive communications between the untrusted
virtualised OSes and the trusted software domain of the hypervisor 315.
[0053] The term hypervisor is used to refer to the trusted software domain of a virtualised
data processing system that controls access to the resources of the system. The hypervisor
may or may not be one and the same as any virtual machine monitor. The term hypervisor
encompasses a "host domain" as used in Xen and MS Veridian, and any trusted software
domain that has direct access to hardware.
[0054] The virtualised operating system instances (software domains) illustrated in figure
3 each support one or more applications 307 which are operable to transmit or receive
data over network 3 by means of NIC 5. Each application supports one or more sockets
309 that define the end points of the data flows of the VOS.
[0055] In accordance with the present invention, data packets received at the network interface
device are mapped to a delivery channel of a VOS in dependence on the data flow to
which the data packet belongs. The data flow to which a packet belongs is indicated
by the packet itself - typically in the packet header. Alternatively the corresponding
data flow could be inferred from meta-data associated with the packet.
[0056] Since each delivery channel accepts packets belonging to one or more data flows,
identifying the data flow to which a packet belongs also identifies the appropriate
delivery channel. Each delivery channel is supported by a single VOS. It is sometimes
appropriate for received data packets to be delivered into more than one delivery
channel - for example, multicast packets. The delivery channels may be spread over
more than one VOS.
[0057] It is advantageous if all the virtualised operating system instances operate in accordance
with the present invention. However, some virtualised OS instances may not support
multiple delivery channels, or may allocate received data packets for processing in
accordance with conventional methods. Each virtualised OS instance may be a different
operating system - for example, one instance might be a Unix-based OS and another
might be Microsoft Windows.
[0058] Preferably the hypervisor 315 also operates as a software domain in accordance with
the present invention. For example, the principles described herein may be extended
to out-of-band data handled by the hypervisor and/or to data flows handled by the
hypervisor. For example, all out-of-band data could be mapped to a particular delivery
channel of the hypervisor so as to cause all out-of-band data to be processed at a
particular CPU of the system.
[0059] Preferably, one or more of the virtualised OS instances 303 include a VNIC or driver
that is configured to receive hardware interrupts (such as MSI-X interrupts) 305 from
the NIC. This allows a VOS to receive notification that data packets have arrived
for one or more of its delivery channels. Alternatively, one or more of the virtualised
OS instances could receive virtualised interrupts: in this case, the hypervisor receives
a notification from the NIC and forwards it on to the appropriate VOS. In a preferred
embodiment, the interrupts are triggered by notification channels 311 having interrupts
enabled, as is known in the art.
Packet-processing entities
[0060] In a second embodiment of the present invention, illustrated in figure 4, two or
more user-level processes 407 supported in user-level environment 405 have access
to network interface 401. Each user-level process forms a software domain of the data
processing system and handles a plurality of data flows 405. Each user-level process
supports at least one packet-processing entity configured to perform protocol processing
of data packets received at the network interface or for transmission by the network
interface. A user-level process therefore supports at least a partial network stack.
Privileged mode environment 403 is typically a kernel or hypervisor of the data processing
system.
[0061] It is advantageous if the user-level processes can receive and/or transmit data over
the network by means of the NIC without data being passed to or handled by trusted
software domain 403. In other words, it is preferable that the user-level packet-processing
entities are not required to make system calls to the trusted software domain in order
to effect transmit and receive of data packets over the network. An example of a user-level
packet processing architecture is the Open Onload architecture by Solarflare Communications.
[0062] Each of the user-level processes supports two or more delivery channels by means
of which received data may be passed from the network interface device. Each user-level
process may also support two or more transmit channels by means of which data for
transmission may be passed to the network interface device. The transmit and delivery
channels allow applications 409 to transmit and receive data by means of the network
interface device.
[0063] In accordance with the present invention, data packets received at the network interface
device are mapped to a delivery channel of a user-level process in dependence on the
data flow to which the packet data belongs. The data flow to which a packet belongs
is indicated by the packet itself - typically in the packet header. Alternatively,
the corresponding data flow could be inferred from one or more characteristics of
the packet, such as the length of the payload data.
[0064] The data processing system illustrated in figure 4 supports a plurality of data flows
and each data flow is directed to a particular one of the user-level processes. Since
each delivery channel accepts packets belonging to one or more data flows, identifying
the data flow to which a packet belongs also identifies the appropriate delivery channel.
[0065] Trusted software domain 403 could itself be a software domain in accordance with
the present invention. For example, the principles described herein may be extended
to out-of-band data handled by the kernel and/or to data flows handled by the kernel.
For example, all out-of-band data could be mapped to a particular delivery channel
of the kernel so as to cause all out-of-band data to be processed at a particular
processor of the system.
[0066] The first and second embodiments described above are not mutually exclusive: one
or more of the virtualised OS instances could support two or more user-level processes
as described herein.
Delivery channels
[0067] Received packet data is delivered to a software domain by means of one or more delivery
channels. In preferred embodiments of the present invention, each delivery channel
has one or more associated buffers into which the received packets are written. The
associated buffers may be on the NIC, or in system memory accessible to the NIC as
shown in figures 3 and 4. It is further advantageous that the buffers associated with
a delivery channel be directly accessible to the software domain. This allows received
data to be delivered to the appropriate software domain without mediation by a processor.
[0068] It will be apparent that there are various mechanisms for delivering received data
into a delivery channel. A delivery channel could comprise a descriptor ring with
descriptors identifying the location of buffers in system memory, as is known in the
art. A data processing system of the present invention could be configured such that
a NIC delivers different sized packets into different descriptor rings, or a packet's
header may be split from its payload data with each being delivered into separate
buffers. As mentioned above, another option is for received data to be stored at the
NIC until it is retrieved by the appropriate agent of each delivery channel.
Notification Channels
[0069] Each delivery channel has an associated notification channel that serves to notify
the software domain of the arrival of received data. In preferred embodiments of the
present invention, notification channels are able to notify software domains of other
types of events. In some embodiments a delivery channel may incorporate a notification
channel as part of a unified mechanism. Figures 3 and 4 illustrate notification channels
311, 413 into which the NIC posts notification events so as to indicate that incoming
data packet(s) have been received. Events posted into a notification channel include
references to the data stored in the buffers that are associated with the delivery
channel. The notification channels may include non-interrupting notification channels
and interrupting notification channels (notification channels with interrupts enabled),
as described below. Alternatively, all notification channels may be interrupting notification
channels.
[0070] The notification channels in particular may be supported at the NIC, or may be maintained
by a virtualised NIC or device driver of a software domain.
[0071] Preferably there are as many notification channels as there are delivery channels
in a data processing system. Preferably, there are as many interrupting notification
channels as there are delivery channels in a data processing system. Each interrupting
notification channel is associated with a particular one of the processors of the
system, so that notification events delivered to the notification channel are handled
on that processor. Preferably, in each software domain that is operable according
to the present invention, there is one delivery channel and one notification channel
for each processor in the software domain configured to handle received data. Alternatively,
there may be a greater or lesser number of notification channels than processors.
[0072] The events posted to each notification channel are dequeued by one or more event
handlers of the system. It is advantageous if each notification channel has an associated
event handler that executes at the processor of the notification channel. This allows
all operations relating to the handling and processing of data packets of a delivery
channel to be performed at a single processor of the system. The event handler may
form part of the notification channel. Alternatively, each software domain (such as
a virtualised operating system instance) could support an event handler, or less preferably,
there could be one or more event handlers supported at a kernel or hypervisor (as
appropriate) of the system. An event handler could be part of a packet-processing
thread supported at one of the RSS processors of the system.
[0073] An event handling routine iteratively dequeues events from a notification channel
until the channel is empty or processing cycles are allocated to another process or
virtualised operating system. An event may indicate that one or more packets have
been received, in which case the event handling routine causes those received packets
to be processed.
[0074] An event handler may be invoked in response to an interrupt being raised. The interrupt
could be a hardware or virtualised interrupt, depending on the configuration of the
system. After the event handler determines that the notification channel is empty
it may re-enable the interrupt so that subsequent events will raise another interrupt.
[0075] A data processing system may be configured so that a non-interrupting notification
channel generates wake-up notifications to another notification channel indicating
that the first channel has received new events. Preferably the second channel is an
interrupting notification channel. In response to handling a wake-up event the event
handler of the first notification channel may be invoked to dequeue events of that
first channel. In this way the event handler of a notification channel in a software
domain that is not able handle hardware interrupts may be invoked in response to events
being delivered to that channel. The event handler of the first notification channel
may be invoked by way of a virtualised interrupt.
[0076] It can be advantageous to group notification channels such that only one notification
channel of the group is an interrupting notification channel. The interrupting notification
channel preferably holds wakeup events indicating which of the non-interrupting channels
of its group have received new events. This layer of indirection can reduce the number
of interrupts delivered to the processors of the system. Preferably all the notification
channels in such a group relate to a single processor of the system so as to ensure
that interrupts raised by means of the interrupting notification channel are handled
by the processor which is to perform processing of all of the data packets indicated
by events in the notification channels of the group.
[0077] In systems in which it is not possible to deliver hardware interrupts to its software
entities (such as some virtualised systems and user-level packet-processing entities),
all the notification channels could be non-interrupting channels. The notification
channels could be virtualised notification channels that are configured to deliver
virtualised interrupts into the appropriate software domains.
[0078] When a data packet is received at the NIC, a notification event is posted into the
notification channel associated with the delivery channel that corresponds to the
data flow to which the packet belongs. The data packet is written into a buffer, either
in system memory or at the NIC. The appropriate delivery channel is selected by the
network interface device on the basis of one or more characteristics of the data packet.
This selection is preferably determined on the basis of one or more identifiers in
the packet header that indicate which data flow the packet belongs to. Preferably
the delivery channel is selected in dependence on one or more address fields of the
packet header, such as the destination address and port number of an IP packet.
[0079] By identifying the data flow to which a packet belongs, the corresponding software
domain, delivery channel, and notification channel (and hence RSS processor) for that
packet are determined.
Packet Mapping
[0080] In preferred embodiments of the present invention, the NIC performs a two-step mapping
of a data packet 501 onto a particular delivery channel in dependence on the packet's
header information 505. This process is illustrated in figure 5. The NIC reads the
header information of an incoming data packet and has two mapping functionalities:
- i. a filter 507 linking one or more packet identifiers (such as predetermined fields
of the packet header) to the software domain of the system to which the packet should
be delivered;
- ii. a hash function 509, which, when performed on predetermined bits of the packet
header, provides an indication of the RSS processor, and therefore the delivery channel,
which is to handle packets of the data flow to which the received packet belongs.
[0081] The filter therefore indicates which software domain of the system a received data
packet is to be directed to. A bit or flag may be stored in the filter to indicate
whether receive side scaling (RSS) is enabled for a particular data flow, or for a
software domain. Alternatively a bit or flag may be associated with the software domain
to indicate whether RSS is enabled for that domain. If RSS is enabled, the packet
is delivered to the particular delivery channel of the indicated software domain that
is identified by the hash function 515. If RSS is not enabled, the packet is simply
delivered to the default delivery channel of that software domain 513.
[0082] If a match is not found at the filter, the NIC delivers the packet to a default software
domain of the data processing system 511 - usually the hypervisor or kernel. Alternatively
the NIC may deliver the packet to each of the software domains.
[0083] A system of the present invention may be configured to apply RSS algorithms only
to certain kinds of packets. For example, RSS may be applied only to TCP/IP v4 and
v6 packets. Only the first step of mapping a packet into a software domain need be
performed for packets to which RSS does not apply.
[0084] The hash function is performed on predetermined bits of an incoming packet header
and the result of the function is used to identify a delivery channel of the software
domain indicated by the filter. In accordance with the present invention, the selection
of a particular delivery channel determines a notification into which indications
for received packets are posted. Because each notification channel is associated with
a particular processor of the system, the output of the hash function therefore determines
at which processor 517 of the system processing of the received packet is performed.
Since the hash function calculates the same output for each packet that belongs to
the same data flow, each packet of that data flow will be delivered to the same processor.
[0085] In a particularly preferred embodiment, the hash function takes as its input various
fields from the packet headers and a key associated with the software domain, and
the result is a large number,
hash. Low order bits taken from that number are used to select an entry from a table associated
with the software domain,
sw_domain.rss_table[hash & mask]. The value of the entry in this table identifies one of a set of delivery channels.
This selection algorithm may be roughly expressed in pseudo-code as:
sw_domain = sw_domain_lookup(packet)
rss_hash = calculate_hash(packet, sw_domain.rss_key)
delivery_channel = sw_domain.rss_table[rss_hash & mask]
[0086] Of course, this is just one of many possible implementations of the present invention;
others will be apparent to those skilled in the art.
[0087] The functions of either the filter and hash function may alternatively be performed
by any one of a filter, forwarding table, a hash function (e.g. a Toeplitz function),
or any other suitable mapping or technique known in the art, and the terms "filter"
and "hash function" as used herein shall apply accordingly.
[0088] Each software domain that is configured in accordance with the teachings of the present
invention supports two or more delivery channels arranged such that data packets received
into one of the delivery channels is processed at a particular RSS processor of the
system. The kernel or hypervisor (i.e. a trusted software domain) of the system may
allocate a set or group of RSS processors for performing packet processing to all
or some of the untrusted software domains of the system. The software domains of the
system need not all utilise the same number of RSS processors.
[0089] In light of the above, it is clear that all the software domains of a system need
not use the same number of delivery channels or notification channels. Furthermore,
each delivery channel may be arbitrarily associated with a processor of the system
in the sense that it may not be important which processor is associated with which
delivery channel, provided that the mapping of data packets to delivery channels is
performed consistently such that packets of a particular data flow are always handled
at the same processor. Some operating systems dictate a particular mapping of received
packets to RSS processors.
[0090] An interrupting notification channel may be configured to trigger interrupts only
when it is primed or not blocked. This can be advantageous to prevent an interrupt
being delivered to a processor in certain situations - for example, when the processor
is processing high priority threads or when a device or thread has taken exclusive
control of the processor.
[0091] It can be advantageous for hardware interrupts to be delivered to a trusted software
domain of the data processing system. For instance, some received data packets may
be directed to untrusted software domains of the system that cannot receive hardware
interrupts, or the trusted domain may be configured in handle all hardware interrupts
for reasons of system integrity or security. In such configurations, the trusted domain
is preferably arranged to deliver virtualised interrupts into the indicated software
domains so as to trigger execution of the appropriate event handling processes.
[0092] In a preferred embodiment, in conjunction with writing a notification to a notification
channel, the NIC writes incoming data packets to that notification channel's corresponding
delivery channel. The channels are preferably supported at an area of memory allocated
to the software domain to which the packets belong. Each notification event identifies
the delivery channel (for example, as an address in memory) to which one or more data
packets have been written. A notification event may include the result of the filter
and/or hash function. This can be useful to indicate to an event handler of a software
domain which delivery channel the data packets belong to.
[0093] In the embodiment described here, it is the NIC which performs the mapping of received
packets to their appropriate delivery channels. However, embodiments are envisaged
that one or both of the mapping steps could be performed at a software domain of the
data processing system. For example, a NIC could deliver all received packets to a
trusted domain of the data processing system and the mapping operations could be performed
at that trusted domain.
Transmit processing
[0094] A data processing system of the present invention preferably extends the principles
described herein to transmit data processing. A software domain supports one or more
transmit channels, each of which is associated with a particular processor of the
system such that the transmit processing of data belonging to a particular data flow
is performed at the same processor at which all previous transmit processing of that
data flow was performed.
[0095] On the transmit path it is particularly efficient if there is one transmit channel
per processor (the processor to which the corresponding transmit channel is tied).
Notifications for transmit completions are therefore delivered to the same processor
which performs the transmit processing. Preferably, data for transmission that belongs
to a particular data flow is delivered into the transmit channel associated with the
RSS processor of that data flow. In other words, a data flow can be considered to
include a transmit data flow and an associated receive data flow, and the system is
configured such that both receive and transmit processing of packets belonging to
that data flow is performed at the same processor.
[0096] The principles of the present invention may be applied at any kind of data processor
capable of processing incoming data packets, including personal computers, laptops,
servers, bridges, switches, and routers. The data packets received at the NIC may
be any kind of data packets. For example, the packets may be TCP/IP or UDP packets.
The data processing system and NIC may support any set of protocols - for example,
the data layer protocol in particular could be IEEE 802.11, Ethernet or ATM.
[0097] A network interface device as described herein could be an expansion card, a computer
peripheral or a chip of any kind integrated into the data processing system - for
example, the chip could be installed on the motherboard. The network interface device
preferably supports a processor for performing, for example, packet mapping and physical
layer receive/transmit processing.
1. A method for receiving packet data by means of a data processing system having a plurality
of processing cores and supporting a network interface device (301; 401) and a set
of at least two software domains (303; 409), each of said software domains (303; 409)
comprising one of: a virtualised operating system (303), part of a monolithic operating
system, an application (409) and a network stack, wherein each of said software domains
(303; 409) has a privilege level below that of a kernel (403) of the data processing
system or a hypervisor (315; 403) of the data processing system, the method
characterised by each software domain (303; 409): (a) carrying a plurality of data flows; (b) supporting
at least two delivery channels between the network interface device (301; 401) and
the respective software domain (303; 409); and (c) being operable to perform receive
processing of received packet data at at least two processing cores of the data processing
system, the method comprising:
receiving at the network interface device (301; 401) packet data that is part of a
particular data flow;
selecting in dependence on one or more characteristics of the packet data a delivery
channel of a particular one of the software domains (303; 409), said delivery channel
being associated with a particular one of the at least two processing cores of the
data processing system, wherein said at least two delivery channels are associated
with different processing cores of said at least two processing cores; and mapping
the incoming packet data into the said selected delivery channel such that receive
processing of the packet data is performed in the software domain (303; 409) by the
same processing core that performed receive processing for preceding packets of that
data flow, wherein the selecting step comprises:
matching a first subset of the one or more characteristics of the packet data to a
set of stored characteristics so as to identify the particular software domain;
choosing, in dependence on a second subset of the one or more characteristics of the
packet data, the delivery channel within the particular software domain,
wherein the matching step is performed by means of a filter and the choosing step
is performed by means of a hash function,
wherein the filter comprises a flag associated with one of the software domains, said
flag indicating whether receive side scaling is enabled for said one of the software
domains.
2. A method as claimed in claim 1, wherein said selected delivery channel has one or
more associated buffers into which incoming packets are written, wherein said associated
buffers are either part of said network interface device (301; 401) or part of a system
memory of said data processing system that is accessible to said network interface
device (301; 401).
3. A method as claimed in claim 1 or 2, wherein each delivery channel is arranged such
that receive processing of all packet data accepted into a delivery channel is performed
at the same processing core.
4. A method as claimed in any preceding claim, wherein the network interface device (301;
401) performs at least some protocol processing of the packet data.
5. A method as claimed in claim 4, wherein the mapping step includes delivering a subset
of the received packet data into the said delivery channel.
6. A method as claimed in any preceding claim, wherein the selecting step is performed
at the network interface device (301; 401).
7. A method as claimed in any preceding claim, wherein the data flow corresponds to a
network socket of the particular software domain (303; 409).
8. A method as claimed in any preceding claim, wherein the one or more characteristics
of the packet data comprise one or more fields of the packet header.
9. A method as claimed in any preceding claim, wherein the mapping step comprises:
writing the data packet to a delivery channel of the data processing system; and
delivering a notification event into a notification channel (311; 413) associated
with the selected delivery channel.
10. A method as claimed in claim 9, wherein the notification channel (311; 413) is configured
to, on receiving the notification event, either cause an interrupt to be delivered
to the processing core associated with the selected delivery channel or cause a wakeup
notification event to be delivered to an interrupting notification channel that is
arranged to cause an interrupt to be delivered to a processing core, said processing
core is the processing core associated with the selected delivery channel.
1. Verfahren zum Empfangen von Paketdaten mittels eines Datenverarbeitungssystems mit
mehreren Verarbeitungskernen und zum Unterstützen eines Netzwerkschnittstellengeräts
(301; 401) und eines Satzes von wenigstens zwei Software-Domänen (303; 409), wobei
jede der genannten Software-Domänen (303; 409) eines der Folgenden umfasst: ein virtualisiertes
Betriebssystem (303), Teil eines monolithischen Betriebssystems, eine Anwendung (409)
und einen Netzwerkstack, wobei jede der genannten Software-Domänen (303; 409) eine
Privilegstufe unter der eines Kernels (403) des Datenverarbeitungssystems oder eines
Hypervisors (315; 403) des Datenverarbeitungssystems hat, wobei das Verfahren
dadurch gekennzeichnet ist, dass jede Software-Domäne (303; 409): (a) mehrere Datenflüsse führt; (b) wenigstens zwei
Lieferkanäle zwischen dem Netzwerkschnittstellengerät (301; 401) und der jeweiligen
Software-Domäne (303; 409) unterstützt; und (c) die Aufgabe hat, Empfangsverarbeitung
von empfangenen Paketdaten an wenigstens zwei Verarbeitungskernen des Datenverarbeitungssystems
durchzuführen, wobei das Verfahren Folgendes beinhaltet:
Empfangen, an dem Netzwerkschnittstellengerät (301; 401), von Paketdaten, die Teil
eines bestimmten Datenflusses sind;
Auswählen, in Abhängigkeit von einer oder mehreren Charakteristiken der Paketdaten,
eines Lieferkanals einer bestimmten der Software-Domänen (303; 409), wobei der genannte
Lieferkanal mit einem bestimmten der wenigstens zwei Verarbeitungskerne des Datenverarbeitungssystems
assoziiert ist, wobei die genannten wenigstens zwei Lieferkanäle mit unterschiedlichen
Verarbeitungskernen der genannten wenigstens zwei Verarbeitungskerne assoziiert sind;
und
Mappen der eingehenden Paketdaten in den genannten gewählten Lieferkanal, so dass
Empfangsverarbeitung der Paketdaten in der Software-Domäne (303; 409) vom selben Verarbeitungskern
durchgeführt wird, der Empfangsverarbeitung für vorangehende Pakete dieses Datenflusses
durchgeführt hat, wobei der Auswahlschritt Folgendes beinhaltet:
Vergleichen eines ersten Teilsatzes der ein oder mehreren Charakteristiken der Paketdaten
mit einem Satz von gespeicherten Charakteristiken, um die bestimmte Software-Domäne
zu identifizieren;
Wählen, in Abhängigkeit von einem zweiten Teilsatz der ein oder mehreren Charakteristiken
der Paketdaten, des Lieferkanals in der bestimmten Software-Domäne,
wobei der Vergleichsschritt mit einem Filter durchgeführt wird und der Auswahlschritt
mit einer Hash-Funktion durchgeführt wird,
wobei das Filter einen mit einer der Software-Domänen assoziierten Flag umfasst, wobei
der genannte Flag anzeigt, ob Empfangsseitenskalierung für die genannte eine der Software-Domänen
freigegeben ist.
2. Verfahren nach Anspruch 1, wobei der genannte gewählte Lieferkanal einen oder mehrere
assoziierte Puffer hat, in die eingehende Pakete geschrieben werden, wobei die genannten
assoziierten Puffer entweder Teil des genannten Netzwerkschnittstellengeräts (301;
401) oder Teil eines Systemspeichers des genannten Datenverarbeitungssystems sind,
auf den das genannte Netzwerkschnittstellengerät (301; 401) zugreifen kann.
3. Verfahren nach Anspruch 1 oder 2, wobei jeder Lieferkanal so ausgelegt ist, dass Empfangsverarbeitung
aller in einem Lieferkanal akzeptierter Paketdaten am selben Verarbeitungskern durchgeführt
wird.
4. Verfahren nach einem vorherigen Anspruch, wobei das Netzwerkschnittstellengerät (301;
401) wenigstens etwas Protokollverarbeitung der Paketdaten durchführt.
5. Verfahren nach Anspruch 4, wobei der Mapping-Schritt das Liefern eines Teilsatzes
der empfangenen Paketdaten in den genannten Lieferkanal beinhaltet.
6. Verfahren nach einem vorherigen Anspruch, wobei der Auswahlschritt an dem Netzwerkschnittstellengerät
(301; 401) durchgeführt wird.
7. Verfahren nach einem vorherigen Anspruch, wobei der Datenfluss einem Netzwerk-Socket
der bestimmten Software-Domäne (303; 409) entspricht.
8. Verfahren nach einem vorherigen Anspruch, wobei die ein oder mehreren Charakteristiken
der Paketdaten ein oder mehrere Felder des Paket-Headers umfassen.
9. Verfahren nach einem vorherigen Anspruch, wobei der Mapping-Schritt Folgendes beinhaltet:
Schreiben des Datenpakets auf einen Lieferkanal des Datenverarbeitungssystems; und
Liefern eines Mitteilungsereignisses in einen mit dem gewählten Lieferkanal assoziierten
Mitteilungskanal (311; 413) .
10. Verfahren nach Anspruch 9, wobei der Mitteilungskanal (311; 413) so konfiguriert ist,
dass er nach Empfang des Mitteilungsereignisses entweder die Lieferung eines Interrupt
zu dem mit dem gewählten Lieferkanal assoziierten Verarbeitungskern bewirkt oder die
Lieferung eines Aufweckmitteilungsereignisses an einen Unterbrechungsmitteilungskanal
bewirkt, der die Aufgabe hat, die Lieferung eines Interrupt zu einem Verarbeitungskern
zu bewirken, wobei der genannte Verarbeitungskern der mit dem gewählten Lieferkanal
assoziierte Verarbeitungskern ist.
1. Procédé servant à recevoir des données de paquets au moyen d'un système de traitement
de données ayant une pluralité de coeurs de traitement et prenant en charge un dispositif
d'interface de réseau (301 ; 401) et un ensemble d'au moins deux domaines de type
logiciel (303 ; 409), chacun desdits domaines de type logiciel (303 ; 409) comportant
l'un parmi : un système d'exploitation virtualisé (303), faisant partie d'un système
d'exploitation monolithique, une application (409) et une pile de réseau, dans lequel
chacun desdits domaines de type logiciel (303 ; 409) a un niveau de privilège inférieur
à celui d'un noyau (403) du système de traitement de données ou d'un hyperviseur (315
; 403) du système de traitement de données, le procédé étant
caractérisé par chaque domaine de type logiciel (303 ; 409) ayant pour objet de : (a) transporter
une pluralité de flux de données ; (b) prendre en charge au moins deux canaux de distribution
entre le dispositif d'interface de réseau (301 ; 401) et le domaine de type logiciel
respectif (303 ; 409) ; et (c) servir à effectuer un traitement de réception des données
de paquets reçues au niveau d'au moins deux coeurs de traitement du système de traitement
de données, le procédé comportant les étapes consistant à :
recevoir au niveau du dispositif d'interface de réseau (301 ; 401) des données de
paquets qui font partie d'un flux de données particulier ;
sélectionner, en fonction d'une ou de plusieurs caractéristiques des données de paquets,
un canal de distribution d'un domaine particulier des domaines de type logiciel (303
; 409), ledit canal de distribution étant associé à un coeur particulier desdits au
moins deux coeurs de traitement du système de traitement de données, dans lequel lesdits
au moins deux canaux de distribution sont associés à différents coeurs de traitement
desdits au moins deux coeurs de traitement ; et
mapper les données des paquets entrants dans ledit canal de distribution sélectionné
de telle sorte que le traitement de réception des données de paquets est effectué
dans le domaine de type logiciel (303 ; 409) par le même coeur de traitement que le
traitement de réception effectué pour les paquets précédents du flux de données, dans
lequel l'étape consistant à sélectionner comporte les étapes consistant à :
faire correspondre un premier sous-ensemble desdites une ou plusieurs caractéristiques
des données de paquets à un ensemble de caractéristiques stockées de manière identifier
le domaine de type logiciel particulier ;
choisir, en fonction d'un deuxième sous-ensemble desdites une ou plusieurs caractéristiques
des données de paquets, le canal de distribution dans les limites du domaine de type
logiciel particulier,
dans lequel l'étape consistant à faire correspondre est effectuée au moyen d'un filtre
et l'étape consistant à choisir est effectuée au moyen d'une fonction de hachage,
dans lequel le filtre comporte un drapeau associé à l'un des domaines de type logiciel,
ledit drapeau indiquant si la mise à l'échelle côté réception est activée pour ledit
l'un des domaines de type logiciel.
2. Procédé selon la revendication 1, dans lequel ledit canal de distribution sélectionné
a une ou plusieurs mémoires tampons associées dans lesquelles des paquets entrants
sont écrits, dans lequel lesdites mémoires tampons associées font soit partie dudit
dispositif d'interface de réseau (301 ; 401) soit partie d'une mémoire système dudit
système de traitement de données qui est accessible audit dispositif d'interface de
réseau (301 ; 401).
3. Procédé selon la revendication 1 ou la revendication 2, dans lequel chaque canal de
distribution est agencé de telle sorte que le traitement de réception de toutes les
données de paquets acceptées dans un canal de distribution est effectué au niveau
du même coeur de traitement.
4. Procédé selon l'une quelconque des revendications précédentes, dans lequel le dispositif
d'interface de réseau (301 ; 401) effectue au moins un certain traitement de protocole
des données de paquets.
5. Procédé selon la revendication 4, dans lequel l'étape consistant à mapper comprend
l'étape consistant à distribuer un sous-ensemble des données de paquets reçues dans
ledit canal de distribution.
6. Procédé selon l'une quelconque des revendications précédentes, dans lequel l'étape
consistant à sélectionner est effectuée au niveau du dispositif d'interface de réseau
(301 ; 401).
7. Procédé selon l'une quelconque des revendications précédentes, dans lequel le flux
de données correspond à un connecteur de réseau du domaine de type logiciel particulier
(303 ; 409).
8. Procédé selon l'une quelconque des revendications précédentes, dans lequel lesdites
une ou plusieurs caractéristiques des données de paquets comportent un ou plusieurs
champs de l'en-tête de paquet.
9. Procédé selon l'une quelconque des revendications précédentes, dans lequel l'étape
consistant à mapper comporte les étapes consistant à :
écrire le paquet de données au niveau d'un canal de distribution du système de traitement
de données ; et
distribuer un événement de notification dans un canal de notification (311 ; 413)
associé au canal de distribution sélectionné.
10. Procédé selon la revendication 9, dans lequel le canal de notification (311 ; 413)
est configuré pour, dès la réception de l'événement de notification, soit amener une
interruption à être distribuée au coeur de traitement associé au canal de distribution
sélectionné soit amener un événement de notification de réveil à être distribué à
un canal de notification d'interruption qui est agencé pour amener une interruption
à être distribuée à un coeur de traitement, ledit coeur de traitement est le coeur
de traitement associé au canal de distribution sélectionné.