This blog post explains how computers running the linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs.
It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.
This blog post will hopefully serve as a reference to anyone looking to do this.
Special thanksSpecial thanks to the folks at Private Internet Access who hired us to research this information in conjunction with other network research and who have graciously allowed us to build upon the research and publish this information.
The information presented here builds upon the work done for Private Internet Access , which was originally published as a 5 part series starting here .
General advice on monitoring and tuning the Linux networking stackThe networking stack is complex and there is no one size fits all solution. If the performance and health of your networking is critical to you or your business, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of the system interact.
Ideally, you should consider measuring packet drops at each layer of the network stack. That way you can determine and narrow down which component needs to be tuned.
This is where, I think, many operators go off track: the assumption is made that a set of sysctl settings or /proc values can simply be reused wholesale. In some case, perhaps, but it turns out that the entire system is so nuanced and intertwined that if you desire to have meaningful monitoring or tuning, you must strive to understand how the system functions at a deep level. Otherwise, you can simply use the default settings, which should be good enough until further optimization (and the required investment to deduce those settings) is necessary.
Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.
Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead make adjustments on new machines and rotate them into production, if possible.
OverviewFor reference, you may want to take have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the igb device driver. You can find that data sheet (warning: LARGE PDF) here for your reference .
The high level path a packet takes from arrival to socket receive buffer is as follows:
Driver is loaded and initialized. Packet arrives at the NIC from the network. Packet is copied (via DMA) to a ring buffer in kernel memory. Hardware interrupt is generated to let the system know a packet is in memory. Driver calls into NAPI to start a poll loop if one was not running already. ksoftirqd processes run on each CPU on the system. They are registered at boot time. The ksoftirqd processes pull packets off the ring buffer by calling the NAPI poll function that the device driver registered during initialization. Memory regions in the ring buffer that have had network data written to them are unmapped. Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing. Incoming network data frames are distributed among multiple CPUs if packet steering is enabled or if the NIC has multiple receive queues. Network data frames are handed to the protocol layers from the queues. Protocol layers process data. Data is added to receive buffers attached to sockets by protocol layers.This entire flow will be examined in detail in the following sections.
The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.
Detailed LookThis blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post.
Understanding exactly how packets are received in the Linux kernel is very involved. We’ll need to closely examine and understand how a network driver works, so that parts of the network stack later are more clear.
This blog post will look at the igb network driver. This driver is used for a relatively common server NIC, the Intel Ethernet Controller I350. So, let’s start by understanding how the igb network driver works.
Network Device Driver InitializationA driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the module_init macro.
The igb initialization function ( igb_init_module ) and its registration with module_init can be found in drivers/net/ethernet/intel/igb/igb_main.c .
Both are fairly straightforward:
/** * igb_init_module - Driver Registration Routine * * igb_init_module is the first routine called when the driver is * loaded. All it does is register with the PCI subsystem. **/ static int __init igb_init_module(void) { int ret; pr_info("%s - version %s\n", igb_driver_string, igb_driver_version); pr_info("%s\n", igb_copyright); /* ... */ ret = pci_register_driver(&igb_driver); return ret; } module_init(igb_init_module);The bulk of the work to initialize the device happens with the call to pci_register_driver as we’ll see next.
PCI initializationThe Intel I350 network card is a PCI express device.
PCI devices identify themselves with a series of registers in the PCI Configuration Space .
When a device driver is compiled, a macro named MODULE_DEVICE_TABLE (from include/module.h ) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.
The kernel uses this table to determine which device driver to load to control the device.
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
This table and the PCI device IDs for the igb driver can be found in drivers/net/ethernet/intel/igb/igb_main.c and drivers/net/ethernet/intel/igb/e1000_hw.h , respectively:
static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 }, /* ... */ }; MODULE_DEVICE_TABLE(pci, igb_pci_tbl);As seen in the previous section, pci_register_driver is called in the driver’s initialization function.
This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
From drivers/net/ethernet/intel/igb/igb_main.c :
static struct pci_driver igb_driver = { .name = igb_driver_name, .id_table = igb_pci_tbl, .probe = igb_probe, .remove = igb_remove, /* ... */ }; PCI probeOnce a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
Some typical operations to perform include:
Enabling the PCI device. Requesting memory ranges and IO ports . Setting the DMA mask. The ethtool (described more below) functions the driver supports are registered. Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung). Other device specific stuff like workarounds or dealing with hardware specific quirks or similar. The creation, initialization, and registration of a struct net_device_ops structure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more. The creation, initialization, and registration of a high level struct net_device which represents a network device.Let’s take a quick look at some of these operations in the igb driver in the function igb_probe .
A peek into PCI initializationThe following code from the igb_probe function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c :
err = pci_enable_device_mem(pdev); /* ... */ err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)); /* ... */ err = pci_request_selected_regions(pdev, pci_select_bars(pdev, IORESOURCE_MEM), igb_driver_name); pci_enable_pcie_error_reporting(pdev); pci_set_master(pdev); pci_save_state(pdev);First, the device is initialized with pci_enable_device_mem . This will wake up the device if it is suspended, enable memory resources, and more.
Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so dma_set_mask_and_coherent is called with DMA_BIT_MASK(64) .
Memory regions will be reserved with a call to pci_request_selected_regions , PCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to pci_set_master , and the PCI configuration space is saved with a call to pci_save_state .
Phew.
More Linux PCI driver informationGoing into the full explanation of how PCI devices work is beyond the scope of this post, but this excellent talk , this wiki , and this text file from the linux kernel are excellent resources.
Network device initializationThe igb_probe function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:
The struct net_device_ops is registered. ethtool operations are registered. The default MAC address is obtained from the NIC. net_device feature flags are set. And lots more.Let’s take a look at each of these as they will be interesting later.
struct net_device_opsThe struct net_device_ops contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.
This net_device_ops structure is attached to a struct net_device in igb_probe . From drivers/net/ethernet/intel/igb/igb_main.c )
static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ netdev->netdev_ops = &igb_netdev_ops;And the functions that this net_device_ops structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/igb/igb_main.c :
static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, /* ... */As you can see, there are several interesting fields in this struct like ndo_open , ndo_stop , ndo_start_xmit , and ndo_get_stats64 which hold the addresses of functions implemented by the igb driver.
We’ll be looking at some of these in more detail later.
Free Deb, RPM, RubyGem, & python repositories.
ethtool registrationethtool is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running apt-get install ethtool .
A common use of ethtool is to gather detailed statistics from network devices. Other ethtool settings of interest will be described later.
The ethtool program talks to device drivers by using the ioctl system call. The device drivers register a series of functions that run for the ethtool operations and the kernel provides the glue.
When an ioctl call is made from ethtool , the kernel finds the ethtool structure registered by the appropriate driver and executes the functions registered. The driver’s ethtool function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.
The igb driver registers its ethtool operations in igb_probe by calling igb_set_ethtool_ops :
static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ igb_set_ethtool_ops(netdev);All of the igb driver’s ethtool code can be found in the file drivers/net/ethernet/intel/igb/igb_ethtool.c along with the igb_set_ethtool_ops function.
From drivers/net/ethernet/intel/igb/igb_ethtool.c :
void igb_set_ethtool_ops(struct net_device *netdev) { SET_ETHTOOL_OPS(netdev, &igb_ethtool_ops); }Above that, you can find the igb_ethtool_ops structure with the ethtool functions the igb driver supports set to the appropriate fields.
From drivers/net/ethernet/intel/igb/igb_ethtool.c :
static const struct ethtool_ops igb_ethtool_ops = { .get_settings = igb_get_settings, .set_settings = igb_set_settings, .get_drvinfo = igb_get_drvinfo, .get_regs_len = igb_get_regs_len, .get_regs = igb_get_regs, /* ... */It is up to the individual drivers to determine which ethtool functions are relevant and which should be implemented. Not all drivers implement all ethtool functions, unfortunately.
One interesting ethtool function is get_ethtool_stats , which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.
The monitoring section below will show how to use ethtool to access these detailed statistics.
IRQsWhen a data frame is written to RAM via DMA , how does the NIC tell the rest of the system that data is ready to be processed?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely.
We’ll see why that is, exactly, in later sections.
NAPINAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a poll function that the NAPI subsystem will call to harvest data frames.
The intended use of NAPI in network device drivers is as follows:
NAPI is enabled by the driver, but is in the off position initially. A packet arrives and is DMA’d to memory by the NIC. An IRQ is generated by the NIC which triggers the IRQ handler in the driver. The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered poll function in a separate thread of execution. The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device. Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled. The process starts back at step 2.This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
The device driver implements a poll function and registers it with NAPI by calling netif_napi_add . When registering a NAPI poll function with netif_napi_add , the driver will also specify the weight . Most of the drivers hardcode a value of 64 . This value and its meaning will be described in more detail below.
Typically, drivers register their NAPI poll functions during driver initialization.
NAPI initialization in the igb driverThe igb driver does this via a long call chain:
igb_probe calls igb_sw_init . igb_sw_init calls igb_init_interrupt_scheme . igb_init_interrupt_scheme calls igb_alloc_q_vectors . igb_alloc_q_vectors calls igb_alloc_q_vector . igb_alloc_q_vector calls netif_napi_add .This call trace results in a few high level things happening:
If MSI-X is supported, it will be enabled with a call to pci_enable_msix . Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets. igb_alloc_q_vector is called once for every transmit and receive queue that will be created. Each call to igb_alloc_q_vector calls netif_napi_add to register a poll function for that queue and an instance of struct napi_struct that will be passed to poll when called to harvest packets.Let’s take a look at igb_alloc_q_vector to see how the poll callback and its private data are registered.
From drivers/net/ethernet/intel/igb/igb_main.c :
static int igb_alloc_q_vector(struct igb_adapter *adapter, int v_count, int v_idx, int txr_count, int txr_idx, int rxr_count, int rxr_idx) { /* ... */ /* allocate q_vector and rings */ q_vector = kzalloc(size, GFP_KERNEL); if (!q_vector) return -ENOMEM; /* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64); /* ... */The above code is allocation memory for a receive queue and registering the function igb_poll with the NAPI subsystem. It provides a reference to the struct napi_struct associated with this newly created RX queue ( &q_vector->napi above). This will be passed into igb_poll when called by the NAPI subsystem when it comes time to harvest packets from this RX queue.
This will be important later when we examine the flow of data from drivers up the network stack.
Bringing a network device upRecall the net_device_ops structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.
When a network device is brought up (for example, with ifconfig eth0 up ), the function attached to the ndo_open field of the net_device_ops structure is called.
The ndo_open function will typically do things like:
Allocate RX and TX queue memory Enable NAPI Register an interrupt handler Enable hardware interrupts And more.In the case of the igb driver, the function attached to the ndo_open field of the net_device_ops structure is called igb_open .
Preparing to receive data from the networkMost NICs you’ll find today will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
The Intel I350 NIC does support multiple queues. We can see evidence of this in the igb driver. One of the first things the igb driver does when it is brought up is call a function named igb_setup_all_rx_resources . This function calls another function, igb_setup_rx_resources , once for each RX queue to arrange for DMA-able memory where the device will write incoming data.
If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO .
It turns out the number and size of the RX queues can be tuned by using ethtool . Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.
The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packages at the hardware level, if desired.
We’ll take a look at how to tune these settings shortly.
Enable NAPIWhen a network device is brought up, a driver will usually enable NAPI .
We saw earlier how drivers register poll functions with NAPI, but NAPI is not usually enabled until the device is brought up.
Enabling NAPI is relatively straight forward. A call to napi_enable will flip a bit in the struct napi_struct to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.
In the case of the igb driver, NAPI is enabled for each q_vector that was initialized when the driver was loaded or when the queue count or size are changed with ethtool .
From drivers/net/ethernet/intel/igb/igb_main.c :
for (i = 0; i < adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi));Free Deb, RPM, RubyGem, & Python repositories.
Register an interrupt handlerAfter enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
Some drivers, like the igb driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.
MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with irqbalance or by modifying /proc/irq/IRQ_NUMBER/smp_affinity ). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.
If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
In the igb driver, the functions igb_msix_ring , igb_intr_msi , igb_intr are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.
You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/igb_main.c :
static int igb_request_irq(struct igb_adapter *adapter) { struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; int err = 0; if (adapter->msix_entries) { err = igb_request_msix(adapter); if (!err) goto request_done; /* fall back to MSI */ /* ... */ } /* ... */ if (adapter->flags & IGB_FLAG_HAS_MSI) { err = request_irq(pdev->irq, igb_intr_msi, 0, netdev->name, adapter); if (!err) goto request_done; /* fall back to legacy interrupts */ /* ... */ } err = request_irq(pdev->irq, igb_intr, IRQF_SHARED, netdev->name, adapter); if (err) dev_err(&pdev->dev, "Error %d getting interrupt\n", err); request_done: return err; }As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with igb_request_msix , falling back to MSI on failure. Next, request_irq is used to register igb_intr_msi , the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts. request_irq is used again to register the legacy interrupt handler igb_intr .
And this is how the igb driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.
Enable InterruptsAt this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the igb driver does this in __igb_open by calling a helper function named igb_irq_enable .
Interrupts are enabled for this device by writing to registers:
static void igb_irq_enable(struct igb_adapter *adapter) { /* ... */ wr32(E1000_IMS, IMS_ENABLE_MASK | E1000_IMS_DRSTA); wr32(E1000_IAM, IMS_ENABLE_MASK | E1000_IMS_DRSTA); /* ... */ } The network device is now upDrivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.
Let’s take a look at monitoring and tuning settings for network device drivers.
Monitoring network devicesThere are several different ways to monitor your network devices offering different levels of granularity and complexity. Let’s start with most granular and move to least granular.
Using ethtool -SYou can install ethtool on an Ubuntu system by running: sudo apt-get install ethtool .
Once it is installed, you can access the statistics by passing the -S flag along with the name of the network device you want statistics about.
Monitor detailed NIC device statistics (e.g., packet drops) with `ethtool -S`.
$ sudo ethtool -S eth0 NIC statistics: rx_packets: 597028087 tx_packets: 5924278060 rx_bytes: 112643393747 tx_bytes: 990080156714 rx_broadcast: 96 tx_broadcast: 116 rx_multicast: 20294528 ....
Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning.
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read. In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via ethtool can be misleading.
Using sysfssysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
You can find the number of dropped incoming network data frames for, e.g. eth0 by using cat on a file.
Monitor higher level NIC statistics with sysfs.
$ cat /sys/class/net/eth0/statistics/rx_dropped 2
The counter values will be split into files like collisions , rx_dropped , rx_errors , rx_missed_errors , etc.
Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss.
If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means.
Using /proc/net/devAn even higher level file is /proc/net/dev which provides high-level summary-esque information for each network adapter on the system.
Monitor high level NIC statistics by reading /proc/net/dev .
$ cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0 lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0
This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver.
Tuning network devices Check the number of RX queues being usedIf your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using ethtool .
Check the number of NIC receive queues with ethtool
$ sudo ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
Error seen if your NIC doesn't support this operation.
$ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters : Operation not supported
This means that your driver has not implemented the ethtool get_channels operation. This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature.
Adjusting the number of RX queuesOnce you’ve found the current and maximum queue count, you can adjust the values by using sudo ethtool -L .
Note: some devices and their drivers only support combined queues that are paired for transmit and receive, as in the example in the above section.
Set combined NIC transmit and receive queues to 8 with ethtool -L
$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d like to change only the RX queue count to 8, you would run:
Set the number of NIC receive queues to 8 with ethtool -L .
$ sudo ethtool -L eth0 rx 8
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the size of the RX queuesSome NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily ethtool provides a generic way for users to adjust the size. Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely.
Check current NIC queue sizes with ethtool -g
$ sudo ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512
the above output indicates that the hardware supports up to 4096 receive and transmit descriptors, but it is currently only using 512.
Increase size of each RX queue to 4096 with ethtool -G
$ sudo ethtool -G eth0 rx 4096
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the processing weight of RX queuesSome NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight.
You can configure this if:
Your NIC supports flow indirection. Your driver implements the ethtool functions get_rxfh_indir_size and get_rxfh_indir . You are running a new enough version of ethtool that has support for the command line options -x and -X to show and set the indirection table, respectively.Check the RX flow indirection table with ethtool -x
$ sudo ethtool -x eth0 RX flow hash indirection table for eth3 with 2 RX ring(s): 0: 0 1 0 1 0 1 0 1 8: 0 1 0 1 0 1 0 1 16: 0 1 0 1 0 1 0 1 24: 0 1 0 1 0 1 0 1
This output shows packet hash values on the left, with receive queue 0 and 1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0, while a packet which hashes to 3 will be delivered to receive queue 1.
Example: spread processing evenly between first 2 RX queues
$ sudo ethtool -X eth0 equal 2
If you want to set custom weights to alter the number of packets which hit certain receive queues (and thus CPUs), you can specify those on the command line, as well:
Set custom RX queue weights with ethtool -X
$ sudo ethtool -X eth0 weight 6 2
The above command specifies a weight of 6 for rx queue 0 and 2 for rx queue 1, pushing much more data to be processed on queue 0.
Some NICs will also let you adjust the fields which be used in the hash algorithm, as we’ll see now.
Adjusting the rx hash fields for network flowsYou can use ethtool to adjust the fields that will be used when computing a hash for use with RSS.
Check which fields are used for UDP RX flow hash with ethtool -n .
$ sudo ethtool -n eth0 rx-flow-hash udp4 UDP over IPV4 flows use these fields for computing Hash flow key: IP SA IP DA
For eth0, the fields that are used for computing a hash on UDP flows is the IPv4 source and destination addresses. Let’s include the source and destination ports:
Set UDP RX flow hash fields with ethtool -N .
$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
The sdfn string is a bit cryptic; check the ethtool man page for an explanation of each letter.
Adjusting the fields to take a hash on is useful, but ntuple filtering is even more useful for finer grained control over which flows will be handled by which RX queue.
ntuple filtering for steering network flowsSome NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via ethtool ) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1.
On Intel NICs this feature is commonly known as Intel Ethernet Flow Director . Other NIC vendors may have other marketing names for this feature.
As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. aRFS will be covered later.
This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:
A webserver running on port 80 is pinned to run on CPU 2. IRQs for an RX queue are assigned to be processed by CPU 2. TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2. All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program. Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.As mentioned, ntuple filtering can be configured with ethtool , but first, you’ll need to ensure that this feature is enabled on your device.
Check if ntuple filters are enabled with ethtool -k
$ sudo ethtool -k eth0 Offload parameters for eth0: ... ntuple-filters: off receive-hashing: on
As you can see, ntuple-filters are set to off on this device.
Enable ntuple filters with ethtool -K
$ sudo ethtool -K eth0 ntuple on
Once you’ve enabled ntuple filters, or verified that it is enabled, you can check the existing ntuple rules by using ethtool :
Check existing ntuple filters with ethtool -u
$ sudo ethtool -u eth0 40 RX rings available Total 0 rules
As you can see, this device has no ntuple filter rules. You can add a rule by specifying it on the command line to ethtool . Let’s add a rule to direct all TCP traffic with a destination port of 80 to RX queue 2:
Add ntuple filter to send TCP flows with destination port 80 to RX queue 2
$ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2
You can also use ntuple filtering to drop packets for particular flows at the hardware level. This can be useful for mitigating heavy incoming traffic from specific IP addresses. For more information about configuring ntuple filter rules, see the ethtool man page.
You can usually get statistics about the success (or failure) of your ntuple rules by checking values output from ethtool -S [device name] . For example, on Intel NICs, the statistics fdir_match and fdir_miss calculate the number of matches and misses for your ntuple filtering rules. Consult your device driver source and device data sheet for tracking down statistics counters (if available). SoftIRQsBefore examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.
What is a softirq?The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen ksoftirqd/0 in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.
Kernel subsystems (like networking) can register a softirq handler by executing the open_softirq function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.
ksoftirqdSince softirqs are so important for deferring the work of device drivers, you might imagine that the ksoftirqd process is spawned pretty early in the life cycle of the kernel and you’d be correct.
Looking at the code found in kernel/softirq.c reveals how the ksoftirqd system is initialized:
static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u", }; static __init int spawn_ksoftirqd(void) { register_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0; } early_initcall(spawn_ksoftirqd);As you can see from the struct smp_hotplug_thread definition above, there are two function pointers being registered: ksoftirqd_should_run and run_ksoftirqd .
Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
The code in kernel/smpboot.c first calls ksoftirqd_should_run which determines if there are any pending softirqs and, if there are pending softirqs, run_ksoftirqd is executed. The run_ksoftirqd does some minor bookkeeping before it calls __do_softirq .
__do_softirqThe __do_softirq function does a few interesting things:
determines which softirq is pending softirq time is