NUMA, Multi-Queue NICs and CPU Affinity

References:
http://www.intel.com/content/dam/doc/application-note/82575-82576-82598-82599-ethernet-controllers-interrupts-appl-note.pdf
https://en.wikipedia.org/wiki/Non-uniform_memory_access
https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
https://blog.cloudflare.com/how-to-receive-a-million-packets/
https://www.kernel.org/doc/Documentation/networking/scaling.txt
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/
https://blogs.technet.microsoft.com/pracheta/2014/01/22/numa-understand-it-its-usefulness-with-windows-server-2012/

Contents:
Queue/Interrupts Affinity
NUMA
RSS/RPS/RFS

Queue/Interrupts Affinity:

NB: The cores on one socket of a dual-socket system can be assigned non-consecutive numbers. For example, on a dual-socket quad-core system Socket 0 has CPU 0, 2, 4, 6 and Socket 1 has CPU 1, 3, 5, 7 on x86_64, and Socket 0 has CPU 0, 1, 2, 3 and Socket 1 has CPU 4, 5, 6 7 on x86_32.

By default, Linux automatically assigns interrupts to processor cores. Two methods currently exist for automatically assigning the interrupts, an in-kernel IRQ balancer and the IRQ balance daemon in user space. For IP forwarding a transmit/receive queue pair should use the same processor core and reduce any cache synchronization between different cores. This can be performed by assigning transmit and receive interrupts to specific cores.

[[email protected]]# service irqbalance stop
[[email protected]]# service iptables stop


# Number of logical cores in the system

[[email protected]]# cat /proc/cpuinfo | grep processor
processor       : 0
processor       : 1

# or run "nproc" # To check how many queues a NIC has run: sudo ethtool -l eth1 # See interrupts for this machine [[email protected]]# cat /proc/interrupts CPU0 CPU1 29 0 0 PCI-MSI-edge eth1-tx-0 30 0 0 PCI-MSI-edge eth1-tx-1 31 0 0 PCI-MSI-edge eth1-tx-2 32 0 0 PCI-MSI-edge eth1-tx-3 33 5 0 PCI-MSI-edge eth1-rx-0 34 5 0 PCI-MSI-edge eth1-rx-1 35 5 0 PCI-MSI-edge eth1-rx-2 36 5 0 PCI-MSI-edge eth1-rx-3 37 2 0 PCI-MSI-edge eth1 39 0 0 PCI-MSI-edge eth2-tx-0 40 0 0 PCI-MSI-edge eth2-tx-1 41 0 0 PCI-MSI-edge eth2-tx-2 42 0 0 PCI-MSI-edge eth2-tx-3 43 6 0 PCI-MSI-edge eth2-rx-0 44 6 0 PCI-MSI-edge eth2-rx-1 45 6 0 PCI-MSI-edge eth2-rx-2 46 6 0 PCI-MSI-edge eth2-rx-3 47 2 0 PCI-MSI-edge eth2

The interrupts can be assigned to specific processor cores by assigning a core to each interrupt. The processor number is converted to a binary mask, so processor 0 is 1, processor 1 is 2, processor 2 is 4, processor 3 is 8 (2^processor_number). To assign eth1-tx-0 (29) to processor 1, echo the converted mask to /proc/irq/29/smp_affinity.

echo 2 > /proc/irq/29/smp_affinity

The IRQs for a specific device can also be seen in this directory "ls /sys/class/net/eth3/device/msi_irqs/".

The script below is run on a system with just 2 TX and 2 RX queues per NIC, and 2 CPU cores. It will automatically fan out the interrupts for a specific NIC across all available cores:

$ sudo ./irq_balance.sh eth3-
CPUs: 2
Existing irq affinity for eth3-:
eth3-rx-0 01
eth3-rx-1 01
eth3-tx-0 01
eth3-tx-1 01
Assign SMP affinity: eth3- queue 0, irq 50, cpu 1, mask 0x2
Assign SMP affinity: eth3- queue 1, irq 51, cpu 0, mask 0x1
Assign SMP affinity: eth3- queue 2, irq 52, cpu 1, mask 0x2
Assign SMP affinity: eth3- queue 3, irq 53, cpu 0, mask 0x1

$ sudo cat /proc/irq/5[0-3]/smp_affinity
02
01
02
01


$ cat irq_balance.sh
#!/bin/bash

# A remake of https://gist.github.com/pavel-odintsov/9b065f96900da40c5301

if [ -z "$1" ]; then
    echo
    echo usage: $0 [network-interface]
    echo
    echo e.g. $0 eth0
    echo
    echo adjusts irg balance for NIC queues
    exit
fi
dev="$1"


ncpus=`nproc`
test "$ncpus" -gt 1 || exit 1
echo "CPUs: $ncpus"


echo "Existing irq affinity for $dev:"
for irq in `awk -F "[ :]" "/$dev/"'{print $2}' /proc/interrupts`
do
    awk "/$irq:/"'{printf "%s ", $NF}' /proc/interrupts
    cat /proc/irq/$irq/smp_affinity
done


n=0
for irq in `awk -F "[ :]" "/$dev/"'{print $2}' /proc/interrupts`
do
    f="/proc/irq/$irq/smp_affinity"
    test -r "$f" || continue
    cpu=$[$ncpus - ($n % $ncpus) - 1]
    if [ $cpu -ge 0 ]
            then
                mask=`printf %x $[2 ** $cpu]`
                echo "Assign SMP affinity: $dev queue $n, irq $irq, cpu $cpu, mask 0x$mask"
                echo "$mask" > "$f"
                let n+=1
    fi
done
# With a TX host sending two data streams to an RX host (one with dst IP 10.0.0.1 and the other has dst IP 10.0.0.2),
# the two streams hash into two seperate RX queues on the NIC.
# The RX packet and byte counters can be seen to increase for both queues on the 2 RX queue NIC with...

$watch -n 1 'sudo ethtool -S eth3 | grep -E " rx_"'


# After setting the queue affinity for the two queues to two seperate CPU cores..
$ sudo ./irq_balance.sh eth3-

# both cores can be seen to be processing frames/packets at the same time and same rate by running the script,
# https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh

$ watch -n 1 ./softnet.sh
cpu      total    dropped   squeezed  collision        rps flow_limit
  0 3876863303          0       6694          0          0          0
  1 3383929713          0        278          0          0          0


# Or just simply by running...
watch -n 1 "cat /proc/net/softnet_stat"

# If enabled, interrupt coalescing can be disabled with...
sudo ethtool -C eth3 rx-usecs 0
sudo ethtool -C eth3 tx-usecs 0
sudo ethtool -c eth3

 

NUMA (Non-Uniform Memory Access):

For non-NUMA supporting multi-core CPUs, only a finite amount of cores can access a specific memory device at a time (be it on chip like CPU cache or off chip like RAM). Multiple cores trying to access different sections of RAM simultaneously can stall if more requests are "in flight" than there are RAM channels. RAM access was originally provided serially but modem RAM uses multi-channel memory access to allow for parallel access and processors like the Intel i7 Extreme series and various Xeons support quad-channel memory (meaning four cores can access memory at the same time) however modern day systems may have hundreds of cores and CPUs are generally much faster that RAM reads/writes.

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA depends heavily on the nature of the running tasks.

"NUMA increases processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.
 
In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.
 
On NUMA hardware, some regions of memory are on physically different buses from other regions. Because NUMA uses local and foreign memory, it will take longer to access some regions of memory than others. Local memory and foreign memory are typically used in reference to a currently running thread. Local memory is the memory that is on the same node as the CPU currently running the thread. Any memory that does not belong to the node on which the thread is currently running is foreign. Foreign memory is also known as remote memory. The ratio of the cost to access foreign memory over that for local memory is called the NUMA ratio. If the NUMA ratio is 1, it is symmetric multiprocessing (SMP). The greater the ratio, the more it costs to access the memory of other nodes.
 
The main benefit of NUMA is scalability. The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, all memory access is posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but not when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus and connecting the various nodes by means of a high speed interconnection.
 
The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary."
 

 

Notes from CloudFlare on NUMA performance when using a single RX queue on a NIC and feeding the data to a single threaded receiver app:

We can pin the single-threaded receiver to one of four interesting CPUs in our setup. The four options are:

  1. Run receiver on another CPU, but on the same NUMA node as the RX queue. The performance as we saw above is around 360kpps.

  2. With receiver on exactly same CPU as the RX queue we can get up to ~430kpps. But it creates high variability. The performance drops down to zero if the NIC is overwhelmed with packets.

  3. When the receiver runs on the HT counterpart of the CPU handling RX queue, the performance is half the usual number at around 200kpps.

  4. With receiver on a CPU on a different NUMA node than the RX queue we get ~330k pps. The numbers aren't too consistent though.

While a 10% penalty for running on a different NUMA node may not sound too bad, the problem only gets worse with scale. On some tests I was able to squeeze out only 250kpps per core. On all the cross-NUMA tests the variability was bad. The performance penalty across NUMA nodes is even more visible at higher throughput. In one of the tests I got a 4x penalty when running the receiver on a bad NUMA node.

Consider an example case where multiple RX and TX queues exist on a NIC and multiple traffic flows exist ingressing the NIC, which can be hashed into the different RX queues. For example sake there are 8 physical cores per NUMA node, 2 NUMA nodes, with Hyper-Threading that is 32 cores, and 4 RX and 4 TX queues on the single NIC.

Ideally the 4 RX queues are mapped to 4 physical cores on NUMA node 1. For the multi-threaded receiving application the other four physical cores could be used for up to 4 receive processing threads (or some combination thereof, 3 + a "master" thread). Mapping the RX queues and RX application across a mixture of physical and logical (Hyper-Threading) cores on the first NUMA node could mean that during times of high pps rates an RX queue-assigned-core might starve out the receive application thread on the logical counterpart core, or vice versa if the receive application thread has minimal blocking from IO for example it might starve out RX queue-assigned-core and there may result some packet loss.

TX queues can be mapped to specific CPU cores just like RX queues (XFS – Transmit Flow Steering). In the case that the application needs to send data too (perhaps it mirrors back the data it receives for example) it would ideally run the multiple transmit threads on separate cores to the RX queues and receive application threads. If the transmit application threads are off of the same NUMA node though there will likely be a slight penalty. So ideally for this example application that mirrors back data that was received the RX queues, RX threads, TX threads and TX queues are all pinned to the same NUMA node across separate physical cores.

Check for NUMA features/compatability:

numactl --hardware
numactl --show

dmesg | grep NUMA

 

Receive Side Scaling / Receive Packet Steering / Receive Flow Steering

RSS – Receive Side Scaling
Each receive queue has a separate IRQ associated with it. The NIC triggers this to notify a CPU when new packets arrive on the given queue. The signalling path for PCIe devices uses message signalled interrupts (MSI-X), which can route each interrupt to a particular CPU. The active mapping of queues to IRQs can be determined from /proc/interrupts. By default, an IRQ may be handled on any CPU. Because a non-negligible part of packet processing takes place in receive interrupt handling, it is advantageous to spread receive interrupts between CPUs.

RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. Spreading load between CPUs decreases queue length. For low latency networking, the optimal setting is to allocate as many queues as there are CPUs in the system (or the NIC maximum, if lower). The most efficient high-rate configuration is likely the one with the smallest number of receive queues where no receive queue overflows due to a saturated CPU, because in default mode with interrupt coalescing enabled, the aggregate number of interrupts (and thus work) grows with each additional queue.

Per-cpu load can be observed using the mpstat utility, but note that on processors with hyperthreading (HT), each hyperthread is represented as a separate CPU. For interrupt handling, HT has shown no benefit in initial tests, so limit the number of queues to the number of CPU cores in the system.

RPS - Receive Packet Steering
Receive Packet Steering (RPS) is logically a software implementation of RSS. Being in software, it is necessarily called later in the datapath. Whereas RSS selects the queue and hence CPU that will run the hardware interrupt handler, RPS selects the CPU to perform protocol processing above the interrupt handler. This is accomplished by placing the packet on the desired CPU's backlog queue and waking up the CPU for processing. RPS has some advantages over RSS: 1) it can be used with any NIC, 2) software filters can easily be added to hash over new protocols, 3) it does not increase hardware device interrupt rate (although it does introduce inter-processor interrupts (IPIs)).

RFS- Receive Flow Steering
While RPS steers packets solely based on hash, and thus generally provides good load distribution, it does not take into account application locality. This is accomplished by Receive Flow Steering (RFS). The goal of RFS is to increase datacache hitrate by steering kernel processing of packets to the CPU where the application thread consuming the packet is running. RFS relies on the same RPS mechanisms to enqueue packets onto the backlog of another CPU and to wake up that CPU.

# Receive Side Scaling
# Using ethtool the NIC can be adjusted to deliver packets into specific queues.
# The queues can then be processed by seperate CPU cores.


# This file contains the number of frames/packets processed per CPU core in the first column
sudo cat /proc/net/softnet_stat

# It can be monitored with this script
# https://github.com/majek/dump/blob/master/how-to-receive-a-packet/softnet.sh
watch -n 1 ./softnet.sh



# The indirection table shows how the results of a flow hash in the NIC will steer traffic into specific RX queues

# Show the indirection table for a NIC (mappings hash results to NIC queues)
sudo ethtool -x eth3

# Only allow traffic to the first 5 RX queues on an 11 RX queue card (the sum of the weights must be non-zero)
sudo ethtool -X eth2 weight 1 1 1 1 1 1 0 0 0 0 0

# Or, spread processing evenly between first 2 RX queues
sudo ethtool -X eth0 equal 2


# Another option is to use ntuples rules. These existing in addition to the INDIR table.

# Check if they are on with
sudo ethtool -k eth3

# Enable them with
sudo ethtool -K eth3 ntuple on

# Get Rx ntuple filters and actions
sudo ethtool -u eth3

# Set a specific rule for a specific destination MAC address, the action number is the RX queue number to place this flow into
sudo ethtool -U eth3 flow-typeether dst xx:yy:zz:aa:bb:cc [m xx:yy:zz:aa:bb:cc] action 0

With RSS configured to map specific queues to specific CPU cores, RFS can be configured to map specific flows into specific queues in the hardware using either ntuples with ethtool or NIC specific features like Flow Director or Accelerated RFS.