Date created: Tuesday, May 7, 2013 11:46:29 AM. Last modified: Friday, January 19, 2024 4:34:56 PM
Linux Network Tuning
PHY/NIC/Ethtool
# Check if flow control is enabled sudo ethtool -a eth3 # Disable flow control sudo ethtool -A eth3 autoneg off rx off tx off # Check the RX and TX queue sizes sudo ethtool -g eth3 # Set the RX queue size to 4096 bytes sudo ethtool -G eth3 rx 4096 # Check for hardware offload settings per NIC sudo ethtool -k eth3 # Disable TX and RX checksumms sudo ethtool -K eth3 rx off sudo ethtool -K eth3 tx off # Show the RX or TX queue stats per NIC watch -n 1 'sudo ethtool -S eth3 | grep -E " rx_"'
# On a VM the Tx Kick counter might be high, and this might be OK because the virtio vring is shared between VM and host, the VM side produces requests to vring and kicks the virtqueue, while the host side produces responses to vring and interrupts from the VM side.
$ sudo ethtool -S eth0
NIC statistics:
rx_queue_0_packets: 4282561
rx_queue_0_bytes: 2987856497
rx_queue_0_drops: 0
rx_queue_0_xdp_packets: 0
rx_queue_0_xdp_tx: 0
rx_queue_0_xdp_redirects: 0
rx_queue_0_xdp_drops: 0
rx_queue_0_kicks: 84
tx_queue_0_packets: 13479797
tx_queue_0_bytes: 1816396339
tx_queue_0_xdp_tx: 0
tx_queue_0_xdp_tx_drops: 0
tx_queue_0_kicks: 13426617
$ watch -n 1 "column -t /proc/net/dev | grep -E 'Inter|face|eth0|eth1'" Every 1.0s: column -t /proc/net/dev | grep -E 'Inter|face|eth0|eth1' Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0: 2990014135 4288231 0 0 0 0 0 0 1818541402 13494586 0 0 0 0 0 0 eth1: 4106015341 16686322 0 0 0 0 0 0 1420117908 7433373 0 0 0 0 0 0
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 11:11:11:22:22:22 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
2993449026 4297303 0 0 0 0
TX: bytes packets errors dropped carrier collsns
1822537655 13525417 0 0 0 0
altname enp0s18
altname ens18
IP/TCP/UDP
$ cat /proc/sys/net/core/wmem_default
212992
$ cat /proc/sys/net/core/wmem_max
212992
$ cat /proc/sys/net/core/rmem_default
212992
$ cat /proc/sys/net/core/rmem_max
212992
watch -n 1 "sudo ss -nupmeO | tee -a ~/ss.log" Every 1.0s: sudo ss -unpmeO Recv-Q Send-Q Local Address:Port Peer Address:PortProcess 0 0 10.10.101.150:43041 8.8.4.4:53 users:(("snmpget",pid=3941568,fd=3)) uid:2113 ino:23550662 sk:1197 cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:59706 8.8.4.4:53 users:(("snmpget",pid=3941557,fd=3)) uid:2113 ino:23550641 sk:1194 cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:33668 8.8.4.4:53 users:(("snmpget",pid=3941576,fd=3)) uid:2113 ino:23550681 sk:1198 cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:55150 8.8.4.4:53 users:(("snmpget",pid=3941583,fd=3)) uid:2113 ino:23552059 sk:1199 cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:39359 8.8.4.4:53 users:(("snmpget",pid=3941589,fd=3)) uid:2113 ino:23552066 sk:119a cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:56464 8.8.4.4:53 users:(("snmpbulkwalk",pid=3941588,fd=3)) uid:2113 ino:23553139 sk:119b cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 10.10.101.150:58603 8.8.4.4:53 users:(("snmpget",pid=3941591,fd=3)) uid:2113 ino:23550706 sk:119c cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0) 0 0 [2001:2001:0:6::1b]:59740 [2001:2001:2001::1111]:53 users:(("snmpget",pid=3941596,fd=3)) uid:2113 ino:23551224 sk:119d cgroup:/system.slice/observium-poller.service <-> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0,d0)
Sysctl / Tunable Kernel Paramters for IP Performance:
This is a great reference: https://sysctl-explorer.net/net/
sysctl -w net.core.rmem_default=524287 /proc/sys/net/core/rmem_default - default receive window (default=124928), suggested change to 524287 /proc/sys/net/core/wmem_default - default send window (default=124928), suggested change to 524287 /proc/sys/net/core/rmem_max - maximum receive window (default=131071), suggested change to 524287 /proc/sys/net/core/wmem_max - maximum send window (default=131071), suggested change to 524287 /proc/sys/net/core/optmem_max - maximum option memory buffers (default=20480), suggested change to 524287 /proc/sys/net/core/netdev_max_backlog - number of unprocessed input packets before kernel starts dropping them (default=1000), suggested change to 300000 /proc/sys/net/ipv4/tcp_rmem - memory reserved for TCP rcv buffers (min default max) (defaults 4096 87380 4194304), suggested change to 10000000 10000000 10000000 /proc/sys/net/ipv4/tcp_wmem - memory reserved for TCP snd buffers (min default max) (defaults 4096 16384 4194304), suggested change to 10000000 10000000 10000000 /proc/sys/net/ipv4/tcp_mem - memory reserved for TCP buffers (min default max) (defaults 193152 257536 386304), suggested change to 10000000 10000000 10000000
ip_forward - (Boolean; default: disabled; since Linux 1.2) Enable IP forwarding with a boolean flag. IP forwarding can be also set on a per-interface basis.
echo 1 > /proc/sys/net/ipv4/ip_forward
ip_local_port_range - (Two integers, low and high bound, default 1024 to 4999 or 32768 61000; since Linux 2.2) - The ephemeral port range. Allocation starts with the first number and ends with the second number. Note that these should not conflict with the ports used by masquerading (although the case is handled). Also arbitrary choices may cause problems with some firewall packet filters that make assumptions about the local ports in use. First number should be at least greater than 1024, or better, greater than 4096, to avoid clashes with well known ports and to minimize firewall problems.
echo "10000 65000" > /proc/sys/net/ipv4/ip_local_port_range
ip_no_pmtu_disc - (Boolean; default: disabled; since Linux 2.2) If enabled, don't do Path MTU Discovery for TCP sockets by default. Path MTU discovery may fail if misconfigured firewalls (that drop all ICMP packets) or misconfigured interfaces (e.g., a point-to-point link where the both ends don't agree on the MTU) are on the path. It is better to fix the broken routers on the path than to turn off Path MTU Discovery globally, because not doing it incurs a high cost to the network.
echo 0 > /proc/sys/net/ipv4/ip_no_pmtu_disc
ip_nonlocal_bind - (Boolean, default disabled) If set, allows processes to bind(2) to nonlocal IP addresses, which can be quite useful, but may break some applications.
echo 1 > /proc/sys/net/ipv4/ip_nonlocal_bind
tcp_fin_timeout (integer; default: 60; since Linux 2.2) This specifies how many seconds to wait for a final FIN packet before the socket is forcibly closed. This is strictly a violation of the TCP specification, but required to prevent denial-of-service attacks. In Linux 2.2, the default value was 180.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
tcp_keepalive_time - (Integer, seconds, default 7200; since Linux 2.2) The interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the connection is marked to need keepalive, this counter is not used any further.
echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
tcp_keepalive_intvl - (Integer, seconds, default 75; since Linux 2.4) The interval between subsequential keepalive probes, regardless of what the connection has exchanged in the meantime.
echo > 60 /proc/sys/net/ipv4/tcp_keepalive_intvl
tcp_keepalive_probes - (Integer, number of proves, default 9; since Linux 2.2) The number of unacknowledged probes to send before considering the connection dead and notifying the application layer.
echo > 15 /proc/sys/net/ipv4/tcp_keepalive_probes
tcp_retries 2 - (integer; default: 15; since Linux 2.2) The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout. The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.
tcp_tw_recycle - (Boolean; default: disabled; since Linux 2.4) - Enable fast recycling of TIME_WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Net‐work Address Translation).
tcp_tw_reuse - (Boolean; default: disabled; since Linux 2.4.19/2.6) Allow to reuse TIME_WAIT sockets for new connections when it is safe from protocol viewpoint. It should not be changed without advice/request of technical experts.
References:
man 7 ip
man 7 tcp
http://www.faqs.org/docs/securing/chap6sec70.html
http://man7.org/linux/man-pages/man7/ip.7.html
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
http://lartc.org/howto/lartc.kernel.obscure.html
ICMP Limiting:
icmp_destunreach_rate - (Integer, 1/100ths of a second; Linux 2.2 to 2.4.9) Maximum rate to send ICMP Destination Unreachable packets. This limits the rate at which packets are sent to any individual route or destination. The limit does not affect sending of ICMP_FRAG_NEEDED packets needed for path MTU discovery.
icmp_echoreply_rate - (Integer, 1/100ths of a second; Linux 2.2 to 2.4.9) Maximum rate for sending ICMP_ECHOREPLY packets in response to ICMP_ECHOREQUEST packets.
icmp_paramprob_rate - (Integer, 1/100ths of a second; Linux 2.2 to 2.4.9) Maximum rate for sending ICMP_PARAMETERPROB packets. These packets are sent when a packet arrives with an invalid IP header.
icmp_timeexceed_rate - (Integer, 1/100ths of a second; Linux 2.2 to 2.4.9) Maximum rate for sending ICMP_TIME_EXCEEDED packets. These packets are sent to prevent loops when a packet has crossed too many hops.
icmp_ratelimit - (integer; default: 1000; since Linux 2.4.10) Limit the maximum rates for sending ICMP packets whose type matches icmp_ratemask (see below) to specific targets. 0 to disable any limiting, otherwise the minimum space between responses in milliseconds.
icmp_ratemask - (integer; default: see below; since Linux 2.4.10) Mask made of ICMP types for which rates are being limited.
Significant bits: IHGFEDCBA9876543210 Default mask: 0000001100000011000 (0x1818)
Bit definitions (see the kernel source file include/linux/icmp.h):
0 Echo Reply
3 Destination Unreachable *
4 Source Quench *
5 Redirect
8 Echo Request
B Time Exceeded *
C Parameter Problem *
D Timestamp Request
E Timestamp Reply
F Info Request
G Info Reply
H Address Mask Request
I Address Mask Reply
The bits marked with an asterisk are rate limited by default (see the default mask above).
References:
man 7 icmp
Previous page: Linux Network Diagrams
Next page: Mobile Internet / WWAN / mmcli