Date created: Wednesday, September 18, 2024 5:54:46 PM. Last modified: Monday, September 23, 2024 6:17:17 PM
7280R3 Jericho2 ASIC Drops
References
https://www.arista.com/en/support/toi/eos-4-25-2f/14703-r-series-drop-voq-monitoring
Packets are dropped for one of three reasons:
- Adverse drops: packets drops because something about the device isn't working properly
- Congestion: device is working but not enough capacity
- Packet processor: device is working, and has capacity but a higher level issues exists like no route, back packet checksum etc.
Drops by the ASIC can be seen with "show hardware counter drop":
lab#show hardware counter drop Summary: Total Adverse (A) Drops: 0 Total Congestion (C) Drops: 0 Total Packet Processor (P) Drops: 2663 Type Chip CounterName : Count : First Occurrence : Last Occurrence -------------------------------------------------------------------------------------------------------------- P Fap0 dropVoqInNullRoute : 2097 : 2024-09-18 14:36:37 : 2024-09-18 15:53:43
lab#show hardware counter drop rates Type Chip CounterName Count 1-Min 10-Min 1-Hour 1-Day 1-Week ------------------------------------------------------------------------------------------------------------------- P Fap0 dropVoqInNullRoute 2186 55 622 1188 2186 2186
When packets are dropped by the ASIC they are wrapped in a customer header which has the original packet and the drop reason, and punted to the CPU.
The CPU punted packets can then be captured with tcpdump to see which packets are being dropped and why.
Note that not 100% of packets can be captured by tcpdump if the dropped pps rate is higher than the CPU can handle.
This commands show the number of drops for any reason sent to the control-plane:
lab#bash fab dump | grep rxdrop_voq
rxdrop_voq 4219101
It should be noted that we only sent a sample of drop packets to the host supervisor, it is not
expected that every drop packet will be visible. This is to avoid throttling the PCIe bus for CPU-
bound traffic. For reference, the tail-drop thresholds used for drop VOQs under this feature are
128KB and 1000 buffers. Drop VOQs are also rate limited to 80KBps.
This limitation does not affect the count by command show hardware counter drop.
Also note that packets dropped for the following reasons aren't punted to the CPU for tcpdump'ing:
- dropVoqInMcastEmptyMcid
- dropVoqInAcl
- dropVoqInMcastNoCpu
- dropVoqInLagDiscarding
The CPU punted packets related to ASIC drops have the Ethernet set to 0x1044.
The only way to know which interface to tcpump on is to run tcpdump on every interface. One way to speed this up is to loop over all interfaces:
lab#bash for intf in $(ip l | grep -E et[0-9]+_[0-9] -o | sort | uniq); do echo "intf: $intf"; timeout 3 tcpdump -c 1 -i $intf ether proto 0x1044; done
Another way to find the right interface is to tcpdump on all interfaces looking for proto 0x002a, see which interface the packets come from, then listen for 0x1044 on that specific interface:
lab#bash tcpdump -e -i any -c 5 ether proto 0x002a tcpdump: data link type LINUX_SLL2 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes 18:33:43.749384 et14 In ifindex 43 3c:fd:fe:11:22:33 (oui Unknown) ethertype ETH_P_ARISTA (0x002a), length 140: 0x0000: 0001 0800 4500 0054 3987 4000 4001 5b8b ....E..T9.@.@.[. 0x0010: 0ac8 c802 0ac8 c804 0800 0b4a 7d58 2893 ...........J}X(. 0x0020: 071d eb66 0000 0000 8a73 0b00 0000 0000 ...f.....s...... 0x0030: 1011 1213 1415 1617 1819 1a1b 1c1d 1e1f ................ 0x0040: 2021 2223 2425 2627 2829 2a2b 2c2d 2e2f .!"#$%&'()*+,-./ 0x0050: 3031 3233 3435 3637 8100 0000 0000 0000 01234567........ 0x0060: 0000 0000 0000 000f ef10 0100 0000 0000 ................ 0x0070: 4a00 0001 0000 0101 J....... ^C r2-lab2-de#bash tcpdump -i et14 ether proto 0x1044 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on et14, link-type EN10MB (Ethernet), snapshot length 262144 bytes 18:35:16.933369 3c:fd:fe:a9:33:b0 (oui Unknown) > e8:ae:c5:29:39:79 (oui Unknown), ethertype Arista Drop VOQ Monitoring (0x1044), length 134: vlan 1, p 0, ethertype IPv4, 10.200.200.2 > 10.200.200.4: ICMP echo request, id 32088, seq 10478, length 64 Drop VOQ trailer: VOQ: inNullRoute (VOQID 74) out_fap_port: 0 outlif: 0 lif_outlif: 0 eei_outlif: 0 fwd_sys_vsi: 4079 inlif_orientation: 16 traffic_class: 1 ftmh_dp: 0 dscp: 0 cpucode: 0 fwd_code: 1 fwd_hdr_offset: 0 eei_type: 0 dscp_rewrite: 1 dsp_ext_present: 1
Above the drop reason is a null route, as shown by the output from "show hardware counter drop", meaning the packet causing the drop counter increase has been found.
Save the following Wireshark filter in "~/.local/lib/wireshark/plugins/": https://github.com/mpergament/voqmonitor
Local mirror: arista-voq-dissector.lua
Then one can decode the custom Arista header directly in Wireshark:
ssh -q labrouter "bash tcpdump -s 0 -U -n -w - -i et14 'ether proto 0x1044' 2>/dev/null" | wireshark -k -i -
One can filter for the drop reason by setting the tcpdump filter to check the drop vode value, in the Arista trailer. Packets which have been dropped and punted are truncated. The maximum size packet is 176 bytes including the 32 byte trailer. The example below filters for the null route drop code 74 (0x4a). If the packet was not truncated (because it was smaller than 144 bytes) then the off-set 168 will be incorrect:
bash tcpdump -i et4_1 -c 1 -s 0 -nlASXev ether proto 0x1044 and ether[168] == 0x4a
The meaning of some of the drop counters are documented at the following URL, however most of them are undocumented (TAC confirmed this): https://www.arista.com/en/support/toi/eos-4-15-3f/13754-drop-counters
Previous page: Traffic Policy Match Statements
Next page: ethxmit