ASR9000 QoS and LAG QoS

References:
https://supportforums.cisco.com/document/135651/asr9000xr-feature-order-operation
http://www.cisco.com/c/en/us/td/docs/routers/asr9000/hardware/overview/guide/asr9kOVRGbk/asr9kOVRGfuncdescription.html
BRKSPG-2904 - ASR-9000/IOS-XR hardware Architecture, QOS, EVC, IOS-XR Configuration and Troubleshooting
https://supportforums.cisco.com/document/12135016/asr9000xr-understanding-and-troubleshooting-fabric-issues-a9k
http://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/asr9k_r6-0/general/release/notes/reln-601a9k.html#concept_D31CCC2886D343A499A6D458B9980E97
https://supportforums.cisco.com/document/105496/asr9000xr-understanding-route-scale#Using_the_ASR9000_as_a_Route_Reflector

Contents:
QoS Order-of-Operations
Ingress QoS Order-of-Operations
Egress QoS Order-of-Operations
Hardware Queues
Policers
Shapers
Port Level Shaping
VoQs and Implicit Trust
Default QoS Behaviour
LAG/Bundle QoS
Debugging Commands

QoS Order-of-Operations:

 

Ingress QoS Order-of-Operations:

I/F Classification
Ingress packets are checked against the TCAM to see what (sub)interface it belongs to. Once known the router can derive the uIDB (micro IDB, Interface Descriptor Block) and know which features need to be applied to the ingressing packet.

ACL Classification
If there is an ACL applied to the interface, a set of keys are built and sent over to the TCAM to find out if this packet is a permit or deny on that configured ACL. Only the result is received, no action is taken yet. In the case of SPAN this is where the "capture" ACE keyword is matched too.

QOS Classification
If there is a QoS policy applied the TCAM is queried with the set of keys to find out if a particular class-map is being matched. The return result is effectively an identifier to the class of the policy-map, so we know what functionality to apply on this packet when QoS actions are executed.

ACL and QoS Classification Notes:
Enabling either ACLs or QoS will result in a TCAM lookup for a match. An ACL lookup takes X time, a QoS lookup takes Y time, enabling both ACL and QoS will not give you an X+Y PPS degradation because the TCAM lookup for them both is done in parallel. BGP flowspec, Open flow and CLI Based PBR use PBR lookup, which logically happens between ACL and QoS.

Forwarding lookup
The ingress forwarding lookup does not traverse the whole forwarding tree yet, it only tries to find the egress interface and thus the egress line card. When bundles are used and members are spread over different linecards the router needs to compute the hash to identify the egress line card for the egress member port.

Note: If uRPF is enabled a full ingress FIB lookup (same as the egress) is performed, this is intense and therefore uRPF can have a relatively larger impact on the PPS.

IFIB Lookup
The iFIB checks which internal packet processor the ingressing packet should be forwarded to. For example ARP and NetFlow are handled by the LC NPU but BGP and OSPF are handled by the RSP. In the case of NetFlox and packets are dropped by an ACL or QoS policer, the flow record will be sampled/exported so this information is captured but a flag is set to indicate the packet was dropped.

ACL Action
If the packet is subject to an ACL deny the packet is dropped now.

QOS Action
Any policer action is done during this stage as well as marking. QoS shaping and buffering is done by the traffic manager which is a separate stage in the NPU.

Note on ACL and QoS Action
Since the LC NPUs use TOPs (task optimized processors) in the common design of four linear TOPs to create a packet pipeline (a parse TOP, feeds into the search TOP, feeds into the resolve TOP and finally feeds into the modify TOP) ACL and QoS actions are applied here even though the ACL and QoS classification sections earlier.

PERSONAL SPECULATION: The reason the order of operations is ACL & QoS classification, forwarding lookup, iFIB lookup then finally ACL and QoS action, meaning that dropped packets by an ACL deny statement for example, which could seemingly have been dropped earlier in the process; is because all these steps are covered by the packets going through the TOPs pipeline and only the modify TOP (the final one) drops packets, which is why ACL and QoS actions happen after an iFIB and egress LC/NPU lookup.

More info on TOPs here: /index.php?page=forwarding-hardware#tops

L2 rewrite
During the L2 rewrite in the ingress stage the LC is applying the fabric header (super frame). In the case of an MPLS labelled packet which requries a SWAP operation, the SWAP is performed in the ingress pipeline before the packet is forwarded to the switch fabric. The egress MAC address (in the example of routing from a layer 3 ingress port to a layer 3 egress port) is looked up in the egress linecard, this improves the scaling limits of the line cards by not having to know all possible egress adjacency rewrite information.

QOS Action
Any other QoS actions are made now. Marking is performed first then and queuing, shaping and WRED are executed. This means that any mutations or re-markings are acted upon by WRED. Note that packets that were previously policed or dropped by ACL are not seen in this stage anymore. It is important to understand that dropped packets are removed from the pipeline, so when there are counter discrepancies, they may have been processed/dropped by an earlier feature.

iFIB action
Either the packet is forwarded over the fabric or handed over to the LC CPU here. If the packet is destined for the RSP CPU, it is forwarded over the fabric to the RSP destination (the RSP is a line card from a fabric point of view, the RSP requests fabric access to inject packets in the same fashion an LC would).

General
The ingress linecard will also do the TTL decrement on the packet and if the TTL exceeds the packet is punted to the LC CPU for an ICMP TTL exceed message. The number of packets we can punt is subject to LPTS policing.

 

Egress QoS Order-of-Operations:

Forwarding lookup
The egress line card performs a full FIB lookup down the leaf to get the rewrite string. This full FIB lookup will provide everything for the packet routing, such as egress interface, encapsulation, and adjacency information.

L2 rewrite
With the information received from the forwarding lookup the router can rewrite the packet headers applying Ethernet headers, VLANs, MPLS labels etc.

Security ACL classification
During the FIB lookup the egress interface has been determined and which features are applied. As with the ingress the keys can be built to query the TCAM for an ACL result based on the application to the egress interface.

QOS Classification
As with the ingress process, the TCAM is queried to identify the QoS class-map and matching criteria for this packet on an egress QoS policy.

ACL action
Any ACL action is executed now.

QOS action
Any QoS action is executed now. Any policing is performed, then marking and then queuing, shaping and WRED. This means packets that are re-marked will be subject to WRED based on their re-marked values. Note that packets dropped by a policer are not seen in the remainder of this stage.

General
MTU verification is only done on the egress line card to determine if fragmentation is needed. The egress line card will punt the packet to the LC CPU of the egress line card if fragmentation is requirement. Fragmentation is done in software and no features are applied on these packets on the egress line card. The number of packets that can be punted for fragmentation is NPU bound and limited by LPTS.

 

Hardware Queues

More info can be found here. When packets enter the Typhoon NPU the ICU (Internal Classification Unit) can pre-parse the packets and classify them into one of 64 ICFDQs (Input Classification Frame Descriptor Queues). Each ICFDQ has 4 CoS queues. Network control traffic is CoS >= 6, high priority traffic is CoS = 5, low priority traffic is CoS < 5, the fourth queue is unused.

Inside the ICFDQs, EFD (Early Fast Discard) is a mechanism which looks at the queue before the packets enter the [TOPs] pipeline of the NPU. The EFD was software based with the ICU in Typhoon cards and has become hardware based with the ICU in Tomahawk cards.

The EFD can perform some minimalistic checking on precedence, EXP or CoS values to determine if a low or high priority packet is being processed. If the pipeline is clogged it can drop some lower priority packets to save the higher priority ones before they enter the pipeline by using Strict-priority Round-Robin.

When packets leave the TOPs pipeline there is an egress Traffic Manager ASIC on Typhoon NPUs (two egress TMs per Tomahawk NPU) which feeds the Fabric Interconnect ASIC. The FIA has 4 VoQs per VIQ, two are strict priority queues, one is for best effort traffic and the last is unused. When packets arive from the switch fabric at the egress FIA, the FIA has the same group of four class queues for each NPU it serves and again only three of the queues in each group are used.

 

Policers:

When defining policers at high(er) rates make sure the committed burst and excess burst are set correctly.

The formula to follow is for 1.5 seconds of burst (the Cisco recommend):


Set Bc to CIR bps * (1 byte) / (8 bits) * 1.5 seconds
and
Be=2xBc

1.5 seconds is quite high for some hardware buffer sizes. 250ms of burst (for comparison) would be:

(CIR Bps / 8) *0.25.


The default burst values are not optimal. For example, if allowing a rate of 1 pps and then for 1 second no packets are received, then in the next second 2 packets are received, an "exceed" will be observed.

When defining a policer it is important to note that layer 2 headers are included on ingress and egress, this means that traffic without dot1q or QinQ headers could achieve a higher throughput rate for the same policer value compared to traffic with VLAN headers. For policing, shaping, and the bandwidth command for ingress and egress traffic directions, the following fields are included in the accounting: MAC DA, MAC SA, EtherType, VLANs, L2 payload, CRC.

The ASR9K policer implementation supports a granularity level of 64Kbps (on Triden line cards). When a rate specified is not a multiple of 64Kbps the rate would be rounded down to the next lower 64Kbps rate. Typhon cards support a granularity of 8Kbps.

When using two priority queues, the 2nd priority queue supports shaping or policing, whereas the first priority queue supports only policing.

 

Shapers:

When defining a shaper in a policy-map the shaper takes the Layer 2 headers into consideration. A shaper rate defined of 1Mbps for example would mean that if the traffic has no dot1q or QinQ tags one can technically send more IP traffic when compared to having QinQ tags. When defining a bandwidth statement in a class the same applies, L2 headers are taken into consideration. The same applies when defining a policer, layer 2 headers are include.

In the ingress direction for both policers and shapers the packet size counters use the incoming packet size including the Layer 2 headers. In order to account for the Layer 2 header in an ingress shaper the ASR9000s have to use a Traffic Manager overhead accounting feature that will only add overhead with 4 byte granularity, which can cause inaccuracy. In the egress direction for both policers and shapers the outgoing packet size also includes the Layer 2 headers.

For policing, shaping, and the bandwidth command for ingress and egress traffic directions, the following fields are included in the accounting: MAC DA, MAC SA, EtherType, VLANs, L2 payload, CRC.

 

Port Level Shaping:

A shaping action requires a queue on which the shaping is applied. This queue must be created by a child level policy. Typically shaper is applied at parent or grandparent level, to allow for differentiation between traffic classes within the shaper. If there is a need to apply a flat port-level shaper, a child policy should be configured with 100% bandwidth explicitly allocated to class-default.

 

VoQs and Implicit Trust:

More details here: /index.php?page=asr9000-series-hardware-overview#voq


Every 10G entity in the system is represented in the ingress Fabric Interfacing ASIC (FIA) by a Virtual Output Queue. This means in a fully loaded chassis with 8 line cards, each with 24x 10G ports there are 192 VoQs represented at each FIA of each linecard.

The VOQ's have 4 different priority levels: Priority 1, Priority 2, Default priority and multicast. The different priority levels used are assigned on the packets fabric headers (internal superframe headers) and are set via QOS policy-maps (MQC; modular QoS configuration on the ingress interface).

When one defines a policy-map and applies it to a (sub)interface and in that policy map certain traffic is marked as priority level 1 or 2 the fabric headers will reflect that, so that this traffic is put in the higher priority queues of the forwarding ASICs as it traverses the FIA and fabric components.

If one does not apply any QoS configuration, all traffic is considered to be "default" in the fabric queues. In order to leverage the strength of the ASR9000's ASIC priority levels one will need to configure (ingress) QoS at the ports to apply the priority level desired.

A packet classified into a P1 class on ingress is mapped to PQ1 system queue. A packet classified into a P2 class on ingress is mapped to PQ2 system queue. A packet classified into a non-PQ1/2 class on ingress will get mapped to the default/best effort queue along the system QoS path.

Note: The marking is implicit once one assigns a packet into a given queue on ingress; this sets the fabric header priority bits onto the packet, no specific "set" action is required to set the internal fabric header priority value, the priority level is taken from the MQC class configuration.

If one does not configure any service policies for QoS the ASR9000 will set an internal CoS value based on the IP Precedence, 802.1 Priority field or the MPLS EXP bits. Depending on the routing or switching scenario, this internal CoS value could potentially be used to perform marking on newly imposed headers on egress.

For bridged packets on ingress the outermost CoS value would be treated as trusted.
For routed packets on ingress the DSCP/Precedence/outermost EXP value would be treated as trusted based on packet type.
Default QoS will be gleaned from ingress interface before QoS marking is applied on the ingress policy-map.
By default ASR 9000s should never modify DSCP/IP precedence of a packet without a policy-map configured.
Default QoS information should be used for imposition fields only.

 

Default QoS Behaviour:

In the case of tagged layer 2 traffic arriving at a layer 2 interface like a bridge, VPLS the outter most VLAN tag is mapped to the internal CoS value. The internal CoS value is pushed on at egress to all newly imposed MPLS labels and/or all newly imposed VLAN tags. If the egress is layer 3, the original IPP/DSCP values are still present.

In the case of untagged layer 2 traffic arriving at a layer 2 interface like a bridge, VPLS or VPWS; the internal CoS is set to 0 by default. The internal CoS value (which is 0) is pushed on at egress to all newly imposed MPLS labels and/or all newly imposed VLAN tags. If the egress is layer 3, the original IPP/DSCP values are still present.

In the cased of a layer 3 routed interfaceingress, untagged incomming IPP value is mapped to the internal CoS value. The internal CoS value is pushed on at egress to all newly imposed MPLS labels and/or all newly imposed VLAN tags. If the egress is layer 3, the original IPP/DSCP values are still present.

In the case of a tagged layer 3 sub-interface the 802.1p CoS value is ignored and the incomming IPP value becomes the internal CoS value. The internal CoS value is pushed on at egress to all newly imposed MPLS labels and/or all newly imposed VLAN tags. If the egress is layer 3, the original IPP/DSCP values are still present.

In the case of an incoming MPLS interface, weather a VLAN tag is present or not the topmost EXP value is mapped to the internal CoS value. The internal CoS value is pushed on at egress to all newly imposed MPLS labels and/or all newly imposed VLAN tags.

In the case of an MPLS explicit null label, the explicit null EXP is treated the same as a top most non-null label and mapped to the internal CoS.

 

LAG/Bundle QoS:

For ASR9000s usign IOS-XR lower than version 6.0.1, when configuring any QoS on a bundle interface the policy is applied to all the member ports of the bundle. This has the caveat that for policers and shapers and the bandwidth command, the configured rate is not a total aggregate. Each LC NP allows its member interface to run up to the configured rate.

The QoS configuration guide for bundled links states the following:


All Quality of Service (QoS) features, currently supported on physical interfaces and subinterfaces, are also supported on all Link Bundle interfaces and subinterfaces. QoS is configured on Link Bundles in the same way that it is configured on individual interfaces. However, the following points should be noted:

  • When a QoS policy is applied on a bundle (ingress or egress directions), the policy is applied at each member interface. Any queues and policers in the policy map (ingress or egress directions) will be replicated on each bundle member.
  • If a QoS policy is not applied to a bundle interface or bundle VLAN, both the ingress and egress traffic will use the per link members port default queue.
  • Link bundle members may appear across multiple Network Processing Units and linecards. The shape rate specified in the bundle policy map is not an aggregate for all bundle members. The shape rate applied to the bundle will depend on the load balancing of the links. For example, if a policy map with a shape rate of 10 Mbps is applied to a bundle with two member links, and if the traffic is always load-balanced to the same member link, then an overall rate of 10 Mbps will apply to the bundle. However, if the traffic is load-balanced evenly between the two links, the overall shape rate for the bundle will be 20 Mbps.

 

The IOS-XR 6.0.1 release notes include the following statement which indicates that aggregate bundled QoS is supported:

Aggregated Bundle QoS feature allows the shape, bandwidth, police rates and burst values to be distributed between the active members of a bundle, where a QoS policy-map is applied.


The "aggregate-bundle-mode" keword command has been added in IOS-XR 6.0.1 to the existing "hw-module all qos-mode" command. These are the options prior to 6.0.1:

RP/0/RSP0/CPU0:ASR9000(config)#hw-module all qos-mode ?
  ingress-queue-enable       enable ingress queueuing support for MOD-80 4*10G MPA
  per-priority-buffer-limit  set different buffer limit for different priority
  pwhe-aggregate-shaper      Configure pseudo wire headend interface qos parameters
  wred-buffer-mode           Configure L4 WRED accounting mode to buffer mode

6.0.1:
RP/0/RSP0/CPU0:ASR9000(config)# hw-module all qos-mode bundle-qos-aggregate-mode

The new command enables a feature which works as follows:

  • Whenever a policy is applied on the bundle member – first a ratio can be calculated based on total bundle bandwidth to bundle member's bandwidth as follows: (ratio = bundle bandwidth/member bandwidth)
  • For example if the bundle bandwidth is 20 Gbps with two bundle member bandwidth 10 Gbps, then the reduction will be 2/1 (0.5 * policy rate) for both members.
  • With bundle bandwidth 50G (with 10G+40G members) the ratios become 5/1 and 5/4 respectively (it is supporting unbalanced member link speeds).
  • The feature will then automatically recalculate the member-port QoS rate whenever a change among bundle occurs. i.e. bundle member active/down/added/removed.
  • If traffic is load balanced well among the bundle members, then in aggregate the bundle-ether traffic will be shaped to the policy rate, matching the QoS policy configuration (instead of being members*rate).
  • If the traffic is not load-balanced well then a single member link may hit its policer or shaper rate before the others do, and before the traffic flow has reached its totally allowed limit.
  • The command takes effect chassis wide, it can't be enabled/disabled per bundle.
  • When the aggregated bundle mode changes, QoS polices on bundle (sub)interfaces are modified automatically.
  • Reloading line card is not required.

 

Hardware Specifications:

The per-port buffer details are as follows:

NPU speed per port buffer
Trident L and B cards support 17M PPS (~15Gbps) and 50ms per-port-buffers.
Trident E cards support 17M PPS (~15Gbps) and 150m per-port-buffers.
Typhoon SE cards support 44M PPS (~60GBps) and 339ms per-port-buffers when there are 2 ports-per-NPU.
Typhoon TR cards support 44M PPS (~60Gbps) and 170ms per-port-buffers when there are 2 ports-per-NPU.
Typhoon SE cards support 44M PPS (~60Gbps) and  226ms per-port-buffers when there are 3 ports-per-NPU.
Typhoon TR cards support 44M PPS (~60Gbps) and 113ms per-port-buffers when there are 3 ports-per-NPU.
Tomahawk TR cards support 6Gbps and 100ms per-port buffers.
Tomahawk SE cards support 12Gbps and 200ms per-port buffers.

Triden NPUs support 32k egress queues and 32k ingress queues on the 10G line cards.
Triden NPUs support 64k egress queues and 32k ingress queues on the 40x1G line cards.
Typhoon NPUs support 192k egress queues and 64k ingress queues.
Tomahawk TR card supports 8 queues per-port.
Tomahawk NPUs (SE card) support 1M queues, 4M queues for 800G line card.

Triden NPUs support upto 64k policers on the E series line cards.
Typhoon NPUs support upto 256k policers on the SE series line cards.
Tomahawk NPUs support upto 512k policers, 32k for TR line cards.

ASR9001:

RP/0/RSP0/CPU0:abr1#show qos capability location 0/0/CPU0
Capability Information:
======================

Max Policy maps supported on this LC: 16384
Max policy hierarchy: 3
Max policy name length: 64
Max classes per child-policy: 1024
Max classes per policy: 1024
Max classes per grand-parent policy: 1
Max police actions per class: 2
Max marking actions per class: 2
Max matches per class : 8
Max class-map name length: 64
Max members in a bundle: 64
Max instance name length: 32

 

Debugging Commands:

General Checks:
"show drops all location all"
"show drops"
"show asic-errors all location 0/0/CPU0"
"show asic-errors all location 0/RSP0/CPU0"

Look for active PFM alarms on LC as well as RSP:
"show pfm location all"

Check FPGA Software Is Updated:
"show controllers np summary all"
"admin show hw-module fpd location all"

Check Hardware Diagnostics:
"admin show diagnostic result location all"

Clearing Counters:
"clear counters all"
"clear controller np counters all"
"clear controller fabric fia location"
"clear controller fabric crossbar-counters location"

Checking Interface Stats:
"show interface Gi0/0/0/0"

Check LC NP Stats:
"sh controller np ports all loc 0/0/cpu0"
"show controllers pm vqi location all"
"show controllers pm interface Te0/0/0/0"
"show controllers pm location 0/0/CPU0 | i "name|switch"
"show controllers np fabric-counters all all"
Some interesting counters for this command are:
xaui_a_t_transmited_packets_cnt -- Num pkt sent by NPU to bridge
xaui_a_r_received_packets_cnt -- Num pkt sent by bridge to NPU

When using "show controllers np fabric-counters all all location 0/0/CPU0", all zero counts can be an indication there are Tx/Rx problems between the NP and FIA.

"show controllers np counters all"
Some interesting counters for this command are:
800 PARSE_ENET_RECEIVE_CNT -- Num of packets received from external interface
970 MODIFY_FABRIC_TRANSMIT_CNT -- Num of packets sent to fabric
801 PARSE_FABRIC_RECEIVE_CNT -- Num of packets received from fabric
971 MODIFY_ENET_TRANSMIT_CNT -- Num of packets sent to external interface

When using "show controller np counters all loc 0/0/CPU0" the output "No non-zero data counters found" for an NP can be an indication it has locked up.

Check LC FIA & Bridge Stats:
"show controllers fabric fia link-status location 0/0/CPU0"
"show controllers fabric fia stats location 0/0/CPU0"
"show controllers fabric fia q-depth location 0/0/CPU0"
"show controllers fabric fia drops ingress location 0/0/CPU0"
"show controllers fabric fia drops egress location 0/0/CPU0"
"show controllers fabric fia errors ingress location 0/0/CPU0"
"show controllers fabric fia errors egress location 0/0/CPU0"
"show controllers fabric fia bridge *" Trident LC Only

Check RSP Abriter and Xbar Stats:
"show controllers fabric arbiter serders location ..."
"show controllers fabric crossbar link-status instance 0 location 0/RSP0/CPU0"
"show controllers fabric crossbar link-status instance 1 location 0/RSP0/CPU0"
"show controllers fabric crossbar statistics instance 0 location 0/RSP0/CPU0"
"show controllers fabric ltrace crossbar last 100 location all"

Check how the interface policy is applied in hardware:
"show qos interface g0/0/0/0 output"