Date created: Monday, July 30, 2018 11:10:38 AM. Last modified: Friday, June 28, 2019 8:55:14 AM

IGP Load-Balancing Techniques

References:
BRKSPG-2904 - ASR9000 Selected Topics and Troubleshooting (2016)
https://supportforums.cisco.com/document/111291/asr9000xr-loadbalancing-architecture-and-characteristics#polar
https://supportforums.cisco.com/document/111291/asr9000xr-loadbalancing-architecture-and-characteristics#L2
https://supportforums.cisco.com/t5/service-providers-documents/asr9000-xr-understanding-unequal-cost-multipath-ucmp-dmz-link/ta-p/3138853
http://wiki.kemot-net.com/igp-load-sharing-cef-polarization
https://networkingbodges.blogspot.com/2012/12/all-sorts-of-things-about-lacp-and-lags.html
RFC6391 - Flow-Aware Transport of Pseudowires over an MPLS Packet Switched Network
https://packetpushers.net/fat-or-entropy-label/

Within this text the load-balancing features available at various levels of the networking protocol stack are compared in relation to how they help scale up capacity between MPLS P/PEs within a service provider network, through load-balancing based on the traffic being transported. NB: The terms hashing and load-balancing are sometimes used interchangeably.

Contents:
Layer 2 - LAG/LACP Bundles
Layer 3 - (Un-)ECMP Routing
MPLS L2 VPNs – Flow Aware Transport
MPLS L2 VPNs - Pseudowire Control Word
MPLS L2 & L3 VPNs – Entropy Labels
MPLS L3 VPNs - Label Allocation Policy
Comparison / Evaluation

Layer 2 - LAG/LACP Bundles
L2 bundles efficiently support an increase in bandwidth between switches or layer 2 devices (terminating L2 traffic at L3 to then use ECMP and then dropping back down to L3 would be highly inefficient) however, between routers and L3 devices in the IGP there is some debate as to whether L2 LACP bundles or L3 ECMP should be used to scale up capacity.

Some advantages to using L2 bundles include:

  • L2 bundles support minimum requirements - A minimum number of links or minimum amount of bandwidth can be configured. This means that if one or more links fail the entire bundle can be brought down automatically to ensure it doesn't become congested. This forces traffic to re-route via an alternate path.

  • Hashing is usually "correct" – L2 bundles between switches may hash on layer 2 headers which is usually error free meaning that they don't look deeper into the payload and mistakenly deduce that that payload is IP when it isn't for example. Switches can hash on IP headers if the Ethertype is for IPv4/6, if it is for MPLS switches often revert to using the layer 2 headers for entropy information. A L2 bundle to or from a layer 3 device which is a PE (e.g. a LAG between from a PE node to P node or from a PE to CPE) and not between two P nodes will have visibility into any L2/L3 VPN traffic and can thus hash at the correct header offsets for IP address or MAC address, providing consistent hashing. A L2 bundle between two P nodes will often hash with the layer 2 header information or not look beyond the MPLS stack (although many vendors do now support looking into the VPN payload), the problem here though is that there can be a lack of entropy as the source and destination MACs are very consistent between two adjacent P nodes.

  • IGP adjacencies formed over the logical LAG interface can see the speed decrease in the LAG speed when a member interface goes down which means the IGP cost is automatically adjusted. This means that traffic can reroute via an alternate path with a better cost. When using ECMP, one fewer next-hop exists however there is less bandwidth available as fewer ECMP paths exists but the IGP costs hasn't changed. This can lead to congestion as the same traffic load is now balanced over fewer ECMP paths.

Some disadvantages of using L2 bundles are:

  • Links of different speeds can't be used – Unlike L3 links where UCMP can be used to allow links of multiple speeds/costs to be combined.

  • Links of different types can't be used – Un/ECMP can load-balance across multiple link types. LACP runs over Ethenet only. It is also important that the MTU is the same on all links.

  • L2 bundles require LACP – Even though LACP might not be the most complex of protocols, it is another protocol that needs supporting and another potential interop issue in a mixed vendor network.

  • L2 hashing is a node-by-node distributed process - Unless all nodes are the same make/model running the same firmware, with the same physical link setup, it can be quite complex to ensure that link polarisation doesn't occur along the path. When polarisation occurs some links maybe heavily utilised whilst others can be barely used at all. Additional entropy may be required on a node-by-node basis such as including a device ID (typically the loopback0 IP address) into the hashing calculation.

  • LACP can be slow to detect failure – L3 ECMP paths can use BFD to detect link/node failures more rapidly. Although LACP was improved with LACP fast-timers and then again with support for micro-BFD sessions (a BFD session per LAG member link) a cleaner solution might be L3 point-to-point links because only BFD is required, not uBFD and LACP.

  • Hashing is asymmetric – Traffic from node N1 towards node N2 might be hashed onto link L1 but return traffic from N2 towards N1 might be hashed onto link L2. The volume of traffic for a flow my be asymmetric or symmetric in each direction. This may require each direction to be hashed onto separate links, it might not.

  • LAG hashing is typically per flow - The throughput of any single flow cannot exceed the bandwidth of the member link it is hashed onto. Unless per-packet load-balancing is used (which is rarely suitable).

  • There can be issues with OAM traffic - With MPLS traffic passing over a LAG, OAM packets need to be sent over each member link to test that every member is working. Hashing on the bottom of stack label or part or all of the stack can prevent this, top of stack hashing solves this problem. Example on Cisco ASR9000.

  • No customer/application link can be faster than any core link - If a customer has a 2Gbps CIR over a 10Gbps access circuit and the core has 10G bundles maded from 10x1Gbps link, a single flow from the customer/application will congest the core link it is hashed onto. This will also limit the customer's service and impact other customers/applications hashed onto the same bundle member link.

  • Interoperability - HAG hashing isn't standardised. No two vendors implement hashing in the same way which means that at the network edge (e.g. a public peering exchange) egress traffic is distributed differently by the local device than ingress traffic from the external network device.

 

Layer 3 - (Un-)ECMP Routing
Equal Cost Multi-Path Routing is commonly used between two adjacent nodes to load-balance traffic over multiple identically "costed" MPLS/L3 links. Even if the links are costed differently the traffic can be unequally distributed across them which can be helpful if equal links between adjacent nodes aren't available.

Some advantages to ECMP/Unequal-CMP routing are:

  • Easy to configure – ECMP can be implemented by essentially adding more point-to-point L3/MPLS links between two existing nodes.

  • Minimal added complexity – No difference to troubleshooting existing point-to-point L3/MPLS links with the exception that when a specific flow has an issue one must first check which link it is being hashed onto at each node in the path, now that multiple choices exist.

  • No additional protocols required – L2 bundles require LACP which might not be the most complex of protocols however, it is another protocol that needs supporting and another potential interop issue in a mixed vendor network.

  • Different link types can be mixed - Although not recommended, different link types can be mixed as long as they all have enough bandwidth and capacity so support a failure, different speed links can be used. It is important that the MTU is the same on all links though.

There are several issues with Un/ECMP link bundles though:

  • All platforms/vendors implement it uniquely – Every vendor tends to use slightly different fields within the IPv4/6 header or MPLS stack for entropy. This is usually configurable though so in a mixed vendor network the same behaviour (or close to) can be implement on each node. This requires additional administration overhead to mange though.

  • All platforms/vendors tend to use a different hashing calculation - Even though one may configure all nodes to hash on the bottom MPLS label or certain IP header fields for example, the way the device calculates the index in a hash table maybe different. For example node N1 might hash flow F1 onto link N1-L1 towards node N2 and flow F2 onto link N1-L2 towards node N2, but node N2 might hash them both on link N2-L1 towards node N3 and congest that link.

  • Some devices are limited in how deep into an MPLS label stack they can read (Cisco ASR9000 example). Segment Routing can greatly increase the label stack depth which has introduced bugs for various vendors (e.g. CSCvg34997) which has led to hashing not working correctly (although in some cases it is easily fixed with a software update – in other cases it maybe a hardware limitation).

  • Hashing can be based on the "wrong" values – If a fixed value like source or destination IP/MAC isn't used packets can become subject to re-ordering inside the service provider core, e.g. using the checksum field inside the IPv4 header, the value will change for almost every packet within the same flow. This usually occurs due to a failure to correctly parse the frame/packet headers.

  • Un/ECMP hashing is a node-by-node distributed process - Unless all nodes are the same make/model running the same firmware, with same physical link setup, it can be quite complex to ensure that link polarisation doesn't occur along the path. When polarisation occurs some links maybe heavily utilised whilst others can be barely used at all. Additional entropy may be required on a node-by-node basis such as including a device ID (typically the loopback0 IP address) into the hashing calculation.

  • ECMP paths can become congested during a failure - Unlike a LAG/LACP bundle which can be configured with a minimum number of links to operate, if one or more ECMP L3 links fail, the remaining bandwidth between two adjacent nodes might not be enough to carry the existing traffic load. Traffic will be rebalanced off of the failed link(s) onto the remaining link(s). Congestion can occur and there is no automatic mitigation method available to correct this (unless for example, RSVP signalled LSPs run over the top and the IGP signals the increase in bandwidth utilisation back to the head-end forcing a re-route onto a backup LSP, but that is something the two ECMP adjacent nodes are unaware of!). IS-IS on Cisco's IOS-XR has the link-group feature which provides the same effect as a LAG minimum link setting, defining a minimum number of ECMP links. This isn't supported in OSPF though so it's protocol specific. It doesn't seem to be support by many vendors either so also vendor specific.

  • Hashing is asymmetric – Traffic from node N1 towards node N2 might be hashed onto link L1 but return traffic from N2 towards N1 might be hashed onto link L2. The volume of traffic for a flow my be asymmetric or symmetric in each direction. This may require each direction to be hashed onto separate links, it might not.

  • Un/ECMP is typically per flow - The throughput of any single flow cannot exceed the bandwidth of the member link it is hashed onto. Unless per-packet load-balancing is used (which is rarely suitable).

  • It can be argued that the increase state from a per-link IGP (and possibly per-link BFD) session doesn't scale well however, modern routers can handle hundreds of IGP and BFD sessions so this is rarely an issue.

  • Interoperability - L3 hashing isn't standardised. Hasing at L3 is slightly more standard than hashing at L2 (e.g. a common 5 tuple used for hashing is Src IP, Dst IP, Protocol No., Src Port and Dst Port) however, there is no gaurentee that any two vendors implement L3 hashing in the same way. This means that at the network edge (e.g. a public peering exchange) egress traffic is distributed differently by the local device than ingress traffic from the external network device.

 

MPLS L2 VPNs – Flow Aware Transport
Flow Aware Transport (RFC6391) enables a PE device to push an extra MPLS label (called the flow label) onto the bottom of the stack for a L2 VPN which is used as an entropy label by P nodes to hash on. The PE router will have visibility into the payload of the MPLS VPN so the flow label is based upon the flow details of the L2 VPN payload. This means the PE will consistently push the same FAT label for the same flow and prevent packet reordering in the core when ECMP occurs.

Some advantages to FAT are:

  • Easy to deploy – For LDP signalled pseudowires and VPLS, flow labels typically require only one or two an extra commands on a router to enable them. This is supported on many vendors including Cisco and Juniper.

  • Only PEs require FAT support – P nodes do not need to support FAT labels to make sure of them, only the PEs which push the FAT label onto the label stack by having visibility into the MPLS VPN payload. P nodes see the FAT label as another label in the stack (not any kind of special label) and include all the labels in the stack in their hashing calculation, ignorant of the fact that one label is in fact added in purely to increase entropy.

  • FAT can be used asymmetrically - Support for FAT labels is signalled between the ingress and egress PE in an interface parameter sub-TLV of LDP. Each PE individually signals it's ability to Transmit and Receive with a flow label. Asymmetric operation is supported by the RFC (although this would likely never be used – asymmetry in production networks is usually a support nightmare).

There are disadvantages to using flow labels, which include:

 

MPLS L2 VPNs - Pseudowire Control Word
The PW CW is recommended by an IETF draft (https://tools.ietf.org/html/draft-ietf-pals-ethernet-cw) to prevent LSR/P nodes performing deep packing inspection and mistaking an MPLS VPN payload which has 0x4 or 0x6 as the first nibble as IPv4 or IPv6 payload when neither are being transported. This is due to devices assuming that 0x4 or 0x6 in the first nibble of the MPLS VPN payload is the IP header version number field.

There are some advantages to using the PW CW, such as:

  • The PW CW is easily deployed – Only the ingress and egress PE need to support it, not any P node. It is also usually easy to enable, just one or two commands typically.

  • LDP and BGP signalling can be used – Both BGP and LDP have support for signally the PW CW. This is widely implemented amongst vendors.

  • The PW CW detects re-ordering - The PW CW has a sequencing field so depending on vendor implementation it has some capabilities to detect out of order packets.

  • PWE3 and VPLS are supported - The PW CW can be used for point-to-point (pseudowires) and multi-point-to-multi-point (VPLS) services.

There are some disadvantages to using the PW CW, which include:

  • The PW CW is for L2 VPNs only – As the name suggests, L3 VPNs aren't supported. This means that the PW CW would have be to be used for L2 VPNs in addition to another technique for L3 VPNs.

  • The PW CW doesn't help to scale load-balancing – The PW CW is meant to prevent packet reordering by P nodes / LSRs that read the wrong byte offsets of packet headers as input to their hash functions. The results would then be that the IP sequence number or checksum field for example maybe read which changes per-packet within the same flow. When there is a need to improve load-balancing for pseudowires another technique must be used such as FAT or Entropy Labels.

  • The PW CW is flawed - The PW CW sets the first nibble after the MPLS label stack to be 0x0 so that any P node or LSR looking beyond the MPLS label stack for load-balancing information will see the 0x0 nibble and ignore the MPLS VPN payload. This doesn't always work as expected thought. Some P nodes may see 0x0 in the first nibble and assume that because it's not 0x4 or 0x6 its not IP and must be Ethernet. Mistaking the PW CW for the start of a MAC address and then trying to hash based on where the layer 2 headers should be will give unpredictable results because the byte offsets will be wrong.

 

MPLS L2 & L3 VPNs – Entropy Labels
Entropy labels are a technique that involes inserting two extra labels into the MPLS label stack, after the transport label(s) but before the service label(s). An (ELI) Entropy Label Indicator and an (EL) Entropy Label immediately after that. The ELI tells the node reading the label stack that the next label is a entropy value to be used for load-balancing.

Advantages to using entropy labels include:

  • Entropy labels can be signalled using BGP, LDP and RSVP – Entropy labels are signalled in the label mapping message from the egress PE to the ingress PE downstream. They can also be signalled in RSVP-TE Path messages but that requires bidirectional support, otherwise a Resv can indicate than an egress PE supports entropy label processing.

  • Widely support – Entropy labels are supported by most major vendors and are typically easy to implement with just a few lines of config.

  • Entropy labels support both L2 and L3 VPNs – From their inception entropy labels have been ambiguous of their payload. The ingress PE decides whether to include an ELI and EL in the label stack for an LSP that has been signalled with the ELC flag (Entropy Label Capability), meaning it's up to the ingress PE to understand the VPN payload and be able to perform the header inspection required to generate an EL value.

  • Entropy labels require support from both PE and P nodes – P nodes must see the ELI label in the stack and recognise its special meaning in order to then hash only on the label stack (with the EL value in any given stack adding additional entropy based on the flow/MPLS VPN payload data). This can been seen as a disadvantage that PEs and Ps must support entropy labels but also an advantages in that there is no way to signal to P nodes when FAT labels are used. Also in the case that P nodes do not support entropy labels, they may look deeper into the stack to the MPLS VPN payload and hash on it's contents (if they are capable of doing so) in order to provide some backwards compatibility during an entropy label deployment phase.

  • PEs implicitly protect against packet re-ordering - PEs guarantee traffic within the same flow will always take the same path across the network. When PEs and P nodes support entropy labels no P nodes need to look into the MPLS VPN payload and the ingress PE will always push the same EL onto the label stack for packets within the same flow. FAT doesn't provide this guarantee as only the PEs know a flow based entropy label is being used meaning the P nodes don't know that they don't need to look into the MPLS VPN payload (if they are capable of doing that).

There are disadvantages to using Entropy Labels, which include:

  • Label stack depth limitations – Lower end PE devices will have a limit to the number of labels they can push and pop. If using an LDP over RSVP signalled transport LSP, the ingress PE needs to push 2 labels to the stack for the transport LSP, 2 labels for ELI and EL, and finally a service label. Not many low-end PEs can push 5 labels. The egress PE needs to be able to pop at least 3 labels (ELI, EL an service label) which again is not possible on all low end PEs. For some devices a "simple" software update may increase the push/pop/swap label depth limitations, for some it might require a hardware upgrade or even replacement.

  • Entrop labels are ignorant of the core - The ingress LER pushes a ELI and EL unaware of which links on a P-to-P/P-to-PE LAG or ECMP bundle that label stack is hashed onto, it could be an already congested link which the LER/PE continues to place additional traffic onto.

 

MPLS L3 VPNs - Label Allocation Policy
When deploying 6PE over an IPv4 only core the IPv6 explicit-null label (value 2) is used. This provides a per-vrf/per-table label allocation policy which means that a P router can not load-balance on a per-flow basis when compared to per-prefix labelling. If a transport label is present then two or more PEs that advertise the same IPv6 route may have separate transport labels at some points in the network where the path between the ingress and egress PEs diverges however, it may be the same in some other places. This means that multiple flows between PE1 and PE2 at certain points in the network will have the same transport and service label stack. Also if Penultimate Hop Popping is used then the final P node before the egress PE only has the service/VPN label to hash on which will be fixed for all IPv6 flows heading to the egress 6PE node.

The same is true when using per-vrf/per-table/per-next-hop/per-CE label allocation for IPv4 or IPv6. All IPv4 prefixes advertised by an egress PE will have the same label so the bottom label doesn't provide effective load-balancing.

 

Comparison / Evaluation
In order to deploy a solution that is unified across MPLS P and PE nodes it is important to keep the LSRs/P nodes agnostic of the VPN/service payload:

  • Both layer 2 and layer 3 hashing are only possible at the ingress and egress PE. It has to be assumed that an LSR/P node can't hash based on the contents of an MPLS VPN because it may not be a protocol supported by the P node. For example, it can't be guaranteed that every P node understands 3GPP to hash the payload from mobile cell sites or IPv6. This means that any solution that is required to assist with load-balancing in an MPLS core must be MPLS based to ensure all nodes are able to hash on headers they can parse. Layer 2 and layer 3 based hashing are likely to be a part of that solution assuming the VPN ingress/egress nodes(LERs) are able to understand the protocol headers being transported inside the VPN, in order to supply the P/LSR nodes with the additional entropy information required encoded as MPLS label values.

  • However, it is possible in the case of an AtoM tunnel for example that even the PE/LER can't understand the payload headers. Even when a P node understands the protocols inside the MPLS VPN it is still possible to incorrectly parse them, e.g. CSCvf97265 - A pseudowire with Ethernet+IPv4 payload and a destination MAC which starts with 0x6 in the MSB, being parsed by a router which fully understands Ethernet, IPv4 and IPv6 mistakes this IPv4 payload for IPv6 and incorrectly parses the headers, causing the LC NPU to crash.

A solution that supports MPLS L2 VPNs and L3 VPNs is required:

  • LDP has the possibility to signal FAT labels for L2 VPNs only not L3 VPNs (although this is changing this is some time away). It also can't allocate any kind of FAT or entropy style labels for IGP prefixes it allocates labels for. In addition to this when using IP FRR with LDP it can use the Explicit Null label. This means that during link failures, even though a fast re-route mechanism is present, traffic inside the GRT / inet.0 (such as the Internet not-in-a-VRF design) that would have possibly been hashed on IP headers or single transport label is now hashed on a single explicit null label. This could lead to congestion during FRR events.

  • The PW CW is flawed in that it doesn't prevent packet re-ordering within a single flow in 100% of scenarios and in the scenarios that is does support, it only detects packet re-ordering. Using an Entropy Labels prevents packet re-ordering in 100% of scenarios it is used for (as far as hashing alone is concerned, e.g. having one fibre path exceptionally longer than another transcends all these techniques). The PW CW also only applies to L2 VPNs not L3 VPNS.

  • Entropy Labels are agnostic of their payload and support L2 P2P/P2MP/MP2P/MP2MP VPNs, L3 Unicast VPNS and L3 Multicast VPNs, which covers the most widely deployed services. Entropy labels also prevent packet re-ordering in 100% of scenarios. Entropy labels do not detect packet re-ordering though (unlike the PW CW) in the unlikely event that it doesn't happen (e.g. during a reconvergence event). Entropy labels can be pushed onto the stack by an LER/PE router based on the VPN/service payload which means that a high level of entropy is added to the flow header stack to be hashed on.

A staged approach to deployment with backwards compatibility must be supported to prevent mass truck roll:

  • Entropy labels are already widely implemented by vendors and operators at the time of writing. Entropy labels can be enabled between a pair or set of PEs, one service at a time.

  • Entropy labels also provide backwards compatibility and a staged deployment with existing LSR/P nodes; any existing device that can hash on MPLS labels can hash on the MPLS label stack without even knowing that one of the labels is an ELI and the other an EL and gain entropy in it's hash calculation.

Excluding a mesh RSVP-TE tunnels, Segment Routing or anything offered by more complex implementations such as PCE-P driven forwarding or LSP tunnelling (e.g. LDP over RSVP), Entropy Labels stand out as a clear leader from the protocols and techniques evaluated above. The following extract from RFC6790 The Use of Entropy Labels in MPLS Forwarding succinctly explains why the MPLS label stack is the ideal place to store and process entropy within the network:

The entire label stack of the MPLS packet can then be used by transit
LSRs to perform load balancing, as the entropy label introduces the
right level of "entropy" into the label stack.

There are five key reasons why this is beneficial:
1. At the ingress LSR, MPLS encapsulation hasn't yet occurred, so deep inspection is not necessary.
2. The ingress LSR has more context and information about incoming packets than transit LSRs.
3. Ingress LSRs usually operate at lower bandwidths than transit LSRs, allowing them to do more work per packet.
4. Transit LSRs do not need to perform deep packet inspection and can load balance effectively using only a packet's MPLS label stack.
5. Transit LSRs, not having the full context that an ingress LSR does, have the hard choice between potentially misinterpreting fields in a packet as valid keys for load balancing (causing packet-ordering problems) or adopting a conservative approach (giving rise to sub-optimal load balancing). Entropy labels relieve them of making this choice.

In all cases it should be noted that hashing at the ingress node is required in some form, to gain the required additional entropy for multiple paths to be efficiently utilised across a network. The hashing functionality itself is not without issues though, e.g. CSCvb96765 – Incoming IPv4 packets are incorrectly hashed onto the same egress MPLS-TE tunnel, incoming MPLS packets are correctly hashed across all MPLS-TE tunnels.

MPLS-TE and Segment Routing haven't been discussed. To use these techniques at any sort of scale a central controller is required.  Without a central controller several issues exist with the techniques described above (even entropy labels aren't a panacea):

  • Even with FAT or entropy labels the ingress PE is unaware of the impact the added entropy is having when a change in network state occurs; for example when an LAG member link fails, which link will the flow/entropy label now map to at a given load-balancing point within the PSN? LAG member links aren't visible inside the IGP.

  • A LSR/P node cannot signal to an ingress PE the protocol headers it is able to parse or the maximum label stack depth it can parse.

  • The bandwidth of each ECMP/LAG member link and weather the links are all the same bandwidth and MTU or not, is unknown to an ingress PE.

  • A LER/PE node has no way of knowing if a new flow over an existing MPLS VPN will be an elephant flow or mice flow.