Date created: Sunday, December 24, 2017 9:59:34 PM. Last modified: Sunday, December 3, 2023 4:50:07 PM

Pseudowire ECMP Missordering (RFC 8469)

References:
draft-ietf-pals-ethernet-cw-07 "Use of Ethernet Control Word RECOMMENDED"
RFC4928 - "Avoiding Equal Cost Multipath Treatment in MPLS Networks"
RFC4385 - "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for Use over an MPLS PSN"
Forwarding issues related to MACs starting with a 4 or a 6
NANOG57 - Understanding MPLS Hashing
http://ytti.fi/pseudohell.png

Contents:
RFC Overview
Problem Definition
Correct Parsing Examples
Incorrect Parsing Examples
An Alternate Approach

RFC Overview
The following statement can be found in the introduction section of draft-ietf-pals-ethernet-cw. This RFC-to-be suggests that using the Pseudowire MPLS Control Word (PWMCW) should be considered best practice to prevent packet reordering by MPLS LSRs/P nodes that unintentionally parse MPLS VPN headers incorrectly:

1. Introduction
...
In the absence of a PW
CW an Ethernet pseudowire packet can be misidentified as an IP packet
by a label switching router (LSR) selecting the equal-cost-multi-path
(ECMP) path based on the five-tuple. This in turn may lead to the
selection of the wrong ECMP path for the packet, leading in turn to
the misordering of packets. Further discussion of this topic is
published in [RFC4928].

The text above mentions RFC4928, which recommends any application developers ensure that 0x0 or 0x1 be used as the first nibble of their application data if it is going to be passed directly over an MPLS pseudowire. If it starts 0x4 or 0x6 the application data risks being mistaken for IPv4 or IPv6 and treated differently (in a way that would lead to packet reordering and considered negative for the application connectivity and performance across the network).

RFC4928 refers to RFC4385, this is the RFC that first defined the Pseudowire MPLS Control Word in order to prevent such behaviour, it specifies that 0x0 should be the first nibble of the PWMCW.

 

Problem Definition
If the payload inside the MPLS label stack is IPv4 then the IP header is 20 bytes (usually). The first byte is split into two 4 bit nibbles, the IP version nibble and length nibble, which would be 0x4 and 0x14 respectively in this example, and the IP checksum field is located at the 11th and 12th bytes within the IPv4 header. If the entire 20 byte value is compared against the checksum value and matches, then the MPLS payload is extremely likely to be an IPv4 packet.

If the payload inside the MPLS label stack is Ethernet (it is a pseudowire/L2 VPN) then the Ethernet headers DST MAC + SRC MAC + ETYPE are 14 bytes long (without a VLAN tag). If the first nibble of the DST MAC is 0x4 or 0x6 then an LSR that only checks for the IP version field will assume this is a layer 3 VPN carrying IPv4 or IPv6 traffic, assuming this is the IP Version nibble at the start of an IP header. In this case the LSR will hash on values further into the IP header that are in fact not the SRC IP + DST IP and optionally values that are not the transport protocol number or SRC PORT + DST PORT of the transport protocol.

What the LSR believes to be the SRC IP fields within the MPLS payload will be the 13th to 16th bytes in the payload. What the LSR believes to be the DST IP fields within the MPLS payload will be the 17th to 20th byes in the payload, and so on. These could be other values within the data payload that change per packet, instead of remaining constant. For example, the IP Checksum field of almost every packet within the same flow will be different. This means that if a LSR/P node mistakenly hashed on the IP checksum field for load-balancing decision making, packets within the same flow will be sent across different paths within the network leading to packet reordering.

More sophisticated LSRs can check the IP version nibble, the IP length nibble and the IP checksum bytes. This greatly reduces the likelihood of hashing an Ethernet frame (inside a pseudowire / L2 VPN) as if it where an IP packet (L3 VPN) accidentally, but it is still not a fool-proof method. It is unlikely that exactly the correct values would be at the required byte offsets within the VPN payload headers that they would pass these checks, however if such values were present this would then become extremely difficult to troubleshoot.

 

Correct Parsing Example
Below are several examples of packet headers inside the MPLS VPN (what sits directly on top of the MPLS label stack) for L2 and L3 VPNs to shows which fields/byte offsets should be hashed on.

Below is an annotated example of an MPLS VPN payload for an IPv4 VPN:

|IPv4                                                     | |TCP                                       |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ..
45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50 d9 e8 cc 77 97 b3 82 7f 80 18
^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^     ^           ^           ^
|  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |     Seq_Num     ACK#        |  
|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |                             Len+Flags
|  DSCP (0)       Flags    |  Checksum                      |     DST_Port (80)
|                 |(DF)    |                                |
IPVer+IHL         |        IP_Prot (TCP)                    SRC_Port (48608)
                  +Frag_Off

First nibble is 0x4 so if an LSR assumes the payload starts with an IPv4 header it would be correct. We can theorise that a LSR/P node would perform a 5-tuple hash to get enough entropy from the IP VPN payload headers based on IP protocol version number, source IP address, destination IP address, source port and destination port. Below the correct byte off-sets are passed to a pseudo hashing function:

key_l3(10, 13, 17, 21, 23);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x06
   // SRC_IP*   == 0x01000001
   // DST_IP*   == 0x01000002
   // SRC_Port* == 0xbde0
   // DST_Port* == 0x0050
}

 

Below is an annoted example for a layer 2 MPLS VPN which contains IPv4 over Ethernet:

|Ethernet                               | |IPv4                                                     | |TCP                    |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42..
22 22 22 22 22 22 11 11 11 11 11 11 08 00 45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50 d9 e8 cc 77
^                 ^                 ^     ^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^     ^          
DST_MAC           SRC_MAC           EType |  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |     Seq_Num    
                                    (IPv4)|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |                
                                          |  DSCP (0)       Flags    |  Checksum                      |     DST_Port (80)
                                          |                 |(DF)    |                                |
                                          IPVer+IHL         |        IP_Prot (TCP)                    SRC_Port (48608)
                                                            +Frag_Off

The first nibble is not 0x4 or 0x6 so the P node assumes the payload starts with an Ethernet header which in this case is correct. This means that to hash on the same 5-tuple as an IP VPN (which provides more entropy than hashing on Ethernet header fields) the LSR can read the byte offset it would expect to be the EtherType field, find that the value is 0x0800 (IPv4) and adjust the byte offsets accordingly to pass the 5-tuple field pointers to the pseudo hashing function. Additionally the IP version, length and checksum fields should be checked too:

key_l3(24, 27, 31, 35, 37);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x06
   // SRC_IP*   == 0x01000001
   // DST_IP*   == 0x01000002
   // SRC_Port* == 0xbde0
   // DST_Port* == 0x0050
}

 

Below is an annotated example of a Layer 2 VPN payload with the PWMCW present (without sequencing enabled) which contains IPv4 over Ethernet:

|PWMCW    | |Ethernet                               | |IPv4                                                     | |TCP        |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42..
00 00 00 00 22 22 22 22 22 22 11 11 11 11 11 11 08 00 45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50
^  ^  ^     ^                 ^                 ^     ^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^    
|  |  Seq#  DST_MAC           SRC_MAC           EType |  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |    
|  |                                            (IPv4)|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |     
|  Frg+Len                                            |  DSCP (0)       Flags    |  Checksum                      |DST_Port(80)
|                                                     |                 |(DF)    |                                |
0's+Flags                                             IPVer+IHL         |        IP_Prot (TCP)                 SRC_Port (48608)
                                                                        +Frag_off

The first nibble is 0x0 so the LSR assumes the payload starts with a PWMCW and thus Ethernet header follows the PWMCW (which is correct) and as per the previous example, it can look to the EtherType value to see the Ethernet payload is IPv4, and correctly find the byte-offsets for the 5-tuple hash inputs:

key_l3(28, 31, 35, 39, 41);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x32
   // SRC_IP*   == 0x40065a4c
   // DST_IP*   == 0x01000001
   // SRC_Port* == 0x0100
   // DST_Port* == 0x0002
}

 

Incorrect Parsing Examples
The following examples show how an LSR/P node can read the wrong values within an MPLS VPN payload.

Below is an annotated example of a layer 2 VPN payload with the PWMCW (without sequencing) which contains IPv4 over Ethernet:

|PWMCW    | |Ethernet                               | |IPv4                                                     | |TCP        |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42..
00 00 00 00 22 22 22 22 22 22 11 11 11 11 11 11 08 00 45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50
^  ^  ^     ^                 ^                 ^     ^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^
|  |  Seq#  DST_MAC           SRC_MAC           EType |  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |
|  |                                            (IPv4)|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |
|  Frg+Len                                            |  DSCP (0)       Flags    |  Checksum                      |DST_Port(80)
|                                                     |                 |(DF)    |                                |
0's+Flags                                             IPVer+IHL         |        IP_Prot (TCP)                 SRC_Port(48608)

The first nibble is not 0x4 or 0x6 so the P node assumes the payload starts with Ethernet header (the P node has incorrectly assumed a Xerox MAC address which start 0x0). If the P node doesn't check it's assumption by checking it's expected location of the EtherType value for 0x0800, which it would find actually reads 0x1111 (bytes 13 and 14) it tries to read the 5-tuple values from the wrong locations within the VPN payload:

key_l3(24, 27, 31, 35, 37);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x32       (2nd byte of ID)
   // SRC_IP*   == 0x40065a4c (TTL+IP_Prot+Checksum)
   // DST_IP*   == 0x01000001 (SRC_IP)
   // SRC_Port* == 0x0100     (bytes 1-2 of DST_IP)
   // DST_Port* == 0x0002     (bytes 3-4 of DST_IP)
}

 

The following annotated example is also of a layer 2 VPN payload but without the PWMCW, it contains Ipv4 over Ethernet:

|Ethernet                               | |IPv4                                                     | |TCP                    |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42..
00 92 2e e6 76 fe 11 11 11 11 11 11 08 00 45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50 d9 e8 cc 77
^                 ^                 ^     ^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^     ^          
DST_MAC           SRC_MAC           EType |  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |     Seq_Num
                                    (IPv4)|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |
                                          |  DSCP (0)       Flags    |  Checksum                      |     DST_Port (80)
                                          |                 |(DF)    |                                |
                                          IPVer+IHL         |        IP_Prot (TCP)                    SRC_Port (48608)
                                                            +Frag_Off

The first nibble is 0x0 so the LSR incorrectly assumes that the payload starts with a PWMCW followed by the Ethernet headers (the P node has incorrectly assumed PWMCW is enabled instead of recognising a Xerox MAC address in the destination MAC address field). Once again, the wrong (but different from before!) byte offsets are used to read the 5-tuple values:

key_l3(28, 31, 35, 39, 41);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x00       (2nd byte of SRC_IP)
   // SRC_IP*   == 0x01000002 (DST_IP)
   // DST_IP*   == 0xbde00050 (SRC_Port+DST_Port)
   // SRC_Port* == 0xd9e8     (Seq_Num[0]+Seq_Num[1])
   // DST_Port* == 0xcc77     (Seq_Num[2]+Seq_Num[3])
}

 

The following is an annotated example of a layer 2 VPN payload without the PWMCW which contains IPv4 over Ethernet:

|Ethernet                               | |IPv4                                                     | |TCP                    |
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42..
40 92 2e e6 76 fe 11 11 11 11 11 11 08 00 45 00 01 77 dd 32 40 00 40 06 5a 4c 01 00 00 01 01 00 00 02 bd e0 00 50 d9 e8 cc 77
^                 ^                 ^     ^  ^  ^     ^     ^     ^  ^  ^     ^           ^           ^     ^     ^
DST_MAC           SRC_MAC           EType |  |  Len   ID    |     TTL|  |     SRC_IP      DST_IP      |     |     Seq_Num
                                    (IPv4)|  |  (375)       |    (64)|  |     (1.0.0.1)   (1.0.0.2)   |     |
                                          |  DSCP (0)       Flags    |  Checksum                      |     DST_Port (80)
                                          |                 |(DF)    |                                |
                                          IPVer+IHL         |        IP_Prot (TCP)                    SRC_Port (48608)
                                                            +Frag_Off

The first nibble is 0x4 so the P node assumes that the payload starts with an IPv4 header (the P node has incorrectly assumed a MAC starting with 0100b is the IP version number nibble but it's actually the first nibble of the destiation MAC address). Once again, the wrong byte offsets are used to read the values for the 5-tuple hash calculation:

key_l3(10, 13, 17, 21, 23);

function key_l3(IP_Prot*, SRC_IP*, DST_IP*, SRC_Port*, DST_Port*) {
   // IP_Prot*  == 0x11       (byte 4 of SRC_MAC)
   // SRC_IP*   == 0x08004500 (EType+IPVer+IHL+DSCP)
   // DST_IP*   == 0x0177dd32 (Len+ID)
   // SRC_Port* == 0x4000     (Flags+Frags_Off)
   // DST_Port* == 0x4006     (TTL+IP_Prot)
}

 

An Alternate Approach
Draft-ietf-pals-ethernet-cw is recommending a new best practice which is to always enabled the PWMCW whenever possible (both LERs support the CW), to prevent packet reordering within an MPLS L2 VPN. It also recommends that ECMP should not be used for pseudowire traffic on paths between LERs that don't need it. However it goes on:

5.  Equal Cost Multi-path (ECMP)

   Where the volume of traffic on an Ethernet PW is such that ECMP is
   required then one of two methods may be used:

   o Flow-Aware Transport (FAT) of Pseudowires over an MPLS Packet
   Switched Network specified in [RFC6391], or

   o LSP entropy labels specified in [RFC6790]

The draft its self stats that entropy labels or FAT should be used when ECMP is needed. Also when no other option is available (e.g. neither FAT nor EL is supported but the CW is supported) then it doesn't recomment that sequencing be enabled, which doesn't prevent packet reordering but does at least detect it. Without sequencing it is very difficult to detect.

An alternative best practice recommendation to the draft could be as follows:

  • Use entropy labels or FAT if supported, with CW disabled, and disable payload inspection (force load-balancing based on MPLS label stack). This prevent packet reordering, supports ECMP and supports multi-segment pseudowires.

  • If neither entropy labels nor FAT are supported, enabled the PW CW and sequencing, and disable payload inspection (for load-balancing on MPLS label stack). This should prevent packet reordering by not supporting ECMP, but if it does occur it offers a method of detection.

  • If sequencing isn't supported or the control word isn't supported, disable payload inspection (for load-balancing on MPLS label stack). This should prevent reordering by not supporting ECMP.

RFC4385 describes the PWMCW header and how sequencing works:

   If a PW is sensitive to packet misordering and is being carried over
   an MPLS PSN that uses the contents of the MPLS payload to select the
   ECMP path, it MUST employ a mechanism that prevents packet
   misordering.  A suitable mechanism is the PWMCW described in Section
   3 for data, and the PWACH described in Section 5 for channel-
   associated traffic.

It's important to note that both entropy labels or FAT are providing consistent hashing within an ECMP/LAG hashing decision and are not a panacea. If Segment Routing is being used a label depth issue can arise in that the entropy labels are too deep in the label stack to hash on.

On Junos one can use:

set forwarding-options enhanced-hash-key family mpls no-ether-pseudowire
set forwarding-options enhanced-hash-key family mpls no-payload
set protocols l2circuit neighbor x.x.x.x interface ge-1/1/1 control-word

https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/enhanced-hash-key-edit-forwarding-options.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vpls-bgp-control-word-configuring.html

On IOS-XR one can use:

l2vpn
 pw-class 123
  encapsulation mpls
   load-balancing
    pw-label | flow-label

https://community.cisco.com/t5/service-providers-documents/asr9000-xr-load-balancing-architecture-and-characteristics/ta-p/3124809

This will hash based on the bottom of stack label (either the VC label or FAT label).

FAT and EL/ELI set up our out of scope for this document.

Saku Ytti has provided some testing results for Cisco and Juniper:


Previous page: MPLS Label Distribution
Next page: The DSL Reference