Tree/Trie IP and Forwarding Lookups


IP Lookup Data Structures
Uni-Bit and Multi-Bit Tries
Lookup Memory Types
Tries and Network Vendors
Cisco Specific Examples

IP Lookup Data Structures
An increase in forwarding memory is required to store an increase in forwarding data, which means power, cooling and cost overheads for the memory increases. IP forwarding search/lookup speed may also suffering depending on the search algorithm used (such as longest prefix match). Additionally, update time for the forwarding table may increase as the forwarding data volume increases. As update time increases the likelihood of black wholing traffic during a reconvergence event may increase both on the local router and on other routers as the number control-plane sessions also scale.

For a 10Gbps link sending minimum sized Ethernet frames of 84 bytes (8 byte preamble and start frame delimiter, 12 byte source + destination MAC, 2 byte EtherType, 46 minimum payload size, 4 byte frame check sequence, 12 byte inter-frame gap), 10,000,000,000 Bps / (84 * 8) == 14.88~ Mpps. This equates to one forwarding lookup every 0.000000067 seconds (67 nanoseconds).

N.B: Lots of the terminology here is not standardised!

Excerpt from here:
"The structure of the cache system means that, all else being equal, a data structure will run faster if it has good locality, which means that pieces of the data structure accessed around the same time tend to be in about the same place. For example, given the choice between searching through an unsorted array and an unsorted linked list, you are better off searching the array: each position you look at in the array comes right after the previous position, so is likely to be stored in the same block. In contrast, the elements of the linked list will be wherever malloc managed to scrounge up space: in the worst case, one element per virtual memory block, each of which needs to be paged in from disk if you are running low on space. The disadvantage of the array over the linked list is that calling realloc may be expensive; one way to avoid this is to build a linked list of large nodes each of which holds an array...This is actually slightly slower than just building a stack as an array...but much faster than the simple linked-list approach...A similar issue arises with binary search trees. Because we are only jumping through O(log n) nodes, the added cost of bad locality is not as noticeable as with linked lists, but we'd like to reduce it, especially if we expect our tree nodes to be stored on a slow device like a disk".

A hash table has an average search/insert/delete time of O(1) and worst case of O(n). Hash tables provide an unordered key-value map which is why the worst case is so much higher than the average case due to no native search capability. A binary tree has an average and worst case search/insert/delete time of O(log(n)) meaning it is completely consistent. B-Trees provide an ordered key-value map thus efficient searching is native. This makes trie data structures ideal for IP forwarding lookups. Despite a hash table having an average of O(1) search time which seems like it could be better for routers used in certain circumstances, the hash function to generate the search key could be so complex that any advantage is lost. In addition to this a hash table can't search for ranges unlike a trie which can, providing results which could include the searched for prefix and/or more specific subnets within. However, when a TCAM is used the "don't care" bits can be used to provide this capability to a hash table structure in physical hardware, but not easily in software.

A Binary Tree, B-Tree or B-Trie is a binary tree which evaluates bits of a search key at each level in the trie. Balanced B-Tree's save on the average height from root to leaf. A uni-bit trie is a tree with a radix of 2 meaning each branch node evaluates just one bit of the key with only two possible outcomes. A Patricia Trie is a binary trie with a radix of 2 that uses the Patricia algorithm to compress the tree height. A multi-bit trie is a binary trie which examines n number of bits from the search key at each node level in the tree. The size of the multi-bit comparison is often called the "stride" width.


Uni-Bit and Multi-Bit Tries
For a unibit trie the worst case scenario for IPv4 lookup is 32 random reads and for IPv4 insertion the worst case would be 32 random writes (to add 32 new nodes to the trie). The upper bound lookup complexity is O(n) where n is the length of the prefix being search for; a /8 requires fewer steps through the node trie than a /32. The upper bound is roughly O(n^w) where n is the number of prefixes and w is their length (this is only roughly because some prefixes in the trie will share branch nodes).

Compressed unibit or Patricia tries compress the height (the distance from the root node to a leaf node); "[for any] internal node of the unibit trie that does not contain a next-hop and has only one child [it] can be removed in order to shorten the path from the root node. By removing these nodes, we need a mechanism to record which nodes are missing" so some extra information must be stored at each node in order to reduce trie height. "Each IP lookup starts at the root node of the trie. Based on the value of skipped bits of the destination address of the packet [the "skipped bits" value in the current node], the lookup algorithm determines whether the left or the right node is to be visited. The next hop of the longer matching prefix found along the path is maintained as Best Matching Prefix (BMP) while the trie is traversed.".

"Compression can reduce the height of a sparse unitbit trie. However, when the unibit trie is full (every node has two children) then there is no compression possible and the two versions will look identical. Thus, a lookup operation requires 32 random reads in the worst case for IPv4, O(n) [where n is the length of the prefix to lookup]. Considering a unibit trie with N leaves, there can be N-1 internal nodes between the leaves and the root including the root...The total amount of memory required will be at most 2n − 1 [where n is the number of prefixes]. Since path compression is possible and some internal nodes can be removed, the space complexity becomes O(n) [number of prefixes], independent of the length of a prefix. Thus, path compressed tries reduce space requirements, but not the search complexity."

With multi-bit tries the "stride" length of the key is examined at each level in the trie. A unit-bit trie would in the worst case require 32 memory accesses for an IPv4 prefix lookup because it has a stride length of 1 bit. If the memory access time is 10ns this would equate to 320ns and 67ns it required for 10Gbps so the throughput would drop to circa 2.1Gbps. With a multi-bit tree k lookups are required (where k is the stride size). Given a prefix of length n the upper bound lookup complexity is O(n/k). If k is 8, then at worst for an IPv4 address lookup 4 memory accesses would be required (32/8 == 4). This is also the trie height calculation for the multi-bit trie; key size / stride size (32 / 8 == 4 for example). For a unit-bit trie for IPv4 forwarding the trie height would be 32 / 1 == 32.

"An update [of a multi-bit trie] requires a search through [prefix length]/k lookup iterations plus access to each child node (2k). The update complexity is O([prefix length]/k + 2k). In the worst case, each prefix would need an entire path of length ([prefix length]/k) and each node would have 2k entries. The space complexity would then be O((2k ∗ [number of prefixes] ∗ [prefix length])/k)...As we increase the stride size the number of levels will decrease and the number of memory access will also decrease. However, memory space consumed will be larger. Therefore, when choosing stride size a tradeoff happens between the memory accesses and space requirements. The designer needs to determine the proper number of levels based on the desired lookup speed. For example, if the maximum lookup speed should be 30ns and the designer is presented with memory chips at a cost of 10ns for access time, then the number of levels needs to be at max 3. Obviously, choosing the optimal stride size of each level in order to minimize the memory space varies and is highly dependent on the prefixes database." - this is where variable stride size multi-bit tries come in.

Compression algorithms exist for multi-bit tries, namely an LC trie (Level Compressed).


Lookup Memory Types
DRAM typical provides read access times in the order of 50ns (high end DDR3 can go even faster). RLDRAM access times are in the order of 20ns. SRAM is now pushing circa 3-4ns access time (time of writing is 2017-10). 10Gbps access times with 84 byte Ethernet frames (14.88Mpps) require 67ns lookup times. 40Gbps native (because some 40Gbps implementations are 4x10Gbps lanes) using 84 byte Ethernet frames (59.52 Mpps) would require 17ns lookup times. With a stride width of 8 bits, four memory accesses are required, 17ns / 4 == 4.25ns budget per lookup. This requirement falls into the category that only SRAM can fulfil. TCAMs can provide lookup times in the region of 1-5ns meaning for single search requirements like an ACL "hit" they can scale to very high throughputs, or for forwarding lookups using longest prefix matching.

New technologies like FlashTries can be used for up to 400Gbps IPv4 native or 600Gbps IPv6 native lookup and forwarding rates:
"We use a hash-based membership query to limit off-chip memory accesses per lookup and to bal-ance memory utilization among the memory modules. By compacting the data structure size, the lookup depth of each level can be increased. We also develop a new data structure called Prefix-Compressed Trie that reduces the size of a bitmap by more than 80%. Our simulation and implemen-tation results show that FlashTrie can achieve 80-Gb/s worst-case throughput while simultaneously supporting 2 M prefixes for IPv4 and 318 k prefixes for IPv6 with one lookup engine and two Dou-ble-Data-Rate (DDR3) SDRAM chips. When implementing five lookup engines on a state-of-the-art field programmable gate array (FPGA) chip and using 10 DDR3 memory chips, we expect FlashTrie to achieve 1-Gpps (packet per second) throughput, equivalent to 400 Gb/s for IPv4 and 600 Gb/s for IPv6. FlashTrie also supports incremental real-time updates."

Below some example multibit trie memory requirements are shown. This is using a stride of 8 bits meaning that each node is 1 byte in size. The first example shows that 4 Gibibytes of memory would be required to store all possible IPv4 addresses as /32 routing entries (arguably wouldn't normally be used nor any addresses from however, a typical service pro-vider network carries many copies of the RFC 1918 ranges in layer 3 VPNs for their customers, also this is just a worst possible example case!).

A geometric progression can be used to calculate the storage requirement for a full tree: k*(1-k^h)/(1-k) where k is the number of possible values based on the stride size and h is the height of the tree. With a stride size of 8, k = 2^8 = 256 (which is why each tree node is 1 byte in size). Storing 32-bit IPv4 addresses means the number of tree "levels" (the height from root to leaf) n = 32 / 8 = 4.

256*(1-256^4)/(1-256) = 4,311,810,304 bytes.

A more realistic scenario is to carry 1 or 2 million IPv4 routes in a routers FIB. Below are two more multibit tries examples still using a stride of 8 bits. The second example shows the best possible (theoretical but unrealistic) scenario with 1 million IPv4 routes stored in FIB which in this case are all /32's falling under the range 0.0-15.0-255.0-255. They fit in just 1.00 Mibibytes of memory. The third example shows the worst possible (theoretical but unrealistic) scenario in which the same num-ber, 1 million IPv4 /32 prefixes, fit into 2.06 Mebibytes of memory, double that of the second example.

Full tree, 4.29 billion IPv4 entries:

      0...255          256 combos: 256 * 1 byte
     /       \
    0...255   0...255  256 * 256 combos: 65536 * 1 byte = 65 KiBs
   /       \
  0...255   0...255    65536 * 256 combos: 16,777,216 * 1 byte = 15.99 MiB
 /      \
0...255  0...255       16,777,216 * 256 combos: 4,294,967,296 * 1 byte = 4096.00 MiB

(256 * 1 byte) + (65,536 * 1 byte) + (16,777,216 * 1 byte) + (4,294,967,296 * 1 byte) =

4,311,810,304 bytes == 4112.06 MiBs (4.01 GiB).

1M IPv4 entries, best case:

        0              1 combo: 1 byte
    0...15             16 combos: 16 * 1 byte
   /      \
  0...255  0...255     16 * 256 combos: 4096 * 1 byte
 /       \
0...255   0...255      4096 * 256 combos: 1,048,576 * 1 byte

(1 * 1 byte) + (16 * 1 byte) + (4096 * 1 byte) + (1,048,576 * 1 byte) =

1,052,689 bytes == 1.00 MiB

1M IPv4 entries, worst case:

      0...255          256 combos: 256 * 1 byte
     /       \
    0...255   0...255  256 * 256 combos: 65536 * 1 byte = 65 KiBs
   /       \
  0...15    0...15     65536 * 16 combos: 1,048,576 * 1 byte = 1 MiB
 /      \
0        0             1,048,576 combos: 1,048,576 * 1 byte = 1 MiB

(256 * 1 byte) + (65,536 * 1 byte) + (1,048,576 * 1 byte) + (1,048,576 * 1 byte) =

2,162,944 bytes == 2.06 MiBs.


Tries and Network Vendors
Juniper devices such as the MX series routers store forwarding information in what Juniper call the JTREE which their own binary trie implementation; from the referenced Juniper patents:

If the current next hop indicates a tree based search is to be performed, the key engine 905 invokes the trie search engine 906 d to perform a lookup operation that includes a longest match lookup traversing a radix trie data structure (referred to herein as a "jtree"). The search is based on the specified number of bits at a particular starting point in the key buffer. The process for performing the longest best match lookup is described in greater detail in copending application "Separation of Data and Control in a Switching Device". The result of the longest match lookup of the key bits is a next hop. More specifically, a route in a jtree consists of a one word 'next hop', at a double-word aligned memory location, followed by zero or more words of prefix information. One or more jtrees 914 are stored in memory 920. A next hop specifying a jtree search includes identifying information for the particular jtree to be searched. The storage of a jtree in memory 920 is described in greater detail in "Separation of Data and Control in a Switching Device".

Juniper describe in the additional linked patents that SRAM can be used for low latency access to the Jtree and multiple packet processing engines can be used in parallel to scale up router throughput.

Cisco use what they call an Mtrie (defined in a patent linked in the references section) which seems to be a multi-bit binary trie with a fix stride width of 8 bits providing an "8-8-8-8 stride". Cisco also has a patent referenced for what they call the Mtrie Plus, from the patent references at the top of this page:

In an embodiment, different aspects of a packet header and data included in the packet are singled out for attention, rather that just the four byte IP destination address. Different information is included in nodes of the trie that enables matching and branching on different header fields. In an embodiment, the ACL of a configuration file in a router or switch is compiled into a trie data structure located in the memory of the router or switch. In an embodiment, a trie data structure is used to map a multicast packet header by a sequence of nodes that match on destination address or source address.

One drawback to TRIE data structure processing is that it is limited to processing the destination IP address. It would be desirable to expand on this so as to single out other aspects of the packet header beyond the four bytes specifying the destination IP address.

A second drawback to the use of a TRIE for lookups is that access control list (ACL) processing remains separate from routing. In some embodiments, ACL processing is driven by software. Since this process is relatively slow when compared to high-speed routers, ACL processing slows the overall rate at which data packets are forwarded. Depending upon the length of the ACL criteria, router speeds can drop 70% or more.

A third drawback to use of a TRIE is that the nodes and leaves of the TRIE generally do not provide adequate information to direct multicast routing.

Cisco also has a linked patent for scaling the Mtrie performance by replicating the same Mtrie or nodes of a trie across multiple memory banks which are accessed in parallel, from the patent referenced at the top of this page:

The invention provides a method and system for rapid access to one or more M-tries for responding to header information. The M-tries are stored in a plurality of memory banks, which are accessed in parallel to provide relatively greater access throughput. Parts of the M-tries that are (or are expected to be) frequently referenced are stored in multiple banks of the memory, to provide concurrent simultaneous access for those parts of the M-tries for parallel lookup of multiple routes. Regions of the multiple banks of the memory can be dynamically reallocated to provide improved access through-put to those multiple banks. The invention can be applied to routing decisions in response to destination addresses, to combinations of destination and source addresses (either for unicast or multicast routing), to access control decisions, to quality of service decisions, to accounting, and to other administrative processing in response to header information.


Cisco Specific Examples
Some devices use a Gtrie such as the Cisco ASR1000 series routers, Cisco 12000 series, CRS and ASR9000 series. The output from the command "show platform hardware qfp active feature cef-mpls prefix ip" on an ASR1000 indicates a Gtrie with a height of 6. However these "show cef" commands say otherwise stating mtrie and rtree:

#show platform hardware qfp active feature cef-mpls prefix ip
=== Gtrie Node ===

Gtrie Node Type: Tree Node
HW Content: : 094c690d 00000004 00000000 90010000
Gtrie Tree Node Type:: Search Trie Node
=== Gtrie Search Node ===
  TN type 0, TN scan use 0, TN stride 6
  TN inode exists 1, TN skip 0
  TN zero perf real len: 0
  TN par bl offset: 0
  TN par bl len: 0
TBM Tree Array
  TA NNodes 3, TA INode Exists 1, TN TNRefs 0x00000000058140f0
TBM Tree Node Bitmap
Search Node Bitmap: 00 00 00 00 90 01 00 00
=== Gtrie Node ===

Gtrie Node Type: Leaf Node

#show ip cef tree
Table Default tree information:
 MTRIE/RTREE storing IPv4 addresses
 512 entries (512/0 fwd/non-fwd)
 Forwarding tree:
  Forwarding lookup routine: IPv4 mtrie 8-8-8-8 optimized
  14362 inserts, 13850 deletes
  8-8-8-8 stride pattern

#show ipv6 cef tree
Table Default tree information:
 RTREE storing IPv6 addresses
 13 entries (13/0 fwd/non-fwd)
 Forwarding & Non-forwarding tree:
  65 inserts, 52 deletes
  13 leaves (520 bytes), 21 nodes (1176 bytes)
  1696 total bytes

Cisco 7200, ASR920, ME3600X/ME3800X platforms all also state the same 8-8-8-8 IPv4 mtrie and rtree for IPv6 in the output of the same cef commands as the ASR1000. A Cisco 7600 states 8-8-4-4-4-4, perhaps a sign that they are optimised for storing lots of smaller prefixes?

#show ip cef tree
Table Default tree information:
 MTRIE storing IPv4 addresses
 654092 entries (654092/0 fwd/non-fwd)
 Forwarding & Non-forwarding tree:
  Forwarding lookup routine: IPv4 mtrie generic
  68027442 inserts, 67373350 deletes
  8-8-4-4-4-4 stride pattern
  short mask protection enabled for <= 4 bits without process suspension
  654092 leaves (18314576 bytes), 133145 nodes (10858000 bytes)
  29174240 total bytes

#show ipv6 cef tree
Table Default tree information:
 RTREE storing IPv6 addresses
 40882 entries (40881/1 fwd/non-fwd)
 Forwarding & Non-forwarding tree:
  9648666 inserts, 9607784 deletes
  40882 leaves (817640 bytes), 78087 nodes (2811132 bytes)
  3628772 total bytes

The above Cisco devices show the following hidden commands in the output of "show run all" however they don't seem to be reconfigurable:

#show run all | i cef
cef table vrf tree IPv4 type MTRIE short-mask-protection 4 stride-pattern 8-8-8-8 hardware-api-notify off
cef table vrf tree IPv6 type RTREE