diff options
author | Neale Ranns <nranns@cisco.com> | 2020-11-09 10:09:42 +0000 |
---|---|---|
committer | Florin Coras <florin.coras@gmail.com> | 2021-01-14 19:55:55 +0000 |
commit | dfd3954c0427422e2739b858d1e18503a5c59970 (patch) | |
tree | 13225967f028f7d386b8da863656b5e576c0b463 /docs/gettingstarted | |
parent | 1b5ca985dc51bea730ce5ee799641c75f73a0f26 (diff) |
docs: Update FIB documentation
Type: docs
Signed-off-by: Neale Ranns <nranns@cisco.com>
Change-Id: I3dfde4520a48c945ca9707accabbe1735c1a8799
Diffstat (limited to 'docs/gettingstarted')
20 files changed, 1390 insertions, 181 deletions
diff --git a/docs/gettingstarted/developers/fib20/arpentries.rst b/docs/gettingstarted/developers/fib20/arpentries.rst deleted file mode 100644 index 3606e620645..00000000000 --- a/docs/gettingstarted/developers/fib20/arpentries.rst +++ /dev/null @@ -1,32 +0,0 @@ -.. _arpentries: - -ARP Entries -^^^^^^^^^^^ - -.. figure:: /_images/fib20fig1.png - -Figure 1: ARP data model - -Figure 1 shows the data model for an ARP entry. An ARP entry contains the mapping -between a peer, identified by an IPv4 address, and its MAC address on a given -interface. The VRF the interface is bound to, is not part of the data. VRFs are -an ingress function not egress. The ARP entry describes how to send traffic to a -peer, which is an egress function. - -The *arp_entry_t* represents the control-plane addition of the ARP entry. The -*ip_adjacency_t* contains the data derived from the *arp_entry_t* that is need to -forward packets to the peer. The additional data in the adjacency are the *rewrite* -and the *link_type*. The *link_type* is a description of the protocol of the packets -that will be forwarded with this adjacency; this can be IPv4 or MPLS. The *link_type* -maps directly to the ether-type in an Ethernet header, or the protocol filed in a -GRE header. The rewrite is a byte string representation of the header that will be -prepended to the packet when it is sent to that peer. For Ethernet interfaces this -would be the src,dst MAC and the ether-type. For LISP tunnels, the IP src,dst pair -and the LISP header. - -The *arp_entry_t* will install a *link_type=IPv4* when the entry is created and a -link_type=MPLS when the interface is MPLS enabled. Interfaces must be explicitly -MPLS enabled for security reasons. - -So that adjacencies can be shared between route, adjacencies are stored in a single -data-base, the key for which is {interface, next-hop, link-type}. diff --git a/docs/gettingstarted/developers/fib20/attachedexport.rst b/docs/gettingstarted/developers/fib20/attachedexport.rst index c2802773d9c..3bf933de679 100644 --- a/docs/gettingstarted/developers/fib20/attachedexport.rst +++ b/docs/gettingstarted/developers/fib20/attachedexport.rst @@ -3,8 +3,8 @@ Attached Export ^^^^^^^^^^^^^^^^ -Extranets make prefixes in VRF A also reachable from VRF B. VRF A is the export VRF, -B the import. Consider this route in the export VRF; +Extranets make prefixes in table A also reachable from table B. Table A is the export table, +B the import. Consider this route in the export table; .. code-block:: console @@ -26,7 +26,7 @@ attached prefix is exported, so are the adj-fibs and local prefixes that it covers, and only the adj-fibs and locals, not any covered more specific (sourced e.g. by API). The imported FIB entries are sourced as *attached-export* this is a low priority source, so if those prefixes already exist in the import -VRF, sourced by the API, then they will continue to forward with that information. +table, sourced by the API, then they will continue to forward with that information. .. figure:: /_images/fib20fig6.png diff --git a/docs/gettingstarted/developers/fib20/barnacles.rst b/docs/gettingstarted/developers/fib20/barnacles.rst new file mode 100644 index 00000000000..b5b89a34b36 --- /dev/null +++ b/docs/gettingstarted/developers/fib20/barnacles.rst @@ -0,0 +1,78 @@ +.. _barnacles: + +Barnacles +--------- + +Features that are stuck on the side of the FIB. Those that directly use +the services that the FIB provides. + +In the section on FIB fundamentals it was mentioned that there is a +separation between what to match and how to forward. In an IP FIB what +to match is the packet's destination address against a table of IP +prefixes, and how to forward is described by a list of paths (the +**fib_path_list_t**). + +ACL Based Forwarding +^^^^^^^^^^^^^^^^^^^^ + +ACL Based Forwarding (ABF) is also know as policy based routing +(PBR). In ABF what to match is described by an ACL. + +ABF uses two VPP services; ACL as a service, as provided by the ACL +plugin and FIB path-lists. It just glues them together. + +An ABF policy is the combination of an ACL with the forwarding +description of a FIB path-list. An ABF attachment is the association +of [an ordered set of] ABF policies to an interface. The attachment is +consulted on the ingress path of the IP DP (as an input +feature). If the ACL matches then the associated forwarding is +followed, if not, the packet continues along the DP. Simples. + +Layer 3 Cross Connect +^^^^^^^^^^^^^^^^^^^^^ + +An L3 cross-connect (L3XC) matches all packets +that ingress the interface and then forwards using the supplied FIB +path-list. Naturally it runs as an input feature in the IP +path. Super simples. + +IP Punt +^^^^^^^ + +Matches all IP packets that VPP has punted. Why they are punted is not +relevant. All IP punted packets are sent by VPP to the punt feature +arc. This feature 'matches' all packets that it receives and forwards +using the FIB path-list. + + +Unicast Reverse Path Forwarding +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Unicast Reverse Path Forwarding (uRPF) is the process of ensuring that +a packet has a conforming source address. It comes in two +flavours: + +- loose: The source address must be reachable, i.e. FIB must have a + route that will forward to the source address. The default route + counts as long as it does not drop. +- strict: The source address is reachable via the interface on which + the packet arrived, i.e. the FIB's route for the source address must + include the input interface as an output interface. + +The uRPF feature can run on either the input or output IP feature +arc. In both cases it serves as an anti-spoofing check, though the +semantics are slightly different. On the input arc it enforces that +peers on that link are only using source addresses that they should - +a network admin should employ at the access edge. On the output +arc it enforces that a packet is sourced from a prefix that belongs to +the network, i.e. that is has originated from within an SP's +network, a network admin could use at its peering points. + +To perform a uRPF check, the DP performs an IP FIB lookup on the +source address, this always results in a load-balance (LB) object. If +the LB has only 1 bucket and that bucket stacks on a drop DPO, then +both a loose and strict check will fail, otherwise a loose check +will pass. Each LB object has an associated uRPF list object. This +object holds the list of interfaces through which the prefix is +reachable. To pass the strict check, the input/output interface must +be in this list. diff --git a/docs/gettingstarted/developers/fib20/controlplane.rst b/docs/gettingstarted/developers/fib20/controlplane.rst index 172e36ca40c..8b58d42f5b3 100644 --- a/docs/gettingstarted/developers/fib20/controlplane.rst +++ b/docs/gettingstarted/developers/fib20/controlplane.rst @@ -12,10 +12,11 @@ information. .. toctree:: - arpentries + neighbors routes attachedexport graphwalks + marknsweep .. rubric:: Footnotes: diff --git a/docs/gettingstarted/developers/fib20/dataplane.rst b/docs/gettingstarted/developers/fib20/dataplane.rst index a9e2bf27d69..34886e18a44 100644 --- a/docs/gettingstarted/developers/fib20/dataplane.rst +++ b/docs/gettingstarted/developers/fib20/dataplane.rst @@ -71,24 +71,25 @@ The diagrams above show that for any given route the full data-plane graph is known before any packet arrives. If that graph is composed of n objects, then the packet will visit n nodes and thus incur a forwarding cost of approximately n times the graph node cost. This could be reduced if the graph were *collapsed* -into a single DPO and associated node. However, collapsing a graph removes the -indirection objects that provide fast convergence (see section Fast Convergence). To -collapse is then a trade-off between faster forwarding and fast convergence; VPP -favours the latter. - -This DPO model effectively exists today but is informally defined. Presently the -only object that is in the data-plane is the ip_adjacency_t, however, features -(like ILA, OAM hop-by-hop, SR, MAP, etc) sub-type the adjacency. The member -lookup_next_index is equivalent to defining a new sub-type. Adding to the -existing union, or casting sub-type specific data into the opaque member, or -even over the rewrite string (e.g. the new port range checker), is equivalent -defining a new C-struct type. Fortunately, at this time, all these sub-types are -smaller in memory than the ip_adjacency_t. It is now possible to dynamically -register new adjacency sub-types with ip_register_adjacency() and provide a -custom format function. - -In my opinion a strongly defined object model will be easier for contributors to -understand, and more robust to implement. +into fewer DPOs and nodes. There are two ways we might consider doing +this: + +- write custom DPOs/nodes for combinded functions, e.g. pop MPLS label + and lookup in v4 table. This has the disadvantage that the number of + such nodes would be, well, combinatorial, and resolving a path via + a combined DPO would be more difficult as it would involve a + forward walk of the graph to determine what the combination + is. However, VPP power users might consider this option for a + limited set of their use cases where performance is truely king. +- collapse multiple levels of load-balancing into one. For example, + if there were two levels of load-balancing each with two choices, + this could equally be represented by one level with 4 choices. + +In either case a disadvantage to collapsing the graph is that it +removes the indirection objects that provide fast convergence (see +section Fast Convergence). To collapse is then a trade-off between +faster forwarding and fast convergence; VPP favours the latter. + .. rubric:: Footnotes: diff --git a/docs/gettingstarted/developers/fib20/debugging.rst b/docs/gettingstarted/developers/fib20/debugging.rst new file mode 100644 index 00000000000..e0a85fd736e --- /dev/null +++ b/docs/gettingstarted/developers/fib20/debugging.rst @@ -0,0 +1,106 @@ +.. _debugging: + +Debugging +--------- + +the anatomy of a route: + +.. code-block:: console + + BGvpp# sh ip fib 1.1.1.3/32 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] + 1.1.1.0/24 fib:0 index:9 locks:2 + CLI refs:1 src-flags:added,contributing,active, + path-list:[24] locks:4 flags:shared, uPRF-list:11 len:1 itfs:[1, ] + path:[26] pl-index:24 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.0.1 loop0 + [@0]: arp-ipv4: via 10.0.0.1 loop0 + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:11 buckets:1 uRPF:11 to:[0:0]] + [0] [@3]: arp-ipv4: via 10.0.0.1 loop0 + +let's go line by line. + +.. code-block:: console + + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] + +Each field in turn: + +- ipv4-VRF:0: the name of the table (as given by the user, or + automatically generated by VPP). +- fib-index:0; in the VPP pool of FIB objects, this is index 0 +- flow hash:[src dst sport dport proto ]: When calculating the flow + hash to use for load-balanacing, these are the fields in the packet + that are used. There is an API to change this per-table. +- epoch:0; Used during mark-n-sweep. +- flags:none; use the force, to find the per-table flags. +- locks: per-source reference counting, a table can only be deleted + when all sources no longer reference it. + +next line: + +.. code-block:: console + + 1.1.1.0/24 fib:0 index:9 locks:2 + +this shows the route that matched the show request. note that it is not +an exact match, it's an LPM. The route is in FIB index 0, its index +(in the VPP pool of fib_entry_t objects) is nine and there are two +references to the entry. +You'll get the same output if you type "sh fib entry 9" + +next line: + +.. code-block:: console + + CLI refs:1 src-flags:added,contributing,active, + +the 'CLI' has sourced this route (it was added via CLI). This source +has been added (well duh) it is 'active', meaning it is the best +source, and it is contributing a forwarding object. There are some +scenarios where sources other than the active source contribute, +namely interpose sources. + +next line: + +.. code-block:: console + + path-list:[24] locks:4 flags:shared, uPRF-list:11 len:1 itfs:[1, ] + +This is path-list inex 24 (see "sh fib path-list 24" this will also +show the children), it is 'shared', +meaning that if other prefixes were to use the same set of paths, +then they would also use this path-list object. It has uRPF list 11 of +length 1 containing interface index 1 (which is loop0, see "sh int"). + +next line: + +.. code-block:: console + + path:[26] pl-index:24 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.0.1 loop0 + [@0]: arp-ipv4: via 10.0.0.1 loop0 + +This is path 26 (see "sh fib path 26"). It's a member of +path-list 24. It's ip4 has a weight of 1 and a preference of 0. It's +of type 'attached-nexthop' and currently resolved - woohoo. +It is a path 'via 10.0.0.1 loop0'. It is contributing an incomplete adjacency. + +next line: + +.. code-block:: console + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:11 buckets:1 uRPF:11 to:[0:0]] + [0] [@3]: arp-ipv4: via 10.0.0.1 loop0 + +This section describes how packets of type 'unicast-ip4' will be +forwarded. It is the result of processing the path information from +above. +Here we see load-balance object 11, which has 1 bucket/choice. It is +also linked to uRPF instance 11 (which it got from path-list 24). +In bucket 0 there is the incomplete adjacnecy that was contributed by +path 26. + diff --git a/docs/gettingstarted/developers/fib20/fastconvergence.rst b/docs/gettingstarted/developers/fib20/fastconvergence.rst index 4356d3f1ed7..b07e08cea6d 100644 --- a/docs/gettingstarted/developers/fib20/fastconvergence.rst +++ b/docs/gettingstarted/developers/fib20/fastconvergence.rst @@ -3,6 +3,574 @@ Fast Convergence ------------------------------------ -.. note:: +This is an excellent description of the topic: + +'FIB <https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12>'_ + +but if you're interested in my take keep reading... + +First some definitions: + +- Convergence; When a FIB is forwarding all packets correctly based + on the network topology (i.e. doing what the routing control plane + has instructed it to do), then it is said to be 'converged'. + Not being in a converged state is [hopefully] a transient state, + when either the topology change (e.g. a link failure) has not been + observed or processed by the routing control plane, or that the FIB + is still processing routing updates. Convergence is the act of + getting to the converged state. +- Fast: In the shortest time possible. There are no absolute limits + placed on how short this must be, although there is one number often + mentioned. Apparently the human ear can detect loss/delay/jitter in + VOIP of 50ms, therefore network failures should last no longer than + this, and some technologies (notably link-free alternate fast + reroute) are designed to converge in this time. However, it is + generally accepted that it is not possible to converge a FIB with + tens of millions of routes in this time scale, the industry + 'standard' is sub-second. + +Converging the FIB quickly is thus a matter of: + +- discovering something is down +- updating as few objects as possible +- to determine which objects to update as efficiently as possible +- to update each object as quickly as possible + +we'll discuss each in turn. +All output came from VPP version 21.01rc0. In what follows I use IPv4 +prefixes, addresses and IPv4 host length masks, however, exactly the +same applies to IPv6. + + +Failure Detection +^^^^^^^^^^^^^^^^^ + +The two common forms (we'll see others later on) of failure detection +are: + +- link down +- BFD + +The FIB needs to hook into these notifications to trigger +convergence. + +Whenever an interface goes down, VPP issues a callback to all +registerd clients. The adjacency code is such a client. The adjacency +is a leaf node in the FIB control-plane graph (containing fib_path_t, +fib_entry_t etc). A back-walk from the adjacnecy will trigger a +re-resolution of the paths. + +FIB is a client of BFD in order to receive BFD notifications. BFD +comes in two flavours; single and multi hop. Single hop is to protect +a specific peer on an interface, such peers are modelled by an +adjacency. Multi hop is to protect a peer on an unspecified interface +(i.e. a remote peer), this peer is represented by a host-prefix +**fib_entry_t**. In both case FIB will add a delegate to the +**ip_adjacency_t** or **fib_entry_t** that represents the association +to the BFD session. If the BFD session signals up/down then a backwalk +can be triggered from the object to trigger re-resolution and hence +convergence. + + +Few Updates +^^^^^^^^^^^ + +In order to talk about what 'a few' is we have to leave the realm of +the FIB as an abstract graph based object DB and move into the +concrete representation of forwarding in a large network. Large +networks are built in layers, it's how you scale them. We'll take +here a hypothetical service provider (SP) network, but the concepts +apply equally to data center leaf-spines. This is a rudimentary +description, but it should serve our purpose. + +An SP manages a BGP autonomous system (AS). The SP's goal is both to +attract traffic into its network to serve its customers, but also to +serve transit traffic passing through it, we'll consider the latter here. +The SP's network is all devices in that AS, these +devices are split into those at the edge (provider edge (PE) routers) +which peer with routers in other SP networks, +and those in the core (termed provider (P) routers). Both the PE and P +routers run the IGP (usually OSPF or ISIS). Only the reachability of the devices +in the AS are advertised in the IGP - thus the scale (i.e. the number +of routes) in the IGP is 'small' - only the number of +devices that the SP has (typically not more than a few 10k). +PE routers run BGP; they have external BGP sessions to devices in +other ASs and internal BGP sessions to devices in the same AS. BGP is +used to advertise the routes to *all* networks on the internet - at +the time of writing this number is approaching 900k IPv4 route, hopefully by +the time you are reading this the number of IPv6 routes has caught up ... +If we include the additional routes the SP carries to offering VPN service to its +customers the number of BGP routes can grow to the tens of millions. + +BGP scale thus exceeds IGP scale by two orders of magnitude... pause for +a moment and let that sink in... + +A comparison of BGP and an IGP is way way beyond the scope of this +documentation (and frankly beyond me) so we'll note only the +difference in the form of the routes they present to FIB. A routing +protocol will produce routes that specify the prefixes that are +reachable through its peers. A good IGP +is link state based, it forms peerings to other devices over these +links, hence its routes specify links/interfaces. In +FIB nomenclature this means an IGP produces routes that are +attached-nexthop, e.g.: + +.. code-block:: console + + ip route add 1.1.1.1/32 via 10.0.0.1 GigEthernet0/0/0 + +BGP on the other hand forms peerings only to neighbours, it does not +know, nor care, what interface is used to reach the peer. In FIB +nomenclature therefore BGP produces recursive routes, e.g.: + +.. code-block:: console + + ip route 8.0.0.0/16 via 1.1.1.1 + +where 1.1.1.1 is the BGP peer. It's no accident in this example that +1.1.1.1/32 happens to be the route the IGP advertised... BGP installs +routes for prefixes reachable via other BGP peers, and the IGP install +the routes to those BGP peers. + +This has been a very long winded way of describing why the scale of +recursive routes is therefore 2 orders of magnitude greater than +non-recursive/attached-nexthop routes. + +If we step back for a moment and recall why we've crawled down this +rabbit hole, we're trying to determine what 'a few' updates means, +does it include all those recursive routes, probably not ... let's +keep crawling. + +We started this chapter with an abstract description of convergence, +let's now make that more real. In the event of a network failure an SP +is interested in moving to an alternate forwarding path as quickly as +possible. If there is no alternate path, and a converged FIB will drop +the packet, then who cares how fast it converges. In other words the +interesting convergence scenarios are the scenarios where the network has +alternate paths. + +PIC Core +^^^^^^^^ + +First let's consider alternate paths in the IGP, e.g.; + +.. code-block:: console + + ip route add 1.1.1.1/32 via 10.0.0.2 GigEthernet0/0/0 + ip route add 1.1.1.1/32 via 10.0.1.2 GigEthernet0/0/1 + +this gives us in the FIB: + +.. code-block:: console + + DBGvpp# sh ip fib 1.1.1.1/32 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ] + 1.1.1.1/32 fib:0 index:15 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[23] locks:2 flags:shared, uPRF-list:22 len:2 itfs:[1, 2, ] + path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.0.2 GigEthernet0/0/0 + [@0]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 + path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.1.2 GigEthernet0/0/1 + [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + +There is ECMP across the two paths. Note that the instance/index of the +load-balance present in the forwarding graph is 17. + +Let's add a BGP route via this peer; + +.. code-block:: console + + ip route add 8.0.0.0/16 via 1.1.1.1 + +in the FIB we see: + + +.. code-block:: console + + DBGvpp# sh ip fib 8.0.0.0/16 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ] + 8.0.0.0/16 fib:0 index:18 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[24] locks:2 flags:shared, uPRF-list:21 len:2 itfs:[1, 2, ] + path:[29] pl-index:24 ip4 weight=1 pref=0 recursive: oper-flags:resolved, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:21 to:[0:0]] + [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + +the load-balance object used by this route is index 20, but note that +the next load-balance in the chain is index 17, i.e. it is exactly +the same instance that appears in the forwarding chain for the IGP +route. So in the forwarding plane the packet first encounters +load-balance object 20 (which it will use in ip4-lookup) and then +number 17 (in ip4-load-balance). + +What's the significance? Let's shut down one of those IGP paths: + +.. code-block:: console + + DBGvpp# set in state GigEthernet0/0/0 down + +the resulting update to the IGP route is: + +.. code-block:: console + + DBGvpp# sh ip fib 1.1.1.1/32 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ] + 1.1.1.1/32 fib:0 index:15 locks:4 + API refs:1 src-flags:added,contributing,active, + path-list:[23] locks:2 flags:shared, uPRF-list:25 len:2 itfs:[1, 2, ] + path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: + 10.0.0.2 GigEthernet0/0/0 + [@0]: arp-ipv4: via 10.0.0.2 GigEthernet0/0/0 + path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, + 10.0.1.2 GigEthernet0/0/1 + [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + + recursive-resolution refs:1 src-flags:added, cover:-1 + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]] + [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + + +notice that the path via 10.0.0.2 is no longer flagged as resolved, +and the forwarding chain does not contain this path as a +choice. However, the key thing to note is the load-balance +instance is still index 17, i.e. it has been modified not +exchanged. In the FIB vernacular we say it has been 'in-place +modified', a somewhat linguistically redundant expression, but one that serves +to emphasise that it was changed whilst still be part of the graph, it +was never at any point removed from the graph and re-added, and it was +modified without worker barrier lock held. + +Still don't see the significance? In order to converge around the +failure of the IGP link it was not necessary to update load-balance +object number 20! It was not necessary to update the recursive +route. i.e. convergence is achieved without updating any recursive +routes, it is only necessary to update the affected IGP routes, this is +the definition of 'a few'. We call this 'prefix independent +convergence' (PIC) which should really be called 'recursive prefix +independent convergence' but it isn't... + +How was the trick done? As with all problems in computer science, it +was solved by a layer of misdirection, I mean indirection. The +indirection is the load-balance that belongs to the IGP route. By +keeping this object in the forwarding graph and updating it in place, +we get PIC. The alternative design would be to collapse the two layers of +load-balancing into one, which would improve forwarding performance +but would come at the cost of prefix dependent convergence. No doubt +there are situations where the VPP deployment would favour forwarding +performance over convergence, you know the drill, contributions welcome. + +This failure scenario is known as PIC core, since it's one of the IGP's +core links that has failed. + +iBGP PIC Edge +^^^^^^^^^^^^^ + +Next, let's consider alternate paths in BGP, e.g: + +.. code-block:: console + + ip route add 8.0.0.0/16 via 1.1.1.1 + ip route add 8.0.0.0/16 via 1.1.1.2 + +the 8.0.0.0/16 prefix is reachable via two BGP next-hops (two PEs). + +Our FIB now also contains: + +.. code-block:: console + + DBGvpp# sh ip fib 8.0.0.0/16 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ] + 8.0.0.0/16 fib:0 index:18 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[15] locks:2 flags:shared, uPRF-list:11 len:2 itfs:[1, 2, ] + path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, + via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:2 uRPF:11 to:[0:0]] + [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:13 to:[0:0]] + [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + +The first load-balance (LB) in the forwarding graph is index 20 (the astute +reader will note this is the same index as in the previous +section, I am adding paths to the same route, the load-balance is +in-place modified again). Each choice in LB 20 is another LB +contributed by the IGP route through which the route's paths recurse. + +So what's the equivalent in BGP to a link down in the IGP? An IGP link +down means it loses its peering out of that link, so the equivalent in +BGP is the loss of the peering and thus the loss of reachability to +the peer. This is signaled by the IGP withdrawing the route to the +peer. But "Wait wait wait", i hear you say ... "just because the IGP +withdraws 1.1.1.1/32 doesn't mean I can't reach 1.1.1.1, perhaps there +is a less specific route that gives reachability to 1.1.1.1". Indeed +there may be. So a little more on BGP network design. I know it's like +a bad detective novel where the author drip feeds you the plot... When +describing iBGP peerings one 'always' describes the peer using one of +its GigEthernet0/0/back addresses. Why? A GigEthernet0/0/back interface +never goes down (unless you admin down it yourself), some muppet can't +accidentally cut through the GigEthernet0/0/back cable whilst digging up the +street. And what subnet mask length does a prefix have on a GigEthernet0/0/back +interface? it's 'always' a /32. Why? because there's no cable to connect +any other devices. This choice justifies there 'always' being a /32 +route for the BGP peer. But what prevents there not being a less +specific - nothing. +Now clearly if the BGP peer crashes then the /32 for its GigEthernet0/0/back is +going to be removed from the IGP, but what will withdraw the less +specific - nothing. + +So in order to make use of this trick of relying on the withdrawal of +the /32 for the peer to signal that the peer is down and thus the +signal to converge the FIB, we need to force FIB to recurse only via +the /32 and not via a less specific. This is called a 'recursion +constraint'. In this case the constraint is 'recurse via host' +i.e. for ipv4 use a /32. +So we need to update our route additions from before: + +.. code-block:: console + + ip route add 8.0.0.0/16 via 1.1.1.1 resolve-via-host + ip route add 8.0.0.0/16 via 1.1.1.2 resolve-via-host + +checking the FIB output is left as an exercise to the reader. I hope +you're doing these configs as you read. There's little change in the +output, you'll see some extra flags on the paths. + +Now let's add the less specific, just for fun: + + +.. code-block:: console + + ip route add 1.1.1.0/28 via 10.0.0.2 GigEthernet0/0/0 + +nothing changes in resolution of 8.0.0.0/16. + +Now withdraw the route to 1.1.1.2/32: + +.. code-block:: console + + ip route del 1.1.1.2/32 via 10.0.0.2 GigEthernet0/0/0 + +In the FIB we see: + +.. code-block:: console + + DBGvpp# sh ip fib 8.0.0.0/32 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ] + 8.0.0.0/16 fib:0 index:18 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[15] locks:2 flags:shared, uPRF-list:13 len:2 itfs:[1, 2, ] + path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host, + via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0] + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:13 to:[0:0]] + [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + +the path via 1.1.1.2 is unresolved, because the recursion constraints +are preventing the the path resolving via 1.1.1.0/28. the LB index 20 +has been updated to remove the unresolved path. + +Job done? Not quite! Why not? + +Let's re-examine the goals of this chapter. We wanted to update 'a +few' objects, which we have defined as not all the millions of +recursive routes. Did we do that here? We sure did, when we +modified LB index 20. So WTF?? Where's the indirection object that can +be modified so that the LBs for the recursive routes are not +modified - it's not there.... WTF? + +OK so the great detective has assembled all the suspects in the +drawing room and only now does he drop the bomb; the FIB knows the +scale, we talked above about what the scale **can** be, worst case +scenario, but that's not necessarily what it is in this hypothetical +(your) deployment. It knows how many recursive routes there are that +depend on a /32, it can thus make its own determination of the +definition of 'a few'. In other words, if there are only 'a few' +recursive prefixes that depend on a /32 then it will update them +synchronously (and we'll discuss what synchronously means a bit more later). + +So what does FIB consider to be 'a few'. Let's add more routes and +find out. + +.. code-block:: console + + DBGvpp# ip route add 8.1.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host + ... + DBGvpp# ip route add 8.63.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host + +and we see: + +.. code-block:: console + + DBGvpp# sh ip fib 8.8.0.0 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] + 8.8.0.0/16 fib:0 index:77 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ] + path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:2 uRPF:28 flags:[uses-map] to:[0:0]] + load-balance-map: index:0 buckets:2 + index: 0 1 + map: 0 1 + [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:18 to:[0:0]] + [0] [@3]: arp-ipv4: via 10.0.1.2 GigEthernet0/0/0 + + +Two elements to note here; the path-list has the 'popular' flag and +there is a load-balance map in the forwarding path. + +'popular' in this case means that the path-list has passed the limit +of 'a few' in the number of children it has. + +here are the children: + +.. code-block:: console + + DBGvpp# sh fib path-list 15 + path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ] + path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] + children:{entry:18}{entry:21}{entry:22}{entry:23}{entry:25}{entry:26}{entry:27}{entry:28}{entry:29}{entry:30}{entry:31}{entry:32}{entry:33}{entry:34}{entry:35}{entry:36}{entry:37}{entry:38}{entry:39}{entry:40}{entry:41}{entry:42}{entry:43}{entry:44}{entry:45}{entry:46}{entry:47}{entry:48}{entry:49}{entry:50}{entry:51}{entry:52}{entry:53}{entry:54}{entry:55}{entry:56}{entry:57}{entry:58}{entry:59}{entry:60}{entry:61}{entry:62}{entry:63}{entry:64}{entry:65}{entry:66}{entry:67}{entry:68}{entry:69}{entry:70}{entry:71}{entry:72}{entry:73}{entry:74}{entry:75}{entry:76}{entry:77}{entry:78}{entry:79}{entry:80}{entry:81}{entry:82}{entry:83}{entry:84} + +64 children makes it popular. The number is fixed (there is no API to +change it). Its choice is an attempt to balance the performance cost +of the indirection performance degradation versus the convergence +gain. + +Popular path-lists contribute the load-balance map, this is the +missing indirection object. Its indirection happens when choosing the +bucket in the LB. The packet's flow-hash is taken 'mod number of +buckets' to give the 'candidate bucket' then the map will take this +'index' and convert it into the 'map'. You can see in the example above +that no change occurs, i.e. if the flow-hash mod n chooses bucket 1 +then it gets bucket 1. + +Why is this useful? The path-list is shared (you can convince +yourself of this if you look at each of the 8.x.0.0/16 routes we +added) and all of these routes use the same load-balance map, therefore, to +converge all the recursive routs, we need only change the map and +we're good; we again get PIC. + +OK who's still awake... if you're thinking there's more to this story, +you're right. Keep reading. + +This failure scenario is called iBGP PIC edge. It's 'edge' because it +refers to the loss of an edge device, and iBGP because the device was +a iBGP peer (we learn iBGP peers in the IGP). There is a similar eBGP +PIC edge scenario, but this is left for an exercise to the reader (hint +there are other recursion constraints - see the RFC). + +Which Objects +^^^^^^^^^^^^^ + +The next topic on our list of how to converge quickly was to +effectively find the objects that need to be updated when a converge +event happens. If you haven't realised by now that the FIB is an +object graph, then can I politely suggest you go back and start from +the beginning ... + +Finding the objects affected by a change is simply a matter of walking +from the parent (the object affected) to its children. These +dependencies are kept really for this reason. + +So is fast convergence just a matter of walking the graph? Yes and +no. The question to ask yourself is this, "in the case of iBGP PIC edge, +when the /32 is withdrawn, what is the list of objects that need to be +updated and particularly what is the order they should be updated in +order to obtain the best convergence time?" Think breadth v. depth first. + +... ponder for a while ... + +For iBGP PIC edge we said it's the path-list that provides the +indirection through the load-balance map. Hence once all path-lists +are updated we are converged, thereafter, at our leisure, we can +update the child recursive prefixes. Is the breadth or depth first? + +It's breadth first. + +Breadth first walks are achieved by spawning an async walk of the +branch of the graph that we don't want to traverse. Withdrawing the /32 +triggers a synchronous walk of the children of the /32 route, we want +a synchronous walk because we want to converge ASAP. This synchronous +walk will encounter path-lists in the /32 route's child dependent list. +These path-lists (and thier LB maps) will be updated. If a path-list is +popular, then it will spawn a async walk of the path-list's child +dependent routes, if not it will walk those routes. So the walk +effectively proceeds breadth first across the path-lists, then returns +to the start to do the affected routes. + +Now the story is complete. The murderer is revealed. + +Let's withdraw one of the IGP routes. + +.. code-block:: console + + DBGvpp# ip route del 1.1.1.2/32 via 10.0.1.2 GigEthernet0/0/1 + + DBGvpp# sh ip fib 8.8.0.0 + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] + 8.8.0.0/16 fib:0 index:77 locks:2 + API refs:1 src-flags:added,contributing,active, + path-list:[15] locks:128 flags:shared,popular, uPRF-list:18 len:2 itfs:[1, 2, ] + path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, + via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] + path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host, + via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0] + + forwarding: unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:1 uRPF:18 to:[0:0]] + [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] + [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 + [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 + +the LB Map has gone, since the prefix now only has one path. You'll +need to be a CLI ninja if you want to catch the output showing the LB +map in its transient state of: + +.. code-block:: console + + load-balance-map: index:0 buckets:2 + index: 0 1 + map: 0 0 + +but it happens. Trust me. I've got tests and everything. + +On the final topic of how to converge quickly; 'make each update fast' +there are no tricks. + + - To be written diff --git a/docs/gettingstarted/developers/fib20/graphs.rst b/docs/gettingstarted/developers/fib20/graphs.rst index ec4c6760062..aec0e4b0135 100644 --- a/docs/gettingstarted/developers/fib20/graphs.rst +++ b/docs/gettingstarted/developers/fib20/graphs.rst @@ -26,9 +26,9 @@ The many to one relationship between child and parent means that the lifetime of parent object must extend to the lifetime of its children. If the control plane removes a parent object before its children, then the parent must remain, in an **incomplete** state, until the children are themselves removed. Likewise if a child -is created before its parent, the parent is completed in an *incomplete* state. These +is created before its parent, the parent is created in an *incomplete* state. These incomplete objects are needed to maintain the graph dependencies. Without them when -the parent is added finding the affected children would be search through many +the parent is added finding the affected children would require a search through many databases for those children. To extend the lifetime of parents all children thereof hold a **lock** on the parent. This is a simple reference count. Children then follow the add-or-lock/unlock semantics for finding a parent, as opposed to a malloc/free. diff --git a/docs/gettingstarted/developers/fib20/graphwalks.rst b/docs/gettingstarted/developers/fib20/graphwalks.rst index 36fdb3f2723..e740660a2ed 100644 --- a/docs/gettingstarted/developers/fib20/graphwalks.rst +++ b/docs/gettingstarted/developers/fib20/graphwalks.rst @@ -8,13 +8,13 @@ susceptible to memory re-allocation, therefore the use of a bare "C" pointer to to a child or parent is not possible. Instead there is the concept of a *fib_node_ptr_t* which is a tuple of type,index. The type indicates what type of object it is (and hence which pool to use) and the index is the index in that pool. This allows -for the safe retrieval of any object type. +for the safe retrieval of any object type. When a child resolves via a parent it does so knowing the type of that parent. The child to parent relationship is thus fully known to the child, and hence a forward walk of the graph (from child to parent) is trivial. However, a parent does not choose its children, it does not even choose the type. All object types that form part of the -FIB control plane graph all inherit from a single base class14; *fib_node_t*. A *fib_node_t* +FIB control plane graph all inherit from a single base class; *fib_node_t*. A *fib_node_t* identifies the object's index and its associated virtual function table provides the parent a mechanism to visit that object during the walk. The reason for a back-walk is to inform all children that the state of the parent has changed in some way, and @@ -32,7 +32,7 @@ contains a *fib_node_ptr_t*. The VPP pool memory model applies to the list eleme they are also identified by an index. When a child is added to a list it is returned the index of the element. Using this index the element can be removed in constant time. The list supports 'push-front' and 'push-back' semantics for ordering. To walk the children -of a parent is then to iterate of this list. +of a parent is then to iterate this list. A back-walk of the graph is a depth first search where all children in all levels of the hierarchy are visited. Such walks can therefore encounter all object instances in the @@ -59,11 +59,14 @@ same parent instance before the fib-walk process can run. FIB is a 'final state' If a parent changes n times, it is not necessary for the children to also update n times, instead it is only necessary that this child updates to the latest, or final, state. Consequently when multiple walks on a parent (and hence potential updates to a -child) are queued, these walks can be merged into a single walk. +child) are queued, these walks can be merged into a single walk. This +is the main reason the walks are designed this way, to eliminate (as +much as possible) redundant work and thus converge the system as fast +as possible. Choosing between a synchronous and an asynchronous walk is therefore a trade-off between time it takes to propagate a change in the parent to all of its children, versus the -time it takes to act on a single route update. For example, if a route update where to +time it takes to act on a single route update. For example, if a route update were to affect millions of child recursive routes, then the rate at which such updates could be processed would be dependent on the number of child recursive route which would not be good. At the time of writing FIB2.0 uses synchronous walk in all locations except when diff --git a/docs/gettingstarted/developers/fib20/hacking.rst b/docs/gettingstarted/developers/fib20/hacking.rst new file mode 100644 index 00000000000..f64d3deb860 --- /dev/null +++ b/docs/gettingstarted/developers/fib20/hacking.rst @@ -0,0 +1,68 @@ +.. _hacking: + +Get Hacking +----------- + +The code's directory structure is trivial, FIB, mFIB, adj have their +own directories. + +for the most part, for all the FIB object types mentioned in this +documentation there is a corresponding .h and .c file. As with any VPP +component/sub-system a 'public' header file is any file that can be +included by another sub-system and/or plugin. These must be specified +in the build-system, so go look there. Public header files are always +a good entry point to start reading. + +FIB +^^^ + +There is no direct [VPP's binary] API access to FIB, but FIB does +expose types that can be used on the API by FIB and by other +subsystems (e.g. :ref:`barnacles`). These types are specified in +fib.api and the encoding and decoding thereof in fib_api.[ch]. + +Most operations on a FIB entry happen as a result of an operation on a +FIB table; an entry does not exist in isolation. The APIs in +fib_table.h are well doxygen documented you should be able to figure +out what they do. Use this as a starting point to explore how entries +are created and deleted and how the source priority scheme works. + +FIB sources are defined in fib_source.h. Each source behaviour has its +own file fib_entry_src_*.c These define the virtual functions that +determine how the source behaves when actions on the FIB occur. For +example, what the entry must do when its covering prefix's forwarding +is updated. + +When creating new paths/path-lists the main action required is to +resolve them; see fib_path*_resolve, and once resolved to have them +contribute a DPO for forwarding or for the uRPF list; see +fib_*_contribute_forwarding and fib_*_contribute_urpf respectively. + +The data-structures that used for entry lookup are protocol +specific, they are implemented in separate files; ip4_fib.[ch], +ip6_fib.[ch] and mpls_fib.[ch]. + +FIB extranet support is implemented in fib_attached_export.[ch]. +FIB tracking is implemented in fib_entry_track.[ch]. +FIB [back]walk is implemented in fib_walk.[ch]. + +Adjacency +^^^^^^^^^ + +Not much to say here, each adjacency type has it own file; use the +force, read the source. + + +Testing +^^^^^^^ + +the majority of FIB coverage comes from the C Unit tests in +fib_test.c. I strongly encourage you to add code here. It's a much +easier development cycle to fire up GDB, run VPP and iterate with +'test fib', than it is work in the python UT. You still need to write +python UT, don't get me wrong, it's just easier to do the FIB dev +using C UT. + + + +Enjoy! diff --git a/docs/gettingstarted/developers/fib20/index.rst b/docs/gettingstarted/developers/fib20/index.rst index 1b014524cb8..6a3115beb2c 100644 --- a/docs/gettingstarted/developers/fib20/index.rst +++ b/docs/gettingstarted/developers/fib20/index.rst @@ -10,5 +10,9 @@ FIB 2.0 Hierarchical, Protocol, Independent tunnels mplsfib multicast + debugging fastconvergence scale + barnacles + hacking + missing diff --git a/docs/gettingstarted/developers/fib20/marknsweep.rst b/docs/gettingstarted/developers/fib20/marknsweep.rst new file mode 100644 index 00000000000..e9e38a33f3a --- /dev/null +++ b/docs/gettingstarted/developers/fib20/marknsweep.rst @@ -0,0 +1,68 @@ +.. _marknsweep: + +Mark and Sweep +-------------- + +The mark and sweep procedures, in FIB and in other subsystems, are +built for the purpose of recovering from a control plane crash. + +In routing if the control plane (CP) crashes, when it restarts, the network +topology may have changed. This means that some of the routes that +were programmed in the FIB may no longer be needed, and perhaps some +new ones are. If the CP were simply to insert all the new routes it +learned after it restarts, then FIB could be left with old routes that +never get removed, this would be bigly bad. + +At a high level the requirement is to delete routes from the old set +that are not present in the new set; 'delete the diff' as it might +be colloquially known. + +How should the control plane determine the old set? It could +conceivably read back the FIB from VPP. But this presents two +problems, firstly, it could be a large set of routes, numbering in the +millions, this is not an efficient mechanism and not one one wants to +perform at a point when the router is trying to converge +ASAP. Secondly it represents a 'source of truth' inversion. The +routing plane is the source of truth, not forwarding. Routing should +not receive its 'input' from the layers below. Thirdly, on a practical +note, the reading of VPP data structures to glean this sort of +accurate information, would only happen in this scenario, i.e. it's +not well tested and therefore not particularly reliable (see point 2). + +Enter 'mark and sweep' or m-n-s (not to be confused with the retail +giant) as it's affectionately known. + +The Mark and Sweep algorithm proceeds in three steps: + +- Step 1; the CP declares to VPP that it wants to begin the process + (i.e. it has just restarted). At this point VPP will iterate through + all the objects that the CP owns and 'mark' then as being + stale. This process effectively declares a new 'epoch', a barrier in + time that separates the old objects from the new. +- Step 2; The CP downloads all of its new objects. If one of these new + CP objects matches (has the same key as) an existing object, then + the CP add is considered an update, and the object's stale state is + removed. +- Step 3: The CP declares it has 'converged'; it has no more updates + to give (at this time). VPP will then again iterate through all the + CP's objects and remove those that do not belong to the new epoch, + i.e. those that are still marked stale. + +After step 3, the CP and VPP databases are in sync. + +The cost of the process was to download all the new routes again. This +is a highly-tuned and well-tested scenario. + +In VPP we use the synonym 'replace' to describe the mark-n-sweep +action in the API. We use this term because it refers to the goals of +the algorithm at a high level - the CP wants to replace the old DB +with a new one - but it does not specify the algorithm by which that +is achieved. One could equally perform this task by constructing a +brand new DB in VPP, and then swapping them when the CP +converges. Other subsystems may employ that approach, but FIB does +not. Updates are typically faster than adds, since the update is +likely a no-op, whereas a separate add would require the memory +allocator, which is the long pole in FIB additions. Additionally, it requires +twice the memory for a moment in time, which could be prohibitive when +the FIB is large. + diff --git a/docs/gettingstarted/developers/fib20/missing.rst b/docs/gettingstarted/developers/fib20/missing.rst new file mode 100644 index 00000000000..c5aa51f6106 --- /dev/null +++ b/docs/gettingstarted/developers/fib20/missing.rst @@ -0,0 +1,110 @@ +.. _missing: + +Missing Functionality +--------------------- + +A list of functionality that the FIB does not currently provide. + + +PIC Edge Backup Paths +^^^^^^^^^^^^^^^^^^^^^ + +FIB supports the concept of path 'preference'. Only paths that have +the best preference contribute to forwarding. Only once all the paths with +the best preference go down do the paths with the next best preference +contribute. + +In BGP PIC edge, BGP would install the primary paths and the backup +paths. With expectation that backups are only used once all primaries +fail; this is the same behaviour that FIB's preference sets provide. + +However, in order to get prefix independent convergence, one must be +able to only modify the path-list's load-balance map (LBM) to choose the +paths to use. Hence the paths must already be in the map, and +conversely must be in the fib_entry's load-balance (LB). In other +words, to use backup paths with PIC, the fib_entry's LB must include +the backup paths, and the path-lists LBM must map from the backups to +the primaries. + +This is change that is reasonably easy w.r.t. to knowing what to +change, but hard to get right and hard to test. + + +Loop Free Alternate Paths +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Contrary to the BGP approach for path backups, an IGP could install a +loop free alternate (LFA) path to achieve fast re-route (FRR). + +Because of the way the LFA paths are calculated by the IGP an LFA backup +path is always paired with a primary. VPP FIB does not support this +primary-backup pair relationship. + +In intent of LFA FRR is/was to get below the magic 50ms mark. To do +this the expectation is/was that one would need in the forwarding +graph an object that represents a path's state. This object would be +checked for each packet being sent. If the path is up, the graph (an +adjacency since it's the IGP) for the primary path is taken, if it's +down the graph for the backup is taken. When a path goes down only +this indirection object needs to be updated to affect all +routes. Naturally, the indirection would incur a performance cost, but +we know that there are many performance-convergence trade-offs in a +FIB design. + +Should VPP's FIB support this feature? It all depends on the +50ms. LFA FRR comes from the era when routers ran on lower performance +CPUs and interface down was an interrupt. VPP typically has plenty of +gas but runs as a user space process. So, can it update all routes in +under 50ms on a meaty CPU and can the OS deliver the interface down +within the time requirements? I don't have the answers to either +question. + + +Extranets for Multicast +^^^^^^^^^^^^^^^^^^^^^^^ + +When a unicast prefix is present in two different tables, then it +refers to a different set of devices. When the prefix is imported it +refers to the same set of devices. If the set of paths to reach the +prefix is different in the import and export table, it doesn't matter, +since they both refer to the same devices, so either set can be +used. Therefore, FIB's usual source preference rules can apply. The +'import' source is lower priority. + +When a multicast prefix is present in two different tables, then it's +two different flows referring to two different set of receivers. When +the prefix is imported, then it refers to the same flow and two +different sets of receivers. In other words, the receiver set in the +import table needs to be the super set of receivers. + +There are two ways one might consider doing this; merging the +path-lists or replicating the packet first into each table. + + +Collapsing +^^^^^^^^^^ + +Read :ref:`fastconvergence` + +Collapsing the DPO graph for recursive routes doesn't have to be an +all or nothing. Easy cases: + + +- A recursive prefix with only one path and a path-list that is not + popular, could stack directly on the LB of the via entry. +- A recursive prefix with only multiple paths and a path-list that is not + popular, could construct a new load balance using the choices + present in each bucket of its via entries. The choices in the new LB + though would need to reflect the relative weighting. + + +The condition of an non-popular path-list means that the LB doesn't +have an LB map and hence it needs to be updated for convergence to +occur. + +The more difficult cases come when the recursive prefix has labels +which need to be stack on the via entries' choices. + +You might also envision a global configuration that always collapses all +chains, which could be used in deployments where convergence is not a +priority. diff --git a/docs/gettingstarted/developers/fib20/mplsfib.rst b/docs/gettingstarted/developers/fib20/mplsfib.rst index 6b14732214b..f66724ea81b 100644 --- a/docs/gettingstarted/developers/fib20/mplsfib.rst +++ b/docs/gettingstarted/developers/fib20/mplsfib.rst @@ -12,6 +12,31 @@ table. Whereas for IPv4 this is an mtrie and for IPv6 a hash table, for MPLS it is a flat array indexed by a 21 bit key (label & EOS bit). This implementation is chosen to favour packet forwarding speed. +It can be the case in MPLS forwarding that packets received with the +EOS bit set in the MPLS label need to be forwarded differently from +those without. The most common example of this is if the path set +contains a path that does not have an output label. In this case the +non-EOS packets cannot take this path, because to do so would expose +the neighbouring router to a label that it did not allocate. + +The desgin choice to make with an MPLS FIB table is therefore: +- 20 bit key: label only. When the EOS and non-EOS actions differ the result is a 'EOS-choice' object. +- 21 bit key: label and EOS-bit. The result is then the specific action based on EOS-bit. + +20 bit key + - Advantages:lower memory overhead, since there are few DB entries. + - Disadvantages: slower DP performance in the case the path-lists + differ, as more objects are encounterd in the switch path + +21 bit key + - Advantages: faster DP performance + Disadvantages: increased memory footprint. + +Switching between schemes based on observed/measured action similarity +is not considered on the grounds of complexity and flip-flopping. + +VPP mantra - favour performance over memory. We choose a 21 bit key. + Basics ^^^^^^ @@ -48,7 +73,7 @@ Entries in the MPLS FIB can be displayed with: There is a tight coupling between IP and MPLS forwarding. MPLS forwarding equivalence classes (FECs) are often an IP prefix – that is to say that traffic matching a given IP prefix is routed into a MPLS -label switch path (LSP). It is thus necessary to be able to associated +label switch path (LSP). It is thus necessary to be able to associate a given prefix/route with an [out-going] MPLS label that will be imposed when the packet is forwarded. This is configured as: @@ -78,7 +103,7 @@ packets to that local-label forwarded equivalently to the prefix do; In the API this action is called a ‘bind’. The router receiving the MPLS encapsulated packets needs to be programmed with actions associated which each label value – this is -the role of the MPLS FIB. The MPLS FIB Is a table, whose key is the +the role of the MPLS FIB. The MPLS FIB is a table, whose key is the MPLS label value and end-of-stack (EOS) bit, which stores the action to perform on packets with matching encapsulation. Currently supported actions are: @@ -107,7 +132,7 @@ and IP lookup only to an eos packet. MPLS VPN ^^^^^^^^ -To configure an MPLS VPN for a PE the follow example can be used. +To configure an MPLS VPN for a PE the following example can be used. Step 1; Configure routes to the iBGP peers - note these route MUST have out-going labels; diff --git a/docs/gettingstarted/developers/fib20/multicast.rst b/docs/gettingstarted/developers/fib20/multicast.rst index 15bcddfec46..7cffba81666 100644 --- a/docs/gettingstarted/developers/fib20/multicast.rst +++ b/docs/gettingstarted/developers/fib20/multicast.rst @@ -3,22 +3,40 @@ IP Multicast FIB ---------------- -Basics -^^^^^^ +The two principal differences between multicast and unicast forwarding +are: + +* there is no load-balancing among paths, there is only replication + across paths. +* multicast forwarding has an explicit reverse path forwarding (RPF) + check. It will only forward a packet if it arrives from a peer for + which it has been explicitly configured to accept. + +The other factor that influences the design of the mFIB is that the +match criteria (the prefix) is different. For multicast it is +necessary to be able to match on source and destination/group +addresses (termed an (S,G)) and only on a destination prefix (a (\*, +G/m)). This prefix is much bigger than a unicast prefix, and since +unicast scale is almost always greater than multicast scale, it is not +a good idea to have a single definition of a prefix. Therefore, +there is a fib_prefix_t (and hence a fib_entry_t) and an +mfib_prefix_t (and hence a mfib_entry_t). + +The fib_path_t and fib_path_list_t are reused. A path can represent +either a peer from which to accept packets or a peer to which to send +packets. A path-extension is added to the fib_path_t/mfib_entry_t to +describe the role the path plays. Logically the path-list is split +into two sets; an accepting set and a forwarding set. The forwarding set +contributes a replicate DPO for forwarding and the accepting set +contributes a list of interfaces (an mfib_itf_t) for the RPF check. An IP multicast FIB (mFIB) is a data-structure that holds entries that -represent a (S,G) or a (\*,G) multicast group. There is one IPv4 and +represent a (S,G) or a (\*,G/m) multicast group. There is one IPv4 and one IPv6 mFIB per IP table, i.e. each time the user calls 'ip[6] table add X' an mFIB is created. - -A path describes either where a packet is sent to or where a packet is -received from. mFIB entries maintain two sets of 'paths'; the -forwarding set and the accepting set. Each path in the forwarding set -will output a replica of a received packet. A received packet is only -accepted for forwarding if it ingresses on a path that matches in the -accepting set - this is the RPF check. - +Usage +^^^^^ To add an entry to the default mFIB for the group (1.1.1.1, 239.1.1.1) that will replicate packets to GigEthernet0/0/0 and GigEthernet0/0/1, do: diff --git a/docs/gettingstarted/developers/fib20/neighbors.rst b/docs/gettingstarted/developers/fib20/neighbors.rst new file mode 100644 index 00000000000..f460955239c --- /dev/null +++ b/docs/gettingstarted/developers/fib20/neighbors.rst @@ -0,0 +1,88 @@ +.. _neighbors: + +Neighbours +^^^^^^^^^^^ + +.. figure:: /_images/ip-neighbor.png + +Figure 1: Neighbour data model + +Figure 1 shows the data model for IP neighbours. An IP neighbour contains the mapping +between a peer, identified by an IPv4 or IPv6 address, and its MAC address on a given +interface. An IP-table (VRF) is not part of the neighbour's +data/identity. This is because the virtualisation of a router into +different tables (VRFs) is performed at the interface level, i.e. an +IP-table is bound to a particular interface. A neighbour, which is +attached to an interface, is thus implicitly in that table, and +only in that table. It is also worth noting that IP neighbours +contribute forwarding for the egress direction, whereas an IP-table +is an ingress only function. + +The *ip_neighbor_t* represents the control-plane addition of the +neighbour. The *ip_adjacency_t* contains the data derived from the *ip_neighbor_t* that is needed to +forward packets to the peer. The additional data in the adjacency are the *rewrite* +and the *link_type*. The *link_type* is a description of the protocol of the packets +that will be forwarded with this adjacency; e.g. IPv4, IPv6 or MPLS. The *link_type* +maps directly to the ether-type in an Ethernet header, or the protocol filed in a +GRE header. The rewrite is a byte string representation of the header that will be +prepended to the packet when it is sent to that peer. For Ethernet interfaces this +is be the src,dst MAC and the ether-type. For LISP tunnels, the IP src,dst pair +and the LISP header. + +The *ip_neighbor_t* for an IPv4 peer (learned e.g. over ARP) will +install a *link_type=IPv4* when the entry is created and a +link_type=MPLS on demand (i.e. when a route with output labels resolves via the peer). + +Adjacency +--------- + +There are three sub-types of adjacencies. Purists would argue that some +of these sub-types are not really adjacencies but are instead other +forms of DPOs, and it would be hard to argue against that, but +historically (not just in VPP, but in the FIB implementations from +which VPP draws on for some of its concepts), these have been modelled +as adjacency types, the one thing they have in common is that they +have an associated interface and are terminal. The [sub] sub-types are: + +* A Neighbour Adjacency (key={interface, next-hop, link-type}). A + representation of a peer on a link (as described above). A neighbour adjacency itself has + two sub-types; terminal and mid-chain. When one speak of 'an + adjacency' one is usually referring to a terminal neighbour + sub-type. A mid-chain adjacency represents a neighbor on a virtual + interface which relies on the FIB to perform further forwarding. This + adjacency is thus not terminal for the FIB object graph but instead + appears in the 'middle' (the term chain is a synonym for graph in + some contexts). + A neighbour adjacency can be in one of two states; complete and + incomplete. A complete adjacency knows the rewrite string that + should be used to reach the peer, an incomplete adjacency does + not. If the adjacency was added as a result of the addition of an + *ip_neighbor_t* then the adjacency will be complete (because the + *ip_neighbor_t* knows the peer's MAC address). An incomplete + adjacency is created on demand by the FIB when a route's path + requires to resolve through such an adjacency. It is thus created in + order to resolve the missing dependency, it will become complete + once the *ip_neighbor_t* is discovered. + In the forwarding path a complete adjacency will prepend the rewrite + string and transmit on the egress interface, an incomplete adjacency + will construct a ARP/ND request to resolve the peer's IP address. + +* A Glean Adjacency (key={interface}). This is a representation of the need to discover + a peer on the given interface. It is used when it is known that the + packet is destined to an undiscoverd peer on that interface. The + difference between the glean adjacency and an + incomplete neighbour adjacency is that in the forwarding path the + glean adjacency will construct an ARP/ND request for the peer as + determined from the packet's destination address. The glean + adjacency is used to resolve connected prefixes on multi-access + interfaces. + +* A Multicast Adjacency (key={interface}). This represents the need to send an IP + multicast packet out of the adjacency's associated interface. Since + IP multicast constructs the destination MAC address from the IP + packet's destination/group address, the rewrite is always known and + hence the adjacency is always complete. + + +All adjacency types can be shared between routes, hence each type is +stored in a DB whose key is appropriate for the type. diff --git a/docs/gettingstarted/developers/fib20/routes.rst b/docs/gettingstarted/developers/fib20/routes.rst index 1ee09ced448..313a86c3af4 100644 --- a/docs/gettingstarted/developers/fib20/routes.rst +++ b/docs/gettingstarted/developers/fib20/routes.rst @@ -3,37 +3,109 @@ Routes ^^^^^^ -The control plane will install a route in a table for a prefix via a list of paths. -The prime function of the FIB is to *resolve* that route. To resolve a route is to -construct an object graph that fully describes all elements of the route. In Figure 3 -the route is resolved as the graph is complete from *fib_entry_t* to *ip_adjacency_t*. +Basics +------ -In some routing models a VRF will consist of a set of tables for IPv4 and IPv6, and -unicast and multicast. In VPP there is no such grouping. Each table is distinct from -each other. A table is identified by its numerical ID. The ID range is separate for -each address family. +The anatomy of a route is crucial to understand: -A table is comprised of two route data-bases; forwarding and non-forwarding. The +.. code-block:: console + + 1.1.1.0/24 via 10.0.0.1 eth0 + +A route is composed of two parts; **what** to match against and **how** to forward +the matched packets. In the above example we want to match packets +whose destination IP address is in the 1.1.1.0/24 subnet and then we +want to forward those packet to 10.0.0.1 on interface eth0. We +therefore want to match the **prefix** 1.1.1.0/24 and forward on the +**path** to 10.0.0.1, eth0. + +Matching on a prefix is the particular task of the IP FIB, matching on +other packet attributes is done by other subsystems, e.g. matching on +MPLS labels in the MPLS-FIB, or matching on a tuple in ACL based +forwarding (ABF), 'matching' on all packets that arrive on an L3 +interface (l3XC). Although these subsystems match on different +properties, they share the infrastructure on **how** to forward +matched packets, that is they share the **paths**. The FIB paths (or +really the path-list) thus provide services to clients, this service +is to **contribute** forwarding, this, in terms that will be made +clear in later sections, is to provide the DPO to use. + +The prime function of the FIB is to *resolve* the paths for a +route. To resolve a route is to construct an object graph that fully +describes how to forward matching packets. This means that the graph +must terminate with an object (the leaf node) that describes how +to send a packet on an interface [#f1]_, i.e what encap to add to the +packet and what interface to send it to; this is the purpose of the IP +adjacency object. In Figure 3 the route is resolved as the graph is +complete from *fib_entry_t* to *ip_adjacency_t*. + + +Thread Model +^^^^^^^^^^^^ + +The FIB is not thread safe. All actions on the FIB are expected to +occur exclusively in the main thread. However, the data-structures +that FIB updates to add routes are thread safe, +w.r.t. addition/deletion and read, therefore routes can be added +without holding the worker thread barrier lock. + + +Tables +------ + +An IP FIB is a set of prefixes against which to match; it is +sub-address family (SAFI) specific (i.e. there is one for ipv4 and ipv6, unicast +and multicast). An IP Table is address family (AFI) specific (i.e. the +'table' includes the unicast and multicast FIB). + +Each FIB is identified by the SAFI and instance number (the [pool] +index), each table is identified by the AFI and ID. The table's ID is +assigned by the user when the table is constructed. Table ID 0 is +reserved for the global/default table. + +In most routing models a VRF is composed of an IPv4 and IPv6 table, +however, VPP has no construct to model this association, it deals only +with tables and FIBs. + +A unicast FIB is comprised of two route data-bases; forwarding and non-forwarding. The forwarding data-base contains routes against which a packet will perform a longest prefix match (LPM) in the data-plane. The non-forwarding DB contains all the routes -with which VPP has been programmed some of these routes may be unresolved for reasons -that prevent their insertion into the forwarding DB -(see section: Adjacency source FIB entries). +with which VPP has been programmed. Some of these routes may be +unresolved, preventing their insertion into the forwarding DB. +(see section: Adjacency source FIB entries). + +Model +----- The route data is decomposed into three parts; entry, path-list and paths; -* The *fib_entry_t*, which contains the routes prefix, is representation of that prefix's entry in the FIB table. -* The *fib_path_t* is a description of where to send the packets destined to the route's prefix. There are several types of path. +* The *fib_entry_t*, which contains the route's prefix, is the representation of that prefix's entry in the FIB table. +* The *fib_path_t* is a description of where to send the packets destined to the route's prefix. There are several types of path, including: * Attached next-hop: the path is described with an interface and a next-hop. The next-hop is in the same sub-net as the router's own address on that interface, hence the peer is considered to be *attached* - * Attached: the path is described only by an interface. All address covered by the prefix are on the same L2 segment to which that router's interface is attached. This means it is possible to ARP for any address covered by the prefix which is usually not the case (hence the proxy ARP debacle in IOS). An attached path is only appropriate for a point-to-point (P2P) interface where ARP is not required, i.e. a GRE tunnel. + * Attached: the path is described only by an interface. An + attached path means that all addresses covered by the route's + prefix are on the same L2 segment to which that router's + interface is attached. This means it is possible to ARP for any + address covered by the route's prefix. If this is not the case + then another device in that L2 segment needs to run proxy + ARP. An attached path is really only appropriate for a point-to-point + (P2P) interface where ARP is not required, i.e. a GRE tunnel. On + a p2p interface, attached and attached-nexthop paths will + resolve via a special 'auto-adjacency'. This is an adjacency + whose next-hop is the all zeros address and describes the only + peer on the link. * Recursive: The path is described only via the next-hop and table-id. - * De-aggregate: The path is described only via the special all zeros address and a table-id. This implies a subsequent lookup in the table should be performed. + * De-aggregate: The path is described only via the special all + zeros address and a table-id. This implies a subsequent lookup + in the table should be performed. -* The *fib_path_list_t* represents the list of paths from which to choose one when forwarding. The path-list is a shared object, i.e. it is the parent to multiple fib_entry_t children. In order to share any object type it is necessary for a child to search for an existing object matching its requirements. For this there must be a data-base. The key to the path-list data-base is a combined description of all of the paths it contains [#f2]_. Searching the path-list database is required with each route addition, so it is populated only with path-lists for which sharing will bring convergence benefits (see Section: :ref:`fastconvergence`). + * There are other path types, please consult the code. + +* The *fib_path_list_t* represents the list of paths from which to choose when forwarding. A path-list is a shared object, i.e. it is the parent to multiple fib_entry_t children. In order to share any object type it is necessary for a child to search for an existing object matching its requirements. For this there must be a database. The key to the path-list database is a combined description of all of the paths it contains [#f2]_. Searching the path-list database is required with each route addition, so it is populated only with path-lists for which sharing will bring convergence benefits (see Section: :ref:`fastconvergence`). .. figure:: /_images/fib20fig2.png @@ -41,7 +113,7 @@ Figure 2: Route data model class diagram Figure 2 shows an example of a route with two attached-next-hop paths. Each of these paths will *resolve* by finding the adjacency that matches the paths attributes, which -are the same as the key for the adjacency data-base [#f3]_. The *forwarding information (FI)* +are the same as the key for the adjacency database [#f3]_. The *forwarding information (FI)* is the set of adjacencies that are available for load-balancing the traffic in the data-plane. A path *contributes* an adjacency to the route's forwarding information, the path-list contributes the full forwarding information for IP packets. @@ -60,7 +132,7 @@ convergence (see section :ref:`fastconvergence`). FIB sources """"""""""" There are various entities in the system that can add routes to the FIB tables. -Each of these entities is termed a *source* When the same prefix is added by different +Each of these entities is termed a *source*. When the same prefix is added by different sources the FIB must arbitrate between them to determine which source will contribute the forwarding information. Since each source determines the forwarding information using different best path and loop prevention algorithms, it is not correct for the @@ -70,17 +142,17 @@ priority assignment [#f4]_. The FIB must maintain the information each source ha so it can be restored should that source become the best source. VPP has two *control-plane* sources; the API and the CLI the API has the higher priority. Each *source* data is represented by a *fib_entry_src_t* object of which a -*fib_entry_t* maintains a sorted vector.n A prefix is *connected* when it is -applied to a routers interface. +*fib_entry_t* maintains a sorted vector. The following configuration: .. code-block:: console - $ set interface address 192.168.1.1/24 GigabitEthernet0/8/0 + $ set interface ip address GigabitEthernet0/8/0 192.168.1.1/24 results in the addition of two FIB entries; 192.168.1.0/24 which is connected and -attached, and 192.168.1.1/32 which is connected and local (a.k.a receive or for-us). +attached, and 192.168.1.1/32 which is connected and local (a.k.a. +receive or for-us). A prefix is *connected* when it is applied to a router's interface. Both prefixes are *interface* sourced. The interface source has a high priority, so the accidental or nefarious addition of identical prefixes does not prevent the router from correctly forwarding. Packets matching a connected prefix will @@ -95,9 +167,10 @@ route, which resolves via an attached path; $ ip route add table X 10.10.10.0/24 via gre0 -as mentioned before, these are only appropriate for point-to-point links. An -attached-host prefix is covered by either an attached prefix (note that connected -prefixes are also attached). If table X is not the table to which gre0 is bound, +as mentioned before, these are only appropriate for point-to-point +links. + +If table X is not the table to which gre0 is bound, then this is the case of an attached export (see the section :ref:`attachedexport`). Adjacency source FIB entries @@ -110,7 +183,7 @@ route is of the form: $ ip route add table X 10.0.0.1/32 via 10.0.0.1 GigabitEthernet0/8/0 -It is a host prefix with a path whose next-hop address is the same. This route +This is a host prefix with a path whose next-hop address is the same host. This route highlights the distinction between the route's prefix - a description of the traffic to match - and the path - a description of where to send the matched traffic. Table X is the same table to which the interface is bound. FIB entries that are @@ -133,22 +206,11 @@ where a route maintains a dependency relationship with the route that is its les specific cover. When this cover changes (i.e. there is a new covering route) or the forwarding information of the cover is updated, then the covered route is notified. Adj-fibs that fail this cover check are not installed in the fib_table_t's forwarding -table, there are only present in the non-forwarding table. +table, they are only present in the non-forwarding table. Overlapping sub-nets are not supported, so no adj-fib has multiple paths. The control plane is expected to remove a prefix configured for an interface before the interface -changes RF. - -So while the following configuration is accepted: - -.. code-block:: console - - $ set interface address 192.168.1.1/32 GigabitEthernet0/8/0 - $ ip arp 192.168.1.2 GigabitEthernet0/8/0 dead.dead.dead - $ set interface ip table GigabitEthernet0/8/0 2 - -it does not result in the desired behaviour, where the adj-fib and connected adjacencies are -moved to table 2. +changes VRF. Recursive Routes """""""""""""""" @@ -219,17 +281,65 @@ when the loop breaks, the affected children and be updated. Output labels """"""""""""" -A route may have associated out MPLS labels [#f11]_. These are labels that are expected +A route may have associated output MPLS labels [#f11]_. These are labels that are expected to be imposed on a packet as it is forwarded. It is important to note that an MPLS -label is per-route and per-path, therefore, even though routes share paths the do not +label is per-route and per-path, therefore, even though routes share paths they do not necessarily have the same label for that path [#f12]_. A label is therefore uniquely associated to a *fib_entry_t* and associated with one of the *fib_path_t* to which it forwards. -MPLS labels are modelled via the generic concept of a *path-extension* A *fib_entry_t* -therefore has a vector of zero to many *fib_path_ext_t objects* to represent the labels +MPLS labels are modelled via the generic concept of a *path-extension*. A *fib_entry_t* +therefore has a vector of zero to many *fib_path_ext_t* objects to represent the labels with which it is configured. + +Delegates +^^^^^^^^^ + +A common software development pattern, a delegate is a means to +extend the functionality of one object through composition of +another, these other objects are called delegates. Both +**fib_entry_t** and **ip_adjacency_t** support extension via delegates. + +The FIB uses delegates to add functionality when those functions are +required by only a few objects instances rather than all of them, to +save on memory. For example, building/contributing a load-balance +object used to forward non-EOS MPLS traffic is only required for a +fib_entry_t that corresponds to a BGP peer and that peer is +advertising labeled route - there are only a few of +these. See **fib_entry_delegate.h** for a full list of delegate types. + + +Tracking +^^^^^^^^ + +A prime service FIB provides for other sub-system is the ability to +'track' the forwarding for a given next-hop. For example, a tunnel +will want to know how to forward to its destination address. It can +therefore request of the FIB to track this host-prefix and inform it +when the forwarding for that prefix changes. + +FIB tracking sources a host-prefix entry in the FIB using the 'recusive +resolution (RR)' source, it exactly the same way that a recursive path +does. If the entry did not previsouly exist, then the RR source will +inherit (and track) forwarding from its covering prefix, therefore all +packets that match this entry are forwarded in the same way as if the +entry did not exist. The tunnel that is tracking this FIB entry will +become a child dependent. The benefit to creating the entry, is that +it now exists in the FIB node graph, so all actions that happen on its +parents, are propagated to the host-prefix entry and consequently to +the tunnel. + +FIB provides a wrapper to the sourcing of the host-prefix using a +delegate attached to the entry, and the entry is RR sourced only once. +. The benefit of this aproach is that each time a new client tracks +the entry it doesn't RR source it. When an entry is sourced all its +children are updated. Thus, new clients tracking an entry is +O(n^2). With the tracker as indirection, the entry is sourced only once. + + .. rubric:: Footnotes: +.. [#f1] Or terminate in an object that transitions the packet out of + the FIB domain, e.g. a drop. .. [#f2] Optimisations .. [#f3] Note it is valid for either interface to be bound to a different table than table 1 .. [#f4] The engaged reader can see the full priority list in vnet/vnet/fib/fib_entry.h diff --git a/docs/gettingstarted/developers/fib20/scale.rst b/docs/gettingstarted/developers/fib20/scale.rst index 8cee1f1835c..4100f9aa7cf 100644 --- a/docs/gettingstarted/developers/fib20/scale.rst +++ b/docs/gettingstarted/developers/fib20/scale.rst @@ -4,20 +4,17 @@ Scale ----- The only limiting factor on FIB scale is the amount of memory -allocated to each heap the FIB uses, and there are 4: +allocated to each heap the FIB uses, and there are 2: -* The IP4 heap -* The IP6 heap * The main heap * The stats heap -IP4 Heap --------- +Main Heap +^^^^^^^^^ -The IPv4 heap is used to allocate the memory needed for the -data-structures within which the IPv4 prefixes are stored. Each -table, created by the user, i.e. with; +The main heap is used to allocate all memory needed for the FIB +data-structures. Each table, created by the user, i.e. with; .. code-block:: console @@ -36,18 +33,12 @@ To see the amount of memory consumed by the IPv4 tables use: .. code-block:: console vpp# sh ip fib mem - ipv4-VRF:0 mtrie:333056 hash:3523 - ipv4-VRF:1 mtrie:333056 hash:3523 - totals: mtrie:666112 hash:7046 all:673158 - - Mtrie Mheap Usage: total: 32.06M, used: 662.44K, free: 31.42M, trimmable: 31.09M - free chunks 3 free fastbin blks 0 - max total allocated 32.06M - no traced allocations + ipv4-VRF:0 mtrie:335744 hash:4663 + ipv4-VRF:1 mtrie:333056 hash:3499 + totals: mtrie:668800 hash:8162 all:676962 this output shows two 'empty' (i.e. no added routes) tables. Each -mtrie uses about 150k of memory, so each table about 300k. the total -heap usage statistics for the IP4 heap are shown at the end. +mtrie uses about 150k of memory, so each table about 300k. Below the output having added 1M, 2M and 4M routes respectively: @@ -58,44 +49,23 @@ Below the output having added 1M, 2M and 4M routes respectively: ipv4-VRF:0 mtrie:335744 hash:4695 totals: mtrie:335744 hash:4695 all:340439 - Mtrie Mheap Usage: total: 1.00G, used: 335.20K, free: 1023.74M, trimmable: 1023.72M - free chunks 3 free fastbin blks 0 - max total allocated 1.00G - no traced allocations - .. code-block:: console vpp# sh ip fib mem ipv4-VRF:0 mtrie:5414720 hash:41177579 totals: mtrie:5414720 hash:41177579 all:46592299 - Mtrie Mheap Usage: total: 1.00G, used: 46.87M, free: 977.19M, trimmable: 955.93M - free chunks 61 free fastbin blks 0 - max total allocated 1.00G - no traced allocations - .. code-block:: console vpp# sh ip fib mem ipv4-VRF:0 mtrie:22452608 hash:168544508 totals: mtrie:22452608 hash:168544508 all:190997116 - Mtrie Mheap Usage: total: 1.00G, used: 198.37M, free: 825.69M, trimmable: 748.24M - free chunks 219 free fastbin blks 0 - max total allocated 1.00G - no traced allocations - -VPP was started with a 1G IP4 heap. -IP6 Heap --------- - -The IPv6 heap is used to allocate the memory needed for the -data-structure within which the IPv6 prefixes are stored. IPv6 also -has the concept of forwarding and non-forwarding entries, however for -IPv6 all the forwarding entries are stored in a single hash table -(same goes for the non-forwarding). The key to the hash table includes -the IPv6 table-id. +IPv6 also has the concept of forwarding and non-forwarding entries, +however for IPv6 all the forwarding entries are stored in a single +hash table (same goes for the non-forwarding). The key to the hash +table includes the IPv6 table-id. To see the amount of memory consumed by the IPv4 tables use: @@ -191,14 +161,10 @@ and 1M: arena: base 7fedba514000, next 3882740 used 59254592 b (56 Mbytes) of 1073741824 b (1024 Mbytes) -as can be seen from the output the IPv6 heap in this case was scaled +as can be seen from the output the IPv6 hash-table in this case was scaled to 1GB and 1million prefixes has used 56MB of it. - -Main Heap ---------- - -The main heap is used to allocate objects that represent the FIB +The main heap is also used to allocate objects that represent the FIB entries in the control and data plane (see :ref:`controlplane` and :ref:`dataplane`) such as *fib_entry_t* and *load_balance_t*. These come from the main heap because they are not protocol specific @@ -263,7 +229,7 @@ requires will increase. Stats Heap ----------- +^^^^^^^^^^ VPP collects statistics for each route. For each route VPP collects byte and packet counters for packets sent to the prefix (i.e. the @@ -279,5 +245,3 @@ Below shows the size of the stats segment with 1M, 2M and 4M routes. total: 1023.99M, used: 234.14M, free: 789.85M, trimmable: 668.15M total: 1023.99M, used: 456.83M, free: 567.17M, trimmable: 388.91M -VPP was started with a 1G stats heap. - diff --git a/docs/gettingstarted/developers/fib20/thedatamodel.rst b/docs/gettingstarted/developers/fib20/thedatamodel.rst index c5b31500221..cd3de179814 100644 --- a/docs/gettingstarted/developers/fib20/thedatamodel.rst +++ b/docs/gettingstarted/developers/fib20/thedatamodel.rst @@ -6,7 +6,7 @@ The Data Model The FIB data model comprises two parts; the control-plane (CP) and the data-plane (DP). The CP data model represents the data that is programmed into VPP by the upper layers. The DP model represents how VPP derives actions to be performed on -packets are they are switched. +packets as they are switched. .. toctree:: diff --git a/docs/gettingstarted/developers/fib20/tunnels.rst b/docs/gettingstarted/developers/fib20/tunnels.rst index c96b1337d4e..f53d6223589 100644 --- a/docs/gettingstarted/developers/fib20/tunnels.rst +++ b/docs/gettingstarted/developers/fib20/tunnels.rst @@ -1,12 +1,12 @@ .. _tunnels: Tunnels ---------- +------- Tunnels share a similar property to recursive routes in that after applying the tunnel encapsulation, a new packet must be forwarded, i.e. forwarding is recursive. However, as with recursive routes the tunnel's destination is known -beforehand, so the recursive switch can be avoided if the packet can follow the +beforehand, so the second lookup can be avoided if the packet can follow the already constructed data-plane graph for the tunnel's destination. This process of joining to DP graphs together is termed *stacking*. @@ -18,7 +18,7 @@ Figure 11 shows the control plane object graph for a route via a tunnel. The two sub-graphs for the route via the tunnel and the route for the tunnel's destination are shown to the right and left respectively. The red line shows the relationship form by stacking the two sub-graphs. The adjacency on the tunnel -interface is termed a 'mid-chain' this it is now present in the middle of the +interface is termed a 'mid-chain' since it is now present in the middle of the graph/chain rather than its usual terminal location. The mid-chain adjacency is contributed by the gre_tunnel_t , which also becomes @@ -27,7 +27,36 @@ back-walk when the forwarding information for the tunnel's destination changes. This will trigger it to restack the mid-chain adjacency on the new *load_balance_t* contributed by the parent *fib_entry_t*. -If the back-walk indicates that there is no route to the tunnel, or that the -route does not meet resolution constraints, then the tunnel can be marked as -down, and fast convergence can be triggered in the same way as for physical -interfaces (see section ...). +If the back-walk indicates that there is no route to the tunnel's +destination, or that the resolving route does not meet resolution +constraints, then the tunnel can be marked as down, and fast +convergence can be triggered in the same way as for physical interfaces (see section ...). + + +Multi-Point Tunnels +^^^^^^^^^^^^^^^^^^^ + +Multi-point tunnels are an example of a non-broadcast multi-access +interface. In simple terms this means there are many peers on the link +but it is not possible to broadcast a single message to all of them at +once, and hence the usual peer discovery mechanism (as employed, +e.g. by ARP) is not available. Although an *ip_neighbor_t* is a +representation of an IP peer on a link, it is not valid in this +context as it maps the peer's identity to its MAC address. For a +tunnel peer it is required to map the peer's overlay address (the +attached address, the one in the same subnet as the device) with the +peer's underlay address (probably on the other side of the +internet). In the P2P case where there is only one peer on the link, +the peer's underlay address is the same as the tunnel's destination +address. +The data structure that represents the mapping of the peer's overlay +with underlay address is an entry in the Tunnel Endpoint Information +Base (TEIB); the *tieb_entry_t*. TEIB entries are created by the +control plane (e.g. NHRP (RFC2332)). + +Each mid-chain adjacency on a multi-point tunnel is stacked on the +*fib_entry_t* object that resolves the peer's underlay address. The +glean adjacency on the tunnel resolves via a drop, since broadcasts +are not possible. A multicast adjacency on a multi-point tunnel is +currently a work in progress. + |