summaryrefslogtreecommitdiffstats
path: root/src/vnet/devices
AgeCommit message (Collapse)AuthorFilesLines
2020-04-10virtio: fix gso and csum offload errors handlingMohsin Kazmi1-45/+82
GSO and CSUM offloaded packets are transmitted even itf doesn't support GSO/CSUM. This patch fixes it by logging the respective errors and dropping the packets. Type: fix Change-Id: I5ab19d15ce6aa9fda515313c313a5a56c0b96837 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-04-08virtio: fix the tcp/udp checksum offloadsMohsin Kazmi2-28/+0
Some vhost-backend calculates the wrong checksum in case of tcp/udp offload when driver resets tcp/udp checksum field to '0'. Type: fix Change-Id: I1d2a9b95b3d5cc1decac38027104a04df2af4680 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-31vlib: move pci api types from vnet/pci to vlib/pciJakub Grajciar2-2/+2
Type: fix Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com> Change-Id: I1a60809a8bbbbb8ac8b65ab990d51aae1229647f Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com>
2020-03-30gso: fix the header parser to read onlyMohsin Kazmi2-4/+44
Previously, header parser sets the tcp/udp checksum to 0. It should be read only function for vlib_buffer_t. Type: fix Change-Id: I9c3398372f22998da3df188f0b7db13748303068 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-23virtio: vhost gso checksum error when both indirect and mrg_rxbuf are offSteven Luong1-20/+19
Turn on gso, turn off both indirect and mrg_rxbuf caused traffic received and sent with checksum error. The problem is we are not mapping the hdr correctly in the shared memory address. Type: fix Signed-off-by: Steven Luong <sluong@cisco.com> Change-Id: I7ef3bc2755544167b0e624365988111b17399e89
2020-03-23tap: fix the numa/queue for buffersMohsin Kazmi1-1/+1
Type: fix Change-Id: Ib320171708bebde6d1dae0b2c665f9bcfc9102db Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-23virtio: improve error handlingMohsin Kazmi1-9/+30
Type: improvement Change-Id: I134465760272ceb29f85486cba838d8687696bbf Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-21virtio: fix link up/down flagMohsin Kazmi1-3/+9
Type: fix "set int state <interface> down" puts the virtio device link down. It will not put the link in "UP" state, when "set int state <interface up>" will be used again to change the interface admin up. This patch fixes it. To test: create tap set int state tap0 up set int state tap0 down sh hardware sh int set int state tap0 up sh int sh hardware Change-Id: I3c0e31539f8a2a1e40220e7fb57eedecf408f067 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-21virtio: fix the out of order descriptors in txMohsin Kazmi4-8/+102
Type: fix Some vhost-backends give used descriptors back in out-of-order. This patch fixes the native virtio to handle out-of-order descriptors. Change-Id: I57323303349f6a385e412ee22772ab979ae8edbf Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-13devices: netlink create the object if missingDave Barach1-4/+4
Type: fix Fixes: b49bc1a Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com> Signed-off-by: Dave Barach <dave@barachs.net> Change-Id: I3dd81a2484c8b4925fd07556576c29d1cde337e1
2020-03-05tap: add support for persistanceMohsin Kazmi7-43/+117
Type: feature Change-Id: I775f53531972447ebae0d69b9e2dfeee84d115e5 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-03-02virtio: fix the coverity warningMohsin Kazmi1-4/+1
Type: fix Change-Id: Ia75edb74eb7c746dd4c66bdbff75efb949575ce4 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-02-26api: improve api string safetyJakub Grajciar1-1/+1
- Remove vl_api_from_api_string to prevent use of not nul-terminated strings. - Rename vl_api_from_api_to_vec -> vl_api_from_api_to_new_vec to imply a new vector is created. NOT nul terminated. - Add vl_api_from_api_to_new_c_string. Returns nul terminated string in a new vector. - Add vl_api_c_string_to_api_string. Convert nul terminated string to vl_api_string_t - Add vl_api_vec_to_api_string. Convert NON nul terminated vector to vl_api_string_t Type: fix Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com> Change-Id: Iadd59b612c0d960a34ad0dd07a9d17f56435c6ea Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com>
2020-02-18devices: netlink: add more error loggingMohsin Kazmi1-10/+50
Type: improvement Change-Id: I4d8ca04840845e1ba631e4260e155df2486155e6 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-02-15tap: fix the default parameter for num_rx_queuesMohsin Kazmi1-9/+3
Type: fix Change-Id: I1a20fea56f1ba1fada7c7ce96ea333bf097b1273 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-02-04virtio: update FEATURE.yaml to include description for vhost-userSteven Luong1-3/+7
Add features supported by vhost-user Type: docs Signed-off-by: Steven Luong <sluong@cisco.com> Change-Id: Iba4c5244c40324b603e2803ade8ecc0816326de8
2020-02-03virtio: vhost gso is broken in some topologySteven Luong1-2/+8
Recent modification added a call to vnet_gso_header_offset_parser in the beginning of vhost_user_handle_tx_offload. The former routine may set tcp or udp->checksum to 0. While it is appropriate to set it to 0 for the GSO packet, it is broken and causes checksum error if the aformentiooned routine is called by a non-GSO packet. The fix is to not call vhost_user_handle_tx_offload if the buffer does not indicate checksum offload is needed. Type: fix Signed-off-by: Steven Luong <sluong@cisco.com> Change-Id: I6e699d7a40b7887ff149cd8f77e8f0fa9374ef19
2020-01-30misc: deprecate netmap and ixge driversDamjan Marion10-2148/+0
Both are out of sync for long time... Type: refactor Change-Id: I7de3170d35330fc172501d87655dfef91998b8fe Signed-off-by: Damjan Marion <damarion@cisco.com>
2020-01-30tap: fix host mtu configuration settingMohsin Kazmi1-12/+13
host mtu can't be set if tap interface is in namespace. This patch fixes this issue. Type: fix Change-Id: I63811c4b56c708fe708061a8afbaec41994f08ca Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-01-30tap: fix the host mac addressMohsin Kazmi3-20/+11
Tap configuration code sets the host mac address two time. This patch fixes it. Type: fix Change-Id: I7bebb9b7f25352a8a9a98bae6a0636757c0cea9c Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-01-27devices: vhost: fix data offset on inputBenoît Ganne1-11/+1
Regardless of whether the virtio_net_hdr is sent as a separate descriptors or in the same descriptor as the data, we always want to skip the header length - maybe moving to the next descriptor along the way. Type: fix Change-Id: Iaa70aeb310e589639b20f8c7029aaa8d3ce5d307 Signed-off-by: Benoît Ganne <bganne@cisco.com>
2020-01-10docs: Edit FEATURE.yaml files so they can be publishedJohn DeNisco1-1/+1
Type: docs Signed-off-by: John DeNisco <jdenisco@cisco.com> Change-Id: I7280e5c5ad10a66c0787a5282291a2ef000bff5f
2020-01-08virtio: fix ip4 checksum offloadMohsin Kazmi1-2/+14
Type: fix Change-Id: I08747ac308e5c1768a3a6aa5f83a016dc0274a1c Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-01-08tap: split gso and checksum offload functionalityMohsin Kazmi5-19/+108
Type: refactor Change-Id: I0d4b79ef384c11c841576d264bfd8ccb21783e10 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-01-08virtio: split gso and checksum offload functionalityMohsin Kazmi9-44/+312
Type: refactor Change-Id: I897e36bd5db593b417c2bac9f739bc51cf45bc08 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2020-01-03devices: virtio API cleanupJakub Grajciar2-16/+21
Use consistent API types. Type: fix Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com> Change-Id: I38a409af770c88c1eb2c68b24abef2a5a91e1b9a
2020-01-02virtio: fix checksum offload supportBenoît Ganne1-10/+21
Checksum offload and GSO are independent. We must support checksum offload if it has been negotiated, independently of GSO. Ticket: VPPSUPP-47 Type: fix Change-Id: I8cb6dd58b61714ebb2726eb4aab0d74d49fdab99 Signed-off-by: Benoît Ganne <bganne@cisco.com>
2019-12-17virtio: fix the tx queue thread bindingMohsin Kazmi1-3/+17
Type: fix Change-Id: Ibbe7e20aebc9153ceba07e048dc0eaa45193f4ea Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2019-12-11devices: vhost API cleanupJakub Grajciar5-26/+171
Use consistent API types. Type: fix Change-Id: I2dec594cb834a45004edc9ca58ad7c7b4bd7ff06 Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com>
2019-12-11devices: tap API cleanupJakub Grajciar5-93/+128
Use consistent API types. Type: fix Change-Id: I11cc7f6347b7a60e5fd41e54f0c7994e2d81199f Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com>
2019-12-10api: multiple connections per processDave Barach6-6/+6
Type: feature Signed-off-by: Dave Barach <dave@barachs.net> Change-Id: I2272521d6e69edcd385ef684af6dd4eea5eaa953
2019-12-06gso: fix the tap/virtio driver for header offsetMohsin Kazmi1-1/+25
Type: fix Change-Id: Ied34466907fa8ad44f997c600dbf481be4d22027 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2019-12-05gso: add protocol header parserMohsin Kazmi1-33/+10
Type: feature Change-Id: I7c6be2b96d19f82be237f6159944f3164ea512d0 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2019-12-04gso: remove the interface countMohsin Kazmi4-17/+1
Type: refactor Change-Id: I51405b9d09fb6fb03d08569369fdd4e11c647908 Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com>
2019-11-25tap: fix coverity warning 205875Andrew Yourtchenko1-1/+6
check the return result from fcntl, and if error, behave the same way the expansion of _IOCTL macro does. Type: fix Change-Id: I6d537d1bdedae64470612aef64b46e07387fe84b Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
2019-11-20tap: multiqueue supportDamjan Marion5-161/+261
Type: feature Change-Id: I7dcc8c6911d02729b3bda1b3a21a211c82c3b949 Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-20virtio: fix use-after-freeBenoît Ganne1-1/+1
Type: fix Change-Id: Ic67d9da65d937f56ecf994a5504c6351624b32ff Signed-off-by: Benoît Ganne <bganne@cisco.com>
2019-11-14virtio: refactor virtio-pci loggingDamjan Marion7-96/+143
Type: refactor Change-Id: I34306c1206b2bf5f521be6c6b78074ccf9259a08 Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-13virtio: feature arc have higher priority than redirectDamjan Marion1-3/+4
Type: fix Fixes: 8389fb9 Change-Id: Ie159eb444b28b36a7af86049b80fba4e49be93cb Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-12tap: Move client registration check to topPaul Vinciguerra2-9/+14
Type: fix Change-Id: I33dc4cf7b6c69f74c7bf4971ce59442678b878ef Signed-off-by: Paul Vinciguerra <pvinci@vinciconsulting.com>
2019-11-12virtio: remove unused codeDamjan Marion1-4/+0
Type: refactor Change-Id: I25f1cc3969c6a6ec1384079dc437537acd2ec152 Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-08tap: add check for vhost-net backendDamjan Marion1-0/+9
Type: feature Change-Id: I402f4c88dee70fbb0b3b61dc4e0a4034d24d8b56 Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-08tap: fix cli parserDamjan Marion1-4/+5
Type: fix Change-Id: I38ee9efd23774cce7790565825527cca9ba6f200 Signed-off-by: Damjan Marion <damarion@cisco.com>
2019-11-06build: add yaml file linting to make checkstylePaul Vinciguerra5-7/+24
Type: feature fts and trex rely on yaml config files. Verify that they are valid, so comitters can catch errors early. Change-Id: Ide0bb276659119c59bdbbc8b8155e37562a648b8 Signed-off-by: Paul Vinciguerra <pvinci@vinciconsulting.com>
2019-10-30docs: devices-- add FEATURES.yamlPaul Vinciguerra15-17/+65
Type: docs Change-Id: I039ba9ad5385452b202366fba0b367506a21ea4f Signed-off-by: Paul Vinciguerra <pvinci@vinciconsulting.com>
2019-10-23devices: vhoost cpu->copy array overflow on tcp jumbo frame (65535 bytes)Steven Luong2-2/+8
We reserve 40 slots in cpu->copy array prior to copy out to avoid overflowing the array. However, 40 is not enough for the jumbo frame because desceiptor buffer len is likely at 1536. Change the reserve to 200 and add ASSERT to avoid encountering the same problem in the future. Type: fix Signed-off-by: Steven Luong <sluong@cisco.com> Change-Id: Ibf0c03c4b4f33e781d5be8679ccd6c3a4b4a646d
2019-10-07devices: vhost not reading packets from vringSteven Luong2-0/+25
In a rare event, after the vhost protocol message exchange has finished and the interface had been brought up successfully, the driver MAY still change its mind about the memory regions by sending new memory maps via SET_MEM_TABLE. Upon processing SET_MEM_TABLE, VPP invalidates the old memory regions and the descriptor tables. But it does not re-compute the new descriptor tables based on the new memory maps. Since VPP does not have the descriptor tables, it does not read the packets from the vring. In the normal working case, after SET_MEM_TABLE, the driver follows up with SET_VRING_ADDRESS which VPP computes the descriptor tables. The fix is to stash away the descriptor table addresses from SET_VRING_ADDRESS. Re-compute the new descriptor tables when processing SET_MEM_TABLE if descriptor table addresses are known. Type: fix Ticket: VPP-1784 Signed-off-by: Steven Luong <sluong@cisco.com> Change-Id: I3361f14c3a0372b8d07943eb6aa4b3a3f10708f9 (cherry picked from commit 61b8ba69f7a9540ed00576504528ce439f0286f5)
2019-09-25devices: pipe API cleanupJakub Grajciar1-7/+9
Use consistent API types. Type: fix Signed-off-by: Jakub Grajciar <jgrajcia@cisco.com> Change-Id: Ifd62207048d125bec18b3a728590ac540dcafe5e
2019-09-25fib: fix some typos in fib/mtrieLijian.Zhang1-1/+1
Type: fix Change-Id: I1af0e4a9bc23a3b6b6d3a74df093801ab6cae1f8 Signed-off-by: Lijian Zhang <Lijian.Zhang@arm.com>
2019-09-24vlib: add flag to explicitelly mark nodes which can init per-node packet traceDamjan Marion4-0/+4
Type: feature Change-Id: I913f08383ee1c24d610c3d2aac07cef402570e2c Signed-off-by: Damjan Marion <damarion@cisco.com>
networks and thus recursive prefixes. There are several * flavours of PIC covering different locations of protection and failure * scenarios. An outline is given below, see the literature for more details: * * Y/16 - CE1 -- PE1---\ * | \ P1---\ * | \ PE3 -- CE3 - X/16 * | - P2---/ * Y/16 - CE2 -- PE2---/ * * CE = customer edge, PE = provider edge. external-BGP runs between customer * and provider, internal-BGP runs between provider and provider. * * 1) iBGP PIC-core: consider traffic from CE1 to X/16 via CE3. On PE1 there is * are routes; * X/16 (and hundreds of thousands of others like it) * via PE3 * and * PE3/32 (its loopback address) * via 10.0.0.1 Link0 (this is P1) * via 10.1.1.1 Link1 (this is P2) * the failure is the loss of link0 or link1 * As in all PIC scenarios, in order to provide prefix independent convergence * it must be that the route for X/16 (and all other routes via PE3) do not * need to be updated in the FIB. The FIB therefore needs to update a single * object that is shared by all routes - once this shared object is updated, * then all routes using it will be instantly updated to use the new forwarding * information. In this case the shared object is the resolving route via PE3. * Once the route via PE3 is updated via IGP (OSPF) convergence, then all * recursive routes that resolve through it are also updated. VPP FIB * implements this scenario via a recursive-adjacency. the X/16 and it sibling * routes share a recursive-adjacency that links to/points at/stacks on the * normal adjacency contributed by the route for PE3. Once this shared * recursive adj is re-linked then all routes are switched to using the new * forwarding information. This is shown below; * * pre-failure; * X/16 --> R-ADJ-1 --> ADJ-1-PE3 (multi-path via P1 and P2) * * post-failure: * X/16 --> R-ADJ-1 --> ADJ-2-PE3 (single path via P1) * * note that R-ADJ-1 (the recursive adj) remains in the forwarding graph, * therefore X/16 (and all its siblings) is not updated. * X/16 and its siblings share the recursive adj since they share the same * path-list. It is the path-list object that contributes the recursive-adj * (see next section for more details) * * * 2) iBGP PIC-edge; Traffic from CE3 to Y/16. On PE3 there is are routes; * Y/16 (and hundreds of thousands of others like it) * via PE1 * via PE2 * and * PE1/32 (PE1's loopback address) * via 10.0.2.2 Link0 (this is P1) * PE2/32 (PE2's loopback address) * via 10.0.3.3 Link1 (this is P2) * * the failure is the loss of reachability to PE2. this could be either the * loss of the link P2-PE2 or the loss of the node PE2. This is detected either * by the withdrawal of the PE2's loopback route or by some form of failure * detection (i.e. BFD). * VPP FIB again provides PIC via the use of the shared recursive-adj. Y/16 and * its siblings will again share a path-list for the list {PE1,PE2}, this * path-list will contribute a multi-path-recursive-adj, i.e. a multi-path-adj * with each choice therein being another adj; * * Y/16 -> RM-ADJ --> ADJ1 (for PE1) * --> ADJ2 (for PE2) * * when the route for PE1 is withdrawn then the multi-path-recursive-adjacency * is updated to be; * * Y/16 --> RM-ADJ --> ADJ1 (for PE1) * --> ADJ1 (for PE1) * * that is both choices in the ECMP set are the same and thus all traffic is * forwarded to PE1. Eventually the control plane will download a route update * for Y/16 to be via PE1 only. At that time the situation will be: * * Y/16 -> R-ADJ --> ADJ1 (for PE1) * * In the scenario above we assumed that PE1 and PE2 are ECMP for Y/16. eBGP * PIC core is also specified for the case were one PE is primary and the other * backup - VPP FIB does not support that case at this time. * * 3) eBGP PIC Edge; Traffic from CE3 to Y/16. On PE1 there is are routes; * Y/16 (and hundreds of thousands of others like it) * via CE1 (primary) * via PE2 (backup) * and * CE1 (this is an adj-fib) * via 11.0.0.1 Link0 (this is CE1) << this is an adj-fib * PE2 (PE2's loopback address) * via 10.0.5.5 Link1 (this is link PE1-PE2) * the failure is the loss of link0 to CE1. The failure can be detected by FIB * either as a link down event or by the control plane withdrawing the connected * prefix on the link0 (say 10.0.5.4/30). The latter works because the resolving * entry is an adj-fib, so removing the connected will withdraw the adj-fib, and * hence the recursive path becomes unresolved. The former is faster, * particularly in the case of Inter-AS option A where there are many VLAN * sub-interfaces on the PE-CE link, one for each VRF, and so the control plane * must remove the connected prefix for each sub-interface to trigger PIC in * each VRF. Note though that total PIC cutover time will depend on VRF scale * with either trigger. * Primary and backup paths in this eBGP PIC-edge scenario are calculated by * BGP. Each peer is configured to always advertise its best external path to * its iBGP peers. Backup paths therefore send traffic from the PE back into the * core to an alternate PE. A PE may have multiple external paths, i.e. multiple * directly connected CEs, it may also have multiple backup PEs, however there * is no correlation between the two, so unlike LFA-FRR, the redundancy model is * N-M; N primary paths are backed-up by M backup paths - only when all primary * paths fail, then the cutover is performed onto the M backup paths. Note that * PE2 must be suitably configured to forward traffic on its external path that * was received from PE1. VPP FIB does not support external-internal-BGP (eiBGP) * load-balancing. * * As with LFA-FRR the use of primary and backup paths is not currently * supported, however, the use of a recursive-multi-path-adj, and a suitably * constrained hashing algorithm to choose from the primary or backup path sets, * would again provide the necessary shared object and hence the prefix scale * independent cutover. * * Astute readers will recognise that both of the eBGP PIC scenarios refer only * to a BGP free core. * * Fast convergence implementation options come in two flavours: * 1) Insert switches into the data-path. The switch represents the protected * resource. If the switch is 'on' the primary path is taken, otherwise * the backup path is taken. Testing the switch in the data-path comes with * an associated performance cost. A given packet may encounter more than * one protected resource as it is forwarded. This approach minimises * cutover times as packets will be forwarded on the backup path as soon * as the protected resource is detected to be down and the single switch * is tripped. However, it comes at a performance cost, which increases * with each shared resource a packet encounters in the data-path. * This approach is thus best suited to LFA-FRR where the protected routes * are non-recursive (i.e. encounter few shared resources) and the * expectation on cutover times is more stringent (<50msecs). * 2) Update shared objects. Identify objects in the data-path, that are * required to be present whether or not fast convergence is required (i.e. * adjacencies) that can be shared by multiple routes. Create a dependency * between these objects at the protected resource. When the protected * resource fails, each of the shared objects is updated in a way that all * users of it see a consistent change. This approach incurs no performance * penalty as the data-path structure is unchanged, however, the cutover * times are longer as more work is required when the resource fails. This * scheme is thus more appropriate to recursive prefixes (where the packet * will encounter multiple protected resources) and to fast-convergence * technologies where the cutover times are less stringent (i.e. PIC). * * Implementation: * --------------- * * Due to the requirements outlined above, not all routes known to FIB * (e.g. adj-fibs) are installed in forwarding. However, should circumstances * change, those routes will need to be added. This adds the requirement that * a FIB maintains two tables per-VRF, per-AF (where a 'table' is indexed by * prefix); the forwarding and non-forwarding tables. * * For DP speed in VPP we want the lookup in the forwarding table to directly * result in the ADJ. So the two tables; one contains all the routes (a * lookup therein yields a fib_entry_t), the other contains only the forwarding * routes (a lookup therein yields an ip_adjacency_t). The latter is used by the * DP. * This trades memory for forwarding performance. A good trade-off in VPP's * expected operating environments. * * Note these tables are keyed only by the prefix (and since there 2 two * per-VRF, implicitly by the VRF too). The key for an adjacency is the * tuple:{next-hop, address (and it's AF), interface, link/ether-type}. * consider this curious, but allowed, config; * * set int ip addr 10.0.0.1/24 Gig0 * set ip arp Gig0 10.0.0.2 dead.dead.dead * # a host in that sub-net is routed via a better next hop (say it avoids a * # big L2 domain) * ip route add 10.0.0.2 Gig1 192.168.1.1 * # this recursive should go via Gig1 * ip route add 1.1.1.1/32 via 10.0.0.2 * # this non-recursive should go via Gig0 * ip route add 2.2.2.2/32 via Gig0 10.0.0.2 * * for the last route, the lookup for the path (via {Gig0, 10.0.0.2}) in the * prefix table would not yield the correct result. To fix this we need a * separate table for the adjacencies. * * - FIB data structures; * * fib_entry_t: * - a representation of a route. * - has a prefix. * - it maintains an array of path-lists that have been contributed by the * different sources * - install an adjacency in the forwarding table contributed by the best * source's path-list. * * fib_path_list_t: * - a list of paths * - path-lists may be shared between FIB entries. The path-lists are thus * kept in a DB. The key is the combined description of the paths. We share * path-lists when it will aid convergence to do so. Adding path-lists to * this DB that are never shared, or are not shared by prefixes that are * not subject to PIC, will increase the size of the DB unnecessarily and * may lead to increased search times due to hash collisions. * - the path-list contributes the appropriate adj for the entry in the * forwarding table. The adj can be 'normal', multi-path or recursive, * depending on the number of paths and their types. * - since path-lists are shared there is only one instance of the multi-path * adj that they [may] create. As such multi-path adjacencies do not need a * separate DB. * The path-list with recursive paths and the recursive adjacency that it * contributes forms the backbone of the fast convergence architecture (as * described previously). * * fib_path_t: * - a description of how to forward the traffic (i.e. via {Gig1, K}). * - the path describes the intent on how to forward. This differs from how * the path resolves. I.e. it might not be resolved at all (since the * interface is deleted or down). * - paths have different types, most notably recursive or non-recursive. * - a fib_path_t will contribute the appropriate adjacency object. It is from * these contributions that the DP graph/chain for the route is built. * - if the path is recursive and a recursion loop is detected, then the path * will contribute the special DROP adjacency. This way, whilst the control * plane graph is looped, the data-plane graph does not. * * we build a graph of these objects; * * fib_entry_t -> fib_path_list_t -> fib_path_t -> ... * * for recursive paths: * * fib_path_t -> fib_entry_t -> .... * * for non-recursive paths * * fib_path_t -> ip_adjacency_t -> interface * * These objects, which constitute the 'control plane' part of the FIB are used * to represent the resolution of a route. As a whole this is referred to as the * control plane graph. There is a separate DP graph to represent the forwarding * of a packet. In the DP graph each object represents an action that is applied * to a packet as it traverses the graph. For example, a lookup of a IP address * in the forwarding table could result in the following graph: * * recursive-adj --> multi-path-adj --> interface_A * --> interface_B * * A packet traversing this FIB DP graph would thus also traverse a VPP node * graph of: * * ipX_recursive --> ipX_rewrite --> interface_A_tx --> etc * * The taxonomy of objects in a FIB graph is as follows, consider; * * A --> * B --> D * C --> * * Where A,B and C are (for example) routes that resolve through D. * parent; D is the parent of A, B, and C. * children: A, B, and C are children of D. * sibling: A, B and C are siblings of one another. * * All shared objects in the FIB are reference counted. Users of these objects * are thus expected to use the add_lock/unlock semantics (as one would * normally use malloc/free). * * WALKS * * It is necessary to walk/traverse the graph forwards (entry to interface) to * perform a collapse or build a recursive adj and backwards (interface * to entry) to perform updates, i.e. when interface state changes or when * recursive route resolution updates occur. * A forward walk follows simply by navigating an object's parent pointer to * access its parent object. For objects with multiple parents (e.g. a * path-list), each parent is walked in turn. * To support back-walks direct dependencies are maintained between objects, * i.e. in the relationship, {A, B, C} --> D, then object D will maintain a list * of 'pointers' to its children {A, B, C}. Bare C-language pointers are not * allowed, so a pointer is described in terms of an object type (i.e. entry, * path-list, etc) and index - this allows the object to be retrieved from the * appropriate pool. A list is maintained to achieve fast convergence at scale. * When there are millions or recursive prefixes, it is very inefficient to * blindly walk the tables looking for entries that were affected by a given * topology change. The lowest hanging fruit when optimising is to remove * actions that are not required, so all back-walks only traverse objects that * are directly affected by the change. * * PIC Core and fast-reroute rely on FIB reacting quickly to an interface * state change to update the multi-path-adjacencies that use this interface. * An example graph is shown below: * * E_a --> * E_b --> PL_2 --> P_a --> Interface_A * ... --> P_c -\ * E_k --> \ * Interface_K * / * E_l --> / * E_m --> PL_1 --> P_d -/ * ... --> P_f --> Interface_F * E_z --> * * E = fib_entry_t * PL = fib_path_list_t * P = fib_path_t * The subscripts are arbitrary and serve only to distinguish object instances. * This CP graph result in the following DP graph: * * M-ADJ-2 --> Interface_A * \ * -> Interface_K * / * M-ADJ-1 --> Interface_F * * M-ADJ = multi-path-adjacency. * * When interface K goes down a back-walk is started over its dependants in the * control plane graph. This back-walk will reach PL_1 and PL_2 and result in * the calculation of new adjacencies that have interface K removed. The walk * will continue to the entry objects and thus the forwarding table is updated * for each prefix with the new adjacency. The DP graph then becomes: * * ADJ-3 --> Interface_A * * ADJ-4 --> Interface_F * * The eBGP PIC scenarios described above relied on the update of a path-list's * recursive-adjacency to provide the shared point of cutover. This is shown * below * * E_a --> * E_b --> PL_2 --> P_a --> E_44 --> PL_a --> P_b --> Interface_A * ... --> P_c -\ * E_k --> \ * \ * E_1 --> PL_k -> P_k --> Interface_K * / * E_l --> / * E_m --> PL_1 --> P_d -/ * ... --> P_f --> E_55 --> PL_e --> P_e --> Interface_E * E_z --> * * The failure scenario is the removal of entry E_1 and thus the paths P_c and * P_d become unresolved. To achieve PIC the two shared recursive path-lists, * PL_1 and PL_2 must be updated to remove E_1 from the recursive-multi-path- * adjacencies that they contribute, before any entry E_a to E_z is updated. * This means that as the update propagates backwards (right to left) in the * graph it must do so breadth first not depth first. Note this approach leads * to convergence times that are dependent on the number of path-list and so * the number of combinations of egress PEs - this is desirable as this * scale is considerably lower than the number of prefixes. * * If we consider another section of the graph that is similar to the one * shown above where there is another prefix E_2 in a similar position to E_1 * and so also has many dependent children. It is reasonable to expect that a * particular network failure may simultaneously render E_1 and E_2 unreachable. * This means that the update to withdraw E_2 is download immediately after the * update to withdraw E_1. It is a requirement on the FIB to not spend large * amounts of time in a back-walk whilst processing the update for E_1, i.e. the * back-walk must not reach as far as E_a and its siblings. Therefore, after the * back-walk has traversed one generation (breadth first) to update all the * path-lists it should be suspended/back-ground and further updates allowed * to be handled. Once the update queue is empty, the suspended walks can be * resumed. Note that in the case that multiple updates affect the same entry * (say E_1) then this will trigger multiple similar walks, these are merged, * so each child is updated only once. * In the presence of more layers of recursion PIC is still a desirable * feature. Consider an extension to the diagram above, where more recursive * routes (E_100 -> E_200) are added as children of E_a: * * E_100 --> * E_101 --> PL_3 --> P_j-\ * ... \ * E_199 --> E_a --> * E_b --> PL_2 --> P_a --> E_44 --> ...etc.. * ... --> P_c -\ * E_k \ * E_1 --> ...etc.. * / * E_l --> / * E_m --> PL_1 --> P_d -/ * ... --> P_e --> E_55 --> ...etc.. * E_z --> * * To achieve PIC for the routes E_100->E_199, PL_3 needs to be updated before * E_b -> E_z, a breadth first traversal at each level would not achieve this. * Instead the walk must proceed intelligently. Children on PL_2 are sorted so * those Entry objects that themselves have children appear first in the list, * those without later. When an entry object is walked that has children, a * walk of its children is pushed to the front background queue. The back * ground queue is a priority queue. As the breadth first traversal proceeds * across the dependent entry object E_a to E_k, when the first entry that does * not have children is reached (E_b), the walk is suspended and placed at the * back of the queue. Following this prioritisation method shared path-list * updates are performed before all non-resolving entry objects. * The CPU/core/thread that handles the updates is the same thread that handles * the back-walks. Handling updates has a higher priority than making walk * progress, so a walk is required to be interruptable/suspendable when new * updates are available. * !!! TODO - this section describes how walks should be not how they are !!! * * In the diagram above E_100 is an IP route, however, VPP has no restrictions * on the type of object that can be a dependent of a FIB entry. Children of * a FIB entry can be (and are) GRE & VXLAN tunnels endpoints, L2VPN LSPs etc. * By including all object types into the graph and extending the back-walk, we * can thus deliver fast convergence to technologies that overlay on an IP * network. * * If having read all the above carefully you are still thinking; 'i don't need * all this %&$* i have a route only I know about and I just need to jam it in', * then fib_table_entry_special_add() is your only friend. */ #ifndef __FIB_H__ #define __FIB_H__ #include <vnet/fib/fib_table.h> #include <vnet/fib/fib_entry.h> #endif