1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
|
.. _fastconvergence:
Fast Convergence
------------------------------------
This is an excellent description of the topic:
'FIB <https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12>'_
but if you're interested in my take keep reading...
First some definitions:
- Convergence; When a FIB is forwarding all packets correctly based
on the network topology (i.e. doing what the routing control plane
has instructed it to do), then it is said to be 'converged'.
Not being in a converged state is [hopefully] a transient state,
when either the topology change (e.g. a link failure) has not been
observed or processed by the routing control plane, or that the FIB
is still processing routing updates. Convergence is the act of
getting to the converged state.
- Fast: In the shortest time possible. There are no absolute limits
placed on how short this must be, although there is one number often
mentioned. Apparently the human ear can detect loss/delay/jitter in
VOIP of 50ms, therefore network failures should last no longer than
this, and some technologies (notably link-free alternate fast
reroute) are designed to converge in this time. However, it is
generally accepted that it is not possible to converge a FIB with
tens of millions of routes in this time scale, the industry
'standard' is sub-second.
Converging the FIB quickly is thus a matter of:
- discovering something is down
- updating as few objects as possible
- to determine which objects to update as efficiently as possible
- to update each object as quickly as possible
we'll discuss each in turn.
All output came from VPP version 21.01rc0. In what follows I use IPv4
prefixes, addresses and IPv4 host length masks, however, exactly the
same applies to IPv6.
Failure Detection
^^^^^^^^^^^^^^^^^
The two common forms (we'll see others later on) of failure detection
are:
- link down
- BFD
The FIB needs to hook into these notifications to trigger
convergence.
Whenever an interface goes down, VPP issues a callback to all
registerd clients. The adjacency code is such a client. The adjacency
is a leaf node in the FIB control-plane graph (containing fib_path_t,
fib_entry_t etc). A back-walk from the adjacnecy will trigger a
re-resolution of the paths.
FIB is a client of BFD in order to receive BFD notifications. BFD
comes in two flavours; single and multi hop. Single hop is to protect
a specific peer on an interface, such peers are modelled by an
adjacency. Multi hop is to protect a peer on an unspecified interface
(i.e. a remote peer), this peer is represented by a host-prefix
**fib_entry_t**. In both case FIB will add a delegate to the
**ip_adjacency_t** or **fib_entry_t** that represents the association
to the BFD session. If the BFD session signals up/down then a backwalk
can be triggered from the object to trigger re-resolution and hence
convergence.
Few Updates
^^^^^^^^^^^
In order to talk about what 'a few' is we have to leave the realm of
the FIB as an abstract graph based object DB and move into the
concrete representation of forwarding in a large network. Large
networks are built in layers, it's how you scale them. We'll take
here a hypothetical service provider (SP) network, but the concepts
apply equally to data center leaf-spines. This is a rudimentary
description, but it should serve our purpose.
An SP manages a BGP autonomous system (AS). The SP's goal is both to
attract traffic into its network to serve its customers, but also to
serve transit traffic passing through it, we'll consider the latter here.
The SP's network is all devices in that AS, these
devices are split into those at the edge (provider edge (PE) routers)
which peer with routers in other SP networks,
and those in the core (termed provider (P) routers). Both the PE and P
routers run the IGP (usually OSPF or ISIS). Only the reachability of the devices
in the AS are advertised in the IGP - thus the scale (i.e. the number
of routes) in the IGP is 'small' - only the number of
devices that the SP has (typically not more than a few 10k).
PE routers run BGP; they have external BGP sessions to devices in
other ASs and internal BGP sessions to devices in the same AS. BGP is
used to advertise the routes to *all* networks on the internet - at
the time of writing this number is approaching 900k IPv4 route, hopefully by
the time you are reading this the number of IPv6 routes has caught up ...
If we include the additional routes the SP carries to offering VPN service to its
customers the number of BGP routes can grow to the tens of millions.
BGP scale thus exceeds IGP scale by two orders of magnitude... pause for
a moment and let that sink in...
A comparison of BGP and an IGP is way way beyond the scope of this
documentation (and frankly beyond me) so we'll note only the
difference in the form of the routes they present to FIB. A routing
protocol will produce routes that specify the prefixes that are
reachable through its peers. A good IGP
is link state based, it forms peerings to other devices over these
links, hence its routes specify links/interfaces. In
FIB nomenclature this means an IGP produces routes that are
attached-nexthop, e.g.:
.. code-block:: console
ip route add 1.1.1.1/32 via 10.0.0.1 GigEthernet0/0/0
BGP on the other hand forms peerings only to neighbours, it does not
know, nor care, what interface is used to reach the peer. In FIB
nomenclature therefore BGP produces recursive routes, e.g.:
.. code-block:: console
ip route 8.0.0.0/16 via 1.1.1.1
where 1.1.1.1 is the BGP peer. It's no accident in this example that
1.1.1.1/32 happens to be the route the IGP advertised... BGP installs
routes for prefixes reachable via other BGP peers, and the IGP install
the routes to those BGP peers.
This has been a very long winded way of describing why the scale of
recursive routes is therefore 2 orders of magnitude greater than
non-recursive/attached-nexthop routes.
If we step back for a moment and recall why we've crawled down this
rabbit hole, we're trying to determine what 'a few' updates means,
does it include all those recursive routes, probably not ... let's
keep crawling.
We started this chapter with an abstract description of convergence,
let's now make that more real. In the event of a network failure an SP
is interested in moving to an alternate forwarding path as quickly as
possible. If there is no alternate path, and a converged FIB will drop
the packet, then who cares how fast it converges. In other words the
interesting convergence scenarios are the scenarios where the network has
alternate paths.
PIC Core
^^^^^^^^
First let's consider alternate paths in the IGP, e.g.;
.. code-block:: console
ip route add 1.1.1.1/32 via 10.0.0.2 GigEthernet0/0/0
ip route add 1.1.1.1/32 via 10.0.1.2 GigEthernet0/0/1
this gives us in the FIB:
.. code-block:: console
DBGvpp# sh ip fib 1.1.1.1/32
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ]
1.1.1.1/32 fib:0 index:15 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[23] locks:2 flags:shared, uPRF-list:22 len:2 itfs:[1, 2, ]
path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
10.0.0.2 GigEthernet0/0/0
[@0]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
10.0.1.2 GigEthernet0/0/1
[@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
There is ECMP across the two paths. Note that the instance/index of the
load-balance present in the forwarding graph is 17.
Let's add a BGP route via this peer;
.. code-block:: console
ip route add 8.0.0.0/16 via 1.1.1.1
in the FIB we see:
.. code-block:: console
DBGvpp# sh ip fib 8.0.0.0/16
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
8.0.0.0/16 fib:0 index:18 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[24] locks:2 flags:shared, uPRF-list:21 len:2 itfs:[1, 2, ]
path:[29] pl-index:24 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:21 to:[0:0]]
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
the load-balance object used by this route is index 20, but note that
the next load-balance in the chain is index 17, i.e. it is exactly
the same instance that appears in the forwarding chain for the IGP
route. So in the forwarding plane the packet first encounters
load-balance object 20 (which it will use in ip4-lookup) and then
number 17 (in ip4-load-balance).
What's the significance? Let's shut down one of those IGP paths:
.. code-block:: console
DBGvpp# set in state GigEthernet0/0/0 down
the resulting update to the IGP route is:
.. code-block:: console
DBGvpp# sh ip fib 1.1.1.1/32
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
1.1.1.1/32 fib:0 index:15 locks:4
API refs:1 src-flags:added,contributing,active,
path-list:[23] locks:2 flags:shared, uPRF-list:25 len:2 itfs:[1, 2, ]
path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:
10.0.0.2 GigEthernet0/0/0
[@0]: arp-ipv4: via 10.0.0.2 GigEthernet0/0/0
path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved,
10.0.1.2 GigEthernet0/0/1
[@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
recursive-resolution refs:1 src-flags:added, cover:-1
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
[0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
notice that the path via 10.0.0.2 is no longer flagged as resolved,
and the forwarding chain does not contain this path as a
choice. However, the key thing to note is the load-balance
instance is still index 17, i.e. it has been modified not
exchanged. In the FIB vernacular we say it has been 'in-place
modified', a somewhat linguistically redundant expression, but one that serves
to emphasise that it was changed whilst still be part of the graph, it
was never at any point removed from the graph and re-added, and it was
modified without worker barrier lock held.
Still don't see the significance? In order to converge around the
failure of the IGP link it was not necessary to update load-balance
object number 20! It was not necessary to update the recursive
route. i.e. convergence is achieved without updating any recursive
routes, it is only necessary to update the affected IGP routes, this is
the definition of 'a few'. We call this 'prefix independent
convergence' (PIC) which should really be called 'recursive prefix
independent convergence' but it isn't...
How was the trick done? As with all problems in computer science, it
was solved by a layer of misdirection, I mean indirection. The
indirection is the load-balance that belongs to the IGP route. By
keeping this object in the forwarding graph and updating it in place,
we get PIC. The alternative design would be to collapse the two layers of
load-balancing into one, which would improve forwarding performance
but would come at the cost of prefix dependent convergence. No doubt
there are situations where the VPP deployment would favour forwarding
performance over convergence, you know the drill, contributions welcome.
This failure scenario is known as PIC core, since it's one of the IGP's
core links that has failed.
iBGP PIC Edge
^^^^^^^^^^^^^
Next, let's consider alternate paths in BGP, e.g:
.. code-block:: console
ip route add 8.0.0.0/16 via 1.1.1.1
ip route add 8.0.0.0/16 via 1.1.1.2
the 8.0.0.0/16 prefix is reachable via two BGP next-hops (two PEs).
Our FIB now also contains:
.. code-block:: console
DBGvpp# sh ip fib 8.0.0.0/16
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
8.0.0.0/16 fib:0 index:18 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[15] locks:2 flags:shared, uPRF-list:11 len:2 itfs:[1, 2, ]
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved,
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:2 uRPF:11 to:[0:0]]
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
[1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:13 to:[0:0]]
[0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
The first load-balance (LB) in the forwarding graph is index 20 (the astute
reader will note this is the same index as in the previous
section, I am adding paths to the same route, the load-balance is
in-place modified again). Each choice in LB 20 is another LB
contributed by the IGP route through which the route's paths recurse.
So what's the equivalent in BGP to a link down in the IGP? An IGP link
down means it loses its peering out of that link, so the equivalent in
BGP is the loss of the peering and thus the loss of reachability to
the peer. This is signaled by the IGP withdrawing the route to the
peer. But "Wait wait wait", i hear you say ... "just because the IGP
withdraws 1.1.1.1/32 doesn't mean I can't reach 1.1.1.1, perhaps there
is a less specific route that gives reachability to 1.1.1.1". Indeed
there may be. So a little more on BGP network design. I know it's like
a bad detective novel where the author drip feeds you the plot... When
describing iBGP peerings one 'always' describes the peer using one of
its GigEthernet0/0/back addresses. Why? A GigEthernet0/0/back interface
never goes down (unless you admin down it yourself), some muppet can't
accidentally cut through the GigEthernet0/0/back cable whilst digging up the
street. And what subnet mask length does a prefix have on a GigEthernet0/0/back
interface? it's 'always' a /32. Why? because there's no cable to connect
any other devices. This choice justifies there 'always' being a /32
route for the BGP peer. But what prevents there not being a less
specific - nothing.
Now clearly if the BGP peer crashes then the /32 for its GigEthernet0/0/back is
going to be removed from the IGP, but what will withdraw the less
specific - nothing.
So in order to make use of this trick of relying on the withdrawal of
the /32 for the peer to signal that the peer is down and thus the
signal to converge the FIB, we need to force FIB to recurse only via
the /32 and not via a less specific. This is called a 'recursion
constraint'. In this case the constraint is 'recurse via host'
i.e. for ipv4 use a /32.
So we need to update our route additions from before:
.. code-block:: console
ip route add 8.0.0.0/16 via 1.1.1.1 resolve-via-host
ip route add 8.0.0.0/16 via 1.1.1.2 resolve-via-host
checking the FIB output is left as an exercise to the reader. I hope
you're doing these configs as you read. There's little change in the
output, you'll see some extra flags on the paths.
Now let's add the less specific, just for fun:
.. code-block:: console
ip route add 1.1.1.0/28 via 10.0.0.2 GigEthernet0/0/0
nothing changes in resolution of 8.0.0.0/16.
Now withdraw the route to 1.1.1.2/32:
.. code-block:: console
ip route del 1.1.1.2/32 via 10.0.0.2 GigEthernet0/0/0
In the FIB we see:
.. code-block:: console
DBGvpp# sh ip fib 8.0.0.0/32
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
8.0.0.0/16 fib:0 index:18 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[15] locks:2 flags:shared, uPRF-list:13 len:2 itfs:[1, 2, ]
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host,
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:13 to:[0:0]]
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
the path via 1.1.1.2 is unresolved, because the recursion constraints
are preventing the the path resolving via 1.1.1.0/28. the LB index 20
has been updated to remove the unresolved path.
Job done? Not quite! Why not?
Let's re-examine the goals of this chapter. We wanted to update 'a
few' objects, which we have defined as not all the millions of
recursive routes. Did we do that here? We sure did, when we
modified LB index 20. So WTF?? Where's the indirection object that can
be modified so that the LBs for the recursive routes are not
modified - it's not there.... WTF?
OK so the great detective has assembled all the suspects in the
drawing room and only now does he drop the bomb; the FIB knows the
scale, we talked above about what the scale **can** be, worst case
scenario, but that's not necessarily what it is in this hypothetical
(your) deployment. It knows how many recursive routes there are that
depend on a /32, it can thus make its own determination of the
definition of 'a few'. In other words, if there are only 'a few'
recursive prefixes that depend on a /32 then it will update them
synchronously (and we'll discuss what synchronously means a bit more later).
So what does FIB consider to be 'a few'. Let's add more routes and
find out.
.. code-block:: console
DBGvpp# ip route add 8.1.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
...
DBGvpp# ip route add 8.63.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
and we see:
.. code-block:: console
DBGvpp# sh ip fib 8.8.0.0
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
8.8.0.0/16 fib:0 index:77 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:79 buckets:2 uRPF:28 flags:[uses-map] to:[0:0]]
load-balance-map: index:0 buckets:2
index: 0 1
map: 0 1
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
[1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:18 to:[0:0]]
[0] [@3]: arp-ipv4: via 10.0.1.2 GigEthernet0/0/0
Two elements to note here; the path-list has the 'popular' flag and
there is a load-balance map in the forwarding path.
'popular' in this case means that the path-list has passed the limit
of 'a few' in the number of children it has.
here are the children:
.. code-block:: console
DBGvpp# sh fib path-list 15
path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
children:{entry:18}{entry:21}{entry:22}{entry:23}{entry:25}{entry:26}{entry:27}{entry:28}{entry:29}{entry:30}{entry:31}{entry:32}{entry:33}{entry:34}{entry:35}{entry:36}{entry:37}{entry:38}{entry:39}{entry:40}{entry:41}{entry:42}{entry:43}{entry:44}{entry:45}{entry:46}{entry:47}{entry:48}{entry:49}{entry:50}{entry:51}{entry:52}{entry:53}{entry:54}{entry:55}{entry:56}{entry:57}{entry:58}{entry:59}{entry:60}{entry:61}{entry:62}{entry:63}{entry:64}{entry:65}{entry:66}{entry:67}{entry:68}{entry:69}{entry:70}{entry:71}{entry:72}{entry:73}{entry:74}{entry:75}{entry:76}{entry:77}{entry:78}{entry:79}{entry:80}{entry:81}{entry:82}{entry:83}{entry:84}
64 children makes it popular. The number is fixed (there is no API to
change it). Its choice is an attempt to balance the performance cost
of the indirection performance degradation versus the convergence
gain.
Popular path-lists contribute the load-balance map, this is the
missing indirection object. Its indirection happens when choosing the
bucket in the LB. The packet's flow-hash is taken 'mod number of
buckets' to give the 'candidate bucket' then the map will take this
'index' and convert it into the 'map'. You can see in the example above
that no change occurs, i.e. if the flow-hash mod n chooses bucket 1
then it gets bucket 1.
Why is this useful? The path-list is shared (you can convince
yourself of this if you look at each of the 8.x.0.0/16 routes we
added) and all of these routes use the same load-balance map, therefore, to
converge all the recursive routs, we need only change the map and
we're good; we again get PIC.
OK who's still awake... if you're thinking there's more to this story,
you're right. Keep reading.
This failure scenario is called iBGP PIC edge. It's 'edge' because it
refers to the loss of an edge device, and iBGP because the device was
a iBGP peer (we learn iBGP peers in the IGP). There is a similar eBGP
PIC edge scenario, but this is left for an exercise to the reader (hint
there are other recursion constraints - see the RFC).
Which Objects
^^^^^^^^^^^^^
The next topic on our list of how to converge quickly was to
effectively find the objects that need to be updated when a converge
event happens. If you haven't realised by now that the FIB is an
object graph, then can I politely suggest you go back and start from
the beginning ...
Finding the objects affected by a change is simply a matter of walking
from the parent (the object affected) to its children. These
dependencies are kept really for this reason.
So is fast convergence just a matter of walking the graph? Yes and
no. The question to ask yourself is this, "in the case of iBGP PIC edge,
when the /32 is withdrawn, what is the list of objects that need to be
updated and particularly what is the order they should be updated in
order to obtain the best convergence time?" Think breadth v. depth first.
... ponder for a while ...
For iBGP PIC edge we said it's the path-list that provides the
indirection through the load-balance map. Hence once all path-lists
are updated we are converged, thereafter, at our leisure, we can
update the child recursive prefixes. Is the breadth or depth first?
It's breadth first.
Breadth first walks are achieved by spawning an async walk of the
branch of the graph that we don't want to traverse. Withdrawing the /32
triggers a synchronous walk of the children of the /32 route, we want
a synchronous walk because we want to converge ASAP. This synchronous
walk will encounter path-lists in the /32 route's child dependent list.
These path-lists (and thier LB maps) will be updated. If a path-list is
popular, then it will spawn a async walk of the path-list's child
dependent routes, if not it will walk those routes. So the walk
effectively proceeds breadth first across the path-lists, then returns
to the start to do the affected routes.
Now the story is complete. The murderer is revealed.
Let's withdraw one of the IGP routes.
.. code-block:: console
DBGvpp# ip route del 1.1.1.2/32 via 10.0.1.2 GigEthernet0/0/1
DBGvpp# sh ip fib 8.8.0.0
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
8.8.0.0/16 fib:0 index:77 locks:2
API refs:1 src-flags:added,contributing,active,
path-list:[15] locks:128 flags:shared,popular, uPRF-list:18 len:2 itfs:[1, 2, ]
path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host,
via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host,
via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:79 buckets:1 uRPF:18 to:[0:0]]
[0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
[0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
[1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
the LB Map has gone, since the prefix now only has one path. You'll
need to be a CLI ninja if you want to catch the output showing the LB
map in its transient state of:
.. code-block:: console
load-balance-map: index:0 buckets:2
index: 0 1
map: 0 0
but it happens. Trust me. I've got tests and everything.
On the final topic of how to converge quickly; 'make each update fast'
there are no tricks.
|