aboutsummaryrefslogtreecommitdiffstats
path: root/docs/developer/corearchitecture/infrastructure.rst
blob: b4e1065f81e013860ef709a09d4ed4abd4181a77 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
VPPINFRA (Infrastructure)
=========================

The files associated with the VPP Infrastructure layer are located in
the ``./src/vppinfra`` folder.

VPPinfra is a collection of basic c-library services, quite sufficient
to build standalone programs to run directly on bare metal. It also
provides high-performance dynamic arrays, hashes, bitmaps,
high-precision real-time clock support, fine-grained event-logging, and
data structure serialization.

One fair comment / fair warning about vppinfra: you can't always tell a
macro from an inline function from an ordinary function simply by name.
Macros are used to avoid function calls in the typical case, and to
cause (intentional) side-effects.

Vppinfra has been around for almost 20 years and tends not to change
frequently. The VPP Infrastructure layer contains the following
functions:

Vectors
-------

Vppinfra vectors are ubiquitous dynamically resized arrays with by user
defined "headers". Many vpppinfra data structures (e.g. hash, heap,
pool) are vectors with various different headers.

The memory layout looks like this:

::

                      User header (optional, uword aligned)
                      Alignment padding (if needed)
                      Vector length in elements
    User's pointer -> Vector element 0
                      Vector element 1
                      ...
                      Vector element N-1

As shown above, the vector APIs deal with pointers to the 0th element of
a vector. Null pointers are valid vectors of length zero.

To avoid thrashing the memory allocator, one often resets the length of
a vector to zero while retaining the memory allocation. Set the vector
length field to zero via the vec_reset_length(v) macro. [Use the macro!
It’s smart about NULL pointers.]

Typically, the user header is not present. User headers allow for other
data structures to be built atop vppinfra vectors. Users may specify the
alignment for first data element of a vector via the [vec]()*_aligned
macros.

Vector elements can be any C type e.g. (int, double, struct bar). This
is also true for data types built atop vectors (e.g. heap, pool, etc.).
Many macros have \_a variants supporting alignment of vector elements
and \_h variants supporting non-zero-length vector headers. The \_ha
variants support both. Additionally cacheline alignment within a vector
element structure can be specified using the
``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.

Inconsistent usage of header and/or alignment related macro variants
will cause delayed, confusing failures.

Standard programming error: memorize a pointer to the ith element of a
vector, and then expand the vector. Vectors expand by 3/2, so such code
may appear to work for a period of time. Correct code almost always
memorizes vector **indices** which are invariant across reallocations.

In typical application images, one supplies a set of global functions
designed to be called from gdb. Here are a few examples:

-  vl(v) - prints vec_len(v)
-  pe(p) - prints pool_elts(p)
-  pifi(p, index) - prints pool_is_free_index(p, index)
-  debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p

Use the “show gdb” debug CLI command to print the current set.

Bitmaps
-------

Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
Quite handy for a variety jobs.

Pools
-----

Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
fixed-size data structures with independent lifetimes. Pools are perfect
for allocating per-session structures.

Hashes
------

Vppinfra provides several hash flavors. Data plane problems involving
packet classification / session lookup often use
./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
These templates are instantiated multiple times, to efficiently service
different fixed-key sizes.

Bihashes are thread-safe. Read-locking is not required. A simple
spin-lock ensures that only one thread writes an entry at a time.

The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
are simple to use, and are often used in control-plane code which needs
exact-string-matching.

In either case, one almost always looks up a key in a hash table to
obtain an index in a related vector or pool. The APIs are simple enough,
but one must take care when using the unmanaged arbitrary-sized key
variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
key_pointer. It is usually a bad mistake to pass the address of a vector
element as the second argument to hash_set_mem. It is perfectly fine to
memorize constant string addresses in the text segment.

Timekeeping
-----------

Vppinfra includes high-precision, low-cost timing services. The datatype
clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
object.

Clib_time_init(…) can use a variety of different ways to establish the
hardware clock frequency. At the end of the day, vppinfra timekeeping
takes the attitude that the operating system’s clock is the closest
thing to a gold standard it has handy.

When properly configured, NTP maintains kernel clock synchronization
with a highly accurate off-premises reference clock. Notwithstanding
network propagation delays, a synchronized NTP client will keep the
kernel clock accurate to within 50ms or so.

Why should one care? Simply put, oscillators used to generate CPU ticks
aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t
be out of the question. That’s a minute and a half’s worth of error in 1
day. The error changes constantly, due to temperature variation, and a
host of other physical factors.

It’s far too expensive to use system calls for timing, so we’re left
with the problem of continuously adjusting our view of the CPU tick
register’s clocks_per_second parameter.

The clock rate adjustment algorithm measures the number of cpu ticks and
the “gold standard” reference time across an interval of approximately
16 seconds. We calculate clocks_per_second for the interval: use rdtsc
(on x86_64) and a system call to get the latest cpu tick count and the
kernel’s latest nanosecond timestamp. We subtract the previous interval
end values, and use exponential smoothing to merge the new clock rate
sample into the clocks_per_second parameter.

As of this writing, we maintain the clock rate by way of the following
first-order differential equation:

.. code:: c

      clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
      where K = e**(-1.0/3.75);

This yields a per observation “half-life” of 1 minute. Empirically, the
clock rate converges within 5 minutes, and appears to maintain
near-perfect agreement with the kernel clock in the face of ongoing NTP
time adjustments.

See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
rate adjustment algorithm. The code rejects frequency samples
corresponding to the sort of adjustment which might occur if someone
changes the gold standard kernel clock by several seconds.

Monotonic timebase support
~~~~~~~~~~~~~~~~~~~~~~~~~~

Particularly during system initialization, the “gold standard” system
reference clock can change by a large amount, in an instant. It’s not a
best practice to yank the reference clock - in either direction - by
hours or days. In fact, some poorly-constructed use-cases do so.

To deal with this reality, clib_time_now(…) returns the number of
seconds since vpp started, *guaranteed to be monotonically increasing,
no matter what happens to the system reference clock*.

This is first-order important, to avoid breaking every active timer in
the system. The vpp host stack alone may account for tens of millions of
active timers. It’s utterly impractical to track down and fix timers, so
we must deal with the issue at the timebase level.

Here’s how it works. Prior to adjusting the clock rate, we collect the
kernel reference clock and the cpu clock:

.. code:: c

     /* Ask the kernel and the CPU what time it is... */
     now_reference = unix_time_now ();
     now_clock = clib_cpu_time_now ();

Compute changes for both clocks since the last rate adjustment, roughly
15 seconds ago:

.. code:: c

     /* Compute change in the reference clock */
     delta_reference = now_reference - c->last_verify_reference_time;

     /* And change in the CPU clock */
     delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
       c->seconds_per_clock;

Delta_reference is key. Almost 100% of the time, delta_reference and
delta_clock_in_seconds are identical modulo one system-call time.
However, NTP or a privileged user can yank the system reference time -
in either direction - by an hour, a day, or a decade.

As described above, clib_time_now(…) must return monotonically
increasing answers to the question “how long has it been since vpp
started, in seconds.” To do that, the clock rate adjustment algorithm
begins by recomputing the initial reference time:

.. code:: c

     c->init_reference_time += (delta_reference - delta_clock_in_seconds);

It’s easy to convince yourself that if the reference clock changes by
15.000000 seconds and the cpu clock tick time changes by 15.000000
seconds, the initial reference time won’t change.

If, on the other hand, delta_reference is -86400.0 and delta clock is
15.0 - reference time jumped backwards by exactly one day in a 15-second
rate update interval - we add -86415.0 to the initial reference time.

Given the corrected initial reference time, we recompute the total
number of cpu ticks which have occurred since the corrected initial
reference time, at the current clock tick rate:

.. code:: c

     c->total_cpu_time = (now_reference - c->init_reference_time)
       * c->clocks_per_second;

Timebase precision
~~~~~~~~~~~~~~~~~~

Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
floating-point value; the number of seconds since vpp started.

Please see `this Wikipedia
article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
for more information. C double-precision floating point numbers (called
f64 in the vpp code base) have a 53-bit effective mantissa, and can
accurately represent 15 decimal digits’ worth of precision.

There are 315,360,000.000001 seconds in ten years plus one microsecond.
That string has exactly 15 decimal digits. The vpp time base retains 1us
precision for roughly 30 years.

vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds.
If necessary, please use clib_cpu_time_now(…) for direct access to the
CPU clock-cycle counter. Note that the number of CPU clock cycles per
second varies significantly across CPU architectures.

Timer Wheels
------------

Vppinfra includes configurable timer wheel support. See the source code
in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].

Instantiation of tw_timer_template.h generates named structures to
implement specific timer wheel geometries. Choices include: number of
timer wheels (currently, 1 or 2), number of slots per ring (a power of
two), and the number of timers per “object handle”.

Internally, user object/timer handles are 32-bit integers, so if one
selects 16 timers/object (4 bits), the resulting timer wheel handle is
limited to 2**28 objects.

Here are the specific settings required to generate a single 2048 slot
wheel which supports 2 timers per object:

.. code:: c

       #define TW_TIMER_WHEELS 1
       #define TW_SLOTS_PER_RING 2048
       #define TW_RING_SHIFT 11
       #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
       #define TW_TIMERS_PER_OBJECT 2
       #define LOG2_TW_TIMERS_PER_OBJECT 1
       #define TW_SUFFIX _2t_1w_2048sl
       #define TW_FAST_WHEEL_BITMAP 0
       #define TW_TIMER_ALLOW_DUPLICATE_STOP 0

See tw_timer_2t_1w_2048sl.h for a complete example.

tw_timer_template.h is not intended to be #included directly. Client
codes can include multiple timer geometry header files, although extreme
caution would required to use the TW and TWT macros in such a case.

API usage examples
~~~~~~~~~~~~~~~~~~

The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
API usage example. It uses a synthetic clock to rapidly exercise the
underlying tw_timer_expire_timers(…) template.

There are not many API routines to call.

Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: c

       tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
                                        expired_timer_single_callback,
                         1.0 / * timer interval * / );

Start a timer
^^^^^^^^^^^^^

.. code:: c

       handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
                                             [0 | 1] / * timer id * / ,
                                             expiration_time_in_u32_ticks);

Stop a timer
^^^^^^^^^^^^

.. code:: c

       tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);

An expired timer callback
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: c

       static void
       expired_timer_single_callback (u32 * expired_timers)
       {
           int i;
           u32 pool_index, timer_id;
           tw_timer_test_elt_t *e;
           tw_timer_test_main_t *tm = &tw_timer_test_main;

           for (i = 0; i < vec_len (expired_timers);
               {
               pool_index = expired_timers[i] & 0x7FFFFFFF;
               timer_id = expired_timers[i] >> 31;

               ASSERT (timer_id == 1);

               e = pool_elt_at_index (tm->test_elts, pool_index);

               if (e->expected_to_expire != tm->single_wheel.current_tick)
                 {
                   fformat (stdout, "[%d] expired at %d not %d\n",
                            e - tm->test_elts, tm->single_wheel.current_tick,
                            e->expected_to_expire);
                 }
            pool_put (tm->test_elts, e);
            }
        }

We use wheel timers extensively in the vpp host stack. Each TCP session
needs 5 timers, so supporting 10 million flows requires up to 50 million
concurrent timers.

Timers rarely expire, so it’s of utmost important that stopping and
restarting a timer costs as few clock cycles as possible.

Stopping a timer costs a doubly-linked list dequeue. Starting a timer
involves modular arithmetic to determine the correct timer wheel and
slot, and a list head enqueue.

Expired timer processing generally involves bulk link-list retirement
with user callback presentation. Some additional complexity at wheel
wrap time, to relocate timers from slower-turning timer wheels into
faster-turning wheels.

Format
------

Vppinfra format is roughly equivalent to printf.

Format has a few properties worth mentioning. Format’s first argument is
a (u8 \*) vector to which it appends the result of the current format
operation. Chaining calls is very easy:

.. code:: c

       u8 * result;

       result = format (0, "junk = %d, ", junk);
       result = format (result, "more junk = %d\n", more_junk);

As previously noted, NULL pointers are perfectly proper 0-length
vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
wish to print a (u8 \*) vector, use the “%v” format string. If you need
a (u8 \*) vector which is also a proper C-string, either of these
schemes may be used:

.. code:: c

       vec_add1 (result, 0)
       or
       result = format (result, "<whatever>%c", 0);

Remember to vec_free() the result if appropriate. Be careful not to pass
format an uninitialized (u8 \*).

Format implements a particularly handy user-format scheme via the “%U”
format specification. For example:

.. code:: c

       u8 * format_junk (u8 * s, va_list *va)
       {
         junk = va_arg (va, u32);
         s = format (s, "%s", junk);
         return s;
       }

       result = format (0, "junk = %U, format_junk, "This is some junk");

format_junk() can invoke other user-format functions if desired. The
programmer shoulders responsibility for argument type-checking. It is
typical for user format functions to blow up spectacularly if the
va_arg(va, type) macros don’t match the caller’s idea of reality.

Unformat
--------

Vppinfra unformat is vaguely related to scanf, but considerably more
general.

A typical use case involves initializing an unformat_input_t from either
a C-string or a (u8 \*) vector, then parsing via unformat() as follows:

.. code:: c

       unformat_input_t input;
       u8 *s = "<some-C-string>";

       unformat_init_string (&input, (char *) s, strlen((char *) s));
       /* or */
       unformat_init_vector (&input, <u8-vector>);

Then loop parsing individual elements:

.. code:: c

       while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
       {
         if (unformat (&input, "value1 %d", &value1))
           ;/* unformat sets value1 */
         else if (unformat (&input, "value2 %d", &value2)
           ;/* unformat sets value2 */
         else
           return clib_error_return (0, "unknown input '%U'",
                                     format_unformat_error, input);
       }

As with format, unformat implements a user-unformat function capability
via a “%U” user unformat function scheme. Generally, one can trivially
transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
&foo)“.

Unformat implements a couple of handy non-scanf-like format specifiers:

.. code:: c

       unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
       unformat (input, "bitzero %|", &mask, (1<<0));
       unformat (input, "bitone %|", &mask, (1<<1));
       <etc>

The phrase “enable %=” means “set the supplied variable to the default
value” if unformat parses the “enable” keyword all by itself. If
unformat parses “enable 123” set the supplied variable to 123.

We could clean up a number of hand-rolled “verbose” + “verbose %d”
argument parsing codes using “%=”.

The phrase “bitzero %\|” means “set the specified bit in the supplied
bitmask” if unformat parses “bitzero”. Although it looks like it could
be fairly handy, it’s very lightly used in the code base.

``%_`` toggles whether or not to skip input white space.

For transition from skip to no-skip in middle of format string, skip
input white space. For example, the following:

.. code:: c

   fmt = "%_%d.%d%_->%_%d.%d%_"
   unformat (input, fmt, &one, &two, &three, &four);

matches input “1.2 -> 3.4”. Without this, the space after -> does not
get skipped.


How to parse a single input line
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Debug CLI command functions MUST NOT accidentally consume input
belonging to other debug CLI commands. Otherwise, it's impossible to
script a set of debug CLI commands which "work fine" when issued one
at a time.

This bit of code is NOT correct:

.. code:: c

     /* Eats script input NOT beloging to it, and chokes! */
     while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
       {
         if (unformat (input, ...))
       ;
         else if (unformat (input, ...))
       ;
         else
           return clib_error_return (0, "parse error: '%U'",
                                format_unformat_error, input);
       }
       }

When executed as part of a script, such a function will return “parse
error: ‘’” every time, unless it happens to be the last command in the
script.

Instead, use “unformat_line_input” to consume the rest of a line’s worth
of input - everything past the path specified in the VLIB_CLI_COMMAND
declaration.

For example, unformat_line_input with “my_command” set up as shown below
and user input “my path is clear” will produce an unformat_input_t that
contains “is clear”.

.. code:: c

       VLIB_CLI_COMMAND (...) = {
           .path = "my path",
       };

Here’s a bit of code which shows the required mechanics, in full:

.. code:: c

       static clib_error_t *
       my_command_fn (vlib_main_t * vm,
                      unformat_input_t * input,
                      vlib_cli_command_t * cmd)
       {
         unformat_input_t _line_input, *line_input = &_line_input;
         u32 this, that;
         clib_error_t *error = 0;

         if (!unformat_user (input, unformat_line_input, line_input))
           return 0;

         /*
          * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
          * not at the end of the script...
          */
         while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
           {
              if (unformat (line_input, "this %u", &this))
                ;
              else if (unformat (line_input, "that %u", &that))
                ;
              else
                {
                  error = clib_error_return (0, "parse error: '%U'",
                                    format_unformat_error, line_input);
                  goto done;
                }
             }

       <do something based on "this" and "that", etc>

       done:
         unformat_free (line_input);
         return error;
       }
      VLIB_CLI_COMMAND (my_command, static) = {
        .path = "my path",
        .function = my_command_fn",
      };

Vppinfra errors and warnings
----------------------------

Many functions within the vpp dataplane have return-values of type
clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of
metadata [fatal, warning] and are easy to announce. Returning a NULL
clib_error_t \* indicates “A-OK, no error.”

Clib_warning(format-args) is a handy way to add debugging output; clib
warnings prepend function:line info to unambiguously locate the message
source. Clib_unix_warning() adds perror()-style Linux system-call
information. In production images, clib_warnings result in syslog
entries.

Serialization
-------------

Vppinfra serialization support allows the programmer to easily serialize
and unserialize complex data structures.

The underlying primitive serialize/unserialize functions use network
byte-order, so there are no structural issues serializing on a
little-endian host and unserializing on a big-endian host.