summaryrefslogtreecommitdiffstats
path: root/docs/developer/corearchitecture/infrastructure.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/developer/corearchitecture/infrastructure.rst')
-rw-r--r--docs/developer/corearchitecture/infrastructure.rst612
1 files changed, 612 insertions, 0 deletions
diff --git a/docs/developer/corearchitecture/infrastructure.rst b/docs/developer/corearchitecture/infrastructure.rst
new file mode 100644
index 00000000000..b4e1065f81e
--- /dev/null
+++ b/docs/developer/corearchitecture/infrastructure.rst
@@ -0,0 +1,612 @@
+VPPINFRA (Infrastructure)
+=========================
+
+The files associated with the VPP Infrastructure layer are located in
+the ``./src/vppinfra`` folder.
+
+VPPinfra is a collection of basic c-library services, quite sufficient
+to build standalone programs to run directly on bare metal. It also
+provides high-performance dynamic arrays, hashes, bitmaps,
+high-precision real-time clock support, fine-grained event-logging, and
+data structure serialization.
+
+One fair comment / fair warning about vppinfra: you can't always tell a
+macro from an inline function from an ordinary function simply by name.
+Macros are used to avoid function calls in the typical case, and to
+cause (intentional) side-effects.
+
+Vppinfra has been around for almost 20 years and tends not to change
+frequently. The VPP Infrastructure layer contains the following
+functions:
+
+Vectors
+-------
+
+Vppinfra vectors are ubiquitous dynamically resized arrays with by user
+defined "headers". Many vpppinfra data structures (e.g. hash, heap,
+pool) are vectors with various different headers.
+
+The memory layout looks like this:
+
+::
+
+ User header (optional, uword aligned)
+ Alignment padding (if needed)
+ Vector length in elements
+ User's pointer -> Vector element 0
+ Vector element 1
+ ...
+ Vector element N-1
+
+As shown above, the vector APIs deal with pointers to the 0th element of
+a vector. Null pointers are valid vectors of length zero.
+
+To avoid thrashing the memory allocator, one often resets the length of
+a vector to zero while retaining the memory allocation. Set the vector
+length field to zero via the vec_reset_length(v) macro. [Use the macro!
+It’s smart about NULL pointers.]
+
+Typically, the user header is not present. User headers allow for other
+data structures to be built atop vppinfra vectors. Users may specify the
+alignment for first data element of a vector via the [vec]()*_aligned
+macros.
+
+Vector elements can be any C type e.g. (int, double, struct bar). This
+is also true for data types built atop vectors (e.g. heap, pool, etc.).
+Many macros have \_a variants supporting alignment of vector elements
+and \_h variants supporting non-zero-length vector headers. The \_ha
+variants support both. Additionally cacheline alignment within a vector
+element structure can be specified using the
+``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.
+
+Inconsistent usage of header and/or alignment related macro variants
+will cause delayed, confusing failures.
+
+Standard programming error: memorize a pointer to the ith element of a
+vector, and then expand the vector. Vectors expand by 3/2, so such code
+may appear to work for a period of time. Correct code almost always
+memorizes vector **indices** which are invariant across reallocations.
+
+In typical application images, one supplies a set of global functions
+designed to be called from gdb. Here are a few examples:
+
+- vl(v) - prints vec_len(v)
+- pe(p) - prints pool_elts(p)
+- pifi(p, index) - prints pool_is_free_index(p, index)
+- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p
+
+Use the “show gdb” debug CLI command to print the current set.
+
+Bitmaps
+-------
+
+Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
+Quite handy for a variety jobs.
+
+Pools
+-----
+
+Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
+fixed-size data structures with independent lifetimes. Pools are perfect
+for allocating per-session structures.
+
+Hashes
+------
+
+Vppinfra provides several hash flavors. Data plane problems involving
+packet classification / session lookup often use
+./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
+These templates are instantiated multiple times, to efficiently service
+different fixed-key sizes.
+
+Bihashes are thread-safe. Read-locking is not required. A simple
+spin-lock ensures that only one thread writes an entry at a time.
+
+The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
+are simple to use, and are often used in control-plane code which needs
+exact-string-matching.
+
+In either case, one almost always looks up a key in a hash table to
+obtain an index in a related vector or pool. The APIs are simple enough,
+but one must take care when using the unmanaged arbitrary-sized key
+variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
+key_pointer. It is usually a bad mistake to pass the address of a vector
+element as the second argument to hash_set_mem. It is perfectly fine to
+memorize constant string addresses in the text segment.
+
+Timekeeping
+-----------
+
+Vppinfra includes high-precision, low-cost timing services. The datatype
+clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
+Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
+object.
+
+Clib_time_init(…) can use a variety of different ways to establish the
+hardware clock frequency. At the end of the day, vppinfra timekeeping
+takes the attitude that the operating system’s clock is the closest
+thing to a gold standard it has handy.
+
+When properly configured, NTP maintains kernel clock synchronization
+with a highly accurate off-premises reference clock. Notwithstanding
+network propagation delays, a synchronized NTP client will keep the
+kernel clock accurate to within 50ms or so.
+
+Why should one care? Simply put, oscillators used to generate CPU ticks
+aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t
+be out of the question. That’s a minute and a half’s worth of error in 1
+day. The error changes constantly, due to temperature variation, and a
+host of other physical factors.
+
+It’s far too expensive to use system calls for timing, so we’re left
+with the problem of continuously adjusting our view of the CPU tick
+register’s clocks_per_second parameter.
+
+The clock rate adjustment algorithm measures the number of cpu ticks and
+the “gold standard” reference time across an interval of approximately
+16 seconds. We calculate clocks_per_second for the interval: use rdtsc
+(on x86_64) and a system call to get the latest cpu tick count and the
+kernel’s latest nanosecond timestamp. We subtract the previous interval
+end values, and use exponential smoothing to merge the new clock rate
+sample into the clocks_per_second parameter.
+
+As of this writing, we maintain the clock rate by way of the following
+first-order differential equation:
+
+.. code:: c
+
+ clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
+ where K = e**(-1.0/3.75);
+
+This yields a per observation “half-life” of 1 minute. Empirically, the
+clock rate converges within 5 minutes, and appears to maintain
+near-perfect agreement with the kernel clock in the face of ongoing NTP
+time adjustments.
+
+See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
+rate adjustment algorithm. The code rejects frequency samples
+corresponding to the sort of adjustment which might occur if someone
+changes the gold standard kernel clock by several seconds.
+
+Monotonic timebase support
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Particularly during system initialization, the “gold standard” system
+reference clock can change by a large amount, in an instant. It’s not a
+best practice to yank the reference clock - in either direction - by
+hours or days. In fact, some poorly-constructed use-cases do so.
+
+To deal with this reality, clib_time_now(…) returns the number of
+seconds since vpp started, *guaranteed to be monotonically increasing,
+no matter what happens to the system reference clock*.
+
+This is first-order important, to avoid breaking every active timer in
+the system. The vpp host stack alone may account for tens of millions of
+active timers. It’s utterly impractical to track down and fix timers, so
+we must deal with the issue at the timebase level.
+
+Here’s how it works. Prior to adjusting the clock rate, we collect the
+kernel reference clock and the cpu clock:
+
+.. code:: c
+
+ /* Ask the kernel and the CPU what time it is... */
+ now_reference = unix_time_now ();
+ now_clock = clib_cpu_time_now ();
+
+Compute changes for both clocks since the last rate adjustment, roughly
+15 seconds ago:
+
+.. code:: c
+
+ /* Compute change in the reference clock */
+ delta_reference = now_reference - c->last_verify_reference_time;
+
+ /* And change in the CPU clock */
+ delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
+ c->seconds_per_clock;
+
+Delta_reference is key. Almost 100% of the time, delta_reference and
+delta_clock_in_seconds are identical modulo one system-call time.
+However, NTP or a privileged user can yank the system reference time -
+in either direction - by an hour, a day, or a decade.
+
+As described above, clib_time_now(…) must return monotonically
+increasing answers to the question “how long has it been since vpp
+started, in seconds.” To do that, the clock rate adjustment algorithm
+begins by recomputing the initial reference time:
+
+.. code:: c
+
+ c->init_reference_time += (delta_reference - delta_clock_in_seconds);
+
+It’s easy to convince yourself that if the reference clock changes by
+15.000000 seconds and the cpu clock tick time changes by 15.000000
+seconds, the initial reference time won’t change.
+
+If, on the other hand, delta_reference is -86400.0 and delta clock is
+15.0 - reference time jumped backwards by exactly one day in a 15-second
+rate update interval - we add -86415.0 to the initial reference time.
+
+Given the corrected initial reference time, we recompute the total
+number of cpu ticks which have occurred since the corrected initial
+reference time, at the current clock tick rate:
+
+.. code:: c
+
+ c->total_cpu_time = (now_reference - c->init_reference_time)
+ * c->clocks_per_second;
+
+Timebase precision
+~~~~~~~~~~~~~~~~~~
+
+Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
+floating-point value; the number of seconds since vpp started.
+
+Please see `this Wikipedia
+article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
+for more information. C double-precision floating point numbers (called
+f64 in the vpp code base) have a 53-bit effective mantissa, and can
+accurately represent 15 decimal digits’ worth of precision.
+
+There are 315,360,000.000001 seconds in ten years plus one microsecond.
+That string has exactly 15 decimal digits. The vpp time base retains 1us
+precision for roughly 30 years.
+
+vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds.
+If necessary, please use clib_cpu_time_now(…) for direct access to the
+CPU clock-cycle counter. Note that the number of CPU clock cycles per
+second varies significantly across CPU architectures.
+
+Timer Wheels
+------------
+
+Vppinfra includes configurable timer wheel support. See the source code
+in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
+number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].
+
+Instantiation of tw_timer_template.h generates named structures to
+implement specific timer wheel geometries. Choices include: number of
+timer wheels (currently, 1 or 2), number of slots per ring (a power of
+two), and the number of timers per “object handle”.
+
+Internally, user object/timer handles are 32-bit integers, so if one
+selects 16 timers/object (4 bits), the resulting timer wheel handle is
+limited to 2**28 objects.
+
+Here are the specific settings required to generate a single 2048 slot
+wheel which supports 2 timers per object:
+
+.. code:: c
+
+ #define TW_TIMER_WHEELS 1
+ #define TW_SLOTS_PER_RING 2048
+ #define TW_RING_SHIFT 11
+ #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
+ #define TW_TIMERS_PER_OBJECT 2
+ #define LOG2_TW_TIMERS_PER_OBJECT 1
+ #define TW_SUFFIX _2t_1w_2048sl
+ #define TW_FAST_WHEEL_BITMAP 0
+ #define TW_TIMER_ALLOW_DUPLICATE_STOP 0
+
+See tw_timer_2t_1w_2048sl.h for a complete example.
+
+tw_timer_template.h is not intended to be #included directly. Client
+codes can include multiple timer geometry header files, although extreme
+caution would required to use the TW and TWT macros in such a case.
+
+API usage examples
+~~~~~~~~~~~~~~~~~~
+
+The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
+API usage example. It uses a synthetic clock to rapidly exercise the
+underlying tw_timer_expire_timers(…) template.
+
+There are not many API routines to call.
+
+Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: c
+
+ tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
+ expired_timer_single_callback,
+ 1.0 / * timer interval * / );
+
+Start a timer
+^^^^^^^^^^^^^
+
+.. code:: c
+
+ handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
+ [0 | 1] / * timer id * / ,
+ expiration_time_in_u32_ticks);
+
+Stop a timer
+^^^^^^^^^^^^
+
+.. code:: c
+
+ tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
+
+An expired timer callback
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: c
+
+ static void
+ expired_timer_single_callback (u32 * expired_timers)
+ {
+ int i;
+ u32 pool_index, timer_id;
+ tw_timer_test_elt_t *e;
+ tw_timer_test_main_t *tm = &tw_timer_test_main;
+
+ for (i = 0; i < vec_len (expired_timers);
+ {
+ pool_index = expired_timers[i] & 0x7FFFFFFF;
+ timer_id = expired_timers[i] >> 31;
+
+ ASSERT (timer_id == 1);
+
+ e = pool_elt_at_index (tm->test_elts, pool_index);
+
+ if (e->expected_to_expire != tm->single_wheel.current_tick)
+ {
+ fformat (stdout, "[%d] expired at %d not %d\n",
+ e - tm->test_elts, tm->single_wheel.current_tick,
+ e->expected_to_expire);
+ }
+ pool_put (tm->test_elts, e);
+ }
+ }
+
+We use wheel timers extensively in the vpp host stack. Each TCP session
+needs 5 timers, so supporting 10 million flows requires up to 50 million
+concurrent timers.
+
+Timers rarely expire, so it’s of utmost important that stopping and
+restarting a timer costs as few clock cycles as possible.
+
+Stopping a timer costs a doubly-linked list dequeue. Starting a timer
+involves modular arithmetic to determine the correct timer wheel and
+slot, and a list head enqueue.
+
+Expired timer processing generally involves bulk link-list retirement
+with user callback presentation. Some additional complexity at wheel
+wrap time, to relocate timers from slower-turning timer wheels into
+faster-turning wheels.
+
+Format
+------
+
+Vppinfra format is roughly equivalent to printf.
+
+Format has a few properties worth mentioning. Format’s first argument is
+a (u8 \*) vector to which it appends the result of the current format
+operation. Chaining calls is very easy:
+
+.. code:: c
+
+ u8 * result;
+
+ result = format (0, "junk = %d, ", junk);
+ result = format (result, "more junk = %d\n", more_junk);
+
+As previously noted, NULL pointers are perfectly proper 0-length
+vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
+wish to print a (u8 \*) vector, use the “%v” format string. If you need
+a (u8 \*) vector which is also a proper C-string, either of these
+schemes may be used:
+
+.. code:: c
+
+ vec_add1 (result, 0)
+ or
+ result = format (result, "<whatever>%c", 0);
+
+Remember to vec_free() the result if appropriate. Be careful not to pass
+format an uninitialized (u8 \*).
+
+Format implements a particularly handy user-format scheme via the “%U”
+format specification. For example:
+
+.. code:: c
+
+ u8 * format_junk (u8 * s, va_list *va)
+ {
+ junk = va_arg (va, u32);
+ s = format (s, "%s", junk);
+ return s;
+ }
+
+ result = format (0, "junk = %U, format_junk, "This is some junk");
+
+format_junk() can invoke other user-format functions if desired. The
+programmer shoulders responsibility for argument type-checking. It is
+typical for user format functions to blow up spectacularly if the
+va_arg(va, type) macros don’t match the caller’s idea of reality.
+
+Unformat
+--------
+
+Vppinfra unformat is vaguely related to scanf, but considerably more
+general.
+
+A typical use case involves initializing an unformat_input_t from either
+a C-string or a (u8 \*) vector, then parsing via unformat() as follows:
+
+.. code:: c
+
+ unformat_input_t input;
+ u8 *s = "<some-C-string>";
+
+ unformat_init_string (&input, (char *) s, strlen((char *) s));
+ /* or */
+ unformat_init_vector (&input, <u8-vector>);
+
+Then loop parsing individual elements:
+
+.. code:: c
+
+ while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (&input, "value1 %d", &value1))
+ ;/* unformat sets value1 */
+ else if (unformat (&input, "value2 %d", &value2)
+ ;/* unformat sets value2 */
+ else
+ return clib_error_return (0, "unknown input '%U'",
+ format_unformat_error, input);
+ }
+
+As with format, unformat implements a user-unformat function capability
+via a “%U” user unformat function scheme. Generally, one can trivially
+transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
+&foo)“.
+
+Unformat implements a couple of handy non-scanf-like format specifiers:
+
+.. code:: c
+
+ unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
+ unformat (input, "bitzero %|", &mask, (1<<0));
+ unformat (input, "bitone %|", &mask, (1<<1));
+ <etc>
+
+The phrase “enable %=” means “set the supplied variable to the default
+value” if unformat parses the “enable” keyword all by itself. If
+unformat parses “enable 123” set the supplied variable to 123.
+
+We could clean up a number of hand-rolled “verbose” + “verbose %d”
+argument parsing codes using “%=”.
+
+The phrase “bitzero %\|” means “set the specified bit in the supplied
+bitmask” if unformat parses “bitzero”. Although it looks like it could
+be fairly handy, it’s very lightly used in the code base.
+
+``%_`` toggles whether or not to skip input white space.
+
+For transition from skip to no-skip in middle of format string, skip
+input white space. For example, the following:
+
+.. code:: c
+
+ fmt = "%_%d.%d%_->%_%d.%d%_"
+ unformat (input, fmt, &one, &two, &three, &four);
+
+matches input “1.2 -> 3.4”. Without this, the space after -> does not
+get skipped.
+
+
+How to parse a single input line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Debug CLI command functions MUST NOT accidentally consume input
+belonging to other debug CLI commands. Otherwise, it's impossible to
+script a set of debug CLI commands which "work fine" when issued one
+at a time.
+
+This bit of code is NOT correct:
+
+.. code:: c
+
+ /* Eats script input NOT beloging to it, and chokes! */
+ while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (input, ...))
+ ;
+ else if (unformat (input, ...))
+ ;
+ else
+ return clib_error_return (0, "parse error: '%U'",
+ format_unformat_error, input);
+ }
+ }
+
+When executed as part of a script, such a function will return “parse
+error: ‘’” every time, unless it happens to be the last command in the
+script.
+
+Instead, use “unformat_line_input” to consume the rest of a line’s worth
+of input - everything past the path specified in the VLIB_CLI_COMMAND
+declaration.
+
+For example, unformat_line_input with “my_command” set up as shown below
+and user input “my path is clear” will produce an unformat_input_t that
+contains “is clear”.
+
+.. code:: c
+
+ VLIB_CLI_COMMAND (...) = {
+ .path = "my path",
+ };
+
+Here’s a bit of code which shows the required mechanics, in full:
+
+.. code:: c
+
+ static clib_error_t *
+ my_command_fn (vlib_main_t * vm,
+ unformat_input_t * input,
+ vlib_cli_command_t * cmd)
+ {
+ unformat_input_t _line_input, *line_input = &_line_input;
+ u32 this, that;
+ clib_error_t *error = 0;
+
+ if (!unformat_user (input, unformat_line_input, line_input))
+ return 0;
+
+ /*
+ * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
+ * not at the end of the script...
+ */
+ while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (line_input, "this %u", &this))
+ ;
+ else if (unformat (line_input, "that %u", &that))
+ ;
+ else
+ {
+ error = clib_error_return (0, "parse error: '%U'",
+ format_unformat_error, line_input);
+ goto done;
+ }
+ }
+
+ <do something based on "this" and "that", etc>
+
+ done:
+ unformat_free (line_input);
+ return error;
+ }
+ VLIB_CLI_COMMAND (my_command, static) = {
+ .path = "my path",
+ .function = my_command_fn",
+ };
+
+Vppinfra errors and warnings
+----------------------------
+
+Many functions within the vpp dataplane have return-values of type
+clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of
+metadata [fatal, warning] and are easy to announce. Returning a NULL
+clib_error_t \* indicates “A-OK, no error.”
+
+Clib_warning(format-args) is a handy way to add debugging output; clib
+warnings prepend function:line info to unambiguously locate the message
+source. Clib_unix_warning() adds perror()-style Linux system-call
+information. In production images, clib_warnings result in syslog
+entries.
+
+Serialization
+-------------
+
+Vppinfra serialization support allows the programmer to easily serialize
+and unserialize complex data structures.
+
+The underlying primitive serialize/unserialize functions use network
+byte-order, so there are no structural issues serializing on a
+little-endian host and unserializing on a big-endian host.