diff options
author | Nathan Skrzypczak <nathan.skrzypczak@gmail.com> | 2021-08-19 11:38:06 +0200 |
---|---|---|
committer | Dave Wallace <dwallacelf@gmail.com> | 2021-10-13 23:22:32 +0000 |
commit | 9ad39c026c8a3c945a7003c4aa4f5cb1d4c80160 (patch) | |
tree | 3cca19635417e28ae381d67ae31c75df2925032d /docs/developer/corearchitecture | |
parent | f47122e07e1ecd0151902a3cabe46c60a99bee8e (diff) |
docs: better docs, mv doxygen to sphinx
This patch refactors the VPP sphinx docs
in order to make it easier to consume
for external readers as well as VPP developers.
It also makes sphinx the single source
of documentation, which simplifies maintenance
and operation.
Most important updates are:
- reformat the existing documentation as rst
- split RELEASE.md and move it into separate rst files
- remove section 'events'
- remove section 'archive'
- remove section 'related projects'
- remove section 'feature by release'
- remove section 'Various links'
- make (Configuration reference, CLI docs,
developer docs) top level items in the list
- move 'Use Cases' as part of 'About VPP'
- move 'Troubleshooting' as part of 'Getting Started'
- move test framework docs into 'Developer Documentation'
- add a 'Contributing' section for gerrit,
docs and other contributer related infos
- deprecate doxygen and test-docs targets
- redirect the "make doxygen" target to "make docs"
Type: refactor
Change-Id: I552a5645d5b7964d547f99b1336e2ac24e7c209f
Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
Diffstat (limited to 'docs/developer/corearchitecture')
17 files changed, 4112 insertions, 0 deletions
diff --git a/docs/developer/corearchitecture/bihash.rst b/docs/developer/corearchitecture/bihash.rst new file mode 100644 index 00000000000..9b62baaf9cf --- /dev/null +++ b/docs/developer/corearchitecture/bihash.rst @@ -0,0 +1,313 @@ +Bounded-index Extensible Hashing (bihash) +========================================= + +Vpp uses bounded-index extensible hashing to solve a variety of +exact-match (key, value) lookup problems. Benefits of the current +implementation: + +- Very high record count scaling, tested to 100,000,000 records. +- Lookup performance degrades gracefully as the number of records + increases +- No reader locking required +- Template implementation, it’s easy to support arbitrary (key,value) + types + +Bounded-index extensible hashing has been widely used in databases for +decades. + +Bihash uses a two-level data structure: + +:: + + +-----------------+ + | bucket-0 | + | log2_size | + | backing store | + +-----------------+ + | bucket-1 | + | log2_size | +--------------------------------+ + | backing store | --------> | KVP_PER_PAGE * key-value-pairs | + +-----------------+ | page 0 | + ... +--------------------------------+ + +-----------------+ | KVP_PER_PAGE * key-value-pairs | + | bucket-2**N-1 | | page 1 | + | log2_size | +--------------------------------+ + | backing store | --- + +-----------------+ +--------------------------------+ + | KVP_PER_PAGE * key-value-pairs | + | page 2**(log2(size)) - 1 | + +--------------------------------+ + +Discussion of the algorithm +--------------------------- + +This structure has a couple of major advantages. In practice, each +bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target +CPU architectures support 64-bit atomic operations. When modifying the +contents of a specific bucket, we do the following: + +- Make a working copy of the bucket’s backing storage +- Atomically swap a pointer to the working copy into the bucket array +- Change the original backing store data +- Atomically swap back to the original + +So, no reader locking is required to search a bihash table. + +At lookup time, the implementation computes a key hash code. We use the +least-significant N bits of the hash to select the bucket. + +With the bucket in hand, we learn log2 (nBackingPages) for the selected +bucket. At this point, we use the next log2_size bits from the hash code +to select the specific backing page in which the (key,value) page will +be found. + +Net result: we search **one** backing page, not 2**log2_size pages. This +is a key property of the algorithm. + +When sufficient collisions occur to fill the backing pages for a given +bucket, we double the bucket size, rehash, and deal the bucket contents +into a double-sized set of backing pages. In the future, we may +represent the size as a linear combination of two powers-of-two, to +increase space efficiency. + +To solve the “jackpot case” where a set of records collide under hashing +in a bad way, the implementation will fall back to linear search across +2**log2_size backing pages on a per-bucket basis. + +To maintain *space* efficiency, we should configure the bucket array so +that backing pages are effectively utilized. Lookup performance tends to +change *very little* if the bucket array is too small or too large. + +Bihash depends on selecting an effective hash function. If one were to +use a truly broken hash function such as “return 1ULL.” bihash would +still work, but it would be equivalent to poorly-programmed linear +search. + +We often use cpu intrinsic functions - think crc32 - to rapidly compute +a hash code which has decent statistics. + +Bihash Cookbook +--------------- + +Using current (key,value) template instance types +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It’s quite easy to use one of the template instance types. As of this +writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24, +40, and 48 byte keys, u8 \* vector keys, and 8 byte values. + +See …/src/vppinfra/{bihash\_\_8}.h + +To define the data types, #include a specific template instance, most +often in a subsystem header file: + +.. code:: c + + #include <vppinfra/bihash_8_8.h> + +If you’re building a standalone application, you’ll need to define the +various functions by #including the method implementation file in a C +source file. + +The core vpp engine currently uses most if not all of the known bihash +types, so you probably won’t need to #include the method implementation +file. + +.. code:: c + + #include <vppinfra/bihash_template.c> + +Add an instance of the selected bihash data structure to e.g. a “main_t” +structure: + +.. code:: c + + typedef struct + { + ... + BVT (clib_bihash) hash_table; + or + clib_bihash_8_8_t hash_table; + ... + } my_main_t; + +The BV macro concatenate its argument with the value of the preprocessor +symbol BIHASH_TYPE. The BVT macro concatenates its argument with the +value of BIHASH_TYPE and the fixed-string “_t”. So in the above example, +BVT (clib_bihash) generates “clib_bihash_8_8_t”. + +If you’re sure you won’t decide to change the template / type name +later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth. + +In fact, if you #include multiple template instances in a single source +file, you **must** use fully-enumerated type names. The macros stand no +chance of working. + +Initializing a bihash table +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Call the init function as shown. As a rough guide, pick a number of +buckets which is approximately +number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant +template instance header-file. See previous discussion. + +The amount of memory selected should easily contain all of the records, +with a generous allowance for hash collisions. Bihash memory is +allocated separately from the main heap, and won’t cost anything except +kernel PTE’s until touched, so it’s OK to be reasonably generous. + +For example: + +.. code:: c + + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + + clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets, + (uword) memory_size); + +Add or delete a key/value pair +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use BV(clib_bihash_add_del), or the explicit type variant: + +.. code:: c + + clib_bihash_kv_8_8_t kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + kv.key = key_to_add_or_delete; + kv.value = value_to_add_or_delete; + + clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */); + +In the delete case, kv.value is irrelevant. To change the value +associated with an existing (key,value) pair, simply re-add the [new] +pair. + +Simple search +~~~~~~~~~~~~~ + +The simplest possible (key, value) search goes like so: + +.. code:: c + + clib_bihash_kv_8_8_t search_kv, return_kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + search_kv.key = key_to_add_or_delete; + + if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0) + key_not_found(); + else + key_found(); + +Note that it’s perfectly fine to collect the lookup result + +.. code:: c + + if (clib_bihash_search_8_8 (h, &search_kv, &search_kv)) + key_not_found(); + etc. + +Bihash vector processing +~~~~~~~~~~~~~~~~~~~~~~~~ + +When processing a vector of packets which need a certain lookup +performed, it’s worth the trouble to compute the key hash, and prefetch +the correct bucket ahead of time. + +Here’s a sketch of one way to write the required code: + +Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet +data required to form the record keys \* 4 packets ahead, form 2x record +keys and call BV(clib_bihash_hash) or the explicit hash function to +calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to +prefetch the buckets \* 2 packets ahead, call 2x +BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \* +In the processing section, call 2x +BV(clib_bihash_search_inline_with_hash) to perform the search + +Programmer’s choice whether to stash the hash code somewhere in +vnet_buffer(b) metadata, or to use local variables. + +Single-loop: \* Use simple search as shown above. + +Walking a bihash table +~~~~~~~~~~~~~~~~~~~~~~ + +A fairly common scenario to build “show” commands involves walking a +bihash table. It’s simple enough: + +.. code:: c + + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + void callback_fn (clib_bihash_kv_8_8_t *, void *); + + h = &mm->hash_table; + + BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg); + +To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates +across the entire table, calling callback_fn with active entries. + +Bihash table iteration safety +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The iterator template “clib_bihash_foreach_key_value_pair” must be used +with a certain amount of care. For one thing, the iterator template does +*not* take the bihash hash table writer lock. If your use-case requires +it, lock the table. + +For another, the iterator template is not safe under all conditions: + +- It’s **OK to delete** bihash table entries during a table-walk. The + iterator checks whether the current bucket has been freed after each + *callback_fn(…)* invocation. + +- It is **not OK to add** entries during a table-walk. + +The add-during-walk case involves a jackpot: while processing a +key-value-pair in a particular bucket, add a certain number of entries. +By luck, assume that one or more of the added entries causes the +**current bucket** to split-and-rehash. + +Since we rehash KVP’s to different pages based on what amounts to a +different hash function, either of these things can go wrong: + +- We may revisit previously-visited entries. Depending on how one coded + the use-case, we could end up in a recursive-add situation. + +- We may skip entries that have not been visited + +One could build an add-safe iterator, at a significant cost in +performance: copy the entire bucket, and walk the copy. + +It’s hard to imagine a worthwhile add-during walk use-case in the first +place; let alone one which couldn’t be implemented by walking the table +without modifying it, then adding a set of records. + +Creating a new template instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Creating a new template is easy. Use one of the existing templates as a +model, and make the obvious changes. The hash and key_compare methods +are performance-critical in multiple senses. + +If the key compare method is slow, every lookup will be slow. If the +hash function is slow, same story. If the hash function has poor +statistical properties, space efficiency will suffer. In the limit, a +bad enough hash function will cause large portions of the table to +revert to linear search. + +Use of the best available vector unit is well worth the trouble in the +hash and key_compare functions. diff --git a/docs/developer/corearchitecture/buffer_metadata.rst b/docs/developer/corearchitecture/buffer_metadata.rst new file mode 100644 index 00000000000..545c31f3041 --- /dev/null +++ b/docs/developer/corearchitecture/buffer_metadata.rst @@ -0,0 +1,237 @@ +Buffer Metadata +=============== + +Each vlib_buffer_t (packet buffer) carries buffer metadata which +describes the current packet-processing state. The underlying techniques +have been used for decades, across multiple packet processing +environments. + +We will examine vpp buffer metadata in some detail, but folks who need +to manipulate and/or extend the scheme should expect to do a certain +level of code inspection. + +Vlib (Vector library) primary buffer metadata +--------------------------------------------- + +The first 64 octets of each vlib_buffer_t carries the primary buffer +metadata. See …/src/vlib/buffer.h for full details. + +Important fields: + +- i16 current_data: the signed offset in data[], pre_data[] that we are + currently processing. If negative current header points into the + pre-data (rewrite space) area. +- u16 current_length: nBytes between current_data and the end of this + buffer. +- u32 flags: Buffer flag bits. Heavily used, not many bits left + + - src/vlib/buffer.h flag bits + + - VLIB_BUFFER_IS_TRACED: buffer is traced + - VLIB_BUFFER_NEXT_PRESENT: buffer has multiple chunks + - VLIB_BUFFER_TOTAL_LENGTH_VALID: + total_length_not_including_first_buffer is valid (see below) + + - src/vnet/buffer.h flag bits + + - VNET_BUFFER_F_L4_CHECKSUM_COMPUTED: tcp/udp checksum has been + computed + - VNET_BUFFER_F_L4_CHECKSUM_CORRECT: tcp/udp checksum is correct + - VNET_BUFFER_F_VLAN_2_DEEP: two vlan tags present + - VNET_BUFFER_F_VLAN_1_DEEP: one vlan tag present + - VNET_BUFFER_F_SPAN_CLONE: packet has already been cloned (span + feature) + - VNET_BUFFER_F_LOOP_COUNTER_VALID: packet look-up loop count + valid + - VNET_BUFFER_F_LOCALLY_ORIGINATED: packet built by vpp + - VNET_BUFFER_F_IS_IP4: packet is ipv4, for checksum offload + - VNET_BUFFER_F_IS_IP6: packet is ipv6, for checksum offload + - VNET_BUFFER_F_OFFLOAD_IP_CKSUM: hardware ip checksum offload + requested + - VNET_BUFFER_F_OFFLOAD_TCP_CKSUM: hardware tcp checksum offload + requested + - VNET_BUFFER_F_OFFLOAD_UDP_CKSUM: hardware udp checksum offload + requested + - VNET_BUFFER_F_IS_NATED: natted packet, skip input checks + - VNET_BUFFER_F_L2_HDR_OFFSET_VALID: L2 header offset valid + - VNET_BUFFER_F_L3_HDR_OFFSET_VALID: L3 header offset valid + - VNET_BUFFER_F_L4_HDR_OFFSET_VALID: L4 header offset valid + - VNET_BUFFER_F_FLOW_REPORT: packet is an ipfix packet + - VNET_BUFFER_F_IS_DVR: packet to be reinjected into the l2 + output path + - VNET_BUFFER_F_QOS_DATA_VALID: QoS data valid in + vnet_buffer_opaque2 + - VNET_BUFFER_F_GSO: generic segmentation offload requested + - VNET_BUFFER_F_AVAIL1: available bit + - VNET_BUFFER_F_AVAIL2: available bit + - VNET_BUFFER_F_AVAIL3: available bit + - VNET_BUFFER_F_AVAIL4: available bit + - VNET_BUFFER_F_AVAIL5: available bit + - VNET_BUFFER_F_AVAIL6: available bit + - VNET_BUFFER_F_AVAIL7: available bit + +- u32 flow_id: generic flow identifier +- u8 ref_count: buffer reference / clone count (e.g. for span + replication) +- u8 buffer_pool_index: buffer pool index which owns this buffer +- vlib_error_t (u16) error: error code for buffers enqueued to error + handler +- u32 next_buffer: buffer index of next buffer in chain. Only valid if + VLIB_BUFFER_NEXT_PRESENT is set +- union + + - u32 current_config_index: current index on feature arc + - u32 punt_reason: reason code once packet punted. Mutually + exclusive with current_config_index + +- u32 opaque[10]: primary vnet-layer opaque data (see below) +- END of first cache line / data initialized by the buffer allocator +- u32 trace_index: buffer’s index in the packet trace subsystem +- u32 total_length_not_including_first_buffer: see + VLIB_BUFFER_TOTAL_LENGTH_VALID above +- u32 opaque2[14]: secondary vnet-layer opaque data (see below) +- u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]: rewrite space, often used to + prepend tunnel encapsulations +- u8 data[0]: buffer data received from the wire. Ordinarily, hardware + devices use b->data[0] as the DMA target but there are exceptions. Do + not write code which blindly assumes that packet data starts in + b->data[0]. Use vlib_buffer_get_current(…). + +Vnet (network stack) primary buffer metadata +-------------------------------------------- + +Vnet primary buffer metadata occupies space reserved in the vlib opaque +field shown above, and has the type name vnet_buffer_opaque_t. +Ordinarily accessed using the vnet_buffer(b) macro. See +../src/vnet/buffer.h for full details. + +Important fields: + +- u32 sw_if_index[2]: RX and TX interface handles. At the ip lookup + stage, vnet_buffer(b)->sw_if_index[VLIB_TX] is interpreted as a FIB + index. +- i16 l2_hdr_offset: offset from b->data[0] of the packet L2 header. + Valid only if b->flags & VNET_BUFFER_F_L2_HDR_OFFSET_VALID is set +- i16 l3_hdr_offset: offset from b->data[0] of the packet L3 header. + Valid only if b->flags & VNET_BUFFER_F_L3_HDR_OFFSET_VALID is set +- i16 l4_hdr_offset: offset from b->data[0] of the packet L4 header. + Valid only if b->flags & VNET_BUFFER_F_L4_HDR_OFFSET_VALID is set +- u8 feature_arc_index: feature arc that the packet is currently + traversing +- union + + - ip + + - u32 adj_index[2]: adjacency from dest IP lookup in [VLIB_TX], + adjacency from source ip lookup in [VLIB_RX], set to ~0 until + source lookup done + - union + + - generic fields + - ICMP fields + - reassembly fields + + - mpls fields + - l2 bridging fields, only valid in the L2 path + - l2tpv3 fields + - l2 classify fields + - vnet policer fields + - MAP fields + - MAP-T fields + - ip fragmentation fields + - COP (whitelist/blacklist filter) fields + - LISP fields + - TCP fields + + - connection index + - sequence numbers + - header and data offsets + - data length + - flags + + - SCTP fields + - NAT fields + - u32 unused[6] + +Vnet (network stack) secondary buffer metadata +---------------------------------------------- + +Vnet primary buffer metadata occupies space reserved in the vlib opaque2 +field shown above, and has the type name vnet_buffer_opaque2_t. +Ordinarily accessed using the vnet_buffer2(b) macro. See +../src/vnet/buffer.h for full details. + +Important fields: + +- qos fields + + - u8 bits + - u8 source + +- u8 loop_counter: used to detect and report internal forwarding loops +- group-based policy fields + + - u8 flags + - u16 sclass: the packet’s source class + +- u16 gso_size: L4 payload size, persists all the way to + interface-output in case GSO is not enabled +- u16 gso_l4_hdr_sz: size of the L4 protocol header +- union + + - packet trajectory tracer (largely deprecated) + + - u16 \*trajectory_trace; only #if VLIB_BUFFER_TRACE_TRAJECTORY > + 0 + + - packet generator + + - u64 pg_replay_timestamp: timestamp for replayed pcap trace + packets + + - u32 unused[8] + +Buffer Metadata Extensions +-------------------------- + +Plugin developers may wish to extend either the primary or secondary +vnet buffer opaque unions. Please perform a manual live variable +analysis, otherwise nodes which use shared buffer metadata space may +break things. + +It’s not OK to add plugin or proprietary metadata to the core vpp engine +header files named above. Instead, proceed as follows. The example +concerns the vnet primary buffer opaque union vlib_buffer_opaque_t. It’s +a very simple variation to use the vnet secondary buffer opaque union +vlib_buffer_opaque2_t. + +In a plugin header file: + +:: + + /* Add arbitrary buffer metadata */ + #include <vnet/buffer.h> + + typedef struct + { + u32 my_stuff[6]; + } my_buffer_opaque_t; + + STATIC_ASSERT (sizeof (my_buffer_opaque_t) <= + STRUCT_SIZE_OF (vnet_buffer_opaque_t, unused), + "Custom meta-data too large for vnet_buffer_opaque_t"); + + #define my_buffer_opaque(b) \ + ((my_buffer_opaque_t *)((u8 *)((b)->opaque) + STRUCT_OFFSET_OF (vnet_buffer_opaque_t, unused))) + +To set data in the custom buffer opaque type given a vlib_buffer_t \*b: + +:: + + my_buffer_opaque (b)->my_stuff[2] = 123; + +To read data from the custom buffer opaque type: + +:: + + stuff0 = my_buffer_opaque (b)->my_stuff[2]; diff --git a/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst new file mode 100644 index 00000000000..1eb4e6b5301 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst @@ -0,0 +1,353 @@ +Introduction to build-root/Makefile +=================================== + +The vpp build system consists of a top-level Makefile, a data-driven +build-root/Makefile, and a set of makefile fragments. The various parts +come together as the result of a set of well-thought-out conventions. + +This section describes build-root/Makefile in some detail. + +Repository Groups and Source Paths +---------------------------------- + +Current vpp workspaces comprise a single repository group. The file +.../build-root/build-config.mk defines a key variable called +SOURCE\_PATH. The SOURCE\_PATH variable names the set of repository +groups. At the moment, there is only one repository group. + +Single pass build system, dependencies and components +----------------------------------------------------- + +The vpp build system caters to components built with GNU autoconf / +automake. Adding such components is a simple process. Dealing with +components which use BSD-style raw Makefiles is a more difficult. +Dealing with toolchain components such as gcc, glibc, and binutils can +be considerably more complicated. + +The vpp build system is a **single-pass** build system. A partial order +must exist for any set of components: the set of (a before b) tuples +must resolve to an ordered list. If you create a circular dependency of +the form; (a,b) (b,c) (c,a), gmake will try to build the target list, +but there’s a 0.0% chance that the results will be pleasant. Cut-n-paste +mistakes in .../build-data/packages/.mk can produce confusing failures. + +In a single-pass build system, it’s best to separate libraries and +applications which instantiate them. For example, if vpp depends on +libfoo.a, and myapp depends on both vpp and libfoo.a, it's best to place +libfoo.a and myapp in separate components. The build system will build +libfoo.a, vpp, and then (as a separate component) myapp. If you try to +build libfoo.a and myapp from the same component, it won’t work. + +If you absolutely, positively insist on having myapp and libfoo.a in the +same source tree, you can create a pseudo-component in a separate .mk +file in the .../build-data/packages/ directory. Define the code +phoneycomponent\_source = realcomponent, and provide manual +configure/build/install targets. + +Separate components for myapp, libfoo.a, and vpp is the best and easiest +solution. However, the “mumble\_source = realsource” degree of freedom +exists to solve intractable circular dependencies, such as: to build +gcc-bootstrap, followed by glibc, followed by “real” gcc/g++ [which +depends on glibc too]. + +.../build-root +-------------- + +The .../build-root directory contains the repository group specification +build-config.mk, the main Makefile, and the system-wide set of +autoconf/automake variable overrides in config.site. We'll describe +these files in some detail. To be clear about expectations: the main +Makefile and config.site file are subtle and complex. It's unlikely that +you'll need or want to modify them. Poorly planned changes in either +place typically cause bugs that are difficult to solve. + +.../build-root/build-config.mk +------------------------------ + +As described above, the build-config.mk file is straightforward: it sets +the make variable SOURCE\_PATH to a list of repository group absolute +paths. + +The SOURCE\_PATH variable If you choose to move a workspace, make sure +to modify the paths defined by the SOURCE\_PATH variable. Those paths +need to match changes you make in the workspace paths. For example, if +you place the vpp directory in the workspace of a user named jsmith, you +might change the SOURCE\_PATH to: + +SOURCE\_PATH = /home/jsmithuser/workspace/vpp + +The "out of the box" setting should work 99.5% of the time: + +:: + + SOURCE_PATH = $(CURDIR)/.. + +.../vpp/build-root/Makefile +--------------------------- + +The main Makefile is complex in a number of dimensions. If you think you +need to modify it, it's a good idea to do some research, or ask for +advice before you change it. + +The main Makefile was organized and designed to provide the following +characteristics: excellent performance, accurate dependency processing, +cache enablement, timestamp optimizations, git integration, +extensibility, builds with cross-compilation tool chains, and builds +with embedded Linux distributions. + +If you really need to do so, you can build double-cross tools with it, +with a minimum amount of fuss. For example, you could: compile gdb on +x86\_64, to run on PowerPC, to debug the Xtensa instruction set. + +The PLATFORM variable +--------------------- + +The PLATFORM make/environment variable controls a number of important +characteristics, primarily: + +- CPU architecture +- The list of images to build. + +With respect to .../build-root/Makefile, the list of images to build is +specified by the target. For example: + +:: + + make PLATFORM=vpp TAG=vpp_debug install-deb + +builds vpp debug Debian packages. + +The main Makefile interprets $PLATFORM by attempting to "-include" the +file /build-data/platforms.mk: + +:: + + $(foreach d,$(FULL_SOURCE_PATH), \ + $(eval -include $(d)/platforms.mk)) + +By convention, we don't define **platforms** in the +...//build-data/platforms.mk file. + +In the vpp case, we search for platform definition makefile fragments in +.../vpp/build-data/platforms.mk, as follows: + +:: + + $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \ + $(eval -include $(d)/platforms/*.mk)) + +With vpp, which uses the "vpp" platform as discussed above, we end up +"-include"-ing .../vpp/build-data/platforms/vpp.mk. + +The platform-specific .mk fragment +---------------------------------- + +Here are the contents of .../build-data/platforms/vpp.mk: + +:: + + MACHINE=$(shell uname -m) + + vpp_arch = native + ifeq ($(TARGET_PLATFORM),thunderx) + vpp_dpdk_target = arm64-thunderx-linuxapp-gcc + endif + vpp_native_tools = vppapigen + + vpp_uses_dpdk = yes + + # Uncomment to enable building unit tests + # vpp_enable_tests = yes + + vpp_root_packages = vpp + + # DPDK configuration parameters + # vpp_uses_dpdk_mlx4_pmd = yes + # vpp_uses_dpdk_mlx5_pmd = yes + # vpp_uses_external_dpdk = yes + # vpp_dpdk_inc_dir = /usr/include/dpdk + # vpp_dpdk_lib_dir = /usr/lib + # vpp_dpdk_shared_lib = yes + + # Use '--without-libnuma' for non-numa aware architecture + # Use '--enable-dlmalloc' to use dlmalloc instead of mheap + vpp_configure_args_vpp = --enable-dlmalloc + sample-plugin_configure_args_vpp = --enable-dlmalloc + + # load balancer plugin is not portable on 32 bit platform + ifeq ($(MACHINE),i686) + vpp_configure_args_vpp += --disable-lb-plugin + endif + + vpp_debug_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + vpp_debug_TAG_CXXFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + vpp_debug_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + + vpp_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_TAG_CXXFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror -pie -Wl,-z,now + + vpp_clang_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_clang_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + + vpp_gcov_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -fprofile-arcs -ftest-coverage + vpp_gcov_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -coverage + + vpp_coverity_TAG_CFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__ + vpp_coverity_TAG_LDFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__ + +Note the following variable settings: + +- The variable \_arch sets the CPU architecture used to build the + per-platform cross-compilation toolchain. With the exception of the + "native" architecture - used in our example - the vpp build system + produces cross-compiled binaries. + +- The variable \_native\_tools lists the required set of self-compiled + build tools. + +- The variable \_root\_packages lists the set of images to build when + specifying the target: make PLATFORM= TAG= [install-deb \| + install-rpm]. + +The TAG variable +---------------- + +The TAG variable indirectly sets CFLAGS and LDFLAGS, as well as the +build and install directory names in the .../vpp/build-root directory. +See definitions above. + +Important targets build-root/Makefile +------------------------------------- + +The main Makefile and the various makefile fragments implement the +following user-visible targets: + ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| Target | ENV Variable Settings| Notes | +| | | | ++==================+======================+======================================================================================+ +| foo | bar | mumble | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| bootstrap-tools | none | Builds the set of native tools needed by the vpp build system to | +| | | build images. Example: vppapigen. In a full cross compilation case might include | +| | | include "make", "git", "find", and "tar | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-tools | PLATFORM | Builds the tool chain for the indicated <platform>. Not used in vpp builds | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| distclean | none | Roto-rooters everything in sight: toolchains, images, and so forth. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-deb | PLATFORM and TAG | Build Debian packages comprising components listed in <platform>_root_packages, | +| | | using compile / link options defined by TAG. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-rpm | PLATFORM and TAG | Build RPMs comprising components listed in <platform>_root_packages, | +| | | using compile / link options defined by TAG. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ + +Additional build-root/Makefile environment variable settings +------------------------------------------------------------ + +These variable settings may be of use: + ++----------------------+------------------------------------------------------------------------------------------------------------+ +| ENV Variable | Notes | ++======================+======================+=====================================================================================+ +| BUILD_DEBUG=vx | Directs Makefile et al. to make a good-faith effort to show what's going on in excruciating detail. | +| | Use it as follows: "make ... BUILD_DEBUG=vx". Fairly effective in Makefile debug situations. | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| V=1 | print detailed cc / ld command lines. Useful for discovering if -DFOO=11 is in the command line or not | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| CC=mygcc | Override the configured C-compiler | ++----------------------+------------------------------------------------------------------------------------------------------------+ + +.../build-root/config.site +-------------------------- + +The contents of .../build-root/config.site override individual autoconf / +automake default variable settings. Here are a few sample settings related to +building a full toolchain: + +:: + + # glibc needs these setting for cross compiling + libc_cv_forced_unwind=yes + libc_cv_c_cleanup=yes + libc_cv_ssp=no + +Determining the set of variables which need to be overridden, and the +override values is a matter of trial and error. It should be +unnecessary to modify this file for use with fd.io vpp. + +.../build-data/platforms.mk +--------------------------- + +Each repo group includes the platforms.mk file, which is included by +the main Makefile. The vpp/build-data/platforms.mk file is not terribly +complex. As of this writing, .../build-data/platforms.mk file accomplishes two +tasks. + +First, it includes vpp/build-data/platforms/\*.mk: + +:: + + # Pick up per-platform makefile fragments + $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \ + $(eval -include $(d)/platforms/*.mk)) + +This collects the set of platform definition makefile fragments, as discussed above. + +Second, platforms.mk implements the user-visible "install-deb" target. + +.../build-data/packages/\*.mk +----------------------------- + +Each component needs a makefile fragment in order for the build system +to recognize it. The per-component makefile fragments vary +considerably in complexity. For a component built with GNU autoconf / +automake which does not depend on other components, the make fragment +can be empty. See .../build-data/packages/vpp.mk for an uncomplicated +but fully realistic example. + +Here are some of the important variable settings in per-component makefile fragments: + ++----------------------+------------------------------------------------------------------------------------------------------------+ +| Variable | Notes | ++======================+======================+=====================================================================================+ +| xxx_configure_depend | Lists the set of component build dependencies for the xxx component. In plain English: don't try to | +| | configure this component until you've successfully built the indicated targets. Almost always, | +| | xxx_configure_depend will list a set of "yyy-install" targets. Note the pattern: | +| | "variable names contain underscores, make target names contain hyphens" | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_configure_args | (optional) Lists any additional arguments to pass to the xxx component "configure" script. | +| | The main Makefile %-configure rule adds the required settings for --libdir, --prefix, and | +| | --host (when cross-compiling) | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_CPPFLAGS | Adds -I stanzas to CPPFLAGS for components upon which xxx depends. | +| | Almost invariably "xxx_CPPFLAGS = $(call installed_includes_fn, dep1 dep2 dep3)", where dep1, dep2, and | +| | dep3 are listed in xxx_configure_depend. It is bad practice to set "-g -O3" here. Those settings | +| | belong in a TAG. | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_LDFLAGS | Adds -Wl,-rpath -Wl,depN stanzas to LDFLAGS for components upon which xxx depends. | +| | Almost invariably "xxx_LDFLAGS = $(call installed_lib_fn, dep1 dep2 dep3)", where dep1, dep2, and | +| | dep3 are listed in xxx_configure_depend. It is bad manners to set "-liberty-or-death" here. | +| | Those settings belong in Makefile.am. | ++----------------------+------------------------------------------------------------------------------------------------------------+ + +When dealing with "irritating" components built with raw Makefiles +which only work when building in the source tree, we use a specific +strategy in the xxx.mk file. + +The strategy is simple for those components: We copy the source tree +into .../vpp/build-root/build-xxx. This works, but completely defeats +dependency processing. This strategy is acceptable only for 3rd party +software which won't need extensive (or preferably any) modifications. + +Take a look at .../vpp/build-data/packages/dpdk.mk. When invoked, the +dpdk_configure variable copies source code into $(PACKAGE_BUILD_DIR), +and performs the BSD equivalent of "autoreconf -i -f" to configure the +build area. The rest of the file is similar: a bunch of hand-rolled +glue code which manages to make the dpdk act like a good vpp build +citizen even though it is not. diff --git a/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst new file mode 100644 index 00000000000..580d261bdac --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst @@ -0,0 +1,186 @@ +Introduction to cmake and ninja +=============================== + +Cmake plus ninja is approximately equal to GNU autotools plus GNU +make, respectively. Both cmake and GNU autotools support self and +cross-compilation, checking for required components and versions. + +- For a decent-sized project - such as vpp - build performance is drastically better with (cmake, ninja). + +- The cmake input language looks like an actual language, rather than a shell scripting scheme on steroids. + +- Ninja doesn't pretend to support manually-generated input files. Think of it as a fast, dumb robot which eats mildly legible byte-code. + +See the `cmake website <http://cmake.org>`_, and the `ninja website +<https://ninja-build.org>`_ for additional information. + +vpp cmake configuration files +----------------------------- + +The top of the vpp project cmake hierarchy lives in .../src/CMakeLists.txt. +This file defines the vpp project, and (recursively) includes two kinds +of files: rule/function definitions, and target lists. + +- Rule/function definitions live in .../src/cmake/{\*.cmake}. Although the contents of these files is simple enough to read, it shouldn't be necessary to modify them very often + +- Build target lists come from CMakeLists.txt files found in subdirectories, which are named in the SUBDIRS list in .../src/CMakeLists.txt + +:: + + ############################################################################## + # subdirs - order matters + ############################################################################## + if("${CMAKE_SYSTEM_NAME}" STREQUAL "Linux") + find_package(OpenSSL REQUIRED) + set(SUBDIRS + vppinfra svm vlib vlibmemory vlibapi vnet vpp vat vcl plugins + vpp-api tools/vppapigen tools/g2 tools/perftool) + elseif("${CMAKE_SYSTEM_NAME}" STREQUAL "Darwin") + set(SUBDIRS vppinfra) + else() + message(FATAL_ERROR "Unsupported system: ${CMAKE_SYSTEM_NAME}") + endif() + + foreach(DIR ${SUBDIRS}) + add_subdirectory(${DIR}) + endforeach() + +- The vpp cmake configuration hierarchy discovers the list of plugins to be built by searching for subdirectories in .../src/plugins which contain CMakeLists.txt files + + +:: + + ############################################################################## + # find and add all plugin subdirs + ############################################################################## + FILE(GLOB files RELATIVE + ${CMAKE_CURRENT_SOURCE_DIR} + ${CMAKE_CURRENT_SOURCE_DIR}/*/CMakeLists.txt + ) + foreach (f ${files}) + get_filename_component(dir ${f} DIRECTORY) + add_subdirectory(${dir}) + endforeach() + +How to write a plugin CMakeLists.txt file +----------------------------------------- + +It's really quite simple. Follow the pattern: + +:: + + add_vpp_plugin(mactime + SOURCES + mactime.c + node.c + + API_FILES + mactime.api + + INSTALL_HEADERS + mactime_all_api_h.h + mactime_msg_enum.h + + API_TEST_SOURCES + mactime_test.c + ) + +Adding a target elsewhere in the source tree +-------------------------------------------- + +Within reason, adding a subdirectory to the SUBDIRS list in +.../src/CMakeLists.txt is perfectly OK. The indicated directory will +need a CMakeLists.txt file. + +.. _building-g2: + +Here's how we build the g2 event data visualization tool: + +:: + + option(VPP_BUILD_G2 "Build g2 tool." OFF) + if(VPP_BUILD_G2) + find_package(GTK2 COMPONENTS gtk) + if(GTK2_FOUND) + include_directories(${GTK2_INCLUDE_DIRS}) + add_vpp_executable(g2 + SOURCES + clib.c + cpel.c + events.c + main.c + menu1.c + pointsel.c + props.c + g2version.c + view1.c + + LINK_LIBRARIES vppinfra Threads::Threads m ${GTK2_LIBRARIES} + NO_INSTALL + ) + endif() + endif() + +The g2 component is optional, and is not built by default. There are +a couple of ways to tell cmake to include it in build.ninja [or in Makefile.] + +When invoking cmake manually [rarely done and not very easy], specify +-DVPP_BUILD_G2=ON: + +:: + + $ cmake ... -DVPP_BUILD_G2=ON + +Take a good look at .../build-data/packages/vpp.mk to see where and +how the top-level Makefile and .../build-root/Makefile set all of the +cmake arguments. One strategy to enable an optional component is fairly +obvious. Add -DVPP_BUILD_G2=ON to vpp_cmake_args. + +That would work, of course, but it's not a particularly elegant solution. + +Tinkering with build options: ccmake +------------------------------------ + +The easy way to set VPP_BUILD_G2 - or frankly **any** cmake +parameter - is to install the "cmake-curses-gui" package and use +it. + +- Do a straightforward vpp build using the top level Makefile, "make build" or "make build-release" +- Ajourn to .../build-root/build-vpp-native/vpp or .../build-root/build-vpp_debug-native/vpp +- Invoke "ccmake ." to reconfigure the project as desired + +Here's approximately what you'll see: + +:: + + CCACHE_FOUND /usr/bin/ccache + CMAKE_BUILD_TYPE + CMAKE_INSTALL_PREFIX /scratch/vpp-gate/build-root/install-vpp-nati + DPDK_INCLUDE_DIR /scratch/vpp-gate/build-root/install-vpp-nati + DPDK_LIB /scratch/vpp-gate/build-root/install-vpp-nati + MBEDTLS_INCLUDE_DIR /usr/include + MBEDTLS_LIB1 /usr/lib/x86_64-linux-gnu/libmbedtls.so + MBEDTLS_LIB2 /usr/lib/x86_64-linux-gnu/libmbedx509.so + MBEDTLS_LIB3 /usr/lib/x86_64-linux-gnu/libmbedcrypto.so + MUSDK_INCLUDE_DIR MUSDK_INCLUDE_DIR-NOTFOUND + MUSDK_LIB MUSDK_LIB-NOTFOUND + PRE_DATA_SIZE 128 + VPP_API_TEST_BUILTIN ON + VPP_BUILD_G2 OFF + VPP_BUILD_PERFTOOL OFF + VPP_BUILD_VCL_TESTS ON + VPP_BUILD_VPPINFRA_TESTS OFF + + CCACHE_FOUND: Path to a program. + Press [enter] to edit option Press [d] to delete an entry CMake Version 3.10.2 + Press [c] to configure + Press [h] for help Press [q] to quit without generating + Press [t] to toggle advanced mode (Currently Off) + +Use the cursor to point at the VPP_BUILD_G2 line. Press the return key +to change OFF to ON. Press "c" to regenerate build.ninja, etc. + +At that point "make build" or "make build-release" will build g2. And so on. + +Note that toggling advanced mode ["t"] gives access to substantially +all of the cmake option, discovered directories and paths. diff --git a/docs/developer/corearchitecture/buildsystem/index.rst b/docs/developer/corearchitecture/buildsystem/index.rst new file mode 100644 index 00000000000..908e91e1fc1 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/index.rst @@ -0,0 +1,14 @@ +.. _buildsystem: + +Build System +============ + +This guide describes the vpp build system in detail. As of this writing, +the build systems uses a mix of make / Makefiles, cmake, and ninja to +achieve excellent build performance. + +.. toctree:: + + mainmakefile + cmakeandninja + buildrootmakefile diff --git a/docs/developer/corearchitecture/buildsystem/mainmakefile.rst b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst new file mode 100644 index 00000000000..96b97496350 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst @@ -0,0 +1,2 @@ +Introduction to the top-level Makefile +====================================== diff --git a/docs/developer/corearchitecture/featurearcs.rst b/docs/developer/corearchitecture/featurearcs.rst new file mode 100644 index 00000000000..89c50e38dce --- /dev/null +++ b/docs/developer/corearchitecture/featurearcs.rst @@ -0,0 +1,225 @@ +Feature Arcs +============ + +A significant number of vpp features are configurable on a per-interface +or per-system basis. Rather than ask feature coders to manually +construct the required graph arcs, we built a general mechanism to +manage these mechanics. + +Specifically, feature arcs comprise ordered sets of graph nodes. Each +feature node in an arc is independently controlled. Feature arc nodes +are generally unaware of each other. Handing a packet to “the next +feature node” is quite inexpensive. + +The feature arc implementation solves the problem of creating graph arcs +used for steering. + +At the beginning of a feature arc, a bit of setup work is needed, but +only if at least one feature is enabled on the arc. + +On a per-arc basis, individual feature definitions create a set of +ordering dependencies. Feature infrastructure performs a topological +sort of the ordering dependencies, to determine the actual feature +order. Missing dependencies **will** lead to runtime disorder. See +https://gerrit.fd.io/r/#/c/12753 for an example. + +If no partial order exists, vpp will refuse to run. Circular dependency +loops of the form “a then b, b then c, c then a” are impossible to +satisfy. + +Adding a feature to an existing feature arc +------------------------------------------- + +To nobody’s great surprise, we set up feature arcs using the typical +“macro -> constructor function -> list of declarations” pattern: + +.. code:: c + + VNET_FEATURE_INIT (mactime, static) = + { + .arc_name = "device-input", + .node_name = "mactime", + .runs_before = VNET_FEATURES ("ethernet-input"), + }; + +This creates a “mactime” feature on the “device-input” arc. + +Once per frame, dig up the vnet_feature_config_main_t corresponding to +the “device-input” feature arc: + +.. code:: c + + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + u8 arc = im->output_feature_arc_index; + vnet_feature_config_main_t *fcm; + + fcm = vnet_feature_get_config_main (arc); + +Note that in this case, we’ve stored the required arc index - assigned +by the feature infrastructure - in the vnet_interface_main_t. Where to +put the arc index is a programmer’s decision when creating a feature +arc. + +Per packet, set next0 to steer packets to the next node they should +visit: + +.. code:: c + + vnet_get_config_data (&fcm->config_main, + &b0->current_config_index /* value-result */, + &next0, 0 /* # bytes of config data */); + +Configuration data is per-feature arc, and is often unused. Note that +it’s normal to reset next0 to divert packets elsewhere; often, to drop +them for cause: + +.. code:: c + + next0 = MACTIME_NEXT_DROP; + b0->error = node->errors[DROP_CAUSE]; + +Creating a feature arc +---------------------- + +Once again, we create feature arcs using constructor macros: + +.. code:: c + + VNET_FEATURE_ARC_INIT (ip4_unicast, static) = + { + .arc_name = "ip4-unicast", + .start_nodes = VNET_FEATURES ("ip4-input", "ip4-input-no-checksum"), + .arc_index_ptr = &ip4_main.lookup_main.ucast_feature_arc_index, + }; + +In this case, we configure two arc start nodes to handle the +“hardware-verified ip checksum or not” cases. During initialization, the +feature infrastructure stores the arc index as shown. + +In the head-of-arc node, do the following to send packets along the +feature arc: + +.. code:: c + + ip_lookup_main_t *lm = &im->lookup_main; + arc = lm->ucast_feature_arc_index; + +Once per packet, initialize packet metadata to walk the feature arc: + +.. code:: c + + vnet_feature_arc_start (arc, sw_if_index0, &next, b0); + +Enabling / Disabling features +----------------------------- + +Simply call vnet_feature_enable_disable to enable or disable a specific +feature: + +.. code:: c + + vnet_feature_enable_disable ("device-input", /* arc name */ + "mactime", /* feature name */ + sw_if_index, /* Interface sw_if_index */ + enable_disable, /* 1 => enable */ + 0 /* (void *) feature_configuration */, + 0 /* feature_configuration_nbytes */); + +The feature_configuration opaque is seldom used. + +If you wish to make a feature a *de facto* system-level concept, pass +sw_if_index=0 at all times. Sw_if_index 0 is always valid, and +corresponds to the “local” interface. + +Related “show” commands +----------------------- + +To display the entire set of features, use “show features [verbose]”. +The verbose form displays arc indices, and feature indicies within the +arcs + +:: + + $ vppctl show features verbose + Available feature paths + <snip> + [14] ip4-unicast: + [ 0]: nat64-out2in-handoff + [ 1]: nat64-out2in + [ 2]: nat44-ed-hairpin-dst + [ 3]: nat44-hairpin-dst + [ 4]: ip4-dhcp-client-detect + [ 5]: nat44-out2in-fast + [ 6]: nat44-in2out-fast + [ 7]: nat44-handoff-classify + [ 8]: nat44-out2in-worker-handoff + [ 9]: nat44-in2out-worker-handoff + [10]: nat44-ed-classify + [11]: nat44-ed-out2in + [12]: nat44-ed-in2out + [13]: nat44-det-classify + [14]: nat44-det-out2in + [15]: nat44-det-in2out + [16]: nat44-classify + [17]: nat44-out2in + [18]: nat44-in2out + [19]: ip4-qos-record + [20]: ip4-vxlan-gpe-bypass + [21]: ip4-reassembly-feature + [22]: ip4-not-enabled + [23]: ip4-source-and-port-range-check-rx + [24]: ip4-flow-classify + [25]: ip4-inacl + [26]: ip4-source-check-via-rx + [27]: ip4-source-check-via-any + [28]: ip4-policer-classify + [29]: ipsec-input-ip4 + [30]: vpath-input-ip4 + [31]: ip4-vxlan-bypass + [32]: ip4-lookup + <snip> + +Here, we learn that the ip4-unicast feature arc has index 14, and that +e.g. ip4-inacl is the 25th feature in the generated partial order. + +To display the features currently active on a specific interface, use +“show interface features”: + +:: + + $ vppctl show interface GigabitEthernet3/0/0 features + Feature paths configured on GigabitEthernet3/0/0... + <snip> + ip4-unicast: + nat44-out2in + <snip> + +Table of Feature Arcs +--------------------- + +Simply search for name-strings to track down the arc definition, +location of the arc index, etc. + +:: + + | Arc Name | + |------------------| + | device-input | + | ethernet-output | + | interface-output | + | ip4-drop | + | ip4-local | + | ip4-multicast | + | ip4-output | + | ip4-punt | + | ip4-unicast | + | ip6-drop | + | ip6-local | + | ip6-multicast | + | ip6-output | + | ip6-punt | + | ip6-unicast | + | mpls-input | + | mpls-output | + | nsh-output | diff --git a/docs/developer/corearchitecture/index.rst b/docs/developer/corearchitecture/index.rst new file mode 100644 index 00000000000..ecd5a3cdb08 --- /dev/null +++ b/docs/developer/corearchitecture/index.rst @@ -0,0 +1,21 @@ +.. _corearchitecture: + +================= +Core Architecture +================= + +.. toctree:: + :maxdepth: 1 + + softwarearchitecture + infrastructure + vlib + vnet + featurearcs + buffer_metadata + multiarch/index + bihash + buildsystem/index + mem + multi_thread + diff --git a/docs/developer/corearchitecture/infrastructure.rst b/docs/developer/corearchitecture/infrastructure.rst new file mode 100644 index 00000000000..b4e1065f81e --- /dev/null +++ b/docs/developer/corearchitecture/infrastructure.rst @@ -0,0 +1,612 @@ +VPPINFRA (Infrastructure) +========================= + +The files associated with the VPP Infrastructure layer are located in +the ``./src/vppinfra`` folder. + +VPPinfra is a collection of basic c-library services, quite sufficient +to build standalone programs to run directly on bare metal. It also +provides high-performance dynamic arrays, hashes, bitmaps, +high-precision real-time clock support, fine-grained event-logging, and +data structure serialization. + +One fair comment / fair warning about vppinfra: you can't always tell a +macro from an inline function from an ordinary function simply by name. +Macros are used to avoid function calls in the typical case, and to +cause (intentional) side-effects. + +Vppinfra has been around for almost 20 years and tends not to change +frequently. The VPP Infrastructure layer contains the following +functions: + +Vectors +------- + +Vppinfra vectors are ubiquitous dynamically resized arrays with by user +defined "headers". Many vpppinfra data structures (e.g. hash, heap, +pool) are vectors with various different headers. + +The memory layout looks like this: + +:: + + User header (optional, uword aligned) + Alignment padding (if needed) + Vector length in elements + User's pointer -> Vector element 0 + Vector element 1 + ... + Vector element N-1 + +As shown above, the vector APIs deal with pointers to the 0th element of +a vector. Null pointers are valid vectors of length zero. + +To avoid thrashing the memory allocator, one often resets the length of +a vector to zero while retaining the memory allocation. Set the vector +length field to zero via the vec_reset_length(v) macro. [Use the macro! +It’s smart about NULL pointers.] + +Typically, the user header is not present. User headers allow for other +data structures to be built atop vppinfra vectors. Users may specify the +alignment for first data element of a vector via the [vec]()*_aligned +macros. + +Vector elements can be any C type e.g. (int, double, struct bar). This +is also true for data types built atop vectors (e.g. heap, pool, etc.). +Many macros have \_a variants supporting alignment of vector elements +and \_h variants supporting non-zero-length vector headers. The \_ha +variants support both. Additionally cacheline alignment within a vector +element structure can be specified using the +``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro. + +Inconsistent usage of header and/or alignment related macro variants +will cause delayed, confusing failures. + +Standard programming error: memorize a pointer to the ith element of a +vector, and then expand the vector. Vectors expand by 3/2, so such code +may appear to work for a period of time. Correct code almost always +memorizes vector **indices** which are invariant across reallocations. + +In typical application images, one supplies a set of global functions +designed to be called from gdb. Here are a few examples: + +- vl(v) - prints vec_len(v) +- pe(p) - prints pool_elts(p) +- pifi(p, index) - prints pool_is_free_index(p, index) +- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p + +Use the “show gdb” debug CLI command to print the current set. + +Bitmaps +------- + +Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs. +Quite handy for a variety jobs. + +Pools +----- + +Vppinfra pools combine vectors and bitmaps to rapidly allocate and free +fixed-size data structures with independent lifetimes. Pools are perfect +for allocating per-session structures. + +Hashes +------ + +Vppinfra provides several hash flavors. Data plane problems involving +packet classification / session lookup often use +./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes. +These templates are instantiated multiple times, to efficiently service +different fixed-key sizes. + +Bihashes are thread-safe. Read-locking is not required. A simple +spin-lock ensures that only one thread writes an entry at a time. + +The original vppinfra hash implementation in ./src/vppinfra/hash.[ch] +are simple to use, and are often used in control-plane code which needs +exact-string-matching. + +In either case, one almost always looks up a key in a hash table to +obtain an index in a related vector or pool. The APIs are simple enough, +but one must take care when using the unmanaged arbitrary-sized key +variant. Hash_set_mem (hash_table, key_pointer, value) memorizes +key_pointer. It is usually a bad mistake to pass the address of a vector +element as the second argument to hash_set_mem. It is perfectly fine to +memorize constant string addresses in the text segment. + +Timekeeping +----------- + +Vppinfra includes high-precision, low-cost timing services. The datatype +clib_time_t and associated functions reside in ./src/vppinfra/time.[ch]. +Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t +object. + +Clib_time_init(…) can use a variety of different ways to establish the +hardware clock frequency. At the end of the day, vppinfra timekeeping +takes the attitude that the operating system’s clock is the closest +thing to a gold standard it has handy. + +When properly configured, NTP maintains kernel clock synchronization +with a highly accurate off-premises reference clock. Notwithstanding +network propagation delays, a synchronized NTP client will keep the +kernel clock accurate to within 50ms or so. + +Why should one care? Simply put, oscillators used to generate CPU ticks +aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t +be out of the question. That’s a minute and a half’s worth of error in 1 +day. The error changes constantly, due to temperature variation, and a +host of other physical factors. + +It’s far too expensive to use system calls for timing, so we’re left +with the problem of continuously adjusting our view of the CPU tick +register’s clocks_per_second parameter. + +The clock rate adjustment algorithm measures the number of cpu ticks and +the “gold standard” reference time across an interval of approximately +16 seconds. We calculate clocks_per_second for the interval: use rdtsc +(on x86_64) and a system call to get the latest cpu tick count and the +kernel’s latest nanosecond timestamp. We subtract the previous interval +end values, and use exponential smoothing to merge the new clock rate +sample into the clocks_per_second parameter. + +As of this writing, we maintain the clock rate by way of the following +first-order differential equation: + +.. code:: c + + clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K) + where K = e**(-1.0/3.75); + +This yields a per observation “half-life” of 1 minute. Empirically, the +clock rate converges within 5 minutes, and appears to maintain +near-perfect agreement with the kernel clock in the face of ongoing NTP +time adjustments. + +See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the +rate adjustment algorithm. The code rejects frequency samples +corresponding to the sort of adjustment which might occur if someone +changes the gold standard kernel clock by several seconds. + +Monotonic timebase support +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Particularly during system initialization, the “gold standard” system +reference clock can change by a large amount, in an instant. It’s not a +best practice to yank the reference clock - in either direction - by +hours or days. In fact, some poorly-constructed use-cases do so. + +To deal with this reality, clib_time_now(…) returns the number of +seconds since vpp started, *guaranteed to be monotonically increasing, +no matter what happens to the system reference clock*. + +This is first-order important, to avoid breaking every active timer in +the system. The vpp host stack alone may account for tens of millions of +active timers. It’s utterly impractical to track down and fix timers, so +we must deal with the issue at the timebase level. + +Here’s how it works. Prior to adjusting the clock rate, we collect the +kernel reference clock and the cpu clock: + +.. code:: c + + /* Ask the kernel and the CPU what time it is... */ + now_reference = unix_time_now (); + now_clock = clib_cpu_time_now (); + +Compute changes for both clocks since the last rate adjustment, roughly +15 seconds ago: + +.. code:: c + + /* Compute change in the reference clock */ + delta_reference = now_reference - c->last_verify_reference_time; + + /* And change in the CPU clock */ + delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) * + c->seconds_per_clock; + +Delta_reference is key. Almost 100% of the time, delta_reference and +delta_clock_in_seconds are identical modulo one system-call time. +However, NTP or a privileged user can yank the system reference time - +in either direction - by an hour, a day, or a decade. + +As described above, clib_time_now(…) must return monotonically +increasing answers to the question “how long has it been since vpp +started, in seconds.” To do that, the clock rate adjustment algorithm +begins by recomputing the initial reference time: + +.. code:: c + + c->init_reference_time += (delta_reference - delta_clock_in_seconds); + +It’s easy to convince yourself that if the reference clock changes by +15.000000 seconds and the cpu clock tick time changes by 15.000000 +seconds, the initial reference time won’t change. + +If, on the other hand, delta_reference is -86400.0 and delta clock is +15.0 - reference time jumped backwards by exactly one day in a 15-second +rate update interval - we add -86415.0 to the initial reference time. + +Given the corrected initial reference time, we recompute the total +number of cpu ticks which have occurred since the corrected initial +reference time, at the current clock tick rate: + +.. code:: c + + c->total_cpu_time = (now_reference - c->init_reference_time) + * c->clocks_per_second; + +Timebase precision +~~~~~~~~~~~~~~~~~~ + +Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit +floating-point value; the number of seconds since vpp started. + +Please see `this Wikipedia +article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__ +for more information. C double-precision floating point numbers (called +f64 in the vpp code base) have a 53-bit effective mantissa, and can +accurately represent 15 decimal digits’ worth of precision. + +There are 315,360,000.000001 seconds in ten years plus one microsecond. +That string has exactly 15 decimal digits. The vpp time base retains 1us +precision for roughly 30 years. + +vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds. +If necessary, please use clib_cpu_time_now(…) for direct access to the +CPU clock-cycle counter. Note that the number of CPU clock cycles per +second varies significantly across CPU architectures. + +Timer Wheels +------------ + +Vppinfra includes configurable timer wheel support. See the source code +in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable +number of template instances defined in …/src/vppinfra/tw_timer\_.[ch]. + +Instantiation of tw_timer_template.h generates named structures to +implement specific timer wheel geometries. Choices include: number of +timer wheels (currently, 1 or 2), number of slots per ring (a power of +two), and the number of timers per “object handle”. + +Internally, user object/timer handles are 32-bit integers, so if one +selects 16 timers/object (4 bits), the resulting timer wheel handle is +limited to 2**28 objects. + +Here are the specific settings required to generate a single 2048 slot +wheel which supports 2 timers per object: + +.. code:: c + + #define TW_TIMER_WHEELS 1 + #define TW_SLOTS_PER_RING 2048 + #define TW_RING_SHIFT 11 + #define TW_RING_MASK (TW_SLOTS_PER_RING -1) + #define TW_TIMERS_PER_OBJECT 2 + #define LOG2_TW_TIMERS_PER_OBJECT 1 + #define TW_SUFFIX _2t_1w_2048sl + #define TW_FAST_WHEEL_BITMAP 0 + #define TW_TIMER_ALLOW_DUPLICATE_STOP 0 + +See tw_timer_2t_1w_2048sl.h for a complete example. + +tw_timer_template.h is not intended to be #included directly. Client +codes can include multiple timer geometry header files, although extreme +caution would required to use the TW and TWT macros in such a case. + +API usage examples +~~~~~~~~~~~~~~~~~~ + +The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete +API usage example. It uses a synthetic clock to rapidly exercise the +underlying tw_timer_expire_timers(…) template. + +There are not many API routines to call. + +Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: c + + tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel, + expired_timer_single_callback, + 1.0 / * timer interval * / ); + +Start a timer +^^^^^^^^^^^^^ + +.. code:: c + + handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index, + [0 | 1] / * timer id * / , + expiration_time_in_u32_ticks); + +Stop a timer +^^^^^^^^^^^^ + +.. code:: c + + tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle); + +An expired timer callback +^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: c + + static void + expired_timer_single_callback (u32 * expired_timers) + { + int i; + u32 pool_index, timer_id; + tw_timer_test_elt_t *e; + tw_timer_test_main_t *tm = &tw_timer_test_main; + + for (i = 0; i < vec_len (expired_timers); + { + pool_index = expired_timers[i] & 0x7FFFFFFF; + timer_id = expired_timers[i] >> 31; + + ASSERT (timer_id == 1); + + e = pool_elt_at_index (tm->test_elts, pool_index); + + if (e->expected_to_expire != tm->single_wheel.current_tick) + { + fformat (stdout, "[%d] expired at %d not %d\n", + e - tm->test_elts, tm->single_wheel.current_tick, + e->expected_to_expire); + } + pool_put (tm->test_elts, e); + } + } + +We use wheel timers extensively in the vpp host stack. Each TCP session +needs 5 timers, so supporting 10 million flows requires up to 50 million +concurrent timers. + +Timers rarely expire, so it’s of utmost important that stopping and +restarting a timer costs as few clock cycles as possible. + +Stopping a timer costs a doubly-linked list dequeue. Starting a timer +involves modular arithmetic to determine the correct timer wheel and +slot, and a list head enqueue. + +Expired timer processing generally involves bulk link-list retirement +with user callback presentation. Some additional complexity at wheel +wrap time, to relocate timers from slower-turning timer wheels into +faster-turning wheels. + +Format +------ + +Vppinfra format is roughly equivalent to printf. + +Format has a few properties worth mentioning. Format’s first argument is +a (u8 \*) vector to which it appends the result of the current format +operation. Chaining calls is very easy: + +.. code:: c + + u8 * result; + + result = format (0, "junk = %d, ", junk); + result = format (result, "more junk = %d\n", more_junk); + +As previously noted, NULL pointers are perfectly proper 0-length +vectors. Format returns a (u8 \*) vector, **not** a C-string. If you +wish to print a (u8 \*) vector, use the “%v” format string. If you need +a (u8 \*) vector which is also a proper C-string, either of these +schemes may be used: + +.. code:: c + + vec_add1 (result, 0) + or + result = format (result, "<whatever>%c", 0); + +Remember to vec_free() the result if appropriate. Be careful not to pass +format an uninitialized (u8 \*). + +Format implements a particularly handy user-format scheme via the “%U” +format specification. For example: + +.. code:: c + + u8 * format_junk (u8 * s, va_list *va) + { + junk = va_arg (va, u32); + s = format (s, "%s", junk); + return s; + } + + result = format (0, "junk = %U, format_junk, "This is some junk"); + +format_junk() can invoke other user-format functions if desired. The +programmer shoulders responsibility for argument type-checking. It is +typical for user format functions to blow up spectacularly if the +va_arg(va, type) macros don’t match the caller’s idea of reality. + +Unformat +-------- + +Vppinfra unformat is vaguely related to scanf, but considerably more +general. + +A typical use case involves initializing an unformat_input_t from either +a C-string or a (u8 \*) vector, then parsing via unformat() as follows: + +.. code:: c + + unformat_input_t input; + u8 *s = "<some-C-string>"; + + unformat_init_string (&input, (char *) s, strlen((char *) s)); + /* or */ + unformat_init_vector (&input, <u8-vector>); + +Then loop parsing individual elements: + +.. code:: c + + while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (&input, "value1 %d", &value1)) + ;/* unformat sets value1 */ + else if (unformat (&input, "value2 %d", &value2) + ;/* unformat sets value2 */ + else + return clib_error_return (0, "unknown input '%U'", + format_unformat_error, input); + } + +As with format, unformat implements a user-unformat function capability +via a “%U” user unformat function scheme. Generally, one can trivially +transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”, +&foo)“. + +Unformat implements a couple of handy non-scanf-like format specifiers: + +.. code:: c + + unformat (input, "enable %=", &enable, 1 /* defaults to 1 */); + unformat (input, "bitzero %|", &mask, (1<<0)); + unformat (input, "bitone %|", &mask, (1<<1)); + <etc> + +The phrase “enable %=” means “set the supplied variable to the default +value” if unformat parses the “enable” keyword all by itself. If +unformat parses “enable 123” set the supplied variable to 123. + +We could clean up a number of hand-rolled “verbose” + “verbose %d” +argument parsing codes using “%=”. + +The phrase “bitzero %\|” means “set the specified bit in the supplied +bitmask” if unformat parses “bitzero”. Although it looks like it could +be fairly handy, it’s very lightly used in the code base. + +``%_`` toggles whether or not to skip input white space. + +For transition from skip to no-skip in middle of format string, skip +input white space. For example, the following: + +.. code:: c + + fmt = "%_%d.%d%_->%_%d.%d%_" + unformat (input, fmt, &one, &two, &three, &four); + +matches input “1.2 -> 3.4”. Without this, the space after -> does not +get skipped. + + +How to parse a single input line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Debug CLI command functions MUST NOT accidentally consume input +belonging to other debug CLI commands. Otherwise, it's impossible to +script a set of debug CLI commands which "work fine" when issued one +at a time. + +This bit of code is NOT correct: + +.. code:: c + + /* Eats script input NOT beloging to it, and chokes! */ + while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (input, ...)) + ; + else if (unformat (input, ...)) + ; + else + return clib_error_return (0, "parse error: '%U'", + format_unformat_error, input); + } + } + +When executed as part of a script, such a function will return “parse +error: ‘’” every time, unless it happens to be the last command in the +script. + +Instead, use “unformat_line_input” to consume the rest of a line’s worth +of input - everything past the path specified in the VLIB_CLI_COMMAND +declaration. + +For example, unformat_line_input with “my_command” set up as shown below +and user input “my path is clear” will produce an unformat_input_t that +contains “is clear”. + +.. code:: c + + VLIB_CLI_COMMAND (...) = { + .path = "my path", + }; + +Here’s a bit of code which shows the required mechanics, in full: + +.. code:: c + + static clib_error_t * + my_command_fn (vlib_main_t * vm, + unformat_input_t * input, + vlib_cli_command_t * cmd) + { + unformat_input_t _line_input, *line_input = &_line_input; + u32 this, that; + clib_error_t *error = 0; + + if (!unformat_user (input, unformat_line_input, line_input)) + return 0; + + /* + * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed, + * not at the end of the script... + */ + while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (line_input, "this %u", &this)) + ; + else if (unformat (line_input, "that %u", &that)) + ; + else + { + error = clib_error_return (0, "parse error: '%U'", + format_unformat_error, line_input); + goto done; + } + } + + <do something based on "this" and "that", etc> + + done: + unformat_free (line_input); + return error; + } + VLIB_CLI_COMMAND (my_command, static) = { + .path = "my path", + .function = my_command_fn", + }; + +Vppinfra errors and warnings +---------------------------- + +Many functions within the vpp dataplane have return-values of type +clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of +metadata [fatal, warning] and are easy to announce. Returning a NULL +clib_error_t \* indicates “A-OK, no error.” + +Clib_warning(format-args) is a handy way to add debugging output; clib +warnings prepend function:line info to unambiguously locate the message +source. Clib_unix_warning() adds perror()-style Linux system-call +information. In production images, clib_warnings result in syslog +entries. + +Serialization +------------- + +Vppinfra serialization support allows the programmer to easily serialize +and unserialize complex data structures. + +The underlying primitive serialize/unserialize functions use network +byte-order, so there are no structural issues serializing on a +little-endian host and unserializing on a big-endian host. diff --git a/docs/developer/corearchitecture/mem.rst b/docs/developer/corearchitecture/mem.rst new file mode 120000 index 00000000000..0fc53eab68c --- /dev/null +++ b/docs/developer/corearchitecture/mem.rst @@ -0,0 +1 @@ +../../../src/vpp/mem/mem.rst
\ No newline at end of file diff --git a/docs/developer/corearchitecture/multi_thread.rst b/docs/developer/corearchitecture/multi_thread.rst new file mode 100644 index 00000000000..195a9b791fd --- /dev/null +++ b/docs/developer/corearchitecture/multi_thread.rst @@ -0,0 +1,169 @@ +.. _vpp_multi_thread: + +Multi-threading in VPP +====================== + +Modes +----- + +VPP can work in 2 different modes: + +- single-thread +- multi-thread with worker threads + +Single-thread +~~~~~~~~~~~~~ + +In a single-thread mode there is one main thread which handles both +packet processing and other management functions (Command-Line Interface +(CLI), API, stats). This is the default setup. There is no special +startup config needed. + +Multi-thread with Worker Threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this mode, the main threads handles management functions(debug CLI, +API, stats collection) and one or more worker threads handle packet +processing from input to output of the packet. + +Each worker thread polls input queues on subset of interfaces. + +With RSS (Receive Side Scaling) enabled multiple threads can service one +physical interface (RSS function on NIC distributes traffic between +different queues which are serviced by different worker threads). + +Thread placement +---------------- + +Thread placement is defined in the startup config under the cpu { … } +section. + +The VPP platform can place threads automatically or manually. Automatic +placement works in the following way: + +- if “skip-cores X” is defined first X cores will not be used +- if “main-core X” is defined, VPP main thread will be placed on core + X, otherwise 1st available one will be used +- if “workers N” is defined vpp will allocate first N available cores + and it will run threads on them +- if “corelist-workers A,B1-Bn,C1-Cn” is defined vpp will automatically + assign those CPU cores to worker threads + +User can see active placement of cores by using the VPP debug CLI +command show threads: + +.. code-block:: console + + vpd# show threads + ID Name Type LWP lcore Core Socket State + 0 vpe_main 59723 2 2 0 wait + 1 vpe_wk_0 workers 59755 4 4 0 running + 2 vpe_wk_1 workers 59756 5 5 0 running + 3 vpe_wk_2 workers 59757 6 0 1 running + 4 vpe_wk_3 workers 59758 7 1 1 running + 5 stats 59775 + vpd# + +The sample output above shows the main thread running on core 2 (2nd +core on the CPU socket 0), worker threads running on cores 4-7. + +Sample Configurations +--------------------- + +By default, at start-up VPP uses +configuration values from: ``/etc/vpp/startup.conf`` + +The following sections describe some of the additional changes that can be made to this file. +This file is initially populated from the files located in the following directory ``/vpp/vpp/conf/`` + +Manual Placement +~~~~~~~~~~~~~~~~ + +Manual placement places the main thread on core 1, workers on cores +4,5,20,21. + +.. code-block:: console + + cpu { + main-core 1 + corelist-workers 4-5,20-21 + } + +Auto placement +-------------- + +Auto placement is likely to place the main thread on core 1 and workers +on cores 2,3,4. + +.. code-block:: console + + cpu { + skip-cores 1 + workers 3 + } + +Buffer Memory Allocation +~~~~~~~~~~~~~~~~~~~~~~~~ + +The VPP platform is NUMA aware. It can allocate memory for buffers on +different CPU sockets (NUMA nodes). The amount of memory allocated can +be defined in the startup config for each CPU socket by using the +socket-mem A[[,B],C] statement inside the dpdk { … } section. + +For example: + +.. code-block:: console + + dpdk { + socket-mem 1024,1024 + } + +The above configuration allocates 1GB of memory on NUMA#0 and 1GB on +NUMA#1. Each worker thread uses buffers which are local to itself. + +Buffer memory is allocated from hugepages. VPP prefers 1G pages if they +are available. If not 2MB pages will be used. + +VPP takes care of mounting/unmounting hugepages file-system +automatically so there is no need to do that manually. + +’‘’NOTE’’’: If you are running latest VPP release, there is no need for +specifying socket-mem manually. VPP will discover all NUMA nodes and it +will allocate 512M on each by default. socket-mem is only needed if +bigger number of mbufs is required (default is 16384 per socket and can +be changed with num-mbufs startup config command). + +Interface Placement in Multi-thread Setup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On startup, the VPP platform assigns interfaces (or interface, queue +pairs if RSS is used) to different worker threads in round robin +fashion. + +The following example shows debug CLI commands to show and change +interface placement: + +.. code-block:: console + + vpd# sh dpdk interface placement + Thread 1 (vpp_wk_0 at lcore 5): + TenGigabitEthernet2/0/0 queue 0 + TenGigabitEthernet2/0/1 queue 0 + Thread 2 (vpp_wk_1 at lcore 6): + TenGigabitEthernet2/0/0 queue 1 + TenGigabitEthernet2/0/1 queue 1 + +The following shows an example of moving TenGigabitEthernet2/0/1 queue 1 +processing to 1st worker thread: + +.. code-block:: console + + vpd# set interface placement TenGigabitEthernet2/0/1 queue 1 thread 1 + + vpp# sh dpdk interface placement + Thread 1 (vpp_wk_0 at lcore 5): + TenGigabitEthernet2/0/0 queue 0 + TenGigabitEthernet2/0/1 queue 0 + TenGigabitEthernet2/0/1 queue 1 + Thread 2 (vpp_wk_1 at lcore 6): + TenGigabitEthernet2/0/0 queue 1 diff --git a/docs/developer/corearchitecture/multiarch/arbfns.rst b/docs/developer/corearchitecture/multiarch/arbfns.rst new file mode 100644 index 00000000000..d469bd8a140 --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/arbfns.rst @@ -0,0 +1,87 @@ +Multi-Architecture Arbitrary Function Cookbook +============================================== + +Optimizing arbitrary functions for multiple architectures is simple +enough, and very similar to process used to produce multi-architecture +graph node dispatch functions. + +As with multi-architecture graph nodes, we compile source files +multiple times, generating multiple implementations of the original +function, and a public selector function. + +Details +------- + +Decorate function definitions with CLIB_MARCH_FN macros. For example: + +Change the original function prototype... + +:: + + u32 vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index, + u32 frame_flags) + +...by recasting the function name and return type as the first two +arguments to the CLIB_MARCH_FN macro: + +:: + + CLIB_MARCH_FN (vlib_frame_alloc_to_node, u32, vlib_main_t * vm, + u32 to_node_index, u32 frame_flags) + +In the actual vpp image, several versions of vlib_frame_alloc_to_node +will appear: vlib_frame_alloc_to_node_avx2, +vlib_frame_alloc_to_node_avx512, and so forth. + + +For each multi-architecture function, use the CLIB_MARCH_FN_SELECT +macro to help generate the one-and-only multi-architecture selector +function: + +:: + + #ifndef CLIB_MARCH_VARIANT + u32 + vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index, + u32 frame_flags) + { + return CLIB_MARCH_FN_SELECT (vlib_frame_alloc_to_node) + (vm, to_node_index, frame_flags); + } + #endif /* CLIB_MARCH_VARIANT */ + +Once bound, the multi-architecture selector function is about as +expensive as an indirect function call; which is to say: not very +expensive. + +Modify CMakeLists.txt +--------------------- + +If the component in question already lists "MULTIARCH_SOURCES", simply +add the indicated .c file to the list. Otherwise, add as shown +below. Note that the added file "new_multiarch_node.c" should appear in +*both* SOURCES and MULTIARCH_SOURCES: + +:: + + add_vpp_plugin(myplugin + SOURCES + multiarch_code.c + ... + + MULTIARCH_SOURCES + multiarch_code.c + ... + ) + +A Word to the Wise +------------------ + +A file which liberally mixes functions worth compiling for multiple +architectures and functions which are not will end up full of +#ifndef CLIB_MARCH_VARIANT conditionals. This won't do a thing to make +the code look any better. + +Depending on requirements, it may make sense to move functions to +(new) files to reduce complexity and/or improve legibility of the +resulting code. diff --git a/docs/developer/corearchitecture/multiarch/index.rst b/docs/developer/corearchitecture/multiarch/index.rst new file mode 100644 index 00000000000..824a8e68438 --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/index.rst @@ -0,0 +1,12 @@ +.. _multiarch: + +Multi-architecture support +========================== + +This reference guide describes how to use the vpp multi-architecture support scheme + +.. toctree:: + :maxdepth: 1 + + nodefns + arbfns diff --git a/docs/developer/corearchitecture/multiarch/nodefns.rst b/docs/developer/corearchitecture/multiarch/nodefns.rst new file mode 100644 index 00000000000..9647e64f08c --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/nodefns.rst @@ -0,0 +1,138 @@ +Multi-Architecture Graph Node Cookbook +====================================== + +In the context of graph node dispatch functions, it's easy enough to +use the vpp multi-architecture support setup. The point of the scheme +is simple: for performance-critical nodes, generate multiple CPU +hardware-dependent versions of the node dispatch functions, and pick +the best one at runtime. + +The vpp scheme is simple enough to use, but details matter. + +100,000 foot view +----------------- + +We compile entire graph node dispatch function implementation files +multiple times. These compilations give rise to multiple versions of +the graph node dispatch functions. Per-node constructor-functions +interrogate CPU hardware, select the node dispatch function variant to +use, and set the vlib_node_registration_t ".function" member to the +address of the selected variant. + +Details +------- + +Declare the node dispatch function as shown, using the VLIB\_NODE\_FN macro. The +name of the node function **MUST** match the name of the graph node. + +:: + + VLIB_NODE_FN (ip4_sdp_node) (vlib_main_t * vm, vlib_node_runtime_t * node, + vlib_frame_t * frame) + { + if (PREDICT_FALSE (node->flags & VLIB_NODE_FLAG_TRACE)) + return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ , + 1 /* is_trace */ ); + else + return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ , + 0 /* is_trace */ ); + } + +We need to generate *precisely one copy* of the +vlib_node_registration_t, error strings, and packet trace decode function. + +Simply bracket these items with "#ifndef CLIB_MARCH_VARIANT...#endif": + +:: + + #ifndef CLIB_MARCH_VARIANT + static u8 * + format_sdp_trace (u8 * s, va_list * args) + { + <snip> + } + #endif + + ... + + #ifndef CLIB_MARCH_VARIANT + static char *sdp_error_strings[] = { + #define _(sym,string) string, + foreach_sdp_error + #undef _ + }; + #endif + + ... + + #ifndef CLIB_MARCH_VARIANT + VLIB_REGISTER_NODE (ip4_sdp_node) = + { + // DO NOT set the .function structure member. + // The multiarch selection __attribute__((constructor)) function + // takes care of it at runtime + .name = "ip4-sdp", + .vector_size = sizeof (u32), + .format_trace = format_sdp_trace, + .type = VLIB_NODE_TYPE_INTERNAL, + + .n_errors = ARRAY_LEN(sdp_error_strings), + .error_strings = sdp_error_strings, + + .n_next_nodes = SDP_N_NEXT, + + /* edit / add dispositions here */ + .next_nodes = + { + [SDP_NEXT_DROP] = "ip4-drop", + }, + }; + #endif + +To belabor the point: *do not* set the ".function" member! That's the job of the multi-arch +selection \_\_attribute\_\_((constructor)) function + +Always inline node dispatch functions +------------------------------------- + +It's typical for a graph dispatch function to contain one or more +calls to an inline function. See above. If your node dispatch function +is structured that way, make *ABSOLUTELY CERTAIN* to use the +"always_inline" macro: + +:: + + always_inline uword + ip46_sdp_inline (vlib_main_t * vm, vlib_node_runtime_t * node, + vlib_frame_t * frame, + int is_ip4, int is_trace) + { ... } + +Otherwise, the compiler is highly likely NOT to build multiple +versions of the guts of your dispatch function. + +It's fairly easy to spot this mistake in "perf top." If you see, for +example, a bunch of functions with names of the form +"xxx_node_fn_avx2" in the profile, *BUT* your brand-new node function +shows up with a name of the form "xxx_inline.isra.1", it's quite likely +that the inline was declared "static inline" instead of "always_inline". + +Modify CMakeLists.txt +--------------------- + +If the component in question already lists "MULTIARCH_SOURCES", simply +add the indicated .c file to the list. Otherwise, add as shown +below. Note that the added file "new_multiarch_node.c" should appear in +*both* SOURCES and MULTIARCH_SOURCES: + +:: + + add_vpp_plugin(myplugin + SOURCES + new_multiarch_node.c + ... + + MULTIARCH_SOURCES + new_ multiarch_node.c + ... + ) diff --git a/docs/developer/corearchitecture/softwarearchitecture.rst b/docs/developer/corearchitecture/softwarearchitecture.rst new file mode 100644 index 00000000000..7f8a0e04645 --- /dev/null +++ b/docs/developer/corearchitecture/softwarearchitecture.rst @@ -0,0 +1,47 @@ +Software Architecture +===================== + +The fd.io vpp implementation is a third-generation vector packet +processing implementation specifically related to US Patent 7,961,636, +as well as earlier work. Note that the Apache-2 license specifically +grants non-exclusive patent licenses; we mention this patent as a point +of historical interest. + +For performance, the vpp dataplane consists of a directed graph of +forwarding nodes which process multiple packets per invocation. This +schema enables a variety of micro-processor optimizations: pipelining +and prefetching to cover dependent read latency, inherent I-cache phase +behavior, vector instructions. Aside from hardware input and hardware +output nodes, the entire forwarding graph is portable code. + +Depending on the scenario at hand, we often spin up multiple worker +threads which process ingress-hashes packets from multiple queues using +identical forwarding graph replicas. + +VPP Layers - Implementation Taxonomy +------------------------------------ + +.. figure:: /_images/VPP_Layering.png + :alt: image + + image + +- VPP Infra - the VPP infrastructure layer, which contains the core + library source code. This layer performs memory functions, works with + vectors and rings, performs key lookups in hash tables, and works + with timers for dispatching graph nodes. +- VLIB - the vector processing library. The vlib layer also handles + various application management functions: buffer, memory and graph + node management, maintaining and exporting counters, thread + management, packet tracing. Vlib implements the debug CLI (command + line interface). +- VNET - works with VPP's networking interface (layers 2, 3, and 4) + performs session and traffic management, and works with devices and + the data control plane. +- Plugins - Contains an increasingly rich set of data-plane plugins, as + noted in the above diagram. +- VPP - the container application linked against all of the above. + +It’s important to understand each of these layers in a certain amount of +detail. Much of the implementation is best dealt with at the API level +and otherwise left alone. diff --git a/docs/developer/corearchitecture/vlib.rst b/docs/developer/corearchitecture/vlib.rst new file mode 100644 index 00000000000..f542d33ebb8 --- /dev/null +++ b/docs/developer/corearchitecture/vlib.rst @@ -0,0 +1,888 @@ +VLIB (Vector Processing Library) +================================ + +The files associated with vlib are located in the ./src/{vlib, vlibapi, +vlibmemory} folders. These libraries provide vector processing support +including graph-node scheduling, reliable multicast support, +ultra-lightweight cooperative multi-tasking threads, a CLI, plug in .DLL +support, physical memory and Linux epoll support. Parts of this library +embody US Patent 7,961,636. + +Init function discovery +----------------------- + +vlib applications register for various [initialization] events by +placing structures and \__attribute__((constructor)) functions into the +image. At appropriate times, the vlib framework walks +constructor-generated singly-linked structure lists, performs a +topological sort based on specified constraints, and calls the indicated +functions. Vlib applications create graph nodes, add CLI functions, +start cooperative multi-tasking threads, etc. etc. using this mechanism. + +vlib applications invariably include a number of VLIB_INIT_FUNCTION +(my_init_function) macros. + +Each init / configure / etc. function has the return type clib_error_t +\*. Make sure that the function returns 0 if all is well, otherwise the +framework will announce an error and exit. + +vlib applications must link against vppinfra, and often link against +other libraries such as VNET. In the latter case, it may be necessary to +explicitly reference symbol(s) otherwise large portions of the library +may be AWOL at runtime. + +Init function construction and constraint specification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It’s easy to add an init function: + +.. code:: c + + static clib_error_t *my_init_function (vlib_main_t *vm) + { + /* ... initialize things ... */ + + return 0; // or return clib_error_return (0, "BROKEN!"); + } + VLIB_INIT_FUNCTION(my_init_function); + +As given, my_init_function will be executed “at some point,” but with no +ordering guarantees. + +Specifying ordering constraints is easy: + +.. code:: c + + VLIB_INIT_FUNCTION(my_init_function) = + { + .runs_before = VLIB_INITS("we_run_before_function_1", + "we_run_before_function_2"), + .runs_after = VLIB_INITS("we_run_after_function_1", + "we_run_after_function_2), + }; + +It’s also easy to specify bulk ordering constraints of the form “a then +b then c then d”: + +.. code:: c + + VLIB_INIT_FUNCTION(my_init_function) = + { + .init_order = VLIB_INITS("a", "b", "c", "d"), + }; + +It’s OK to specify all three sorts of ordering constraints for a single +init function, although it’s hard to imagine why it would be necessary. + +Node Graph Initialization +------------------------- + +vlib packet-processing applications invariably define a set of graph +nodes to process packets. + +One constructs a vlib_node_registration_t, most often via the +VLIB_REGISTER_NODE macro. At runtime, the framework processes the set of +such registrations into a directed graph. It is easy enough to add nodes +to the graph at runtime. The framework does not support removing nodes. + +vlib provides several types of vector-processing graph nodes, primarily +to control framework dispatch behaviors. The type member of the +vlib_node_registration_t functions as follows: + +- VLIB_NODE_TYPE_PRE_INPUT - run before all other node types +- VLIB_NODE_TYPE_INPUT - run as often as possible, after pre_input + nodes +- VLIB_NODE_TYPE_INTERNAL - only when explicitly made runnable by + adding pending frames for processing +- VLIB_NODE_TYPE_PROCESS - only when explicitly made runnable. + “Process” nodes are actually cooperative multi-tasking threads. They + **must** explicitly suspend after a reasonably short period of time. + +For a precise understanding of the graph node dispatcher, please read +./src/vlib/main.c:vlib_main_loop. + +Graph node dispatcher +--------------------- + +Vlib_main_loop() dispatches graph nodes. The basic vector processing +algorithm is diabolically simple, but may not be obvious from even a +long stare at the code. Here’s how it works: some input node, or set of +input nodes, produce a vector of work to process. The graph node +dispatcher pushes the work vector through the directed graph, +subdividing it as needed, until the original work vector has been +completely processed. At that point, the process recurs. + +This scheme yields a stable equilibrium in frame size, by construction. +Here’s why: as the frame size increases, the per-frame-element +processing time decreases. There are several related forces at work; the +simplest to describe is the effect of vector processing on the CPU L1 +I-cache. The first frame element [packet] processed by a given node +warms up the node dispatch function in the L1 I-cache. All subsequent +frame elements profit. As we increase the number of frame elements, the +cost per element goes down. + +Under light load, it is a crazy waste of CPU cycles to run the graph +node dispatcher flat-out. So, the graph node dispatcher arranges to wait +for work by sitting in a timed epoll wait if the prevailing frame size +is low. The scheme has a certain amount of hysteresis to avoid +constantly toggling back and forth between interrupt and polling mode. +Although the graph dispatcher supports interrupt and polling modes, our +current default device drivers do not. + +The graph node scheduler uses a hierarchical timer wheel to reschedule +process nodes upon timer expiration. + +Graph dispatcher internals +-------------------------- + +This section may be safely skipped. It’s not necessary to understand +graph dispatcher internals to create graph nodes. + +Vector Data Structure +--------------------- + +In vpp / vlib, we represent vectors as instances of the vlib_frame_t +type: + +.. code:: c + + typedef struct vlib_frame_t + { + /* Frame flags. */ + u16 flags; + + /* Number of scalar bytes in arguments. */ + u8 scalar_size; + + /* Number of bytes per vector argument. */ + u8 vector_size; + + /* Number of vector elements currently in frame. */ + u16 n_vectors; + + /* Scalar and vector arguments to next node. */ + u8 arguments[0]; + } vlib_frame_t; + +Note that one *could* construct all kinds of vectors - including vectors +with some associated scalar data - using this structure. In the vpp +application, vectors typically use a 4-byte vector element size, and +zero bytes’ worth of associated per-frame scalar data. + +Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries. Frames +have u32 indices which make use of the alignment property, so the +maximum feasible main heap offset of a frame is CLIB_CACHE_LINE_BYTES \* +0xFFFFFFFF: 64*4 = 256 Gbytes. + +Scheduling Vectors +------------------ + +As you can see, vectors are not directly associated with graph nodes. We +represent that association in a couple of ways. The simplest is the +vlib_pending_frame_t: + +.. code:: c + + /* A frame pending dispatch by main loop. */ + typedef struct + { + /* Node and runtime for this frame. */ + u32 node_runtime_index; + + /* Frame index (in the heap). */ + u32 frame_index; + + /* Start of next frames for this node. */ + u32 next_frame_index; + + /* Special value for next_frame_index when there is no next frame. */ + #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0) + } vlib_pending_frame_t; + +Here is the code in …/src/vlib/main.c:vlib_main_or_worker_loop() which +processes frames: + +.. code:: c + + /* + * Input nodes may have added work to the pending vector. + * Process pending vector until there is nothing left. + * All pending vectors will be processed from input -> output. + */ + for (i = 0; i < _vec_len (nm->pending_frames); i++) + cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now); + /* Reset pending vector for next iteration. */ + +The pending frame node_runtime_index associates the frame with the node +which will process it. + +Complications +------------- + +Fasten your seatbelt. Here’s where the story - and the data structures - +become quite complicated… + +At 100,000 feet: vpp uses a directed graph, not a directed *acyclic* +graph. It’s really quite normal for a packet to visit ip[46]-lookup +multiple times. The worst-case: a graph node which enqueues packets to +itself. + +To deal with this issue, the graph dispatcher must force allocation of a +new frame if the current graph node’s dispatch function happens to +enqueue a packet back to itself. + +There are no guarantees that a pending frame will be processed +immediately, which means that more packets may be added to the +underlying vlib_frame_t after it has been attached to a +vlib_pending_frame_t. Care must be taken to allocate new frames and +pending frames if a (pending_frame, frame) pair fills. + +Next frames, next frame ownership +--------------------------------- + +The vlib_next_frame_t is the last key graph dispatcher data structure: + +.. code:: c + + typedef struct + { + /* Frame index. */ + u32 frame_index; + + /* Node runtime for this next. */ + u32 node_runtime_index; + + /* Next frame flags. */ + u32 flags; + + /* Reflects node frame-used flag for this next. */ + #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \ + VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH + + /* This next frame owns enqueue to node + corresponding to node_runtime_index. */ + #define VLIB_FRAME_OWNER (1 << 15) + + /* Set when frame has been allocated for this next. */ + #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT + + /* Set when frame has been added to pending vector. */ + #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP + + /* Set when frame is to be freed after dispatch. */ + #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT + + /* Set when frame has traced packets. */ + #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE + + /* Number of vectors enqueue to this next since last overflow. */ + u32 vectors_since_last_overflow; + } vlib_next_frame_t; + +Graph node dispatch functions call vlib_get_next_frame (…) to set “(u32 +\*)to_next” to the right place in the vlib_frame_t corresponding to the +ith arc (aka next0) from the current node to the indicated next node. + +After some scuffling around - two levels of macros - processing reaches +vlib_get_next_frame_internal (…). Get-next-frame-internal digs up the +vlib_next_frame_t corresponding to the desired graph arc. + +The next frame data structure amounts to a graph-arc-centric frame +cache. Once a node finishes adding element to a frame, it will acquire a +vlib_pending_frame_t and end up on the graph dispatcher’s run-queue. But +there’s no guarantee that more vector elements won’t be added to the +underlying frame from the same (source_node, next_index) arc or from a +different (source_node, next_index) arc. + +Maintaining consistency of the arc-to-frame cache is necessary. The +first step in maintaining consistency is to make sure that only one +graph node at a time thinks it “owns” the target vlib_frame_t. + +Back to the graph node dispatch function. In the usual case, a certain +number of packets will be added to the vlib_frame_t acquired by calling +vlib_get_next_frame (…). + +Before a dispatch function returns, it’s required to call +vlib_put_next_frame (…) for all of the graph arcs it actually used. This +action adds a vlib_pending_frame_t to the graph dispatcher’s pending +frame vector. + +Vlib_put_next_frame makes a note in the pending frame of the frame +index, and also of the vlib_next_frame_t index. + +dispatch_pending_node actions +----------------------------- + +The main graph dispatch loop calls dispatch pending node as shown above. + +Dispatch_pending_node recovers the pending frame, and the graph node +runtime / dispatch function. Further, it recovers the next_frame +currently associated with the vlib_frame_t, and detaches the +vlib_frame_t from the next_frame. + +In …/src/vlib/main.c:dispatch_pending_node(…), note this stanza: + +.. code:: c + + /* Force allocation of new frame while current frame is being + dispatched. */ + restore_frame_index = ~0; + if (nf->frame_index == p->frame_index) + { + nf->frame_index = ~0; + nf->flags &= ~VLIB_FRAME_IS_ALLOCATED; + if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH)) + restore_frame_index = p->frame_index; + } + +dispatch_pending_node is worth a hard stare due to the several +second-order optimizations it implements. Almost as an afterthought, it +calls dispatch_node which actually calls the graph node dispatch +function. + +Process / thread model +---------------------- + +vlib provides an ultra-lightweight cooperative multi-tasking thread +model. The graph node scheduler invokes these processes in much the same +way as traditional vector-processing run-to-completion graph nodes; +plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply +set the vlib_node_registration_t type field to vlib_NODE_TYPE_PROCESS. +Yes, process is a misnomer. These are cooperative multi-tasking threads. + +As of this writing, the default stack size is 2<<15 = 32kb. Initialize +the node registration’s process_log2_n_stack_bytes member as needed. The +graph node dispatcher makes some effort to detect stack overrun, e.g. by +mapping a no-access page below each thread stack. + +Process node dispatch functions are expected to be “while(1) { }” loops +which suspend when not otherwise occupied, and which must not run for +unreasonably long periods of time. + +“Unreasonably long” is an application-dependent concept. Over the years, +we have constructed frame-size sensitive control-plane nodes which will +use a much higher fraction of the available CPU bandwidth when the frame +size is low. The classic example: modifying forwarding tables. So long +as the table-builder leaves the forwarding tables in a valid state, one +can suspend the table builder to avoid dropping packets as a result of +control-plane activity. + +Process nodes can suspend for fixed amounts of time, or until another +entity signals an event, or both. See the next section for a description +of the vlib process event mechanism. + +When running in vlib process context, one must pay strict attention to +loop invariant issues. If one walks a data structure and calls a +function which may suspend, one had best know by construction that it +cannot change. Often, it’s best to simply make a snapshot copy of a data +structure, walk the copy at leisure, then free the copy. + +Process events +-------------- + +The vlib process event mechanism API is extremely lightweight and easy +to use. Here is a typical example: + +.. code:: c + + vlib_main_t *vm = &vlib_global_main; + uword event_type, * event_data = 0; + + while (1) + { + vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */); + + event_type = vlib_process_get_events (vm, &event_data); + + switch (event_type) { + case EVENT1: + handle_event1s (event_data); + break; + + case EVENT2: + handle_event2s (event_data); + break; + + case ~0: /* 5-second idle/periodic */ + handle_idle (); + break; + + default: /* bug! */ + ASSERT (0); + } + + vec_reset_length(event_data); + } + +In this example, the VLIB process node waits for an event to occur, or +for 5 seconds to elapse. The code demuxes on the event type, calling the +appropriate handler function. Each call to vlib_process_get_events +returns a vector of per-event-type data passed to successive +vlib_process_signal_event calls; it is a serious error to process only +event_data[0]. + +Resetting the event_data vector-length to 0 [instead of calling +vec_free] means that the event scheme doesn’t burn cycles continuously +allocating and freeing the event data vector. This is a common vppinfra +/ vlib coding pattern, well worth using when appropriate. + +Signaling an event is easy, for example: + +.. code:: c + + vlib_process_signal_event (vm, process_node_index, EVENT1, + (uword)arbitrary_event1_data); /* and so forth */ + +One can either know the process node index by construction - dig it out +of the appropriate vlib_node_registration_t - or by finding the +vlib_node_t with vlib_get_node_by_name(…). + +Buffers +------- + +vlib buffering solves the usual set of packet-processing problems, +albeit at high performance. Key in terms of performance: one ordinarily +allocates / frees N buffers at a time rather than one at a time. Except +when operating directly on a specific buffer, one deals with buffers by +index, not by pointer. + +Packet-processing frames are u32[] arrays, not vlib_buffer_t[] arrays. + +Packets comprise one or more vlib buffers, chained together as required. +Multiple particle sizes are supported; hardware input nodes simply ask +for the required size(s). Coalescing support is available. For obvious +reasons one is discouraged from writing one’s own wild and wacky buffer +chain traversal code. + +vlib buffer headers are allocated immediately prior to the buffer data +area. In typical packet processing this saves a dependent read wait: +given a buffer’s address, one can prefetch the buffer header [metadata] +at the same time as the first cache line of buffer data. + +Buffer header metadata (vlib_buffer_t) includes the usual rewrite +expansion space, a current_data offset, RX and TX interface indices, +packet trace information, and a opaque areas. + +The opaque data is intended to control packet processing in arbitrary +subgraph-dependent ways. The programmer shoulders responsibility for +data lifetime analysis, type-checking, etc. + +Buffers have reference-counts in support of e.g. multicast replication. + +Shared-memory message API +------------------------- + +Local control-plane and application processes interact with the vpp +dataplane via asynchronous message-passing in shared memory over +unidirectional queues. The same application APIs are available via +sockets. + +Capturing API traces and replaying them in a simulation environment +requires a disciplined approach to the problem. This seems like a +make-work task, but it is not. When something goes wrong in the +control-plane after 300,000 or 3,000,000 operations, high-speed replay +of the events leading up to the accident is a huge win. + +The shared-memory message API message allocator vl_api_msg_alloc uses a +particularly cute trick. Since messages are processed in order, we try +to allocate message buffering from a set of fixed-size, preallocated +rings. Each ring item has a “busy” bit. Freeing one of the preallocated +message buffers merely requires the message consumer to clear the busy +bit. No locking required. + +Debug CLI +--------- + +Adding debug CLI commands to VLIB applications is very simple. + +Here is a complete example: + +.. code:: c + + static clib_error_t * + show_ip_tuple_match (vlib_main_t * vm, + unformat_input_t * input, + vlib_cli_command_t * cmd) + { + vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main); + return 0; + } + + static VLIB_CLI_COMMAND (show_ip_tuple_command) = + { + .path = "show ip tuple match", + .short_help = "Show ip 5-tuple match-and-broadcast tables", + .function = show_ip_tuple_match, + }; + +This example implements the “show ip tuple match” debug cli command. In +ordinary usage, the vlib cli is available via the “vppctl” application, +which sends traffic to a named pipe. One can configure debug CLI telnet +access on a configurable port. + +The cli implementation has an output redirection facility which makes it +simple to deliver cli output via shared-memory API messaging, + +Particularly for debug or “show tech support” type commands, it would be +wasteful to write vlib application code to pack binary data, write more +code elsewhere to unpack the data and finally print the answer. If a +certain cli command has the potential to hurt packet processing +performance by running for too long, do the work incrementally in a +process node. The client can wait. + +Macro expansion +~~~~~~~~~~~~~~~ + +The vpp debug CLI engine includes a recursive macro expander. This is +quite useful for factoring out address and/or interface name specifics: + +:: + + define ip1 192.168.1.1/24 + define ip2 192.168.2.1/24 + define iface1 GigabitEthernet3/0/0 + define iface2 loop1 + + set int ip address $iface1 $ip1 + set int ip address $iface2 $(ip2) + + undefine ip1 + undefine ip2 + undefine iface1 + undefine iface2 + +Each socket (or telnet) debug CLI session has its own macro tables. All +debug CLI sessions which use CLI_INBAND binary API messages share a +single table. + +The macro expander recognizes circular definitions: + +:: + + define foo \$(bar) + define bar \$(mumble) + define mumble \$(foo) + +At 8 levels of recursion, the macro expander throws up its hands and +replies “CIRCULAR.” + +Macro-related debug CLI commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In addition to the “define” and “undefine” debug CLI commands, use “show +macro [noevaluate]” to dump the macro table. The “echo” debug CLI +command will evaluate and print its argument: + +:: + + vpp# define foo This\ Is\ Foo + vpp# echo $foo + This Is Foo + +Handing off buffers between threads +----------------------------------- + +Vlib includes an easy-to-use mechanism for handing off buffers between +worker threads. A typical use-case: software ingress flow hashing. At a +high level, one creates a per-worker-thread queue which sends packets to +a specific graph node in the indicated worker thread. With the queue in +hand, enqueue packets to the worker thread of your choice. + +Initialize a handoff queue +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Simple enough, call vlib_frame_queue_main_init: + +.. code:: c + + main_ptr->frame_queue_index + = vlib_frame_queue_main_init (dest_node.index, frame_queue_size); + +Frame_queue_size means what it says: the number of frames which may be +queued. Since frames contain 1…256 packets, frame_queue_size should be a +reasonably small number (32…64). If the frame queue producer(s) are +faster than the frame queue consumer(s), congestion will occur. Suggest +letting the enqueue operator deal with queue congestion, as shown in the +enqueue example below. + +Under the floorboards, vlib_frame_queue_main_init creates an input queue +for each worker thread. + +Please do NOT create frame queues until it’s clear that they will be +used. Although the main dispatch loop is reasonably smart about how +often it polls the (entire set of) frame queues, polling unused frame +queues is a waste of clock cycles. + +Hand off packets +~~~~~~~~~~~~~~~~ + +The actual handoff mechanics are simple, and integrate nicely with a +typical graph-node dispatch function: + +.. code:: c + + always_inline uword + do_handoff_inline (vlib_main_t * vm, + vlib_node_runtime_t * node, vlib_frame_t * frame, + int is_ip4, int is_trace) + { + u32 n_left_from, *from; + vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; + u16 thread_indices [VLIB_FRAME_SIZE]; + u16 nexts[VLIB_FRAME_SIZE], *next; + u32 n_enq; + htest_main_t *hmp = &htest_main; + int i; + + from = vlib_frame_vector_args (frame); + n_left_from = frame->n_vectors; + + vlib_get_buffers (vm, from, bufs, n_left_from); + next = nexts; + b = bufs; + + /* + * Typical frame traversal loop, details vary with + * use case. Make sure to set thread_indices[i] with + * the desired destination thread index. You may + * or may not bother to set next[i]. + */ + + for (i = 0; i < frame->n_vectors; i++) + { + <snip> + /* Pick a thread to handle this packet */ + thread_indices[i] = f (packet_data_or_whatever); + <snip> + + b += 1; + next += 1; + n_left_from -= 1; + } + + /* Enqueue buffers to threads */ + n_enq = + vlib_buffer_enqueue_to_thread (vm, node, hmp->frame_queue_index, + from, thread_indices, frame->n_vectors, + 1 /* drop on congestion */); + /* Typical counters, + if (n_enq < frame->n_vectors) + vlib_node_increment_counter (vm, node->node_index, + XXX_ERROR_CONGESTION_DROP, + frame->n_vectors - n_enq); + vlib_node_increment_counter (vm, node->node_index, + XXX_ERROR_HANDED_OFF, n_enq); + return frame->n_vectors; + } + +Notes about calling vlib_buffer_enqueue_to_thread(…): + +- If you pass “drop on congestion” non-zero, all packets in the inbound + frame will be consumed one way or the other. This is the recommended + setting. + +- In the drop-on-congestion case, please don’t try to “help” in the + enqueue node by freeing dropped packets, or by pushing them to + “error-drop.” Either of those actions would be a severe error. + +- It’s perfectly OK to enqueue packets to the current thread. + +Handoff Demo Plugin +------------------- + +Check out the sample (plugin) example in …/src/examples/handoffdemo. If +you want to build the handoff demo plugin: + +:: + + $ cd .../src/plugins + $ ln -s ../examples/handoffdemo + +This plugin provides a simple example of how to hand off packets between +threads. We used it to debug packet-tracer handoff tracing support. + +Packet generator input script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + packet-generator new { + name x + limit 5 + size 128-128 + interface local0 + node handoffdemo-1 + data { + incrementing 30 + } + } + +Start vpp with 2 worker threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The demo plugin hands packets from worker 1 to worker 2. + +Enable tracing, and start the packet generator +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + trace add pg-input 100 + packet-generator enable + +Sample Run +~~~~~~~~~~ + +:: + + DBGvpp# ex /tmp/pg_input_script + DBGvpp# pa en + DBGvpp# sh err + Count Node Reason + 5 handoffdemo-1 packets handed off processed + 5 handoffdemo-2 completed packets + DBGvpp# show run + Thread 1 vpp_wk_0 (lcore 0) + Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 + vector rates in 3.7331e-2, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call + handoffdemo-1 active 1 5 0 4.76e3 5.00 + pg-input disabled 2 5 0 5.58e4 2.50 + unix-epoll-input polling 22760 0 0 2.14e7 0.00 + --------------- + Thread 2 vpp_wk_1 (lcore 2) + Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 + vector rates in 0.0000e0, out 0.0000e0, drop 3.7331e-2, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call + drop active 1 5 0 1.35e4 5.00 + error-drop active 1 5 0 2.52e4 5.00 + handoffdemo-2 active 1 5 0 2.56e4 5.00 + unix-epoll-input polling 22406 0 0 2.18e7 0.00 + +Enable the packet tracer and run it again… + +:: + + DBGvpp# trace add pg-input 100 + DBGvpp# pa en + DBGvpp# sh trace + sh trace + ------------------- Start of thread 0 vpp_main ------------------- + No packets in trace buffer + ------------------- Start of thread 1 vpp_wk_0 ------------------- + Packet 1 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000000 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 2 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000001 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 3 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000002 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 4 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000003 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 5 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000004 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + ------------------- Start of thread 2 vpp_wk_1 ------------------- + Packet 1 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 0 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 2 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 1 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 3 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 2 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 4 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 3 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 5 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 4 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + DBGvpp# diff --git a/docs/developer/corearchitecture/vnet.rst b/docs/developer/corearchitecture/vnet.rst new file mode 100644 index 00000000000..812e2fb4f8a --- /dev/null +++ b/docs/developer/corearchitecture/vnet.rst @@ -0,0 +1,807 @@ +VNET (VPP Network Stack) +======================== + +The files associated with the VPP network stack layer are located in the +*./src/vnet* folder. The Network Stack Layer is basically an +instantiation of the code in the other layers. This layer has a vnet +library that provides vectorized layer-2 and 3 networking graph nodes, a +packet generator, and a packet tracer. + +In terms of building a packet processing application, vnet provides a +platform-independent subgraph to which one connects a couple of +device-driver nodes. + +Typical RX connections include “ethernet-input” [full software +classification, feeds ipv4-input, ipv6-input, arp-input etc.] and +“ipv4-input-no-checksum” [if hardware can classify, perform ipv4 header +checksum]. + +Effective graph dispatch function coding +---------------------------------------- + +Over the 15 years, multiple coding styles have emerged: a +single/dual/quad loop coding model (with variations) and a +fully-pipelined coding model. + +Single/dual loops +----------------- + +The single/dual/quad loop model variations conveniently solve problems +where the number of items to process is not known in advance: typical +hardware RX-ring processing. This coding style is also very effective +when a given node will not need to cover a complex set of dependent +reads. + +Here is an quad/single loop which can leverage up-to-avx512 SIMD vector +units to convert buffer indices to buffer pointers: + +.. code:: c + + static uword + simulated_ethernet_interface_tx (vlib_main_t * vm, + vlib_node_runtime_t * + node, vlib_frame_t * frame) + { + u32 n_left_from, *from; + u32 next_index = 0; + u32 n_bytes; + u32 thread_index = vm->thread_index; + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; + u16 nexts[VLIB_FRAME_SIZE], *next; + + n_left_from = frame->n_vectors; + from = vlib_frame_vector_args (frame); + + /* + * Convert up to VLIB_FRAME_SIZE indices in "from" to + * buffer pointers in bufs[] + */ + vlib_get_buffers (vm, from, bufs, n_left_from); + b = bufs; + next = nexts; + + /* + * While we have at least 4 vector elements (pkts) to process.. + */ + while (n_left_from >= 4) + { + /* Prefetch next quad-loop iteration. */ + if (PREDICT_TRUE (n_left_from >= 8)) + { + vlib_prefetch_buffer_header (b[4], STORE); + vlib_prefetch_buffer_header (b[5], STORE); + vlib_prefetch_buffer_header (b[6], STORE); + vlib_prefetch_buffer_header (b[7], STORE); + } + + /* + * $$$ Process 4x packets right here... + * set next[0..3] to send the packets where they need to go + */ + + do_something_to (b[0]); + do_something_to (b[1]); + do_something_to (b[2]); + do_something_to (b[3]); + + /* Process the next 0..4 packets */ + b += 4; + next += 4; + n_left_from -= 4; + } + /* + * Clean up 0...3 remaining packets at the end of the incoming frame + */ + while (n_left_from > 0) + { + /* + * $$$ Process one packet right here... + * set next[0..3] to send the packets where they need to go + */ + do_something_to (b[0]); + + /* Process the next packet */ + b += 1; + next += 1; + n_left_from -= 1; + } + + /* + * Send the packets along their respective next-node graph arcs + * Considerable locality of reference is expected, most if not all + * packets in the inbound vector will traverse the same next-node + * arc + */ + vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors); + + return frame->n_vectors; + } + +Given a packet processing task to implement, it pays to scout around +looking for similar tasks, and think about using the same coding +pattern. It is not uncommon to recode a given graph node dispatch +function several times during performance optimization. + +Creating Packets from Scratch +----------------------------- + +At times, it’s necessary to create packets from scratch and send them. +Tasks like sending keepalives or actively opening connections come to +mind. Its not difficult, but accurate buffer metadata setup is required. + +Allocating Buffers +~~~~~~~~~~~~~~~~~~ + +Use vlib_buffer_alloc, which allocates a set of buffer indices. For +low-performance applications, it’s OK to allocate one buffer at a time. +Note that vlib_buffer_alloc(…) does NOT initialize buffer metadata. See +below. + +In high-performance cases, allocate a vector of buffer indices, and hand +them out from the end of the vector; decrement \_vec_len(..) as buffer +indices are allocated. See tcp_alloc_tx_buffers(…) and +tcp_get_free_buffer_index(…) for an example. + +Buffer Initialization Example +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following example shows the **main points**, but is not to be +blindly cut-’n-pasted. + +.. code:: c + + u32 bi0; + vlib_buffer_t *b0; + ip4_header_t *ip; + udp_header_t *udp; + + /* Allocate a buffer */ + if (vlib_buffer_alloc (vm, &bi0, 1) != 1) + return -1; + + b0 = vlib_get_buffer (vm, bi0); + + /* At this point b0->current_data = 0, b0->current_length = 0 */ + + /* + * Copy data into the buffer. This example ASSUMES that data will fit + * in a single buffer, and is e.g. an ip4 packet. + */ + if (have_packet_rewrite) + { + clib_memcpy (b0->data, data, vec_len (data)); + b0->current_length = vec_len (data); + } + else + { + /* OR, build a udp-ip packet (for example) */ + ip = vlib_buffer_get_current (b0); + udp = (udp_header_t *) (ip + 1); + data_dst = (u8 *) (udp + 1); + + ip->ip_version_and_header_length = 0x45; + ip->ttl = 254; + ip->protocol = IP_PROTOCOL_UDP; + ip->length = clib_host_to_net_u16 (sizeof (*ip) + sizeof (*udp) + + vec_len(udp_data)); + ip->src_address.as_u32 = src_address->as_u32; + ip->dst_address.as_u32 = dst_address->as_u32; + udp->src_port = clib_host_to_net_u16 (src_port); + udp->dst_port = clib_host_to_net_u16 (dst_port); + udp->length = clib_host_to_net_u16 (vec_len (udp_data)); + clib_memcpy (data_dst, udp_data, vec_len(udp_data)); + + if (compute_udp_checksum) + { + /* RFC 7011 section 10.3.2. */ + udp->checksum = ip4_tcp_udp_compute_checksum (vm, b0, ip); + if (udp->checksum == 0) + udp->checksum = 0xffff; + } + b0->current_length = vec_len (sizeof (*ip) + sizeof (*udp) + + vec_len (udp_data)); + + } + b0->flags |= VLIB_BUFFER_TOTAL_LENGTH_VALID; + + /* sw_if_index 0 is the "local" interface, which always exists */ + vnet_buffer (b0)->sw_if_index[VLIB_RX] = 0; + + /* Use the default FIB index for tx lookup. Set non-zero to use another fib */ + vnet_buffer (b0)->sw_if_index[VLIB_TX] = 0; + +If your use-case calls for large packet transmission, use +vlib_buffer_chain_append_data_with_alloc(…) to create the requisite +buffer chain. + +Enqueueing packets for lookup and transmission +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The simplest way to send a set of packets is to use +vlib_get_frame_to_node(…) to allocate fresh frame(s) to ip4_lookup_node +or ip6_lookup_node, add the constructed buffer indices, and dispatch the +frame using vlib_put_frame_to_node(…). + +.. code:: c + + vlib_frame_t *f; + f = vlib_get_frame_to_node (vm, ip4_lookup_node.index); + f->n_vectors = vec_len(buffer_indices_to_send); + to_next = vlib_frame_vector_args (f); + + for (i = 0; i < vec_len (buffer_indices_to_send); i++) + to_next[i] = buffer_indices_to_send[i]; + + vlib_put_frame_to_node (vm, ip4_lookup_node_index, f); + +It is inefficient to allocate and schedule single packet frames. That’s +typical in case you need to send one packet per second, but should +**not** occur in a for-loop! + +Packet tracer +------------- + +Vlib includes a frame element [packet] trace facility, with a simple +debug CLI interface. The cli is straightforward: “trace add +input-node-name count” to start capturing packet traces. + +To trace 100 packets on a typical x86_64 system running the dpdk plugin: +“trace add dpdk-input 100”. When using the packet generator: “trace add +pg-input 100” + +To display the packet trace: “show trace” + +Each graph node has the opportunity to capture its own trace data. It is +almost always a good idea to do so. The trace capture APIs are simple. + +The packet capture APIs snapshoot binary data, to minimize processing at +capture time. Each participating graph node initialization provides a +vppinfra format-style user function to pretty-print data when required +by the VLIB “show trace” command. + +Set the VLIB node registration “.format_trace” member to the name of the +per-graph node format function. + +Here’s a simple example: + +.. code:: c + + u8 * my_node_format_trace (u8 * s, va_list * args) + { + vlib_main_t * vm = va_arg (*args, vlib_main_t *); + vlib_node_t * node = va_arg (*args, vlib_node_t *); + my_node_trace_t * t = va_arg (*args, my_trace_t *); + + s = format (s, "My trace data was: %d", t-><whatever>); + + return s; + } + +The trace framework hands the per-node format function the data it +captured as the packet whizzed by. The format function pretty-prints the +data as desired. + +Graph Dispatcher Pcap Tracing +----------------------------- + +The vpp graph dispatcher knows how to capture vectors of packets in pcap +format as they’re dispatched. The pcap captures are as follows: + +:: + + VPP graph dispatch trace record description: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Major Version | Minor Version | NStrings | ProtoHint | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer index (big endian) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + VPP graph node name ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Metadata ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Opaque ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Opaque 2 ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | VPP ASCII packet trace (if NStrings > 4) | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Packet data (up to 16K) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +Graph dispatch records comprise a version stamp, an indication of how +many NULL-terminated strings will follow the record header and preceed +packet data, and a protocol hint. + +The buffer index is an opaque 32-bit cookie which allows consumers of +these data to easily filter/track single packets as they traverse the +forwarding graph. + +Multiple records per packet are normal, and to be expected. Packets will +appear multiple times as they traverse the vpp forwarding graph. In this +way, vpp graph dispatch traces are significantly different from regular +network packet captures from an end-station. This property complicates +stateful packet analysis. + +Restricting stateful analysis to records from a single vpp graph node +such as “ethernet-input” seems likely to improve the situation. + +As of this writing: major version = 1, minor version = 0. Nstrings +SHOULD be 4 or 5. Consumers SHOULD be wary values less than 4 or greater +than 5. They MAY attempt to display the claimed number of strings, or +they MAY treat the condition as an error. + +Here is the current set of protocol hints: + +.. code:: c + + typedef enum + { + VLIB_NODE_PROTO_HINT_NONE = 0, + VLIB_NODE_PROTO_HINT_ETHERNET, + VLIB_NODE_PROTO_HINT_IP4, + VLIB_NODE_PROTO_HINT_IP6, + VLIB_NODE_PROTO_HINT_TCP, + VLIB_NODE_PROTO_HINT_UDP, + VLIB_NODE_N_PROTO_HINTS, + } vlib_node_proto_hint_t; + +Example: VLIB_NODE_PROTO_HINT_IP6 means that the first octet of packet +data SHOULD be 0x60, and should begin an ipv6 packet header. + +Downstream consumers of these data SHOULD pay attention to the protocol +hint. They MUST tolerate inaccurate hints, which MAY occur from time to +time. + +Dispatch Pcap Trace Debug CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To start a dispatch trace capture of up to 10,000 trace records: + +:: + + pcap dispatch trace on max 10000 file dispatch.pcap + +To start a dispatch trace which will also include standard vpp packet +tracing for packets which originate in dpdk-input: + +:: + + pcap dispatch trace on max 10000 file dispatch.pcap buffer-trace dpdk-input 1000 + +To save the pcap trace, e.g. in /tmp/dispatch.pcap: + +:: + + pcap dispatch trace off + +Wireshark dissection of dispatch pcap traces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It almost goes without saying that we built a companion wireshark +dissector to display these traces. As of this writing, we have +upstreamed the wireshark dissector. + +Since it will be a while before wireshark/master/latest makes it into +all of the popular Linux distros, please see the “How to build a vpp +dispatch trace aware Wireshark” page for build info. + +Here is a sample packet dissection, with some fields omitted for +clarity. The point is that the wireshark dissector accurately displays +**all** of the vpp buffer metadata, and the name of the graph node in +question. + +:: + + Frame 1: 2216 bytes on wire (17728 bits), 2216 bytes captured (17728 bits) + Encapsulation type: USER 13 (58) + [Protocols in frame: vpp:vpp-metadata:vpp-opaque:vpp-opaque2:eth:ethertype:ip:tcp:data] + VPP Dispatch Trace + BufferIndex: 0x00036663 + NodeName: ethernet-input + VPP Buffer Metadata + Metadata: flags: + Metadata: current_data: 0, current_length: 102 + Metadata: current_config_index: 0, flow_id: 0, next_buffer: 0 + Metadata: error: 0, n_add_refs: 0, buffer_pool_index: 0 + Metadata: trace_index: 0, recycle_count: 0, len_not_first_buf: 0 + Metadata: free_list_index: 0 + Metadata: + VPP Buffer Opaque + Opaque: raw: 00000007 ffffffff 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 + Opaque: sw_if_index[VLIB_RX]: 7, sw_if_index[VLIB_TX]: -1 + Opaque: L2 offset 0, L3 offset 0, L4 offset 0, feature arc index 0 + Opaque: ip.adj_index[VLIB_RX]: 0, ip.adj_index[VLIB_TX]: 0 + Opaque: ip.flow_hash: 0x0, ip.save_protocol: 0x0, ip.fib_index: 0 + Opaque: ip.save_rewrite_length: 0, ip.rpf_id: 0 + Opaque: ip.icmp.type: 0 ip.icmp.code: 0, ip.icmp.data: 0x0 + Opaque: ip.reass.next_index: 0, ip.reass.estimated_mtu: 0 + Opaque: ip.reass.fragment_first: 0 ip.reass.fragment_last: 0 + Opaque: ip.reass.range_first: 0 ip.reass.range_last: 0 + Opaque: ip.reass.next_range_bi: 0x0, ip.reass.ip6_frag_hdr_offset: 0 + Opaque: mpls.ttl: 0, mpls.exp: 0, mpls.first: 0, mpls.save_rewrite_length: 0, mpls.bier.n_bytes: 0 + Opaque: l2.feature_bitmap: 00000000, l2.bd_index: 0, l2.l2_len: 0, l2.shg: 0, l2.l2fib_sn: 0, l2.bd_age: 0 + Opaque: l2.feature_bitmap_input: none configured, L2.feature_bitmap_output: none configured + Opaque: l2t.next_index: 0, l2t.session_index: 0 + Opaque: l2_classify.table_index: 0, l2_classify.opaque_index: 0, l2_classify.hash: 0x0 + Opaque: policer.index: 0 + Opaque: ipsec.flags: 0x0, ipsec.sad_index: 0 + Opaque: map.mtu: 0 + Opaque: map_t.v6.saddr: 0x0, map_t.v6.daddr: 0x0, map_t.v6.frag_offset: 0, map_t.v6.l4_offset: 0 + Opaque: map_t.v6.l4_protocol: 0, map_t.checksum_offset: 0, map_t.mtu: 0 + Opaque: ip_frag.mtu: 0, ip_frag.next_index: 0, ip_frag.flags: 0x0 + Opaque: cop.current_config_index: 0 + Opaque: lisp.overlay_afi: 0 + Opaque: tcp.connection_index: 0, tcp.seq_number: 0, tcp.seq_end: 0, tcp.ack_number: 0, tcp.hdr_offset: 0, tcp.data_offset: 0 + Opaque: tcp.data_len: 0, tcp.flags: 0x0 + Opaque: sctp.connection_index: 0, sctp.sid: 0, sctp.ssn: 0, sctp.tsn: 0, sctp.hdr_offset: 0 + Opaque: sctp.data_offset: 0, sctp.data_len: 0, sctp.subconn_idx: 0, sctp.flags: 0x0 + Opaque: snat.flags: 0x0 + Opaque: + VPP Buffer Opaque2 + Opaque2: raw: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 + Opaque2: qos.bits: 0, qos.source: 0 + Opaque2: loop_counter: 0 + Opaque2: gbp.flags: 0, gbp.src_epg: 0 + Opaque2: pg_replay_timestamp: 0 + Opaque2: + Ethernet II, Src: 06:d6:01:41:3b:92 (06:d6:01:41:3b:92), Dst: IntelCor_3d:f6 Transmission Control Protocol, Src Port: 22432, Dst Port: 54084, Seq: 1, Ack: 1, Len: 36 + Source Port: 22432 + Destination Port: 54084 + TCP payload (36 bytes) + Data (36 bytes) + + 0000 cf aa 8b f5 53 14 d4 c7 29 75 3e 56 63 93 9d 11 ....S...)u>Vc... + 0010 e5 f2 92 27 86 56 4c 21 ce c5 23 46 d7 eb ec 0d ...'.VL!..#F.... + 0020 a8 98 36 5a ..6Z + Data: cfaa8bf55314d4c729753e5663939d11e5f2922786564c21… + [Length: 36] + +It’s a matter of a couple of mouse-clicks in Wireshark to filter the +trace to a specific buffer index. With that specific kind of filtration, +one can watch a packet walk through the forwarding graph; noting any/all +metadata changes, header checksum changes, and so forth. + +This should be of significant value when developing new vpp graph nodes. +If new code mispositions b->current_data, it will be completely obvious +from looking at the dispatch trace in wireshark. + +pcap rx, tx, and drop tracing +----------------------------- + +vpp also supports rx, tx, and drop packet capture in pcap format, +through the “pcap trace” debug CLI command. + +This command is used to start or stop a packet capture, or show the +status of packet capture. Each of “pcap trace rx”, “pcap trace tx”, and +“pcap trace drop” is implemented. Supply one or more of “rx”, “tx”, and +“drop” to enable multiple simultaneous capture types. + +These commands have the following optional parameters: + +- rx - trace received packets. + +- tx - trace transmitted packets. + +- drop - trace dropped packets. + +- max *nnnn*\ - file size, number of packet captures. Once packets + have been received, the trace buffer buffer is flushed to the + indicated file. Defaults to 1000. Can only be updated if packet + capture is off. + +- max-bytes-per-pkt *nnnn*\ - maximum number of bytes to trace on a + per-packet basis. Must be >32 and less than 9000. Default value: + + 512. + +- filter - Use the pcap rx / tx / drop trace filter, which must be + configured. Use classify filter pcap… to configure the filter. The + filter will only be executed if the per-interface or any-interface + tests fail. + +- intfc *interface* \| *any*\ - Used to specify a given interface, or + use ‘any’ to run packet capture on all interfaces. ‘any’ is the + default if not provided. Settings from a previous packet capture are + preserved, so ‘any’ can be used to reset the interface setting. + +- file *filename*\ - Used to specify the output filename. The file + will be placed in the ‘/tmp’ directory. If *filename* already exists, + file will be overwritten. If no filename is provided, ‘/tmp/rx.pcap + or tx.pcap’ will be used, depending on capture direction. Can only be + updated when pcap capture is off. + +- status - Displays the current status and configured attributes + associated with a packet capture. If packet capture is in progress, + ‘status’ also will return the number of packets currently in the + buffer. Any additional attributes entered on command line with a + ‘status’ request will be ignored. + +- filter - Capture packets which match the current packet trace filter + set. See next section. Configure the capture filter first. + +packet trace capture filtering +------------------------------ + +The “classify filter pcap \| \| trace” debug CLI command constructs an +arbitrary set of packet classifier tables for use with “pcap rx \| tx \| +drop trace,” and with the vpp packet tracer on a per-interface or +system-wide basis. + +Packets which match a rule in the classifier table chain will be traced. +The tables are automatically ordered so that matches in the most +specific table are tried first. + +It’s reasonably likely that folks will configure a single table with one +or two matches. As a result, we configure 8 hash buckets and 128K of +match rule space by default. One can override the defaults by specifying +“buckets ” and “memory-size ” as desired. + +To build up complex filter chains, repeatedly issue the classify filter +debug CLI command. Each command must specify the desired mask and match +values. If a classifier table with a suitable mask already exists, the +CLI command adds a match rule to the existing table. If not, the CLI +command add a new table and the indicated mask rule + +Configure a simple pcap classify filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter pcap mask l3 ip4 src match l3 ip4 src 192.168.1.11 + pcap trace rx max 100 filter + +Configure a simple per-interface capture filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter GigabitEthernet3/0/0 mask l3 ip4 src match l3 ip4 src 192.168.1.11" + pcap trace rx max 100 intfc GigabitEthernet3/0/0 + +Note that per-interface capture filters are *always* applied. + +Clear per-interface capture filters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter GigabitEthernet3/0/0 del + +Configure another fairly simple pcap classify filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter pcap mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10 + pcap trace tx max 100 filter + +Configure a vpp packet tracer filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter trace mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10 + trace add dpdk-input 100 filter + +Clear all current classifier filters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter [pcap | <interface> | trace] del + +To inspect the classifier tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + show classify table [verbose] + +The verbose form displays all of the match rules, with hit-counters. + +Terse description of the “mask ” syntax: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + l2 src dst proto tag1 tag2 ignore-tag1 ignore-tag2 cos1 cos2 dot1q dot1ad + l3 ip4 <ip4-mask> ip6 <ip6-mask> + <ip4-mask> version hdr_length src[/width] dst[/width] + tos length fragment_id ttl protocol checksum + <ip6-mask> version traffic-class flow-label src dst proto + payload_length hop_limit protocol + l4 tcp <tcp-mask> udp <udp_mask> src_port dst_port + <tcp-mask> src dst # ports + <udp-mask> src_port dst_port + +To construct **matches**, add the values to match after the indicated +keywords in the mask syntax. For example: “… mask l3 ip4 src” -> “… +match l3 ip4 src 192.168.1.11” + +VPP Packet Generator +-------------------- + +We use the VPP packet generator to inject packets into the forwarding +graph. The packet generator can replay pcap traces, and generate packets +out of whole cloth at respectably high performance. + +The VPP pg enables quite a variety of use-cases, ranging from functional +testing of new data-plane nodes to regression testing to performance +tuning. + +PG setup scripts +---------------- + +PG setup scripts describe traffic in detail, and leverage vpp debug CLI +mechanisms. It’s reasonably unusual to construct a pg setup script which +doesn’t include a certain amount of interface and FIB configuration. + +For example: + +:: + + loop create + set int ip address loop0 192.168.1.1/24 + set int state loop0 up + + packet-generator new { + name pg0 + limit 100 + rate 1e6 + size 300-300 + interface loop0 + node ethernet-input + data { IP4: 1.2.3 -> 4.5.6 + UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10 + UDP: 1234 -> 2345 + incrementing 286 + } + } + +A packet generator stream definition includes two major sections: - +Stream Parameter Setup - Packet Data + +Stream Parameter Setup +~~~~~~~~~~~~~~~~~~~~~~ + +Given the example above, let’s look at how to set up stream parameters: + +- **name pg0** - Name of the stream, in this case “pg0” + +- **limit 1000** - Number of packets to send when the stream is + enabled. “limit 0” means send packets continuously. + +- **maxframe <nnn>** - Maximum frame size. Handy for injecting multiple + frames no larger than <nnn>. Useful for checking dual / quad loop + codes + +- **rate 1e6** - Packet injection rate, in this case 1 MPPS. When not + specified, the packet generator injects packets as fast as possible + +- **size 300-300** - Packet size range, in this case send 300-byte + packets + +- **interface loop0** - Packets appear as if they were received on the + specified interface. This datum is used in multiple ways: to select + graph arc feature configuration, to select IP FIBs. Configure + features e.g. on loop0 to exercise those features. + +- **tx-interface <name>** - Packets will be transmitted on the + indicated interface. Typically required only when injecting packets + into post-IP-rewrite graph nodes. + +- **pcap <filename>** - Replay packets from the indicated pcap capture + file. “make test” makes extensive use of this feature: generate + packets using scapy, save them in a .pcap file, then inject them into + the vpp graph via a vpp pg “pcap <filename>” stream definition + +- **worker <nn>** - Generate packets for the stream using the indicated + vpp worker thread. The vpp pg generates and injects O(10 MPPS / + core). Use multiple stream definitions and worker threads to generate + and inject enough traffic to easily fill a 40 gbit pipe with small + packets. + +Data definition +~~~~~~~~~~~~~~~ + +Packet generator data definitions make use of a layered implementation +strategy. Networking layers are specified in order, and the notation can +seem a bit counter-intuitive. In the example above, the data definition +stanza constructs a set of L2-L4 headers layers, and uses an +incrementing fill pattern to round out the requested 300-byte packets. + +- **IP4: 1.2.3 -> 4.5.6** - Construct an L2 (MAC) header with the ip4 + ethertype (0x800), src MAC address of 00:01:00:02:00:03 and dst MAC + address of 00:04:00:05:00:06. Mac addresses may be specified in + either *xxxx.xxxx.xxxx* format or *xx:xx:xx:xx:xx:xx* format. + +- **UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10** - Construct an + incrementing set of L3 (IPv4) headers for successive packets with + source addresses ranging from .10 to .254. All packets in the stream + have a constant dest address of 192.168.2.10. Set the protocol field + to 17, UDP. + +- **UDP: 1234 -> 2345** - Set the UDP source and destination ports to + 1234 and 2345, respectively + +- **incrementing 256** - Insert up to 256 incrementing data bytes. + +Obvious variations involve “s/IP4/IP6/” in the above, along with +changing from IPv4 to IPv6 address notation. + +The vpp pg can set any / all IPv4 header fields, including tos, packet +length, mf / df / fragment id and offset, ttl, protocol, checksum, and +src/dst addresses. Take a look at ../src/vnet/ip/ip[46]_pg.c for +details. + +If all else fails, specify the entire packet data in hex: + +- **hex 0xabcd…** - copy hex data verbatim into the packet + +When replaying pcap files (“**pcap <filename>**”), do not specify a data +stanza. + +Diagnosing “packet-generator new” parse failures +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you want to inject packets into a brand-new graph node, remember to +tell the packet generator debug CLI how to parse the packet data stanza. + +If the node expects L2 Ethernet MAC headers, specify “.unformat_buffer = +unformat_ethernet_header”: + +.. code:: c + + VLIB_REGISTER_NODE (ethernet_input_node) = + { + <snip> + .unformat_buffer = unformat_ethernet_header, + <snip> + }; + +Beyond that, it may be necessary to set breakpoints in +…/src/vnet/pg/cli.c. Debug image suggested. + +When debugging new nodes, it may be far simpler to directly inject +ethernet frames - and add a corresponding vlib_buffer_advance in the new +node - than to modify the packet generator. + +Debug CLI +--------- + +The descriptions above describe the “packet-generator new” debug CLI in +detail. + +Additional debug CLI commands include: + +:: + + vpp# packet-generator enable [<stream-name>] + +which enables the named stream, or all streams. + +:: + + vpp# packet-generator disable [<stream-name>] + +disables the named stream, or all streams. + +:: + + vpp# packet-generator delete <stream-name> + +Deletes the named stream. + +:: + + vpp# packet-generator configure <stream-name> [limit <nnn>] + [rate <f64-pps>] [size <nn>-<nn>] + +Changes stream parameters without having to recreate the entire stream +definition. Note that re-issuing a “packet-generator new” command will +correctly recreate the named stream. |