diff options
Diffstat (limited to 'docs/developer/corearchitecture')
17 files changed, 4112 insertions, 0 deletions
diff --git a/docs/developer/corearchitecture/bihash.rst b/docs/developer/corearchitecture/bihash.rst new file mode 100644 index 00000000000..9b62baaf9cf --- /dev/null +++ b/docs/developer/corearchitecture/bihash.rst @@ -0,0 +1,313 @@ +Bounded-index Extensible Hashing (bihash) +========================================= + +Vpp uses bounded-index extensible hashing to solve a variety of +exact-match (key, value) lookup problems. Benefits of the current +implementation: + +- Very high record count scaling, tested to 100,000,000 records. +- Lookup performance degrades gracefully as the number of records + increases +- No reader locking required +- Template implementation, it’s easy to support arbitrary (key,value) + types + +Bounded-index extensible hashing has been widely used in databases for +decades. + +Bihash uses a two-level data structure: + +:: + + +-----------------+ + | bucket-0 | + | log2_size | + | backing store | + +-----------------+ + | bucket-1 | + | log2_size | +--------------------------------+ + | backing store | --------> | KVP_PER_PAGE * key-value-pairs | + +-----------------+ | page 0 | + ... +--------------------------------+ + +-----------------+ | KVP_PER_PAGE * key-value-pairs | + | bucket-2**N-1 | | page 1 | + | log2_size | +--------------------------------+ + | backing store | --- + +-----------------+ +--------------------------------+ + | KVP_PER_PAGE * key-value-pairs | + | page 2**(log2(size)) - 1 | + +--------------------------------+ + +Discussion of the algorithm +--------------------------- + +This structure has a couple of major advantages. In practice, each +bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target +CPU architectures support 64-bit atomic operations. When modifying the +contents of a specific bucket, we do the following: + +- Make a working copy of the bucket’s backing storage +- Atomically swap a pointer to the working copy into the bucket array +- Change the original backing store data +- Atomically swap back to the original + +So, no reader locking is required to search a bihash table. + +At lookup time, the implementation computes a key hash code. We use the +least-significant N bits of the hash to select the bucket. + +With the bucket in hand, we learn log2 (nBackingPages) for the selected +bucket. At this point, we use the next log2_size bits from the hash code +to select the specific backing page in which the (key,value) page will +be found. + +Net result: we search **one** backing page, not 2**log2_size pages. This +is a key property of the algorithm. + +When sufficient collisions occur to fill the backing pages for a given +bucket, we double the bucket size, rehash, and deal the bucket contents +into a double-sized set of backing pages. In the future, we may +represent the size as a linear combination of two powers-of-two, to +increase space efficiency. + +To solve the “jackpot case” where a set of records collide under hashing +in a bad way, the implementation will fall back to linear search across +2**log2_size backing pages on a per-bucket basis. + +To maintain *space* efficiency, we should configure the bucket array so +that backing pages are effectively utilized. Lookup performance tends to +change *very little* if the bucket array is too small or too large. + +Bihash depends on selecting an effective hash function. If one were to +use a truly broken hash function such as “return 1ULL.” bihash would +still work, but it would be equivalent to poorly-programmed linear +search. + +We often use cpu intrinsic functions - think crc32 - to rapidly compute +a hash code which has decent statistics. + +Bihash Cookbook +--------------- + +Using current (key,value) template instance types +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It’s quite easy to use one of the template instance types. As of this +writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24, +40, and 48 byte keys, u8 \* vector keys, and 8 byte values. + +See …/src/vppinfra/{bihash\_\_8}.h + +To define the data types, #include a specific template instance, most +often in a subsystem header file: + +.. code:: c + + #include <vppinfra/bihash_8_8.h> + +If you’re building a standalone application, you’ll need to define the +various functions by #including the method implementation file in a C +source file. + +The core vpp engine currently uses most if not all of the known bihash +types, so you probably won’t need to #include the method implementation +file. + +.. code:: c + + #include <vppinfra/bihash_template.c> + +Add an instance of the selected bihash data structure to e.g. a “main_t” +structure: + +.. code:: c + + typedef struct + { + ... + BVT (clib_bihash) hash_table; + or + clib_bihash_8_8_t hash_table; + ... + } my_main_t; + +The BV macro concatenate its argument with the value of the preprocessor +symbol BIHASH_TYPE. The BVT macro concatenates its argument with the +value of BIHASH_TYPE and the fixed-string “_t”. So in the above example, +BVT (clib_bihash) generates “clib_bihash_8_8_t”. + +If you’re sure you won’t decide to change the template / type name +later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth. + +In fact, if you #include multiple template instances in a single source +file, you **must** use fully-enumerated type names. The macros stand no +chance of working. + +Initializing a bihash table +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Call the init function as shown. As a rough guide, pick a number of +buckets which is approximately +number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant +template instance header-file. See previous discussion. + +The amount of memory selected should easily contain all of the records, +with a generous allowance for hash collisions. Bihash memory is +allocated separately from the main heap, and won’t cost anything except +kernel PTE’s until touched, so it’s OK to be reasonably generous. + +For example: + +.. code:: c + + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + + clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets, + (uword) memory_size); + +Add or delete a key/value pair +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use BV(clib_bihash_add_del), or the explicit type variant: + +.. code:: c + + clib_bihash_kv_8_8_t kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + kv.key = key_to_add_or_delete; + kv.value = value_to_add_or_delete; + + clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */); + +In the delete case, kv.value is irrelevant. To change the value +associated with an existing (key,value) pair, simply re-add the [new] +pair. + +Simple search +~~~~~~~~~~~~~ + +The simplest possible (key, value) search goes like so: + +.. code:: c + + clib_bihash_kv_8_8_t search_kv, return_kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + search_kv.key = key_to_add_or_delete; + + if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0) + key_not_found(); + else + key_found(); + +Note that it’s perfectly fine to collect the lookup result + +.. code:: c + + if (clib_bihash_search_8_8 (h, &search_kv, &search_kv)) + key_not_found(); + etc. + +Bihash vector processing +~~~~~~~~~~~~~~~~~~~~~~~~ + +When processing a vector of packets which need a certain lookup +performed, it’s worth the trouble to compute the key hash, and prefetch +the correct bucket ahead of time. + +Here’s a sketch of one way to write the required code: + +Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet +data required to form the record keys \* 4 packets ahead, form 2x record +keys and call BV(clib_bihash_hash) or the explicit hash function to +calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to +prefetch the buckets \* 2 packets ahead, call 2x +BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \* +In the processing section, call 2x +BV(clib_bihash_search_inline_with_hash) to perform the search + +Programmer’s choice whether to stash the hash code somewhere in +vnet_buffer(b) metadata, or to use local variables. + +Single-loop: \* Use simple search as shown above. + +Walking a bihash table +~~~~~~~~~~~~~~~~~~~~~~ + +A fairly common scenario to build “show” commands involves walking a +bihash table. It’s simple enough: + +.. code:: c + + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + void callback_fn (clib_bihash_kv_8_8_t *, void *); + + h = &mm->hash_table; + + BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg); + +To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates +across the entire table, calling callback_fn with active entries. + +Bihash table iteration safety +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The iterator template “clib_bihash_foreach_key_value_pair” must be used +with a certain amount of care. For one thing, the iterator template does +*not* take the bihash hash table writer lock. If your use-case requires +it, lock the table. + +For another, the iterator template is not safe under all conditions: + +- It’s **OK to delete** bihash table entries during a table-walk. The + iterator checks whether the current bucket has been freed after each + *callback_fn(…)* invocation. + +- It is **not OK to add** entries during a table-walk. + +The add-during-walk case involves a jackpot: while processing a +key-value-pair in a particular bucket, add a certain number of entries. +By luck, assume that one or more of the added entries causes the +**current bucket** to split-and-rehash. + +Since we rehash KVP’s to different pages based on what amounts to a +different hash function, either of these things can go wrong: + +- We may revisit previously-visited entries. Depending on how one coded + the use-case, we could end up in a recursive-add situation. + +- We may skip entries that have not been visited + +One could build an add-safe iterator, at a significant cost in +performance: copy the entire bucket, and walk the copy. + +It’s hard to imagine a worthwhile add-during walk use-case in the first +place; let alone one which couldn’t be implemented by walking the table +without modifying it, then adding a set of records. + +Creating a new template instance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Creating a new template is easy. Use one of the existing templates as a +model, and make the obvious changes. The hash and key_compare methods +are performance-critical in multiple senses. + +If the key compare method is slow, every lookup will be slow. If the +hash function is slow, same story. If the hash function has poor +statistical properties, space efficiency will suffer. In the limit, a +bad enough hash function will cause large portions of the table to +revert to linear search. + +Use of the best available vector unit is well worth the trouble in the +hash and key_compare functions. diff --git a/docs/developer/corearchitecture/buffer_metadata.rst b/docs/developer/corearchitecture/buffer_metadata.rst new file mode 100644 index 00000000000..545c31f3041 --- /dev/null +++ b/docs/developer/corearchitecture/buffer_metadata.rst @@ -0,0 +1,237 @@ +Buffer Metadata +=============== + +Each vlib_buffer_t (packet buffer) carries buffer metadata which +describes the current packet-processing state. The underlying techniques +have been used for decades, across multiple packet processing +environments. + +We will examine vpp buffer metadata in some detail, but folks who need +to manipulate and/or extend the scheme should expect to do a certain +level of code inspection. + +Vlib (Vector library) primary buffer metadata +--------------------------------------------- + +The first 64 octets of each vlib_buffer_t carries the primary buffer +metadata. See …/src/vlib/buffer.h for full details. + +Important fields: + +- i16 current_data: the signed offset in data[], pre_data[] that we are + currently processing. If negative current header points into the + pre-data (rewrite space) area. +- u16 current_length: nBytes between current_data and the end of this + buffer. +- u32 flags: Buffer flag bits. Heavily used, not many bits left + + - src/vlib/buffer.h flag bits + + - VLIB_BUFFER_IS_TRACED: buffer is traced + - VLIB_BUFFER_NEXT_PRESENT: buffer has multiple chunks + - VLIB_BUFFER_TOTAL_LENGTH_VALID: + total_length_not_including_first_buffer is valid (see below) + + - src/vnet/buffer.h flag bits + + - VNET_BUFFER_F_L4_CHECKSUM_COMPUTED: tcp/udp checksum has been + computed + - VNET_BUFFER_F_L4_CHECKSUM_CORRECT: tcp/udp checksum is correct + - VNET_BUFFER_F_VLAN_2_DEEP: two vlan tags present + - VNET_BUFFER_F_VLAN_1_DEEP: one vlan tag present + - VNET_BUFFER_F_SPAN_CLONE: packet has already been cloned (span + feature) + - VNET_BUFFER_F_LOOP_COUNTER_VALID: packet look-up loop count + valid + - VNET_BUFFER_F_LOCALLY_ORIGINATED: packet built by vpp + - VNET_BUFFER_F_IS_IP4: packet is ipv4, for checksum offload + - VNET_BUFFER_F_IS_IP6: packet is ipv6, for checksum offload + - VNET_BUFFER_F_OFFLOAD_IP_CKSUM: hardware ip checksum offload + requested + - VNET_BUFFER_F_OFFLOAD_TCP_CKSUM: hardware tcp checksum offload + requested + - VNET_BUFFER_F_OFFLOAD_UDP_CKSUM: hardware udp checksum offload + requested + - VNET_BUFFER_F_IS_NATED: natted packet, skip input checks + - VNET_BUFFER_F_L2_HDR_OFFSET_VALID: L2 header offset valid + - VNET_BUFFER_F_L3_HDR_OFFSET_VALID: L3 header offset valid + - VNET_BUFFER_F_L4_HDR_OFFSET_VALID: L4 header offset valid + - VNET_BUFFER_F_FLOW_REPORT: packet is an ipfix packet + - VNET_BUFFER_F_IS_DVR: packet to be reinjected into the l2 + output path + - VNET_BUFFER_F_QOS_DATA_VALID: QoS data valid in + vnet_buffer_opaque2 + - VNET_BUFFER_F_GSO: generic segmentation offload requested + - VNET_BUFFER_F_AVAIL1: available bit + - VNET_BUFFER_F_AVAIL2: available bit + - VNET_BUFFER_F_AVAIL3: available bit + - VNET_BUFFER_F_AVAIL4: available bit + - VNET_BUFFER_F_AVAIL5: available bit + - VNET_BUFFER_F_AVAIL6: available bit + - VNET_BUFFER_F_AVAIL7: available bit + +- u32 flow_id: generic flow identifier +- u8 ref_count: buffer reference / clone count (e.g. for span + replication) +- u8 buffer_pool_index: buffer pool index which owns this buffer +- vlib_error_t (u16) error: error code for buffers enqueued to error + handler +- u32 next_buffer: buffer index of next buffer in chain. Only valid if + VLIB_BUFFER_NEXT_PRESENT is set +- union + + - u32 current_config_index: current index on feature arc + - u32 punt_reason: reason code once packet punted. Mutually + exclusive with current_config_index + +- u32 opaque[10]: primary vnet-layer opaque data (see below) +- END of first cache line / data initialized by the buffer allocator +- u32 trace_index: buffer’s index in the packet trace subsystem +- u32 total_length_not_including_first_buffer: see + VLIB_BUFFER_TOTAL_LENGTH_VALID above +- u32 opaque2[14]: secondary vnet-layer opaque data (see below) +- u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]: rewrite space, often used to + prepend tunnel encapsulations +- u8 data[0]: buffer data received from the wire. Ordinarily, hardware + devices use b->data[0] as the DMA target but there are exceptions. Do + not write code which blindly assumes that packet data starts in + b->data[0]. Use vlib_buffer_get_current(…). + +Vnet (network stack) primary buffer metadata +-------------------------------------------- + +Vnet primary buffer metadata occupies space reserved in the vlib opaque +field shown above, and has the type name vnet_buffer_opaque_t. +Ordinarily accessed using the vnet_buffer(b) macro. See +../src/vnet/buffer.h for full details. + +Important fields: + +- u32 sw_if_index[2]: RX and TX interface handles. At the ip lookup + stage, vnet_buffer(b)->sw_if_index[VLIB_TX] is interpreted as a FIB + index. +- i16 l2_hdr_offset: offset from b->data[0] of the packet L2 header. + Valid only if b->flags & VNET_BUFFER_F_L2_HDR_OFFSET_VALID is set +- i16 l3_hdr_offset: offset from b->data[0] of the packet L3 header. + Valid only if b->flags & VNET_BUFFER_F_L3_HDR_OFFSET_VALID is set +- i16 l4_hdr_offset: offset from b->data[0] of the packet L4 header. + Valid only if b->flags & VNET_BUFFER_F_L4_HDR_OFFSET_VALID is set +- u8 feature_arc_index: feature arc that the packet is currently + traversing +- union + + - ip + + - u32 adj_index[2]: adjacency from dest IP lookup in [VLIB_TX], + adjacency from source ip lookup in [VLIB_RX], set to ~0 until + source lookup done + - union + + - generic fields + - ICMP fields + - reassembly fields + + - mpls fields + - l2 bridging fields, only valid in the L2 path + - l2tpv3 fields + - l2 classify fields + - vnet policer fields + - MAP fields + - MAP-T fields + - ip fragmentation fields + - COP (whitelist/blacklist filter) fields + - LISP fields + - TCP fields + + - connection index + - sequence numbers + - header and data offsets + - data length + - flags + + - SCTP fields + - NAT fields + - u32 unused[6] + +Vnet (network stack) secondary buffer metadata +---------------------------------------------- + +Vnet primary buffer metadata occupies space reserved in the vlib opaque2 +field shown above, and has the type name vnet_buffer_opaque2_t. +Ordinarily accessed using the vnet_buffer2(b) macro. See +../src/vnet/buffer.h for full details. + +Important fields: + +- qos fields + + - u8 bits + - u8 source + +- u8 loop_counter: used to detect and report internal forwarding loops +- group-based policy fields + + - u8 flags + - u16 sclass: the packet’s source class + +- u16 gso_size: L4 payload size, persists all the way to + interface-output in case GSO is not enabled +- u16 gso_l4_hdr_sz: size of the L4 protocol header +- union + + - packet trajectory tracer (largely deprecated) + + - u16 \*trajectory_trace; only #if VLIB_BUFFER_TRACE_TRAJECTORY > + 0 + + - packet generator + + - u64 pg_replay_timestamp: timestamp for replayed pcap trace + packets + + - u32 unused[8] + +Buffer Metadata Extensions +-------------------------- + +Plugin developers may wish to extend either the primary or secondary +vnet buffer opaque unions. Please perform a manual live variable +analysis, otherwise nodes which use shared buffer metadata space may +break things. + +It’s not OK to add plugin or proprietary metadata to the core vpp engine +header files named above. Instead, proceed as follows. The example +concerns the vnet primary buffer opaque union vlib_buffer_opaque_t. It’s +a very simple variation to use the vnet secondary buffer opaque union +vlib_buffer_opaque2_t. + +In a plugin header file: + +:: + + /* Add arbitrary buffer metadata */ + #include <vnet/buffer.h> + + typedef struct + { + u32 my_stuff[6]; + } my_buffer_opaque_t; + + STATIC_ASSERT (sizeof (my_buffer_opaque_t) <= + STRUCT_SIZE_OF (vnet_buffer_opaque_t, unused), + "Custom meta-data too large for vnet_buffer_opaque_t"); + + #define my_buffer_opaque(b) \ + ((my_buffer_opaque_t *)((u8 *)((b)->opaque) + STRUCT_OFFSET_OF (vnet_buffer_opaque_t, unused))) + +To set data in the custom buffer opaque type given a vlib_buffer_t \*b: + +:: + + my_buffer_opaque (b)->my_stuff[2] = 123; + +To read data from the custom buffer opaque type: + +:: + + stuff0 = my_buffer_opaque (b)->my_stuff[2]; diff --git a/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst new file mode 100644 index 00000000000..1eb4e6b5301 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst @@ -0,0 +1,353 @@ +Introduction to build-root/Makefile +=================================== + +The vpp build system consists of a top-level Makefile, a data-driven +build-root/Makefile, and a set of makefile fragments. The various parts +come together as the result of a set of well-thought-out conventions. + +This section describes build-root/Makefile in some detail. + +Repository Groups and Source Paths +---------------------------------- + +Current vpp workspaces comprise a single repository group. The file +.../build-root/build-config.mk defines a key variable called +SOURCE\_PATH. The SOURCE\_PATH variable names the set of repository +groups. At the moment, there is only one repository group. + +Single pass build system, dependencies and components +----------------------------------------------------- + +The vpp build system caters to components built with GNU autoconf / +automake. Adding such components is a simple process. Dealing with +components which use BSD-style raw Makefiles is a more difficult. +Dealing with toolchain components such as gcc, glibc, and binutils can +be considerably more complicated. + +The vpp build system is a **single-pass** build system. A partial order +must exist for any set of components: the set of (a before b) tuples +must resolve to an ordered list. If you create a circular dependency of +the form; (a,b) (b,c) (c,a), gmake will try to build the target list, +but there’s a 0.0% chance that the results will be pleasant. Cut-n-paste +mistakes in .../build-data/packages/.mk can produce confusing failures. + +In a single-pass build system, it’s best to separate libraries and +applications which instantiate them. For example, if vpp depends on +libfoo.a, and myapp depends on both vpp and libfoo.a, it's best to place +libfoo.a and myapp in separate components. The build system will build +libfoo.a, vpp, and then (as a separate component) myapp. If you try to +build libfoo.a and myapp from the same component, it won’t work. + +If you absolutely, positively insist on having myapp and libfoo.a in the +same source tree, you can create a pseudo-component in a separate .mk +file in the .../build-data/packages/ directory. Define the code +phoneycomponent\_source = realcomponent, and provide manual +configure/build/install targets. + +Separate components for myapp, libfoo.a, and vpp is the best and easiest +solution. However, the “mumble\_source = realsource” degree of freedom +exists to solve intractable circular dependencies, such as: to build +gcc-bootstrap, followed by glibc, followed by “real” gcc/g++ [which +depends on glibc too]. + +.../build-root +-------------- + +The .../build-root directory contains the repository group specification +build-config.mk, the main Makefile, and the system-wide set of +autoconf/automake variable overrides in config.site. We'll describe +these files in some detail. To be clear about expectations: the main +Makefile and config.site file are subtle and complex. It's unlikely that +you'll need or want to modify them. Poorly planned changes in either +place typically cause bugs that are difficult to solve. + +.../build-root/build-config.mk +------------------------------ + +As described above, the build-config.mk file is straightforward: it sets +the make variable SOURCE\_PATH to a list of repository group absolute +paths. + +The SOURCE\_PATH variable If you choose to move a workspace, make sure +to modify the paths defined by the SOURCE\_PATH variable. Those paths +need to match changes you make in the workspace paths. For example, if +you place the vpp directory in the workspace of a user named jsmith, you +might change the SOURCE\_PATH to: + +SOURCE\_PATH = /home/jsmithuser/workspace/vpp + +The "out of the box" setting should work 99.5% of the time: + +:: + + SOURCE_PATH = $(CURDIR)/.. + +.../vpp/build-root/Makefile +--------------------------- + +The main Makefile is complex in a number of dimensions. If you think you +need to modify it, it's a good idea to do some research, or ask for +advice before you change it. + +The main Makefile was organized and designed to provide the following +characteristics: excellent performance, accurate dependency processing, +cache enablement, timestamp optimizations, git integration, +extensibility, builds with cross-compilation tool chains, and builds +with embedded Linux distributions. + +If you really need to do so, you can build double-cross tools with it, +with a minimum amount of fuss. For example, you could: compile gdb on +x86\_64, to run on PowerPC, to debug the Xtensa instruction set. + +The PLATFORM variable +--------------------- + +The PLATFORM make/environment variable controls a number of important +characteristics, primarily: + +- CPU architecture +- The list of images to build. + +With respect to .../build-root/Makefile, the list of images to build is +specified by the target. For example: + +:: + + make PLATFORM=vpp TAG=vpp_debug install-deb + +builds vpp debug Debian packages. + +The main Makefile interprets $PLATFORM by attempting to "-include" the +file /build-data/platforms.mk: + +:: + + $(foreach d,$(FULL_SOURCE_PATH), \ + $(eval -include $(d)/platforms.mk)) + +By convention, we don't define **platforms** in the +...//build-data/platforms.mk file. + +In the vpp case, we search for platform definition makefile fragments in +.../vpp/build-data/platforms.mk, as follows: + +:: + + $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \ + $(eval -include $(d)/platforms/*.mk)) + +With vpp, which uses the "vpp" platform as discussed above, we end up +"-include"-ing .../vpp/build-data/platforms/vpp.mk. + +The platform-specific .mk fragment +---------------------------------- + +Here are the contents of .../build-data/platforms/vpp.mk: + +:: + + MACHINE=$(shell uname -m) + + vpp_arch = native + ifeq ($(TARGET_PLATFORM),thunderx) + vpp_dpdk_target = arm64-thunderx-linuxapp-gcc + endif + vpp_native_tools = vppapigen + + vpp_uses_dpdk = yes + + # Uncomment to enable building unit tests + # vpp_enable_tests = yes + + vpp_root_packages = vpp + + # DPDK configuration parameters + # vpp_uses_dpdk_mlx4_pmd = yes + # vpp_uses_dpdk_mlx5_pmd = yes + # vpp_uses_external_dpdk = yes + # vpp_dpdk_inc_dir = /usr/include/dpdk + # vpp_dpdk_lib_dir = /usr/lib + # vpp_dpdk_shared_lib = yes + + # Use '--without-libnuma' for non-numa aware architecture + # Use '--enable-dlmalloc' to use dlmalloc instead of mheap + vpp_configure_args_vpp = --enable-dlmalloc + sample-plugin_configure_args_vpp = --enable-dlmalloc + + # load balancer plugin is not portable on 32 bit platform + ifeq ($(MACHINE),i686) + vpp_configure_args_vpp += --disable-lb-plugin + endif + + vpp_debug_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + vpp_debug_TAG_CXXFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + vpp_debug_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG \ + -fstack-protector-all -fPIC -Werror + + vpp_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_TAG_CXXFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror -pie -Wl,-z,now + + vpp_clang_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + vpp_clang_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror + + vpp_gcov_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -fprofile-arcs -ftest-coverage + vpp_gcov_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -coverage + + vpp_coverity_TAG_CFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__ + vpp_coverity_TAG_LDFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__ + +Note the following variable settings: + +- The variable \_arch sets the CPU architecture used to build the + per-platform cross-compilation toolchain. With the exception of the + "native" architecture - used in our example - the vpp build system + produces cross-compiled binaries. + +- The variable \_native\_tools lists the required set of self-compiled + build tools. + +- The variable \_root\_packages lists the set of images to build when + specifying the target: make PLATFORM= TAG= [install-deb \| + install-rpm]. + +The TAG variable +---------------- + +The TAG variable indirectly sets CFLAGS and LDFLAGS, as well as the +build and install directory names in the .../vpp/build-root directory. +See definitions above. + +Important targets build-root/Makefile +------------------------------------- + +The main Makefile and the various makefile fragments implement the +following user-visible targets: + ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| Target | ENV Variable Settings| Notes | +| | | | ++==================+======================+======================================================================================+ +| foo | bar | mumble | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| bootstrap-tools | none | Builds the set of native tools needed by the vpp build system to | +| | | build images. Example: vppapigen. In a full cross compilation case might include | +| | | include "make", "git", "find", and "tar | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-tools | PLATFORM | Builds the tool chain for the indicated <platform>. Not used in vpp builds | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| distclean | none | Roto-rooters everything in sight: toolchains, images, and so forth. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-deb | PLATFORM and TAG | Build Debian packages comprising components listed in <platform>_root_packages, | +| | | using compile / link options defined by TAG. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ +| install-rpm | PLATFORM and TAG | Build RPMs comprising components listed in <platform>_root_packages, | +| | | using compile / link options defined by TAG. | ++------------------+----------------------+--------------------------------------------------------------------------------------+ + +Additional build-root/Makefile environment variable settings +------------------------------------------------------------ + +These variable settings may be of use: + ++----------------------+------------------------------------------------------------------------------------------------------------+ +| ENV Variable | Notes | ++======================+======================+=====================================================================================+ +| BUILD_DEBUG=vx | Directs Makefile et al. to make a good-faith effort to show what's going on in excruciating detail. | +| | Use it as follows: "make ... BUILD_DEBUG=vx". Fairly effective in Makefile debug situations. | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| V=1 | print detailed cc / ld command lines. Useful for discovering if -DFOO=11 is in the command line or not | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| CC=mygcc | Override the configured C-compiler | ++----------------------+------------------------------------------------------------------------------------------------------------+ + +.../build-root/config.site +-------------------------- + +The contents of .../build-root/config.site override individual autoconf / +automake default variable settings. Here are a few sample settings related to +building a full toolchain: + +:: + + # glibc needs these setting for cross compiling + libc_cv_forced_unwind=yes + libc_cv_c_cleanup=yes + libc_cv_ssp=no + +Determining the set of variables which need to be overridden, and the +override values is a matter of trial and error. It should be +unnecessary to modify this file for use with fd.io vpp. + +.../build-data/platforms.mk +--------------------------- + +Each repo group includes the platforms.mk file, which is included by +the main Makefile. The vpp/build-data/platforms.mk file is not terribly +complex. As of this writing, .../build-data/platforms.mk file accomplishes two +tasks. + +First, it includes vpp/build-data/platforms/\*.mk: + +:: + + # Pick up per-platform makefile fragments + $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \ + $(eval -include $(d)/platforms/*.mk)) + +This collects the set of platform definition makefile fragments, as discussed above. + +Second, platforms.mk implements the user-visible "install-deb" target. + +.../build-data/packages/\*.mk +----------------------------- + +Each component needs a makefile fragment in order for the build system +to recognize it. The per-component makefile fragments vary +considerably in complexity. For a component built with GNU autoconf / +automake which does not depend on other components, the make fragment +can be empty. See .../build-data/packages/vpp.mk for an uncomplicated +but fully realistic example. + +Here are some of the important variable settings in per-component makefile fragments: + ++----------------------+------------------------------------------------------------------------------------------------------------+ +| Variable | Notes | ++======================+======================+=====================================================================================+ +| xxx_configure_depend | Lists the set of component build dependencies for the xxx component. In plain English: don't try to | +| | configure this component until you've successfully built the indicated targets. Almost always, | +| | xxx_configure_depend will list a set of "yyy-install" targets. Note the pattern: | +| | "variable names contain underscores, make target names contain hyphens" | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_configure_args | (optional) Lists any additional arguments to pass to the xxx component "configure" script. | +| | The main Makefile %-configure rule adds the required settings for --libdir, --prefix, and | +| | --host (when cross-compiling) | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_CPPFLAGS | Adds -I stanzas to CPPFLAGS for components upon which xxx depends. | +| | Almost invariably "xxx_CPPFLAGS = $(call installed_includes_fn, dep1 dep2 dep3)", where dep1, dep2, and | +| | dep3 are listed in xxx_configure_depend. It is bad practice to set "-g -O3" here. Those settings | +| | belong in a TAG. | ++----------------------+------------------------------------------------------------------------------------------------------------+ +| xxx_LDFLAGS | Adds -Wl,-rpath -Wl,depN stanzas to LDFLAGS for components upon which xxx depends. | +| | Almost invariably "xxx_LDFLAGS = $(call installed_lib_fn, dep1 dep2 dep3)", where dep1, dep2, and | +| | dep3 are listed in xxx_configure_depend. It is bad manners to set "-liberty-or-death" here. | +| | Those settings belong in Makefile.am. | ++----------------------+------------------------------------------------------------------------------------------------------------+ + +When dealing with "irritating" components built with raw Makefiles +which only work when building in the source tree, we use a specific +strategy in the xxx.mk file. + +The strategy is simple for those components: We copy the source tree +into .../vpp/build-root/build-xxx. This works, but completely defeats +dependency processing. This strategy is acceptable only for 3rd party +software which won't need extensive (or preferably any) modifications. + +Take a look at .../vpp/build-data/packages/dpdk.mk. When invoked, the +dpdk_configure variable copies source code into $(PACKAGE_BUILD_DIR), +and performs the BSD equivalent of "autoreconf -i -f" to configure the +build area. The rest of the file is similar: a bunch of hand-rolled +glue code which manages to make the dpdk act like a good vpp build +citizen even though it is not. diff --git a/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst new file mode 100644 index 00000000000..580d261bdac --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst @@ -0,0 +1,186 @@ +Introduction to cmake and ninja +=============================== + +Cmake plus ninja is approximately equal to GNU autotools plus GNU +make, respectively. Both cmake and GNU autotools support self and +cross-compilation, checking for required components and versions. + +- For a decent-sized project - such as vpp - build performance is drastically better with (cmake, ninja). + +- The cmake input language looks like an actual language, rather than a shell scripting scheme on steroids. + +- Ninja doesn't pretend to support manually-generated input files. Think of it as a fast, dumb robot which eats mildly legible byte-code. + +See the `cmake website <http://cmake.org>`_, and the `ninja website +<https://ninja-build.org>`_ for additional information. + +vpp cmake configuration files +----------------------------- + +The top of the vpp project cmake hierarchy lives in .../src/CMakeLists.txt. +This file defines the vpp project, and (recursively) includes two kinds +of files: rule/function definitions, and target lists. + +- Rule/function definitions live in .../src/cmake/{\*.cmake}. Although the contents of these files is simple enough to read, it shouldn't be necessary to modify them very often + +- Build target lists come from CMakeLists.txt files found in subdirectories, which are named in the SUBDIRS list in .../src/CMakeLists.txt + +:: + + ############################################################################## + # subdirs - order matters + ############################################################################## + if("${CMAKE_SYSTEM_NAME}" STREQUAL "Linux") + find_package(OpenSSL REQUIRED) + set(SUBDIRS + vppinfra svm vlib vlibmemory vlibapi vnet vpp vat vcl plugins + vpp-api tools/vppapigen tools/g2 tools/perftool) + elseif("${CMAKE_SYSTEM_NAME}" STREQUAL "Darwin") + set(SUBDIRS vppinfra) + else() + message(FATAL_ERROR "Unsupported system: ${CMAKE_SYSTEM_NAME}") + endif() + + foreach(DIR ${SUBDIRS}) + add_subdirectory(${DIR}) + endforeach() + +- The vpp cmake configuration hierarchy discovers the list of plugins to be built by searching for subdirectories in .../src/plugins which contain CMakeLists.txt files + + +:: + + ############################################################################## + # find and add all plugin subdirs + ############################################################################## + FILE(GLOB files RELATIVE + ${CMAKE_CURRENT_SOURCE_DIR} + ${CMAKE_CURRENT_SOURCE_DIR}/*/CMakeLists.txt + ) + foreach (f ${files}) + get_filename_component(dir ${f} DIRECTORY) + add_subdirectory(${dir}) + endforeach() + +How to write a plugin CMakeLists.txt file +----------------------------------------- + +It's really quite simple. Follow the pattern: + +:: + + add_vpp_plugin(mactime + SOURCES + mactime.c + node.c + + API_FILES + mactime.api + + INSTALL_HEADERS + mactime_all_api_h.h + mactime_msg_enum.h + + API_TEST_SOURCES + mactime_test.c + ) + +Adding a target elsewhere in the source tree +-------------------------------------------- + +Within reason, adding a subdirectory to the SUBDIRS list in +.../src/CMakeLists.txt is perfectly OK. The indicated directory will +need a CMakeLists.txt file. + +.. _building-g2: + +Here's how we build the g2 event data visualization tool: + +:: + + option(VPP_BUILD_G2 "Build g2 tool." OFF) + if(VPP_BUILD_G2) + find_package(GTK2 COMPONENTS gtk) + if(GTK2_FOUND) + include_directories(${GTK2_INCLUDE_DIRS}) + add_vpp_executable(g2 + SOURCES + clib.c + cpel.c + events.c + main.c + menu1.c + pointsel.c + props.c + g2version.c + view1.c + + LINK_LIBRARIES vppinfra Threads::Threads m ${GTK2_LIBRARIES} + NO_INSTALL + ) + endif() + endif() + +The g2 component is optional, and is not built by default. There are +a couple of ways to tell cmake to include it in build.ninja [or in Makefile.] + +When invoking cmake manually [rarely done and not very easy], specify +-DVPP_BUILD_G2=ON: + +:: + + $ cmake ... -DVPP_BUILD_G2=ON + +Take a good look at .../build-data/packages/vpp.mk to see where and +how the top-level Makefile and .../build-root/Makefile set all of the +cmake arguments. One strategy to enable an optional component is fairly +obvious. Add -DVPP_BUILD_G2=ON to vpp_cmake_args. + +That would work, of course, but it's not a particularly elegant solution. + +Tinkering with build options: ccmake +------------------------------------ + +The easy way to set VPP_BUILD_G2 - or frankly **any** cmake +parameter - is to install the "cmake-curses-gui" package and use +it. + +- Do a straightforward vpp build using the top level Makefile, "make build" or "make build-release" +- Ajourn to .../build-root/build-vpp-native/vpp or .../build-root/build-vpp_debug-native/vpp +- Invoke "ccmake ." to reconfigure the project as desired + +Here's approximately what you'll see: + +:: + + CCACHE_FOUND /usr/bin/ccache + CMAKE_BUILD_TYPE + CMAKE_INSTALL_PREFIX /scratch/vpp-gate/build-root/install-vpp-nati + DPDK_INCLUDE_DIR /scratch/vpp-gate/build-root/install-vpp-nati + DPDK_LIB /scratch/vpp-gate/build-root/install-vpp-nati + MBEDTLS_INCLUDE_DIR /usr/include + MBEDTLS_LIB1 /usr/lib/x86_64-linux-gnu/libmbedtls.so + MBEDTLS_LIB2 /usr/lib/x86_64-linux-gnu/libmbedx509.so + MBEDTLS_LIB3 /usr/lib/x86_64-linux-gnu/libmbedcrypto.so + MUSDK_INCLUDE_DIR MUSDK_INCLUDE_DIR-NOTFOUND + MUSDK_LIB MUSDK_LIB-NOTFOUND + PRE_DATA_SIZE 128 + VPP_API_TEST_BUILTIN ON + VPP_BUILD_G2 OFF + VPP_BUILD_PERFTOOL OFF + VPP_BUILD_VCL_TESTS ON + VPP_BUILD_VPPINFRA_TESTS OFF + + CCACHE_FOUND: Path to a program. + Press [enter] to edit option Press [d] to delete an entry CMake Version 3.10.2 + Press [c] to configure + Press [h] for help Press [q] to quit without generating + Press [t] to toggle advanced mode (Currently Off) + +Use the cursor to point at the VPP_BUILD_G2 line. Press the return key +to change OFF to ON. Press "c" to regenerate build.ninja, etc. + +At that point "make build" or "make build-release" will build g2. And so on. + +Note that toggling advanced mode ["t"] gives access to substantially +all of the cmake option, discovered directories and paths. diff --git a/docs/developer/corearchitecture/buildsystem/index.rst b/docs/developer/corearchitecture/buildsystem/index.rst new file mode 100644 index 00000000000..908e91e1fc1 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/index.rst @@ -0,0 +1,14 @@ +.. _buildsystem: + +Build System +============ + +This guide describes the vpp build system in detail. As of this writing, +the build systems uses a mix of make / Makefiles, cmake, and ninja to +achieve excellent build performance. + +.. toctree:: + + mainmakefile + cmakeandninja + buildrootmakefile diff --git a/docs/developer/corearchitecture/buildsystem/mainmakefile.rst b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst new file mode 100644 index 00000000000..96b97496350 --- /dev/null +++ b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst @@ -0,0 +1,2 @@ +Introduction to the top-level Makefile +====================================== diff --git a/docs/developer/corearchitecture/featurearcs.rst b/docs/developer/corearchitecture/featurearcs.rst new file mode 100644 index 00000000000..89c50e38dce --- /dev/null +++ b/docs/developer/corearchitecture/featurearcs.rst @@ -0,0 +1,225 @@ +Feature Arcs +============ + +A significant number of vpp features are configurable on a per-interface +or per-system basis. Rather than ask feature coders to manually +construct the required graph arcs, we built a general mechanism to +manage these mechanics. + +Specifically, feature arcs comprise ordered sets of graph nodes. Each +feature node in an arc is independently controlled. Feature arc nodes +are generally unaware of each other. Handing a packet to “the next +feature node” is quite inexpensive. + +The feature arc implementation solves the problem of creating graph arcs +used for steering. + +At the beginning of a feature arc, a bit of setup work is needed, but +only if at least one feature is enabled on the arc. + +On a per-arc basis, individual feature definitions create a set of +ordering dependencies. Feature infrastructure performs a topological +sort of the ordering dependencies, to determine the actual feature +order. Missing dependencies **will** lead to runtime disorder. See +https://gerrit.fd.io/r/#/c/12753 for an example. + +If no partial order exists, vpp will refuse to run. Circular dependency +loops of the form “a then b, b then c, c then a” are impossible to +satisfy. + +Adding a feature to an existing feature arc +------------------------------------------- + +To nobody’s great surprise, we set up feature arcs using the typical +“macro -> constructor function -> list of declarations” pattern: + +.. code:: c + + VNET_FEATURE_INIT (mactime, static) = + { + .arc_name = "device-input", + .node_name = "mactime", + .runs_before = VNET_FEATURES ("ethernet-input"), + }; + +This creates a “mactime” feature on the “device-input” arc. + +Once per frame, dig up the vnet_feature_config_main_t corresponding to +the “device-input” feature arc: + +.. code:: c + + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + u8 arc = im->output_feature_arc_index; + vnet_feature_config_main_t *fcm; + + fcm = vnet_feature_get_config_main (arc); + +Note that in this case, we’ve stored the required arc index - assigned +by the feature infrastructure - in the vnet_interface_main_t. Where to +put the arc index is a programmer’s decision when creating a feature +arc. + +Per packet, set next0 to steer packets to the next node they should +visit: + +.. code:: c + + vnet_get_config_data (&fcm->config_main, + &b0->current_config_index /* value-result */, + &next0, 0 /* # bytes of config data */); + +Configuration data is per-feature arc, and is often unused. Note that +it’s normal to reset next0 to divert packets elsewhere; often, to drop +them for cause: + +.. code:: c + + next0 = MACTIME_NEXT_DROP; + b0->error = node->errors[DROP_CAUSE]; + +Creating a feature arc +---------------------- + +Once again, we create feature arcs using constructor macros: + +.. code:: c + + VNET_FEATURE_ARC_INIT (ip4_unicast, static) = + { + .arc_name = "ip4-unicast", + .start_nodes = VNET_FEATURES ("ip4-input", "ip4-input-no-checksum"), + .arc_index_ptr = &ip4_main.lookup_main.ucast_feature_arc_index, + }; + +In this case, we configure two arc start nodes to handle the +“hardware-verified ip checksum or not” cases. During initialization, the +feature infrastructure stores the arc index as shown. + +In the head-of-arc node, do the following to send packets along the +feature arc: + +.. code:: c + + ip_lookup_main_t *lm = &im->lookup_main; + arc = lm->ucast_feature_arc_index; + +Once per packet, initialize packet metadata to walk the feature arc: + +.. code:: c + + vnet_feature_arc_start (arc, sw_if_index0, &next, b0); + +Enabling / Disabling features +----------------------------- + +Simply call vnet_feature_enable_disable to enable or disable a specific +feature: + +.. code:: c + + vnet_feature_enable_disable ("device-input", /* arc name */ + "mactime", /* feature name */ + sw_if_index, /* Interface sw_if_index */ + enable_disable, /* 1 => enable */ + 0 /* (void *) feature_configuration */, + 0 /* feature_configuration_nbytes */); + +The feature_configuration opaque is seldom used. + +If you wish to make a feature a *de facto* system-level concept, pass +sw_if_index=0 at all times. Sw_if_index 0 is always valid, and +corresponds to the “local” interface. + +Related “show” commands +----------------------- + +To display the entire set of features, use “show features [verbose]”. +The verbose form displays arc indices, and feature indicies within the +arcs + +:: + + $ vppctl show features verbose + Available feature paths + <snip> + [14] ip4-unicast: + [ 0]: nat64-out2in-handoff + [ 1]: nat64-out2in + [ 2]: nat44-ed-hairpin-dst + [ 3]: nat44-hairpin-dst + [ 4]: ip4-dhcp-client-detect + [ 5]: nat44-out2in-fast + [ 6]: nat44-in2out-fast + [ 7]: nat44-handoff-classify + [ 8]: nat44-out2in-worker-handoff + [ 9]: nat44-in2out-worker-handoff + [10]: nat44-ed-classify + [11]: nat44-ed-out2in + [12]: nat44-ed-in2out + [13]: nat44-det-classify + [14]: nat44-det-out2in + [15]: nat44-det-in2out + [16]: nat44-classify + [17]: nat44-out2in + [18]: nat44-in2out + [19]: ip4-qos-record + [20]: ip4-vxlan-gpe-bypass + [21]: ip4-reassembly-feature + [22]: ip4-not-enabled + [23]: ip4-source-and-port-range-check-rx + [24]: ip4-flow-classify + [25]: ip4-inacl + [26]: ip4-source-check-via-rx + [27]: ip4-source-check-via-any + [28]: ip4-policer-classify + [29]: ipsec-input-ip4 + [30]: vpath-input-ip4 + [31]: ip4-vxlan-bypass + [32]: ip4-lookup + <snip> + +Here, we learn that the ip4-unicast feature arc has index 14, and that +e.g. ip4-inacl is the 25th feature in the generated partial order. + +To display the features currently active on a specific interface, use +“show interface features”: + +:: + + $ vppctl show interface GigabitEthernet3/0/0 features + Feature paths configured on GigabitEthernet3/0/0... + <snip> + ip4-unicast: + nat44-out2in + <snip> + +Table of Feature Arcs +--------------------- + +Simply search for name-strings to track down the arc definition, +location of the arc index, etc. + +:: + + | Arc Name | + |------------------| + | device-input | + | ethernet-output | + | interface-output | + | ip4-drop | + | ip4-local | + | ip4-multicast | + | ip4-output | + | ip4-punt | + | ip4-unicast | + | ip6-drop | + | ip6-local | + | ip6-multicast | + | ip6-output | + | ip6-punt | + | ip6-unicast | + | mpls-input | + | mpls-output | + | nsh-output | diff --git a/docs/developer/corearchitecture/index.rst b/docs/developer/corearchitecture/index.rst new file mode 100644 index 00000000000..ecd5a3cdb08 --- /dev/null +++ b/docs/developer/corearchitecture/index.rst @@ -0,0 +1,21 @@ +.. _corearchitecture: + +================= +Core Architecture +================= + +.. toctree:: + :maxdepth: 1 + + softwarearchitecture + infrastructure + vlib + vnet + featurearcs + buffer_metadata + multiarch/index + bihash + buildsystem/index + mem + multi_thread + diff --git a/docs/developer/corearchitecture/infrastructure.rst b/docs/developer/corearchitecture/infrastructure.rst new file mode 100644 index 00000000000..b4e1065f81e --- /dev/null +++ b/docs/developer/corearchitecture/infrastructure.rst @@ -0,0 +1,612 @@ +VPPINFRA (Infrastructure) +========================= + +The files associated with the VPP Infrastructure layer are located in +the ``./src/vppinfra`` folder. + +VPPinfra is a collection of basic c-library services, quite sufficient +to build standalone programs to run directly on bare metal. It also +provides high-performance dynamic arrays, hashes, bitmaps, +high-precision real-time clock support, fine-grained event-logging, and +data structure serialization. + +One fair comment / fair warning about vppinfra: you can't always tell a +macro from an inline function from an ordinary function simply by name. +Macros are used to avoid function calls in the typical case, and to +cause (intentional) side-effects. + +Vppinfra has been around for almost 20 years and tends not to change +frequently. The VPP Infrastructure layer contains the following +functions: + +Vectors +------- + +Vppinfra vectors are ubiquitous dynamically resized arrays with by user +defined "headers". Many vpppinfra data structures (e.g. hash, heap, +pool) are vectors with various different headers. + +The memory layout looks like this: + +:: + + User header (optional, uword aligned) + Alignment padding (if needed) + Vector length in elements + User's pointer -> Vector element 0 + Vector element 1 + ... + Vector element N-1 + +As shown above, the vector APIs deal with pointers to the 0th element of +a vector. Null pointers are valid vectors of length zero. + +To avoid thrashing the memory allocator, one often resets the length of +a vector to zero while retaining the memory allocation. Set the vector +length field to zero via the vec_reset_length(v) macro. [Use the macro! +It’s smart about NULL pointers.] + +Typically, the user header is not present. User headers allow for other +data structures to be built atop vppinfra vectors. Users may specify the +alignment for first data element of a vector via the [vec]()*_aligned +macros. + +Vector elements can be any C type e.g. (int, double, struct bar). This +is also true for data types built atop vectors (e.g. heap, pool, etc.). +Many macros have \_a variants supporting alignment of vector elements +and \_h variants supporting non-zero-length vector headers. The \_ha +variants support both. Additionally cacheline alignment within a vector +element structure can be specified using the +``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro. + +Inconsistent usage of header and/or alignment related macro variants +will cause delayed, confusing failures. + +Standard programming error: memorize a pointer to the ith element of a +vector, and then expand the vector. Vectors expand by 3/2, so such code +may appear to work for a period of time. Correct code almost always +memorizes vector **indices** which are invariant across reallocations. + +In typical application images, one supplies a set of global functions +designed to be called from gdb. Here are a few examples: + +- vl(v) - prints vec_len(v) +- pe(p) - prints pool_elts(p) +- pifi(p, index) - prints pool_is_free_index(p, index) +- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p + +Use the “show gdb” debug CLI command to print the current set. + +Bitmaps +------- + +Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs. +Quite handy for a variety jobs. + +Pools +----- + +Vppinfra pools combine vectors and bitmaps to rapidly allocate and free +fixed-size data structures with independent lifetimes. Pools are perfect +for allocating per-session structures. + +Hashes +------ + +Vppinfra provides several hash flavors. Data plane problems involving +packet classification / session lookup often use +./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes. +These templates are instantiated multiple times, to efficiently service +different fixed-key sizes. + +Bihashes are thread-safe. Read-locking is not required. A simple +spin-lock ensures that only one thread writes an entry at a time. + +The original vppinfra hash implementation in ./src/vppinfra/hash.[ch] +are simple to use, and are often used in control-plane code which needs +exact-string-matching. + +In either case, one almost always looks up a key in a hash table to +obtain an index in a related vector or pool. The APIs are simple enough, +but one must take care when using the unmanaged arbitrary-sized key +variant. Hash_set_mem (hash_table, key_pointer, value) memorizes +key_pointer. It is usually a bad mistake to pass the address of a vector +element as the second argument to hash_set_mem. It is perfectly fine to +memorize constant string addresses in the text segment. + +Timekeeping +----------- + +Vppinfra includes high-precision, low-cost timing services. The datatype +clib_time_t and associated functions reside in ./src/vppinfra/time.[ch]. +Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t +object. + +Clib_time_init(…) can use a variety of different ways to establish the +hardware clock frequency. At the end of the day, vppinfra timekeeping +takes the attitude that the operating system’s clock is the closest +thing to a gold standard it has handy. + +When properly configured, NTP maintains kernel clock synchronization +with a highly accurate off-premises reference clock. Notwithstanding +network propagation delays, a synchronized NTP client will keep the +kernel clock accurate to within 50ms or so. + +Why should one care? Simply put, oscillators used to generate CPU ticks +aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t +be out of the question. That’s a minute and a half’s worth of error in 1 +day. The error changes constantly, due to temperature variation, and a +host of other physical factors. + +It’s far too expensive to use system calls for timing, so we’re left +with the problem of continuously adjusting our view of the CPU tick +register’s clocks_per_second parameter. + +The clock rate adjustment algorithm measures the number of cpu ticks and +the “gold standard” reference time across an interval of approximately +16 seconds. We calculate clocks_per_second for the interval: use rdtsc +(on x86_64) and a system call to get the latest cpu tick count and the +kernel’s latest nanosecond timestamp. We subtract the previous interval +end values, and use exponential smoothing to merge the new clock rate +sample into the clocks_per_second parameter. + +As of this writing, we maintain the clock rate by way of the following +first-order differential equation: + +.. code:: c + + clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K) + where K = e**(-1.0/3.75); + +This yields a per observation “half-life” of 1 minute. Empirically, the +clock rate converges within 5 minutes, and appears to maintain +near-perfect agreement with the kernel clock in the face of ongoing NTP +time adjustments. + +See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the +rate adjustment algorithm. The code rejects frequency samples +corresponding to the sort of adjustment which might occur if someone +changes the gold standard kernel clock by several seconds. + +Monotonic timebase support +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Particularly during system initialization, the “gold standard” system +reference clock can change by a large amount, in an instant. It’s not a +best practice to yank the reference clock - in either direction - by +hours or days. In fact, some poorly-constructed use-cases do so. + +To deal with this reality, clib_time_now(…) returns the number of +seconds since vpp started, *guaranteed to be monotonically increasing, +no matter what happens to the system reference clock*. + +This is first-order important, to avoid breaking every active timer in +the system. The vpp host stack alone may account for tens of millions of +active timers. It’s utterly impractical to track down and fix timers, so +we must deal with the issue at the timebase level. + +Here’s how it works. Prior to adjusting the clock rate, we collect the +kernel reference clock and the cpu clock: + +.. code:: c + + /* Ask the kernel and the CPU what time it is... */ + now_reference = unix_time_now (); + now_clock = clib_cpu_time_now (); + +Compute changes for both clocks since the last rate adjustment, roughly +15 seconds ago: + +.. code:: c + + /* Compute change in the reference clock */ + delta_reference = now_reference - c->last_verify_reference_time; + + /* And change in the CPU clock */ + delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) * + c->seconds_per_clock; + +Delta_reference is key. Almost 100% of the time, delta_reference and +delta_clock_in_seconds are identical modulo one system-call time. +However, NTP or a privileged user can yank the system reference time - +in either direction - by an hour, a day, or a decade. + +As described above, clib_time_now(…) must return monotonically +increasing answers to the question “how long has it been since vpp +started, in seconds.” To do that, the clock rate adjustment algorithm +begins by recomputing the initial reference time: + +.. code:: c + + c->init_reference_time += (delta_reference - delta_clock_in_seconds); + +It’s easy to convince yourself that if the reference clock changes by +15.000000 seconds and the cpu clock tick time changes by 15.000000 +seconds, the initial reference time won’t change. + +If, on the other hand, delta_reference is -86400.0 and delta clock is +15.0 - reference time jumped backwards by exactly one day in a 15-second +rate update interval - we add -86415.0 to the initial reference time. + +Given the corrected initial reference time, we recompute the total +number of cpu ticks which have occurred since the corrected initial +reference time, at the current clock tick rate: + +.. code:: c + + c->total_cpu_time = (now_reference - c->init_reference_time) + * c->clocks_per_second; + +Timebase precision +~~~~~~~~~~~~~~~~~~ + +Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit +floating-point value; the number of seconds since vpp started. + +Please see `this Wikipedia +article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__ +for more information. C double-precision floating point numbers (called +f64 in the vpp code base) have a 53-bit effective mantissa, and can +accurately represent 15 decimal digits’ worth of precision. + +There are 315,360,000.000001 seconds in ten years plus one microsecond. +That string has exactly 15 decimal digits. The vpp time base retains 1us +precision for roughly 30 years. + +vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds. +If necessary, please use clib_cpu_time_now(…) for direct access to the +CPU clock-cycle counter. Note that the number of CPU clock cycles per +second varies significantly across CPU architectures. + +Timer Wheels +------------ + +Vppinfra includes configurable timer wheel support. See the source code +in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable +number of template instances defined in …/src/vppinfra/tw_timer\_.[ch]. + +Instantiation of tw_timer_template.h generates named structures to +implement specific timer wheel geometries. Choices include: number of +timer wheels (currently, 1 or 2), number of slots per ring (a power of +two), and the number of timers per “object handle”. + +Internally, user object/timer handles are 32-bit integers, so if one +selects 16 timers/object (4 bits), the resulting timer wheel handle is +limited to 2**28 objects. + +Here are the specific settings required to generate a single 2048 slot +wheel which supports 2 timers per object: + +.. code:: c + + #define TW_TIMER_WHEELS 1 + #define TW_SLOTS_PER_RING 2048 + #define TW_RING_SHIFT 11 + #define TW_RING_MASK (TW_SLOTS_PER_RING -1) + #define TW_TIMERS_PER_OBJECT 2 + #define LOG2_TW_TIMERS_PER_OBJECT 1 + #define TW_SUFFIX _2t_1w_2048sl + #define TW_FAST_WHEEL_BITMAP 0 + #define TW_TIMER_ALLOW_DUPLICATE_STOP 0 + +See tw_timer_2t_1w_2048sl.h for a complete example. + +tw_timer_template.h is not intended to be #included directly. Client +codes can include multiple timer geometry header files, although extreme +caution would required to use the TW and TWT macros in such a case. + +API usage examples +~~~~~~~~~~~~~~~~~~ + +The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete +API usage example. It uses a synthetic clock to rapidly exercise the +underlying tw_timer_expire_timers(…) template. + +There are not many API routines to call. + +Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: c + + tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel, + expired_timer_single_callback, + 1.0 / * timer interval * / ); + +Start a timer +^^^^^^^^^^^^^ + +.. code:: c + + handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index, + [0 | 1] / * timer id * / , + expiration_time_in_u32_ticks); + +Stop a timer +^^^^^^^^^^^^ + +.. code:: c + + tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle); + +An expired timer callback +^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code:: c + + static void + expired_timer_single_callback (u32 * expired_timers) + { + int i; + u32 pool_index, timer_id; + tw_timer_test_elt_t *e; + tw_timer_test_main_t *tm = &tw_timer_test_main; + + for (i = 0; i < vec_len (expired_timers); + { + pool_index = expired_timers[i] & 0x7FFFFFFF; + timer_id = expired_timers[i] >> 31; + + ASSERT (timer_id == 1); + + e = pool_elt_at_index (tm->test_elts, pool_index); + + if (e->expected_to_expire != tm->single_wheel.current_tick) + { + fformat (stdout, "[%d] expired at %d not %d\n", + e - tm->test_elts, tm->single_wheel.current_tick, + e->expected_to_expire); + } + pool_put (tm->test_elts, e); + } + } + +We use wheel timers extensively in the vpp host stack. Each TCP session +needs 5 timers, so supporting 10 million flows requires up to 50 million +concurrent timers. + +Timers rarely expire, so it’s of utmost important that stopping and +restarting a timer costs as few clock cycles as possible. + +Stopping a timer costs a doubly-linked list dequeue. Starting a timer +involves modular arithmetic to determine the correct timer wheel and +slot, and a list head enqueue. + +Expired timer processing generally involves bulk link-list retirement +with user callback presentation. Some additional complexity at wheel +wrap time, to relocate timers from slower-turning timer wheels into +faster-turning wheels. + +Format +------ + +Vppinfra format is roughly equivalent to printf. + +Format has a few properties worth mentioning. Format’s first argument is +a (u8 \*) vector to which it appends the result of the current format +operation. Chaining calls is very easy: + +.. code:: c + + u8 * result; + + result = format (0, "junk = %d, ", junk); + result = format (result, "more junk = %d\n", more_junk); + +As previously noted, NULL pointers are perfectly proper 0-length +vectors. Format returns a (u8 \*) vector, **not** a C-string. If you +wish to print a (u8 \*) vector, use the “%v” format string. If you need +a (u8 \*) vector which is also a proper C-string, either of these +schemes may be used: + +.. code:: c + + vec_add1 (result, 0) + or + result = format (result, "<whatever>%c", 0); + +Remember to vec_free() the result if appropriate. Be careful not to pass +format an uninitialized (u8 \*). + +Format implements a particularly handy user-format scheme via the “%U” +format specification. For example: + +.. code:: c + + u8 * format_junk (u8 * s, va_list *va) + { + junk = va_arg (va, u32); + s = format (s, "%s", junk); + return s; + } + + result = format (0, "junk = %U, format_junk, "This is some junk"); + +format_junk() can invoke other user-format functions if desired. The +programmer shoulders responsibility for argument type-checking. It is +typical for user format functions to blow up spectacularly if the +va_arg(va, type) macros don’t match the caller’s idea of reality. + +Unformat +-------- + +Vppinfra unformat is vaguely related to scanf, but considerably more +general. + +A typical use case involves initializing an unformat_input_t from either +a C-string or a (u8 \*) vector, then parsing via unformat() as follows: + +.. code:: c + + unformat_input_t input; + u8 *s = "<some-C-string>"; + + unformat_init_string (&input, (char *) s, strlen((char *) s)); + /* or */ + unformat_init_vector (&input, <u8-vector>); + +Then loop parsing individual elements: + +.. code:: c + + while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (&input, "value1 %d", &value1)) + ;/* unformat sets value1 */ + else if (unformat (&input, "value2 %d", &value2) + ;/* unformat sets value2 */ + else + return clib_error_return (0, "unknown input '%U'", + format_unformat_error, input); + } + +As with format, unformat implements a user-unformat function capability +via a “%U” user unformat function scheme. Generally, one can trivially +transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”, +&foo)“. + +Unformat implements a couple of handy non-scanf-like format specifiers: + +.. code:: c + + unformat (input, "enable %=", &enable, 1 /* defaults to 1 */); + unformat (input, "bitzero %|", &mask, (1<<0)); + unformat (input, "bitone %|", &mask, (1<<1)); + <etc> + +The phrase “enable %=” means “set the supplied variable to the default +value” if unformat parses the “enable” keyword all by itself. If +unformat parses “enable 123” set the supplied variable to 123. + +We could clean up a number of hand-rolled “verbose” + “verbose %d” +argument parsing codes using “%=”. + +The phrase “bitzero %\|” means “set the specified bit in the supplied +bitmask” if unformat parses “bitzero”. Although it looks like it could +be fairly handy, it’s very lightly used in the code base. + +``%_`` toggles whether or not to skip input white space. + +For transition from skip to no-skip in middle of format string, skip +input white space. For example, the following: + +.. code:: c + + fmt = "%_%d.%d%_->%_%d.%d%_" + unformat (input, fmt, &one, &two, &three, &four); + +matches input “1.2 -> 3.4”. Without this, the space after -> does not +get skipped. + + +How to parse a single input line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Debug CLI command functions MUST NOT accidentally consume input +belonging to other debug CLI commands. Otherwise, it's impossible to +script a set of debug CLI commands which "work fine" when issued one +at a time. + +This bit of code is NOT correct: + +.. code:: c + + /* Eats script input NOT beloging to it, and chokes! */ + while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (input, ...)) + ; + else if (unformat (input, ...)) + ; + else + return clib_error_return (0, "parse error: '%U'", + format_unformat_error, input); + } + } + +When executed as part of a script, such a function will return “parse +error: ‘’” every time, unless it happens to be the last command in the +script. + +Instead, use “unformat_line_input” to consume the rest of a line’s worth +of input - everything past the path specified in the VLIB_CLI_COMMAND +declaration. + +For example, unformat_line_input with “my_command” set up as shown below +and user input “my path is clear” will produce an unformat_input_t that +contains “is clear”. + +.. code:: c + + VLIB_CLI_COMMAND (...) = { + .path = "my path", + }; + +Here’s a bit of code which shows the required mechanics, in full: + +.. code:: c + + static clib_error_t * + my_command_fn (vlib_main_t * vm, + unformat_input_t * input, + vlib_cli_command_t * cmd) + { + unformat_input_t _line_input, *line_input = &_line_input; + u32 this, that; + clib_error_t *error = 0; + + if (!unformat_user (input, unformat_line_input, line_input)) + return 0; + + /* + * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed, + * not at the end of the script... + */ + while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (line_input, "this %u", &this)) + ; + else if (unformat (line_input, "that %u", &that)) + ; + else + { + error = clib_error_return (0, "parse error: '%U'", + format_unformat_error, line_input); + goto done; + } + } + + <do something based on "this" and "that", etc> + + done: + unformat_free (line_input); + return error; + } + VLIB_CLI_COMMAND (my_command, static) = { + .path = "my path", + .function = my_command_fn", + }; + +Vppinfra errors and warnings +---------------------------- + +Many functions within the vpp dataplane have return-values of type +clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of +metadata [fatal, warning] and are easy to announce. Returning a NULL +clib_error_t \* indicates “A-OK, no error.” + +Clib_warning(format-args) is a handy way to add debugging output; clib +warnings prepend function:line info to unambiguously locate the message +source. Clib_unix_warning() adds perror()-style Linux system-call +information. In production images, clib_warnings result in syslog +entries. + +Serialization +------------- + +Vppinfra serialization support allows the programmer to easily serialize +and unserialize complex data structures. + +The underlying primitive serialize/unserialize functions use network +byte-order, so there are no structural issues serializing on a +little-endian host and unserializing on a big-endian host. diff --git a/docs/developer/corearchitecture/mem.rst b/docs/developer/corearchitecture/mem.rst new file mode 120000 index 00000000000..0fc53eab68c --- /dev/null +++ b/docs/developer/corearchitecture/mem.rst @@ -0,0 +1 @@ +../../../src/vpp/mem/mem.rst
\ No newline at end of file diff --git a/docs/developer/corearchitecture/multi_thread.rst b/docs/developer/corearchitecture/multi_thread.rst new file mode 100644 index 00000000000..195a9b791fd --- /dev/null +++ b/docs/developer/corearchitecture/multi_thread.rst @@ -0,0 +1,169 @@ +.. _vpp_multi_thread: + +Multi-threading in VPP +====================== + +Modes +----- + +VPP can work in 2 different modes: + +- single-thread +- multi-thread with worker threads + +Single-thread +~~~~~~~~~~~~~ + +In a single-thread mode there is one main thread which handles both +packet processing and other management functions (Command-Line Interface +(CLI), API, stats). This is the default setup. There is no special +startup config needed. + +Multi-thread with Worker Threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this mode, the main threads handles management functions(debug CLI, +API, stats collection) and one or more worker threads handle packet +processing from input to output of the packet. + +Each worker thread polls input queues on subset of interfaces. + +With RSS (Receive Side Scaling) enabled multiple threads can service one +physical interface (RSS function on NIC distributes traffic between +different queues which are serviced by different worker threads). + +Thread placement +---------------- + +Thread placement is defined in the startup config under the cpu { … } +section. + +The VPP platform can place threads automatically or manually. Automatic +placement works in the following way: + +- if “skip-cores X” is defined first X cores will not be used +- if “main-core X” is defined, VPP main thread will be placed on core + X, otherwise 1st available one will be used +- if “workers N” is defined vpp will allocate first N available cores + and it will run threads on them +- if “corelist-workers A,B1-Bn,C1-Cn” is defined vpp will automatically + assign those CPU cores to worker threads + +User can see active placement of cores by using the VPP debug CLI +command show threads: + +.. code-block:: console + + vpd# show threads + ID Name Type LWP lcore Core Socket State + 0 vpe_main 59723 2 2 0 wait + 1 vpe_wk_0 workers 59755 4 4 0 running + 2 vpe_wk_1 workers 59756 5 5 0 running + 3 vpe_wk_2 workers 59757 6 0 1 running + 4 vpe_wk_3 workers 59758 7 1 1 running + 5 stats 59775 + vpd# + +The sample output above shows the main thread running on core 2 (2nd +core on the CPU socket 0), worker threads running on cores 4-7. + +Sample Configurations +--------------------- + +By default, at start-up VPP uses +configuration values from: ``/etc/vpp/startup.conf`` + +The following sections describe some of the additional changes that can be made to this file. +This file is initially populated from the files located in the following directory ``/vpp/vpp/conf/`` + +Manual Placement +~~~~~~~~~~~~~~~~ + +Manual placement places the main thread on core 1, workers on cores +4,5,20,21. + +.. code-block:: console + + cpu { + main-core 1 + corelist-workers 4-5,20-21 + } + +Auto placement +-------------- + +Auto placement is likely to place the main thread on core 1 and workers +on cores 2,3,4. + +.. code-block:: console + + cpu { + skip-cores 1 + workers 3 + } + +Buffer Memory Allocation +~~~~~~~~~~~~~~~~~~~~~~~~ + +The VPP platform is NUMA aware. It can allocate memory for buffers on +different CPU sockets (NUMA nodes). The amount of memory allocated can +be defined in the startup config for each CPU socket by using the +socket-mem A[[,B],C] statement inside the dpdk { … } section. + +For example: + +.. code-block:: console + + dpdk { + socket-mem 1024,1024 + } + +The above configuration allocates 1GB of memory on NUMA#0 and 1GB on +NUMA#1. Each worker thread uses buffers which are local to itself. + +Buffer memory is allocated from hugepages. VPP prefers 1G pages if they +are available. If not 2MB pages will be used. + +VPP takes care of mounting/unmounting hugepages file-system +automatically so there is no need to do that manually. + +’‘’NOTE’’’: If you are running latest VPP release, there is no need for +specifying socket-mem manually. VPP will discover all NUMA nodes and it +will allocate 512M on each by default. socket-mem is only needed if +bigger number of mbufs is required (default is 16384 per socket and can +be changed with num-mbufs startup config command). + +Interface Placement in Multi-thread Setup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On startup, the VPP platform assigns interfaces (or interface, queue +pairs if RSS is used) to different worker threads in round robin +fashion. + +The following example shows debug CLI commands to show and change +interface placement: + +.. code-block:: console + + vpd# sh dpdk interface placement + Thread 1 (vpp_wk_0 at lcore 5): + TenGigabitEthernet2/0/0 queue 0 + TenGigabitEthernet2/0/1 queue 0 + Thread 2 (vpp_wk_1 at lcore 6): + TenGigabitEthernet2/0/0 queue 1 + TenGigabitEthernet2/0/1 queue 1 + +The following shows an example of moving TenGigabitEthernet2/0/1 queue 1 +processing to 1st worker thread: + +.. code-block:: console + + vpd# set interface placement TenGigabitEthernet2/0/1 queue 1 thread 1 + + vpp# sh dpdk interface placement + Thread 1 (vpp_wk_0 at lcore 5): + TenGigabitEthernet2/0/0 queue 0 + TenGigabitEthernet2/0/1 queue 0 + TenGigabitEthernet2/0/1 queue 1 + Thread 2 (vpp_wk_1 at lcore 6): + TenGigabitEthernet2/0/0 queue 1 diff --git a/docs/developer/corearchitecture/multiarch/arbfns.rst b/docs/developer/corearchitecture/multiarch/arbfns.rst new file mode 100644 index 00000000000..d469bd8a140 --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/arbfns.rst @@ -0,0 +1,87 @@ +Multi-Architecture Arbitrary Function Cookbook +============================================== + +Optimizing arbitrary functions for multiple architectures is simple +enough, and very similar to process used to produce multi-architecture +graph node dispatch functions. + +As with multi-architecture graph nodes, we compile source files +multiple times, generating multiple implementations of the original +function, and a public selector function. + +Details +------- + +Decorate function definitions with CLIB_MARCH_FN macros. For example: + +Change the original function prototype... + +:: + + u32 vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index, + u32 frame_flags) + +...by recasting the function name and return type as the first two +arguments to the CLIB_MARCH_FN macro: + +:: + + CLIB_MARCH_FN (vlib_frame_alloc_to_node, u32, vlib_main_t * vm, + u32 to_node_index, u32 frame_flags) + +In the actual vpp image, several versions of vlib_frame_alloc_to_node +will appear: vlib_frame_alloc_to_node_avx2, +vlib_frame_alloc_to_node_avx512, and so forth. + + +For each multi-architecture function, use the CLIB_MARCH_FN_SELECT +macro to help generate the one-and-only multi-architecture selector +function: + +:: + + #ifndef CLIB_MARCH_VARIANT + u32 + vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index, + u32 frame_flags) + { + return CLIB_MARCH_FN_SELECT (vlib_frame_alloc_to_node) + (vm, to_node_index, frame_flags); + } + #endif /* CLIB_MARCH_VARIANT */ + +Once bound, the multi-architecture selector function is about as +expensive as an indirect function call; which is to say: not very +expensive. + +Modify CMakeLists.txt +--------------------- + +If the component in question already lists "MULTIARCH_SOURCES", simply +add the indicated .c file to the list. Otherwise, add as shown +below. Note that the added file "new_multiarch_node.c" should appear in +*both* SOURCES and MULTIARCH_SOURCES: + +:: + + add_vpp_plugin(myplugin + SOURCES + multiarch_code.c + ... + + MULTIARCH_SOURCES + multiarch_code.c + ... + ) + +A Word to the Wise +------------------ + +A file which liberally mixes functions worth compiling for multiple +architectures and functions which are not will end up full of +#ifndef CLIB_MARCH_VARIANT conditionals. This won't do a thing to make +the code look any better. + +Depending on requirements, it may make sense to move functions to +(new) files to reduce complexity and/or improve legibility of the +resulting code. diff --git a/docs/developer/corearchitecture/multiarch/index.rst b/docs/developer/corearchitecture/multiarch/index.rst new file mode 100644 index 00000000000..824a8e68438 --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/index.rst @@ -0,0 +1,12 @@ +.. _multiarch: + +Multi-architecture support +========================== + +This reference guide describes how to use the vpp multi-architecture support scheme + +.. toctree:: + :maxdepth: 1 + + nodefns + arbfns diff --git a/docs/developer/corearchitecture/multiarch/nodefns.rst b/docs/developer/corearchitecture/multiarch/nodefns.rst new file mode 100644 index 00000000000..9647e64f08c --- /dev/null +++ b/docs/developer/corearchitecture/multiarch/nodefns.rst @@ -0,0 +1,138 @@ +Multi-Architecture Graph Node Cookbook +====================================== + +In the context of graph node dispatch functions, it's easy enough to +use the vpp multi-architecture support setup. The point of the scheme +is simple: for performance-critical nodes, generate multiple CPU +hardware-dependent versions of the node dispatch functions, and pick +the best one at runtime. + +The vpp scheme is simple enough to use, but details matter. + +100,000 foot view +----------------- + +We compile entire graph node dispatch function implementation files +multiple times. These compilations give rise to multiple versions of +the graph node dispatch functions. Per-node constructor-functions +interrogate CPU hardware, select the node dispatch function variant to +use, and set the vlib_node_registration_t ".function" member to the +address of the selected variant. + +Details +------- + +Declare the node dispatch function as shown, using the VLIB\_NODE\_FN macro. The +name of the node function **MUST** match the name of the graph node. + +:: + + VLIB_NODE_FN (ip4_sdp_node) (vlib_main_t * vm, vlib_node_runtime_t * node, + vlib_frame_t * frame) + { + if (PREDICT_FALSE (node->flags & VLIB_NODE_FLAG_TRACE)) + return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ , + 1 /* is_trace */ ); + else + return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ , + 0 /* is_trace */ ); + } + +We need to generate *precisely one copy* of the +vlib_node_registration_t, error strings, and packet trace decode function. + +Simply bracket these items with "#ifndef CLIB_MARCH_VARIANT...#endif": + +:: + + #ifndef CLIB_MARCH_VARIANT + static u8 * + format_sdp_trace (u8 * s, va_list * args) + { + <snip> + } + #endif + + ... + + #ifndef CLIB_MARCH_VARIANT + static char *sdp_error_strings[] = { + #define _(sym,string) string, + foreach_sdp_error + #undef _ + }; + #endif + + ... + + #ifndef CLIB_MARCH_VARIANT + VLIB_REGISTER_NODE (ip4_sdp_node) = + { + // DO NOT set the .function structure member. + // The multiarch selection __attribute__((constructor)) function + // takes care of it at runtime + .name = "ip4-sdp", + .vector_size = sizeof (u32), + .format_trace = format_sdp_trace, + .type = VLIB_NODE_TYPE_INTERNAL, + + .n_errors = ARRAY_LEN(sdp_error_strings), + .error_strings = sdp_error_strings, + + .n_next_nodes = SDP_N_NEXT, + + /* edit / add dispositions here */ + .next_nodes = + { + [SDP_NEXT_DROP] = "ip4-drop", + }, + }; + #endif + +To belabor the point: *do not* set the ".function" member! That's the job of the multi-arch +selection \_\_attribute\_\_((constructor)) function + +Always inline node dispatch functions +------------------------------------- + +It's typical for a graph dispatch function to contain one or more +calls to an inline function. See above. If your node dispatch function +is structured that way, make *ABSOLUTELY CERTAIN* to use the +"always_inline" macro: + +:: + + always_inline uword + ip46_sdp_inline (vlib_main_t * vm, vlib_node_runtime_t * node, + vlib_frame_t * frame, + int is_ip4, int is_trace) + { ... } + +Otherwise, the compiler is highly likely NOT to build multiple +versions of the guts of your dispatch function. + +It's fairly easy to spot this mistake in "perf top." If you see, for +example, a bunch of functions with names of the form +"xxx_node_fn_avx2" in the profile, *BUT* your brand-new node function +shows up with a name of the form "xxx_inline.isra.1", it's quite likely +that the inline was declared "static inline" instead of "always_inline". + +Modify CMakeLists.txt +--------------------- + +If the component in question already lists "MULTIARCH_SOURCES", simply +add the indicated .c file to the list. Otherwise, add as shown +below. Note that the added file "new_multiarch_node.c" should appear in +*both* SOURCES and MULTIARCH_SOURCES: + +:: + + add_vpp_plugin(myplugin + SOURCES + new_multiarch_node.c + ... + + MULTIARCH_SOURCES + new_ multiarch_node.c + ... + ) diff --git a/docs/developer/corearchitecture/softwarearchitecture.rst b/docs/developer/corearchitecture/softwarearchitecture.rst new file mode 100644 index 00000000000..7f8a0e04645 --- /dev/null +++ b/docs/developer/corearchitecture/softwarearchitecture.rst @@ -0,0 +1,47 @@ +Software Architecture +===================== + +The fd.io vpp implementation is a third-generation vector packet +processing implementation specifically related to US Patent 7,961,636, +as well as earlier work. Note that the Apache-2 license specifically +grants non-exclusive patent licenses; we mention this patent as a point +of historical interest. + +For performance, the vpp dataplane consists of a directed graph of +forwarding nodes which process multiple packets per invocation. This +schema enables a variety of micro-processor optimizations: pipelining +and prefetching to cover dependent read latency, inherent I-cache phase +behavior, vector instructions. Aside from hardware input and hardware +output nodes, the entire forwarding graph is portable code. + +Depending on the scenario at hand, we often spin up multiple worker +threads which process ingress-hashes packets from multiple queues using +identical forwarding graph replicas. + +VPP Layers - Implementation Taxonomy +------------------------------------ + +.. figure:: /_images/VPP_Layering.png + :alt: image + + image + +- VPP Infra - the VPP infrastructure layer, which contains the core + library source code. This layer performs memory functions, works with + vectors and rings, performs key lookups in hash tables, and works + with timers for dispatching graph nodes. +- VLIB - the vector processing library. The vlib layer also handles + various application management functions: buffer, memory and graph + node management, maintaining and exporting counters, thread + management, packet tracing. Vlib implements the debug CLI (command + line interface). +- VNET - works with VPP's networking interface (layers 2, 3, and 4) + performs session and traffic management, and works with devices and + the data control plane. +- Plugins - Contains an increasingly rich set of data-plane plugins, as + noted in the above diagram. +- VPP - the container application linked against all of the above. + +It’s important to understand each of these layers in a certain amount of +detail. Much of the implementation is best dealt with at the API level +and otherwise left alone. diff --git a/docs/developer/corearchitecture/vlib.rst b/docs/developer/corearchitecture/vlib.rst new file mode 100644 index 00000000000..f542d33ebb8 --- /dev/null +++ b/docs/developer/corearchitecture/vlib.rst @@ -0,0 +1,888 @@ +VLIB (Vector Processing Library) +================================ + +The files associated with vlib are located in the ./src/{vlib, vlibapi, +vlibmemory} folders. These libraries provide vector processing support +including graph-node scheduling, reliable multicast support, +ultra-lightweight cooperative multi-tasking threads, a CLI, plug in .DLL +support, physical memory and Linux epoll support. Parts of this library +embody US Patent 7,961,636. + +Init function discovery +----------------------- + +vlib applications register for various [initialization] events by +placing structures and \__attribute__((constructor)) functions into the +image. At appropriate times, the vlib framework walks +constructor-generated singly-linked structure lists, performs a +topological sort based on specified constraints, and calls the indicated +functions. Vlib applications create graph nodes, add CLI functions, +start cooperative multi-tasking threads, etc. etc. using this mechanism. + +vlib applications invariably include a number of VLIB_INIT_FUNCTION +(my_init_function) macros. + +Each init / configure / etc. function has the return type clib_error_t +\*. Make sure that the function returns 0 if all is well, otherwise the +framework will announce an error and exit. + +vlib applications must link against vppinfra, and often link against +other libraries such as VNET. In the latter case, it may be necessary to +explicitly reference symbol(s) otherwise large portions of the library +may be AWOL at runtime. + +Init function construction and constraint specification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It’s easy to add an init function: + +.. code:: c + + static clib_error_t *my_init_function (vlib_main_t *vm) + { + /* ... initialize things ... */ + + return 0; // or return clib_error_return (0, "BROKEN!"); + } + VLIB_INIT_FUNCTION(my_init_function); + +As given, my_init_function will be executed “at some point,” but with no +ordering guarantees. + +Specifying ordering constraints is easy: + +.. code:: c + + VLIB_INIT_FUNCTION(my_init_function) = + { + .runs_before = VLIB_INITS("we_run_before_function_1", + "we_run_before_function_2"), + .runs_after = VLIB_INITS("we_run_after_function_1", + "we_run_after_function_2), + }; + +It’s also easy to specify bulk ordering constraints of the form “a then +b then c then d”: + +.. code:: c + + VLIB_INIT_FUNCTION(my_init_function) = + { + .init_order = VLIB_INITS("a", "b", "c", "d"), + }; + +It’s OK to specify all three sorts of ordering constraints for a single +init function, although it’s hard to imagine why it would be necessary. + +Node Graph Initialization +------------------------- + +vlib packet-processing applications invariably define a set of graph +nodes to process packets. + +One constructs a vlib_node_registration_t, most often via the +VLIB_REGISTER_NODE macro. At runtime, the framework processes the set of +such registrations into a directed graph. It is easy enough to add nodes +to the graph at runtime. The framework does not support removing nodes. + +vlib provides several types of vector-processing graph nodes, primarily +to control framework dispatch behaviors. The type member of the +vlib_node_registration_t functions as follows: + +- VLIB_NODE_TYPE_PRE_INPUT - run before all other node types +- VLIB_NODE_TYPE_INPUT - run as often as possible, after pre_input + nodes +- VLIB_NODE_TYPE_INTERNAL - only when explicitly made runnable by + adding pending frames for processing +- VLIB_NODE_TYPE_PROCESS - only when explicitly made runnable. + “Process” nodes are actually cooperative multi-tasking threads. They + **must** explicitly suspend after a reasonably short period of time. + +For a precise understanding of the graph node dispatcher, please read +./src/vlib/main.c:vlib_main_loop. + +Graph node dispatcher +--------------------- + +Vlib_main_loop() dispatches graph nodes. The basic vector processing +algorithm is diabolically simple, but may not be obvious from even a +long stare at the code. Here’s how it works: some input node, or set of +input nodes, produce a vector of work to process. The graph node +dispatcher pushes the work vector through the directed graph, +subdividing it as needed, until the original work vector has been +completely processed. At that point, the process recurs. + +This scheme yields a stable equilibrium in frame size, by construction. +Here’s why: as the frame size increases, the per-frame-element +processing time decreases. There are several related forces at work; the +simplest to describe is the effect of vector processing on the CPU L1 +I-cache. The first frame element [packet] processed by a given node +warms up the node dispatch function in the L1 I-cache. All subsequent +frame elements profit. As we increase the number of frame elements, the +cost per element goes down. + +Under light load, it is a crazy waste of CPU cycles to run the graph +node dispatcher flat-out. So, the graph node dispatcher arranges to wait +for work by sitting in a timed epoll wait if the prevailing frame size +is low. The scheme has a certain amount of hysteresis to avoid +constantly toggling back and forth between interrupt and polling mode. +Although the graph dispatcher supports interrupt and polling modes, our +current default device drivers do not. + +The graph node scheduler uses a hierarchical timer wheel to reschedule +process nodes upon timer expiration. + +Graph dispatcher internals +-------------------------- + +This section may be safely skipped. It’s not necessary to understand +graph dispatcher internals to create graph nodes. + +Vector Data Structure +--------------------- + +In vpp / vlib, we represent vectors as instances of the vlib_frame_t +type: + +.. code:: c + + typedef struct vlib_frame_t + { + /* Frame flags. */ + u16 flags; + + /* Number of scalar bytes in arguments. */ + u8 scalar_size; + + /* Number of bytes per vector argument. */ + u8 vector_size; + + /* Number of vector elements currently in frame. */ + u16 n_vectors; + + /* Scalar and vector arguments to next node. */ + u8 arguments[0]; + } vlib_frame_t; + +Note that one *could* construct all kinds of vectors - including vectors +with some associated scalar data - using this structure. In the vpp +application, vectors typically use a 4-byte vector element size, and +zero bytes’ worth of associated per-frame scalar data. + +Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries. Frames +have u32 indices which make use of the alignment property, so the +maximum feasible main heap offset of a frame is CLIB_CACHE_LINE_BYTES \* +0xFFFFFFFF: 64*4 = 256 Gbytes. + +Scheduling Vectors +------------------ + +As you can see, vectors are not directly associated with graph nodes. We +represent that association in a couple of ways. The simplest is the +vlib_pending_frame_t: + +.. code:: c + + /* A frame pending dispatch by main loop. */ + typedef struct + { + /* Node and runtime for this frame. */ + u32 node_runtime_index; + + /* Frame index (in the heap). */ + u32 frame_index; + + /* Start of next frames for this node. */ + u32 next_frame_index; + + /* Special value for next_frame_index when there is no next frame. */ + #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0) + } vlib_pending_frame_t; + +Here is the code in …/src/vlib/main.c:vlib_main_or_worker_loop() which +processes frames: + +.. code:: c + + /* + * Input nodes may have added work to the pending vector. + * Process pending vector until there is nothing left. + * All pending vectors will be processed from input -> output. + */ + for (i = 0; i < _vec_len (nm->pending_frames); i++) + cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now); + /* Reset pending vector for next iteration. */ + +The pending frame node_runtime_index associates the frame with the node +which will process it. + +Complications +------------- + +Fasten your seatbelt. Here’s where the story - and the data structures - +become quite complicated… + +At 100,000 feet: vpp uses a directed graph, not a directed *acyclic* +graph. It’s really quite normal for a packet to visit ip[46]-lookup +multiple times. The worst-case: a graph node which enqueues packets to +itself. + +To deal with this issue, the graph dispatcher must force allocation of a +new frame if the current graph node’s dispatch function happens to +enqueue a packet back to itself. + +There are no guarantees that a pending frame will be processed +immediately, which means that more packets may be added to the +underlying vlib_frame_t after it has been attached to a +vlib_pending_frame_t. Care must be taken to allocate new frames and +pending frames if a (pending_frame, frame) pair fills. + +Next frames, next frame ownership +--------------------------------- + +The vlib_next_frame_t is the last key graph dispatcher data structure: + +.. code:: c + + typedef struct + { + /* Frame index. */ + u32 frame_index; + + /* Node runtime for this next. */ + u32 node_runtime_index; + + /* Next frame flags. */ + u32 flags; + + /* Reflects node frame-used flag for this next. */ + #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \ + VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH + + /* This next frame owns enqueue to node + corresponding to node_runtime_index. */ + #define VLIB_FRAME_OWNER (1 << 15) + + /* Set when frame has been allocated for this next. */ + #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT + + /* Set when frame has been added to pending vector. */ + #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP + + /* Set when frame is to be freed after dispatch. */ + #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT + + /* Set when frame has traced packets. */ + #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE + + /* Number of vectors enqueue to this next since last overflow. */ + u32 vectors_since_last_overflow; + } vlib_next_frame_t; + +Graph node dispatch functions call vlib_get_next_frame (…) to set “(u32 +\*)to_next” to the right place in the vlib_frame_t corresponding to the +ith arc (aka next0) from the current node to the indicated next node. + +After some scuffling around - two levels of macros - processing reaches +vlib_get_next_frame_internal (…). Get-next-frame-internal digs up the +vlib_next_frame_t corresponding to the desired graph arc. + +The next frame data structure amounts to a graph-arc-centric frame +cache. Once a node finishes adding element to a frame, it will acquire a +vlib_pending_frame_t and end up on the graph dispatcher’s run-queue. But +there’s no guarantee that more vector elements won’t be added to the +underlying frame from the same (source_node, next_index) arc or from a +different (source_node, next_index) arc. + +Maintaining consistency of the arc-to-frame cache is necessary. The +first step in maintaining consistency is to make sure that only one +graph node at a time thinks it “owns” the target vlib_frame_t. + +Back to the graph node dispatch function. In the usual case, a certain +number of packets will be added to the vlib_frame_t acquired by calling +vlib_get_next_frame (…). + +Before a dispatch function returns, it’s required to call +vlib_put_next_frame (…) for all of the graph arcs it actually used. This +action adds a vlib_pending_frame_t to the graph dispatcher’s pending +frame vector. + +Vlib_put_next_frame makes a note in the pending frame of the frame +index, and also of the vlib_next_frame_t index. + +dispatch_pending_node actions +----------------------------- + +The main graph dispatch loop calls dispatch pending node as shown above. + +Dispatch_pending_node recovers the pending frame, and the graph node +runtime / dispatch function. Further, it recovers the next_frame +currently associated with the vlib_frame_t, and detaches the +vlib_frame_t from the next_frame. + +In …/src/vlib/main.c:dispatch_pending_node(…), note this stanza: + +.. code:: c + + /* Force allocation of new frame while current frame is being + dispatched. */ + restore_frame_index = ~0; + if (nf->frame_index == p->frame_index) + { + nf->frame_index = ~0; + nf->flags &= ~VLIB_FRAME_IS_ALLOCATED; + if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH)) + restore_frame_index = p->frame_index; + } + +dispatch_pending_node is worth a hard stare due to the several +second-order optimizations it implements. Almost as an afterthought, it +calls dispatch_node which actually calls the graph node dispatch +function. + +Process / thread model +---------------------- + +vlib provides an ultra-lightweight cooperative multi-tasking thread +model. The graph node scheduler invokes these processes in much the same +way as traditional vector-processing run-to-completion graph nodes; +plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply +set the vlib_node_registration_t type field to vlib_NODE_TYPE_PROCESS. +Yes, process is a misnomer. These are cooperative multi-tasking threads. + +As of this writing, the default stack size is 2<<15 = 32kb. Initialize +the node registration’s process_log2_n_stack_bytes member as needed. The +graph node dispatcher makes some effort to detect stack overrun, e.g. by +mapping a no-access page below each thread stack. + +Process node dispatch functions are expected to be “while(1) { }” loops +which suspend when not otherwise occupied, and which must not run for +unreasonably long periods of time. + +“Unreasonably long” is an application-dependent concept. Over the years, +we have constructed frame-size sensitive control-plane nodes which will +use a much higher fraction of the available CPU bandwidth when the frame +size is low. The classic example: modifying forwarding tables. So long +as the table-builder leaves the forwarding tables in a valid state, one +can suspend the table builder to avoid dropping packets as a result of +control-plane activity. + +Process nodes can suspend for fixed amounts of time, or until another +entity signals an event, or both. See the next section for a description +of the vlib process event mechanism. + +When running in vlib process context, one must pay strict attention to +loop invariant issues. If one walks a data structure and calls a +function which may suspend, one had best know by construction that it +cannot change. Often, it’s best to simply make a snapshot copy of a data +structure, walk the copy at leisure, then free the copy. + +Process events +-------------- + +The vlib process event mechanism API is extremely lightweight and easy +to use. Here is a typical example: + +.. code:: c + + vlib_main_t *vm = &vlib_global_main; + uword event_type, * event_data = 0; + + while (1) + { + vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */); + + event_type = vlib_process_get_events (vm, &event_data); + + switch (event_type) { + case EVENT1: + handle_event1s (event_data); + break; + + case EVENT2: + handle_event2s (event_data); + break; + + case ~0: /* 5-second idle/periodic */ + handle_idle (); + break; + + default: /* bug! */ + ASSERT (0); + } + + vec_reset_length(event_data); + } + +In this example, the VLIB process node waits for an event to occur, or +for 5 seconds to elapse. The code demuxes on the event type, calling the +appropriate handler function. Each call to vlib_process_get_events +returns a vector of per-event-type data passed to successive +vlib_process_signal_event calls; it is a serious error to process only +event_data[0]. + +Resetting the event_data vector-length to 0 [instead of calling +vec_free] means that the event scheme doesn’t burn cycles continuously +allocating and freeing the event data vector. This is a common vppinfra +/ vlib coding pattern, well worth using when appropriate. + +Signaling an event is easy, for example: + +.. code:: c + + vlib_process_signal_event (vm, process_node_index, EVENT1, + (uword)arbitrary_event1_data); /* and so forth */ + +One can either know the process node index by construction - dig it out +of the appropriate vlib_node_registration_t - or by finding the +vlib_node_t with vlib_get_node_by_name(…). + +Buffers +------- + +vlib buffering solves the usual set of packet-processing problems, +albeit at high performance. Key in terms of performance: one ordinarily +allocates / frees N buffers at a time rather than one at a time. Except +when operating directly on a specific buffer, one deals with buffers by +index, not by pointer. + +Packet-processing frames are u32[] arrays, not vlib_buffer_t[] arrays. + +Packets comprise one or more vlib buffers, chained together as required. +Multiple particle sizes are supported; hardware input nodes simply ask +for the required size(s). Coalescing support is available. For obvious +reasons one is discouraged from writing one’s own wild and wacky buffer +chain traversal code. + +vlib buffer headers are allocated immediately prior to the buffer data +area. In typical packet processing this saves a dependent read wait: +given a buffer’s address, one can prefetch the buffer header [metadata] +at the same time as the first cache line of buffer data. + +Buffer header metadata (vlib_buffer_t) includes the usual rewrite +expansion space, a current_data offset, RX and TX interface indices, +packet trace information, and a opaque areas. + +The opaque data is intended to control packet processing in arbitrary +subgraph-dependent ways. The programmer shoulders responsibility for +data lifetime analysis, type-checking, etc. + +Buffers have reference-counts in support of e.g. multicast replication. + +Shared-memory message API +------------------------- + +Local control-plane and application processes interact with the vpp +dataplane via asynchronous message-passing in shared memory over +unidirectional queues. The same application APIs are available via +sockets. + +Capturing API traces and replaying them in a simulation environment +requires a disciplined approach to the problem. This seems like a +make-work task, but it is not. When something goes wrong in the +control-plane after 300,000 or 3,000,000 operations, high-speed replay +of the events leading up to the accident is a huge win. + +The shared-memory message API message allocator vl_api_msg_alloc uses a +particularly cute trick. Since messages are processed in order, we try +to allocate message buffering from a set of fixed-size, preallocated +rings. Each ring item has a “busy” bit. Freeing one of the preallocated +message buffers merely requires the message consumer to clear the busy +bit. No locking required. + +Debug CLI +--------- + +Adding debug CLI commands to VLIB applications is very simple. + +Here is a complete example: + +.. code:: c + + static clib_error_t * + show_ip_tuple_match (vlib_main_t * vm, + unformat_input_t * input, + vlib_cli_command_t * cmd) + { + vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main); + return 0; + } + + static VLIB_CLI_COMMAND (show_ip_tuple_command) = + { + .path = "show ip tuple match", + .short_help = "Show ip 5-tuple match-and-broadcast tables", + .function = show_ip_tuple_match, + }; + +This example implements the “show ip tuple match” debug cli command. In +ordinary usage, the vlib cli is available via the “vppctl” application, +which sends traffic to a named pipe. One can configure debug CLI telnet +access on a configurable port. + +The cli implementation has an output redirection facility which makes it +simple to deliver cli output via shared-memory API messaging, + +Particularly for debug or “show tech support” type commands, it would be +wasteful to write vlib application code to pack binary data, write more +code elsewhere to unpack the data and finally print the answer. If a +certain cli command has the potential to hurt packet processing +performance by running for too long, do the work incrementally in a +process node. The client can wait. + +Macro expansion +~~~~~~~~~~~~~~~ + +The vpp debug CLI engine includes a recursive macro expander. This is +quite useful for factoring out address and/or interface name specifics: + +:: + + define ip1 192.168.1.1/24 + define ip2 192.168.2.1/24 + define iface1 GigabitEthernet3/0/0 + define iface2 loop1 + + set int ip address $iface1 $ip1 + set int ip address $iface2 $(ip2) + + undefine ip1 + undefine ip2 + undefine iface1 + undefine iface2 + +Each socket (or telnet) debug CLI session has its own macro tables. All +debug CLI sessions which use CLI_INBAND binary API messages share a +single table. + +The macro expander recognizes circular definitions: + +:: + + define foo \$(bar) + define bar \$(mumble) + define mumble \$(foo) + +At 8 levels of recursion, the macro expander throws up its hands and +replies “CIRCULAR.” + +Macro-related debug CLI commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In addition to the “define” and “undefine” debug CLI commands, use “show +macro [noevaluate]” to dump the macro table. The “echo” debug CLI +command will evaluate and print its argument: + +:: + + vpp# define foo This\ Is\ Foo + vpp# echo $foo + This Is Foo + +Handing off buffers between threads +----------------------------------- + +Vlib includes an easy-to-use mechanism for handing off buffers between +worker threads. A typical use-case: software ingress flow hashing. At a +high level, one creates a per-worker-thread queue which sends packets to +a specific graph node in the indicated worker thread. With the queue in +hand, enqueue packets to the worker thread of your choice. + +Initialize a handoff queue +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Simple enough, call vlib_frame_queue_main_init: + +.. code:: c + + main_ptr->frame_queue_index + = vlib_frame_queue_main_init (dest_node.index, frame_queue_size); + +Frame_queue_size means what it says: the number of frames which may be +queued. Since frames contain 1…256 packets, frame_queue_size should be a +reasonably small number (32…64). If the frame queue producer(s) are +faster than the frame queue consumer(s), congestion will occur. Suggest +letting the enqueue operator deal with queue congestion, as shown in the +enqueue example below. + +Under the floorboards, vlib_frame_queue_main_init creates an input queue +for each worker thread. + +Please do NOT create frame queues until it’s clear that they will be +used. Although the main dispatch loop is reasonably smart about how +often it polls the (entire set of) frame queues, polling unused frame +queues is a waste of clock cycles. + +Hand off packets +~~~~~~~~~~~~~~~~ + +The actual handoff mechanics are simple, and integrate nicely with a +typical graph-node dispatch function: + +.. code:: c + + always_inline uword + do_handoff_inline (vlib_main_t * vm, + vlib_node_runtime_t * node, vlib_frame_t * frame, + int is_ip4, int is_trace) + { + u32 n_left_from, *from; + vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; + u16 thread_indices [VLIB_FRAME_SIZE]; + u16 nexts[VLIB_FRAME_SIZE], *next; + u32 n_enq; + htest_main_t *hmp = &htest_main; + int i; + + from = vlib_frame_vector_args (frame); + n_left_from = frame->n_vectors; + + vlib_get_buffers (vm, from, bufs, n_left_from); + next = nexts; + b = bufs; + + /* + * Typical frame traversal loop, details vary with + * use case. Make sure to set thread_indices[i] with + * the desired destination thread index. You may + * or may not bother to set next[i]. + */ + + for (i = 0; i < frame->n_vectors; i++) + { + <snip> + /* Pick a thread to handle this packet */ + thread_indices[i] = f (packet_data_or_whatever); + <snip> + + b += 1; + next += 1; + n_left_from -= 1; + } + + /* Enqueue buffers to threads */ + n_enq = + vlib_buffer_enqueue_to_thread (vm, node, hmp->frame_queue_index, + from, thread_indices, frame->n_vectors, + 1 /* drop on congestion */); + /* Typical counters, + if (n_enq < frame->n_vectors) + vlib_node_increment_counter (vm, node->node_index, + XXX_ERROR_CONGESTION_DROP, + frame->n_vectors - n_enq); + vlib_node_increment_counter (vm, node->node_index, + XXX_ERROR_HANDED_OFF, n_enq); + return frame->n_vectors; + } + +Notes about calling vlib_buffer_enqueue_to_thread(…): + +- If you pass “drop on congestion” non-zero, all packets in the inbound + frame will be consumed one way or the other. This is the recommended + setting. + +- In the drop-on-congestion case, please don’t try to “help” in the + enqueue node by freeing dropped packets, or by pushing them to + “error-drop.” Either of those actions would be a severe error. + +- It’s perfectly OK to enqueue packets to the current thread. + +Handoff Demo Plugin +------------------- + +Check out the sample (plugin) example in …/src/examples/handoffdemo. If +you want to build the handoff demo plugin: + +:: + + $ cd .../src/plugins + $ ln -s ../examples/handoffdemo + +This plugin provides a simple example of how to hand off packets between +threads. We used it to debug packet-tracer handoff tracing support. + +Packet generator input script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + packet-generator new { + name x + limit 5 + size 128-128 + interface local0 + node handoffdemo-1 + data { + incrementing 30 + } + } + +Start vpp with 2 worker threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The demo plugin hands packets from worker 1 to worker 2. + +Enable tracing, and start the packet generator +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + trace add pg-input 100 + packet-generator enable + +Sample Run +~~~~~~~~~~ + +:: + + DBGvpp# ex /tmp/pg_input_script + DBGvpp# pa en + DBGvpp# sh err + Count Node Reason + 5 handoffdemo-1 packets handed off processed + 5 handoffdemo-2 completed packets + DBGvpp# show run + Thread 1 vpp_wk_0 (lcore 0) + Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 + vector rates in 3.7331e-2, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call + handoffdemo-1 active 1 5 0 4.76e3 5.00 + pg-input disabled 2 5 0 5.58e4 2.50 + unix-epoll-input polling 22760 0 0 2.14e7 0.00 + --------------- + Thread 2 vpp_wk_1 (lcore 2) + Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 + vector rates in 0.0000e0, out 0.0000e0, drop 3.7331e-2, punt 0.0000e0 + Name State Calls Vectors Suspends Clocks Vectors/Call + drop active 1 5 0 1.35e4 5.00 + error-drop active 1 5 0 2.52e4 5.00 + handoffdemo-2 active 1 5 0 2.56e4 5.00 + unix-epoll-input polling 22406 0 0 2.18e7 0.00 + +Enable the packet tracer and run it again… + +:: + + DBGvpp# trace add pg-input 100 + DBGvpp# pa en + DBGvpp# sh trace + sh trace + ------------------- Start of thread 0 vpp_main ------------------- + No packets in trace buffer + ------------------- Start of thread 1 vpp_wk_0 ------------------- + Packet 1 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000000 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 2 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000001 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 3 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000002 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 4 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000003 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + Packet 5 + + 00:06:50:520688: pg-input + stream x, 128 bytes, 0 sw_if_index + current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000004 + 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 + 00000020: 0000000000000000000000000000000000000000000000000000000000000000 + 00000040: 0000000000000000000000000000000000000000000000000000000000000000 + 00000060: 0000000000000000000000000000000000000000000000000000000000000000 + 00:06:50:520762: handoffdemo-1 + HANDOFFDEMO: current thread 1 + + ------------------- Start of thread 2 vpp_wk_1 ------------------- + Packet 1 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 0 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 2 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 1 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 3 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 2 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 4 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 3 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + + Packet 5 + + 00:06:50:520796: handoff_trace + HANDED-OFF: from thread 1 trace index 4 + 00:06:50:520796: handoffdemo-2 + HANDOFFDEMO: current thread 2 + 00:06:50:520867: error-drop + rx:local0 + 00:06:50:520914: drop + handoffdemo-2: completed packets + DBGvpp# diff --git a/docs/developer/corearchitecture/vnet.rst b/docs/developer/corearchitecture/vnet.rst new file mode 100644 index 00000000000..812e2fb4f8a --- /dev/null +++ b/docs/developer/corearchitecture/vnet.rst @@ -0,0 +1,807 @@ +VNET (VPP Network Stack) +======================== + +The files associated with the VPP network stack layer are located in the +*./src/vnet* folder. The Network Stack Layer is basically an +instantiation of the code in the other layers. This layer has a vnet +library that provides vectorized layer-2 and 3 networking graph nodes, a +packet generator, and a packet tracer. + +In terms of building a packet processing application, vnet provides a +platform-independent subgraph to which one connects a couple of +device-driver nodes. + +Typical RX connections include “ethernet-input” [full software +classification, feeds ipv4-input, ipv6-input, arp-input etc.] and +“ipv4-input-no-checksum” [if hardware can classify, perform ipv4 header +checksum]. + +Effective graph dispatch function coding +---------------------------------------- + +Over the 15 years, multiple coding styles have emerged: a +single/dual/quad loop coding model (with variations) and a +fully-pipelined coding model. + +Single/dual loops +----------------- + +The single/dual/quad loop model variations conveniently solve problems +where the number of items to process is not known in advance: typical +hardware RX-ring processing. This coding style is also very effective +when a given node will not need to cover a complex set of dependent +reads. + +Here is an quad/single loop which can leverage up-to-avx512 SIMD vector +units to convert buffer indices to buffer pointers: + +.. code:: c + + static uword + simulated_ethernet_interface_tx (vlib_main_t * vm, + vlib_node_runtime_t * + node, vlib_frame_t * frame) + { + u32 n_left_from, *from; + u32 next_index = 0; + u32 n_bytes; + u32 thread_index = vm->thread_index; + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; + u16 nexts[VLIB_FRAME_SIZE], *next; + + n_left_from = frame->n_vectors; + from = vlib_frame_vector_args (frame); + + /* + * Convert up to VLIB_FRAME_SIZE indices in "from" to + * buffer pointers in bufs[] + */ + vlib_get_buffers (vm, from, bufs, n_left_from); + b = bufs; + next = nexts; + + /* + * While we have at least 4 vector elements (pkts) to process.. + */ + while (n_left_from >= 4) + { + /* Prefetch next quad-loop iteration. */ + if (PREDICT_TRUE (n_left_from >= 8)) + { + vlib_prefetch_buffer_header (b[4], STORE); + vlib_prefetch_buffer_header (b[5], STORE); + vlib_prefetch_buffer_header (b[6], STORE); + vlib_prefetch_buffer_header (b[7], STORE); + } + + /* + * $$$ Process 4x packets right here... + * set next[0..3] to send the packets where they need to go + */ + + do_something_to (b[0]); + do_something_to (b[1]); + do_something_to (b[2]); + do_something_to (b[3]); + + /* Process the next 0..4 packets */ + b += 4; + next += 4; + n_left_from -= 4; + } + /* + * Clean up 0...3 remaining packets at the end of the incoming frame + */ + while (n_left_from > 0) + { + /* + * $$$ Process one packet right here... + * set next[0..3] to send the packets where they need to go + */ + do_something_to (b[0]); + + /* Process the next packet */ + b += 1; + next += 1; + n_left_from -= 1; + } + + /* + * Send the packets along their respective next-node graph arcs + * Considerable locality of reference is expected, most if not all + * packets in the inbound vector will traverse the same next-node + * arc + */ + vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors); + + return frame->n_vectors; + } + +Given a packet processing task to implement, it pays to scout around +looking for similar tasks, and think about using the same coding +pattern. It is not uncommon to recode a given graph node dispatch +function several times during performance optimization. + +Creating Packets from Scratch +----------------------------- + +At times, it’s necessary to create packets from scratch and send them. +Tasks like sending keepalives or actively opening connections come to +mind. Its not difficult, but accurate buffer metadata setup is required. + +Allocating Buffers +~~~~~~~~~~~~~~~~~~ + +Use vlib_buffer_alloc, which allocates a set of buffer indices. For +low-performance applications, it’s OK to allocate one buffer at a time. +Note that vlib_buffer_alloc(…) does NOT initialize buffer metadata. See +below. + +In high-performance cases, allocate a vector of buffer indices, and hand +them out from the end of the vector; decrement \_vec_len(..) as buffer +indices are allocated. See tcp_alloc_tx_buffers(…) and +tcp_get_free_buffer_index(…) for an example. + +Buffer Initialization Example +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following example shows the **main points**, but is not to be +blindly cut-’n-pasted. + +.. code:: c + + u32 bi0; + vlib_buffer_t *b0; + ip4_header_t *ip; + udp_header_t *udp; + + /* Allocate a buffer */ + if (vlib_buffer_alloc (vm, &bi0, 1) != 1) + return -1; + + b0 = vlib_get_buffer (vm, bi0); + + /* At this point b0->current_data = 0, b0->current_length = 0 */ + + /* + * Copy data into the buffer. This example ASSUMES that data will fit + * in a single buffer, and is e.g. an ip4 packet. + */ + if (have_packet_rewrite) + { + clib_memcpy (b0->data, data, vec_len (data)); + b0->current_length = vec_len (data); + } + else + { + /* OR, build a udp-ip packet (for example) */ + ip = vlib_buffer_get_current (b0); + udp = (udp_header_t *) (ip + 1); + data_dst = (u8 *) (udp + 1); + + ip->ip_version_and_header_length = 0x45; + ip->ttl = 254; + ip->protocol = IP_PROTOCOL_UDP; + ip->length = clib_host_to_net_u16 (sizeof (*ip) + sizeof (*udp) + + vec_len(udp_data)); + ip->src_address.as_u32 = src_address->as_u32; + ip->dst_address.as_u32 = dst_address->as_u32; + udp->src_port = clib_host_to_net_u16 (src_port); + udp->dst_port = clib_host_to_net_u16 (dst_port); + udp->length = clib_host_to_net_u16 (vec_len (udp_data)); + clib_memcpy (data_dst, udp_data, vec_len(udp_data)); + + if (compute_udp_checksum) + { + /* RFC 7011 section 10.3.2. */ + udp->checksum = ip4_tcp_udp_compute_checksum (vm, b0, ip); + if (udp->checksum == 0) + udp->checksum = 0xffff; + } + b0->current_length = vec_len (sizeof (*ip) + sizeof (*udp) + + vec_len (udp_data)); + + } + b0->flags |= VLIB_BUFFER_TOTAL_LENGTH_VALID; + + /* sw_if_index 0 is the "local" interface, which always exists */ + vnet_buffer (b0)->sw_if_index[VLIB_RX] = 0; + + /* Use the default FIB index for tx lookup. Set non-zero to use another fib */ + vnet_buffer (b0)->sw_if_index[VLIB_TX] = 0; + +If your use-case calls for large packet transmission, use +vlib_buffer_chain_append_data_with_alloc(…) to create the requisite +buffer chain. + +Enqueueing packets for lookup and transmission +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The simplest way to send a set of packets is to use +vlib_get_frame_to_node(…) to allocate fresh frame(s) to ip4_lookup_node +or ip6_lookup_node, add the constructed buffer indices, and dispatch the +frame using vlib_put_frame_to_node(…). + +.. code:: c + + vlib_frame_t *f; + f = vlib_get_frame_to_node (vm, ip4_lookup_node.index); + f->n_vectors = vec_len(buffer_indices_to_send); + to_next = vlib_frame_vector_args (f); + + for (i = 0; i < vec_len (buffer_indices_to_send); i++) + to_next[i] = buffer_indices_to_send[i]; + + vlib_put_frame_to_node (vm, ip4_lookup_node_index, f); + +It is inefficient to allocate and schedule single packet frames. That’s +typical in case you need to send one packet per second, but should +**not** occur in a for-loop! + +Packet tracer +------------- + +Vlib includes a frame element [packet] trace facility, with a simple +debug CLI interface. The cli is straightforward: “trace add +input-node-name count” to start capturing packet traces. + +To trace 100 packets on a typical x86_64 system running the dpdk plugin: +“trace add dpdk-input 100”. When using the packet generator: “trace add +pg-input 100” + +To display the packet trace: “show trace” + +Each graph node has the opportunity to capture its own trace data. It is +almost always a good idea to do so. The trace capture APIs are simple. + +The packet capture APIs snapshoot binary data, to minimize processing at +capture time. Each participating graph node initialization provides a +vppinfra format-style user function to pretty-print data when required +by the VLIB “show trace” command. + +Set the VLIB node registration “.format_trace” member to the name of the +per-graph node format function. + +Here’s a simple example: + +.. code:: c + + u8 * my_node_format_trace (u8 * s, va_list * args) + { + vlib_main_t * vm = va_arg (*args, vlib_main_t *); + vlib_node_t * node = va_arg (*args, vlib_node_t *); + my_node_trace_t * t = va_arg (*args, my_trace_t *); + + s = format (s, "My trace data was: %d", t-><whatever>); + + return s; + } + +The trace framework hands the per-node format function the data it +captured as the packet whizzed by. The format function pretty-prints the +data as desired. + +Graph Dispatcher Pcap Tracing +----------------------------- + +The vpp graph dispatcher knows how to capture vectors of packets in pcap +format as they’re dispatched. The pcap captures are as follows: + +:: + + VPP graph dispatch trace record description: + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Major Version | Minor Version | NStrings | ProtoHint | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer index (big endian) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + VPP graph node name ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Metadata ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Opaque ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Buffer Opaque 2 ... ... | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | VPP ASCII packet trace (if NStrings > 4) | NULL octet | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Packet data (up to 16K) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + +Graph dispatch records comprise a version stamp, an indication of how +many NULL-terminated strings will follow the record header and preceed +packet data, and a protocol hint. + +The buffer index is an opaque 32-bit cookie which allows consumers of +these data to easily filter/track single packets as they traverse the +forwarding graph. + +Multiple records per packet are normal, and to be expected. Packets will +appear multiple times as they traverse the vpp forwarding graph. In this +way, vpp graph dispatch traces are significantly different from regular +network packet captures from an end-station. This property complicates +stateful packet analysis. + +Restricting stateful analysis to records from a single vpp graph node +such as “ethernet-input” seems likely to improve the situation. + +As of this writing: major version = 1, minor version = 0. Nstrings +SHOULD be 4 or 5. Consumers SHOULD be wary values less than 4 or greater +than 5. They MAY attempt to display the claimed number of strings, or +they MAY treat the condition as an error. + +Here is the current set of protocol hints: + +.. code:: c + + typedef enum + { + VLIB_NODE_PROTO_HINT_NONE = 0, + VLIB_NODE_PROTO_HINT_ETHERNET, + VLIB_NODE_PROTO_HINT_IP4, + VLIB_NODE_PROTO_HINT_IP6, + VLIB_NODE_PROTO_HINT_TCP, + VLIB_NODE_PROTO_HINT_UDP, + VLIB_NODE_N_PROTO_HINTS, + } vlib_node_proto_hint_t; + +Example: VLIB_NODE_PROTO_HINT_IP6 means that the first octet of packet +data SHOULD be 0x60, and should begin an ipv6 packet header. + +Downstream consumers of these data SHOULD pay attention to the protocol +hint. They MUST tolerate inaccurate hints, which MAY occur from time to +time. + +Dispatch Pcap Trace Debug CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To start a dispatch trace capture of up to 10,000 trace records: + +:: + + pcap dispatch trace on max 10000 file dispatch.pcap + +To start a dispatch trace which will also include standard vpp packet +tracing for packets which originate in dpdk-input: + +:: + + pcap dispatch trace on max 10000 file dispatch.pcap buffer-trace dpdk-input 1000 + +To save the pcap trace, e.g. in /tmp/dispatch.pcap: + +:: + + pcap dispatch trace off + +Wireshark dissection of dispatch pcap traces +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It almost goes without saying that we built a companion wireshark +dissector to display these traces. As of this writing, we have +upstreamed the wireshark dissector. + +Since it will be a while before wireshark/master/latest makes it into +all of the popular Linux distros, please see the “How to build a vpp +dispatch trace aware Wireshark” page for build info. + +Here is a sample packet dissection, with some fields omitted for +clarity. The point is that the wireshark dissector accurately displays +**all** of the vpp buffer metadata, and the name of the graph node in +question. + +:: + + Frame 1: 2216 bytes on wire (17728 bits), 2216 bytes captured (17728 bits) + Encapsulation type: USER 13 (58) + [Protocols in frame: vpp:vpp-metadata:vpp-opaque:vpp-opaque2:eth:ethertype:ip:tcp:data] + VPP Dispatch Trace + BufferIndex: 0x00036663 + NodeName: ethernet-input + VPP Buffer Metadata + Metadata: flags: + Metadata: current_data: 0, current_length: 102 + Metadata: current_config_index: 0, flow_id: 0, next_buffer: 0 + Metadata: error: 0, n_add_refs: 0, buffer_pool_index: 0 + Metadata: trace_index: 0, recycle_count: 0, len_not_first_buf: 0 + Metadata: free_list_index: 0 + Metadata: + VPP Buffer Opaque + Opaque: raw: 00000007 ffffffff 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 + Opaque: sw_if_index[VLIB_RX]: 7, sw_if_index[VLIB_TX]: -1 + Opaque: L2 offset 0, L3 offset 0, L4 offset 0, feature arc index 0 + Opaque: ip.adj_index[VLIB_RX]: 0, ip.adj_index[VLIB_TX]: 0 + Opaque: ip.flow_hash: 0x0, ip.save_protocol: 0x0, ip.fib_index: 0 + Opaque: ip.save_rewrite_length: 0, ip.rpf_id: 0 + Opaque: ip.icmp.type: 0 ip.icmp.code: 0, ip.icmp.data: 0x0 + Opaque: ip.reass.next_index: 0, ip.reass.estimated_mtu: 0 + Opaque: ip.reass.fragment_first: 0 ip.reass.fragment_last: 0 + Opaque: ip.reass.range_first: 0 ip.reass.range_last: 0 + Opaque: ip.reass.next_range_bi: 0x0, ip.reass.ip6_frag_hdr_offset: 0 + Opaque: mpls.ttl: 0, mpls.exp: 0, mpls.first: 0, mpls.save_rewrite_length: 0, mpls.bier.n_bytes: 0 + Opaque: l2.feature_bitmap: 00000000, l2.bd_index: 0, l2.l2_len: 0, l2.shg: 0, l2.l2fib_sn: 0, l2.bd_age: 0 + Opaque: l2.feature_bitmap_input: none configured, L2.feature_bitmap_output: none configured + Opaque: l2t.next_index: 0, l2t.session_index: 0 + Opaque: l2_classify.table_index: 0, l2_classify.opaque_index: 0, l2_classify.hash: 0x0 + Opaque: policer.index: 0 + Opaque: ipsec.flags: 0x0, ipsec.sad_index: 0 + Opaque: map.mtu: 0 + Opaque: map_t.v6.saddr: 0x0, map_t.v6.daddr: 0x0, map_t.v6.frag_offset: 0, map_t.v6.l4_offset: 0 + Opaque: map_t.v6.l4_protocol: 0, map_t.checksum_offset: 0, map_t.mtu: 0 + Opaque: ip_frag.mtu: 0, ip_frag.next_index: 0, ip_frag.flags: 0x0 + Opaque: cop.current_config_index: 0 + Opaque: lisp.overlay_afi: 0 + Opaque: tcp.connection_index: 0, tcp.seq_number: 0, tcp.seq_end: 0, tcp.ack_number: 0, tcp.hdr_offset: 0, tcp.data_offset: 0 + Opaque: tcp.data_len: 0, tcp.flags: 0x0 + Opaque: sctp.connection_index: 0, sctp.sid: 0, sctp.ssn: 0, sctp.tsn: 0, sctp.hdr_offset: 0 + Opaque: sctp.data_offset: 0, sctp.data_len: 0, sctp.subconn_idx: 0, sctp.flags: 0x0 + Opaque: snat.flags: 0x0 + Opaque: + VPP Buffer Opaque2 + Opaque2: raw: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 + Opaque2: qos.bits: 0, qos.source: 0 + Opaque2: loop_counter: 0 + Opaque2: gbp.flags: 0, gbp.src_epg: 0 + Opaque2: pg_replay_timestamp: 0 + Opaque2: + Ethernet II, Src: 06:d6:01:41:3b:92 (06:d6:01:41:3b:92), Dst: IntelCor_3d:f6 Transmission Control Protocol, Src Port: 22432, Dst Port: 54084, Seq: 1, Ack: 1, Len: 36 + Source Port: 22432 + Destination Port: 54084 + TCP payload (36 bytes) + Data (36 bytes) + + 0000 cf aa 8b f5 53 14 d4 c7 29 75 3e 56 63 93 9d 11 ....S...)u>Vc... + 0010 e5 f2 92 27 86 56 4c 21 ce c5 23 46 d7 eb ec 0d ...'.VL!..#F.... + 0020 a8 98 36 5a ..6Z + Data: cfaa8bf55314d4c729753e5663939d11e5f2922786564c21… + [Length: 36] + +It’s a matter of a couple of mouse-clicks in Wireshark to filter the +trace to a specific buffer index. With that specific kind of filtration, +one can watch a packet walk through the forwarding graph; noting any/all +metadata changes, header checksum changes, and so forth. + +This should be of significant value when developing new vpp graph nodes. +If new code mispositions b->current_data, it will be completely obvious +from looking at the dispatch trace in wireshark. + +pcap rx, tx, and drop tracing +----------------------------- + +vpp also supports rx, tx, and drop packet capture in pcap format, +through the “pcap trace” debug CLI command. + +This command is used to start or stop a packet capture, or show the +status of packet capture. Each of “pcap trace rx”, “pcap trace tx”, and +“pcap trace drop” is implemented. Supply one or more of “rx”, “tx”, and +“drop” to enable multiple simultaneous capture types. + +These commands have the following optional parameters: + +- rx - trace received packets. + +- tx - trace transmitted packets. + +- drop - trace dropped packets. + +- max *nnnn*\ - file size, number of packet captures. Once packets + have been received, the trace buffer buffer is flushed to the + indicated file. Defaults to 1000. Can only be updated if packet + capture is off. + +- max-bytes-per-pkt *nnnn*\ - maximum number of bytes to trace on a + per-packet basis. Must be >32 and less than 9000. Default value: + + 512. + +- filter - Use the pcap rx / tx / drop trace filter, which must be + configured. Use classify filter pcap… to configure the filter. The + filter will only be executed if the per-interface or any-interface + tests fail. + +- intfc *interface* \| *any*\ - Used to specify a given interface, or + use ‘any’ to run packet capture on all interfaces. ‘any’ is the + default if not provided. Settings from a previous packet capture are + preserved, so ‘any’ can be used to reset the interface setting. + +- file *filename*\ - Used to specify the output filename. The file + will be placed in the ‘/tmp’ directory. If *filename* already exists, + file will be overwritten. If no filename is provided, ‘/tmp/rx.pcap + or tx.pcap’ will be used, depending on capture direction. Can only be + updated when pcap capture is off. + +- status - Displays the current status and configured attributes + associated with a packet capture. If packet capture is in progress, + ‘status’ also will return the number of packets currently in the + buffer. Any additional attributes entered on command line with a + ‘status’ request will be ignored. + +- filter - Capture packets which match the current packet trace filter + set. See next section. Configure the capture filter first. + +packet trace capture filtering +------------------------------ + +The “classify filter pcap \| \| trace” debug CLI command constructs an +arbitrary set of packet classifier tables for use with “pcap rx \| tx \| +drop trace,” and with the vpp packet tracer on a per-interface or +system-wide basis. + +Packets which match a rule in the classifier table chain will be traced. +The tables are automatically ordered so that matches in the most +specific table are tried first. + +It’s reasonably likely that folks will configure a single table with one +or two matches. As a result, we configure 8 hash buckets and 128K of +match rule space by default. One can override the defaults by specifying +“buckets ” and “memory-size ” as desired. + +To build up complex filter chains, repeatedly issue the classify filter +debug CLI command. Each command must specify the desired mask and match +values. If a classifier table with a suitable mask already exists, the +CLI command adds a match rule to the existing table. If not, the CLI +command add a new table and the indicated mask rule + +Configure a simple pcap classify filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter pcap mask l3 ip4 src match l3 ip4 src 192.168.1.11 + pcap trace rx max 100 filter + +Configure a simple per-interface capture filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter GigabitEthernet3/0/0 mask l3 ip4 src match l3 ip4 src 192.168.1.11" + pcap trace rx max 100 intfc GigabitEthernet3/0/0 + +Note that per-interface capture filters are *always* applied. + +Clear per-interface capture filters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter GigabitEthernet3/0/0 del + +Configure another fairly simple pcap classify filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter pcap mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10 + pcap trace tx max 100 filter + +Configure a vpp packet tracer filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter trace mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10 + trace add dpdk-input 100 filter + +Clear all current classifier filters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + classify filter [pcap | <interface> | trace] del + +To inspect the classifier tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + show classify table [verbose] + +The verbose form displays all of the match rules, with hit-counters. + +Terse description of the “mask ” syntax: +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + l2 src dst proto tag1 tag2 ignore-tag1 ignore-tag2 cos1 cos2 dot1q dot1ad + l3 ip4 <ip4-mask> ip6 <ip6-mask> + <ip4-mask> version hdr_length src[/width] dst[/width] + tos length fragment_id ttl protocol checksum + <ip6-mask> version traffic-class flow-label src dst proto + payload_length hop_limit protocol + l4 tcp <tcp-mask> udp <udp_mask> src_port dst_port + <tcp-mask> src dst # ports + <udp-mask> src_port dst_port + +To construct **matches**, add the values to match after the indicated +keywords in the mask syntax. For example: “… mask l3 ip4 src” -> “… +match l3 ip4 src 192.168.1.11” + +VPP Packet Generator +-------------------- + +We use the VPP packet generator to inject packets into the forwarding +graph. The packet generator can replay pcap traces, and generate packets +out of whole cloth at respectably high performance. + +The VPP pg enables quite a variety of use-cases, ranging from functional +testing of new data-plane nodes to regression testing to performance +tuning. + +PG setup scripts +---------------- + +PG setup scripts describe traffic in detail, and leverage vpp debug CLI +mechanisms. It’s reasonably unusual to construct a pg setup script which +doesn’t include a certain amount of interface and FIB configuration. + +For example: + +:: + + loop create + set int ip address loop0 192.168.1.1/24 + set int state loop0 up + + packet-generator new { + name pg0 + limit 100 + rate 1e6 + size 300-300 + interface loop0 + node ethernet-input + data { IP4: 1.2.3 -> 4.5.6 + UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10 + UDP: 1234 -> 2345 + incrementing 286 + } + } + +A packet generator stream definition includes two major sections: - +Stream Parameter Setup - Packet Data + +Stream Parameter Setup +~~~~~~~~~~~~~~~~~~~~~~ + +Given the example above, let’s look at how to set up stream parameters: + +- **name pg0** - Name of the stream, in this case “pg0” + +- **limit 1000** - Number of packets to send when the stream is + enabled. “limit 0” means send packets continuously. + +- **maxframe <nnn>** - Maximum frame size. Handy for injecting multiple + frames no larger than <nnn>. Useful for checking dual / quad loop + codes + +- **rate 1e6** - Packet injection rate, in this case 1 MPPS. When not + specified, the packet generator injects packets as fast as possible + +- **size 300-300** - Packet size range, in this case send 300-byte + packets + +- **interface loop0** - Packets appear as if they were received on the + specified interface. This datum is used in multiple ways: to select + graph arc feature configuration, to select IP FIBs. Configure + features e.g. on loop0 to exercise those features. + +- **tx-interface <name>** - Packets will be transmitted on the + indicated interface. Typically required only when injecting packets + into post-IP-rewrite graph nodes. + +- **pcap <filename>** - Replay packets from the indicated pcap capture + file. “make test” makes extensive use of this feature: generate + packets using scapy, save them in a .pcap file, then inject them into + the vpp graph via a vpp pg “pcap <filename>” stream definition + +- **worker <nn>** - Generate packets for the stream using the indicated + vpp worker thread. The vpp pg generates and injects O(10 MPPS / + core). Use multiple stream definitions and worker threads to generate + and inject enough traffic to easily fill a 40 gbit pipe with small + packets. + +Data definition +~~~~~~~~~~~~~~~ + +Packet generator data definitions make use of a layered implementation +strategy. Networking layers are specified in order, and the notation can +seem a bit counter-intuitive. In the example above, the data definition +stanza constructs a set of L2-L4 headers layers, and uses an +incrementing fill pattern to round out the requested 300-byte packets. + +- **IP4: 1.2.3 -> 4.5.6** - Construct an L2 (MAC) header with the ip4 + ethertype (0x800), src MAC address of 00:01:00:02:00:03 and dst MAC + address of 00:04:00:05:00:06. Mac addresses may be specified in + either *xxxx.xxxx.xxxx* format or *xx:xx:xx:xx:xx:xx* format. + +- **UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10** - Construct an + incrementing set of L3 (IPv4) headers for successive packets with + source addresses ranging from .10 to .254. All packets in the stream + have a constant dest address of 192.168.2.10. Set the protocol field + to 17, UDP. + +- **UDP: 1234 -> 2345** - Set the UDP source and destination ports to + 1234 and 2345, respectively + +- **incrementing 256** - Insert up to 256 incrementing data bytes. + +Obvious variations involve “s/IP4/IP6/” in the above, along with +changing from IPv4 to IPv6 address notation. + +The vpp pg can set any / all IPv4 header fields, including tos, packet +length, mf / df / fragment id and offset, ttl, protocol, checksum, and +src/dst addresses. Take a look at ../src/vnet/ip/ip[46]_pg.c for +details. + +If all else fails, specify the entire packet data in hex: + +- **hex 0xabcd…** - copy hex data verbatim into the packet + +When replaying pcap files (“**pcap <filename>**”), do not specify a data +stanza. + +Diagnosing “packet-generator new” parse failures +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you want to inject packets into a brand-new graph node, remember to +tell the packet generator debug CLI how to parse the packet data stanza. + +If the node expects L2 Ethernet MAC headers, specify “.unformat_buffer = +unformat_ethernet_header”: + +.. code:: c + + VLIB_REGISTER_NODE (ethernet_input_node) = + { + <snip> + .unformat_buffer = unformat_ethernet_header, + <snip> + }; + +Beyond that, it may be necessary to set breakpoints in +…/src/vnet/pg/cli.c. Debug image suggested. + +When debugging new nodes, it may be far simpler to directly inject +ethernet frames - and add a corresponding vlib_buffer_advance in the new +node - than to modify the packet generator. + +Debug CLI +--------- + +The descriptions above describe the “packet-generator new” debug CLI in +detail. + +Additional debug CLI commands include: + +:: + + vpp# packet-generator enable [<stream-name>] + +which enables the named stream, or all streams. + +:: + + vpp# packet-generator disable [<stream-name>] + +disables the named stream, or all streams. + +:: + + vpp# packet-generator delete <stream-name> + +Deletes the named stream. + +:: + + vpp# packet-generator configure <stream-name> [limit <nnn>] + [rate <f64-pps>] [size <nn>-<nn>] + +Changes stream parameters without having to recreate the entire stream +definition. Note that re-issuing a “packet-generator new” command will +correctly recreate the named stream. |