summaryrefslogtreecommitdiffstats
path: root/docs/developer/corearchitecture
diff options
context:
space:
mode:
Diffstat (limited to 'docs/developer/corearchitecture')
-rw-r--r--docs/developer/corearchitecture/bihash.rst313
-rw-r--r--docs/developer/corearchitecture/buffer_metadata.rst237
-rw-r--r--docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst353
-rw-r--r--docs/developer/corearchitecture/buildsystem/cmakeandninja.rst186
-rw-r--r--docs/developer/corearchitecture/buildsystem/index.rst14
-rw-r--r--docs/developer/corearchitecture/buildsystem/mainmakefile.rst2
-rw-r--r--docs/developer/corearchitecture/featurearcs.rst225
-rw-r--r--docs/developer/corearchitecture/index.rst21
-rw-r--r--docs/developer/corearchitecture/infrastructure.rst612
l---------docs/developer/corearchitecture/mem.rst1
-rw-r--r--docs/developer/corearchitecture/multi_thread.rst169
-rw-r--r--docs/developer/corearchitecture/multiarch/arbfns.rst87
-rw-r--r--docs/developer/corearchitecture/multiarch/index.rst12
-rw-r--r--docs/developer/corearchitecture/multiarch/nodefns.rst138
-rw-r--r--docs/developer/corearchitecture/softwarearchitecture.rst47
-rw-r--r--docs/developer/corearchitecture/vlib.rst888
-rw-r--r--docs/developer/corearchitecture/vnet.rst807
17 files changed, 4112 insertions, 0 deletions
diff --git a/docs/developer/corearchitecture/bihash.rst b/docs/developer/corearchitecture/bihash.rst
new file mode 100644
index 00000000000..9b62baaf9cf
--- /dev/null
+++ b/docs/developer/corearchitecture/bihash.rst
@@ -0,0 +1,313 @@
+Bounded-index Extensible Hashing (bihash)
+=========================================
+
+Vpp uses bounded-index extensible hashing to solve a variety of
+exact-match (key, value) lookup problems. Benefits of the current
+implementation:
+
+- Very high record count scaling, tested to 100,000,000 records.
+- Lookup performance degrades gracefully as the number of records
+ increases
+- No reader locking required
+- Template implementation, it’s easy to support arbitrary (key,value)
+ types
+
+Bounded-index extensible hashing has been widely used in databases for
+decades.
+
+Bihash uses a two-level data structure:
+
+::
+
+ +-----------------+
+ | bucket-0 |
+ | log2_size |
+ | backing store |
+ +-----------------+
+ | bucket-1 |
+ | log2_size | +--------------------------------+
+ | backing store | --------> | KVP_PER_PAGE * key-value-pairs |
+ +-----------------+ | page 0 |
+ ... +--------------------------------+
+ +-----------------+ | KVP_PER_PAGE * key-value-pairs |
+ | bucket-2**N-1 | | page 1 |
+ | log2_size | +--------------------------------+
+ | backing store | ---
+ +-----------------+ +--------------------------------+
+ | KVP_PER_PAGE * key-value-pairs |
+ | page 2**(log2(size)) - 1 |
+ +--------------------------------+
+
+Discussion of the algorithm
+---------------------------
+
+This structure has a couple of major advantages. In practice, each
+bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
+CPU architectures support 64-bit atomic operations. When modifying the
+contents of a specific bucket, we do the following:
+
+- Make a working copy of the bucket’s backing storage
+- Atomically swap a pointer to the working copy into the bucket array
+- Change the original backing store data
+- Atomically swap back to the original
+
+So, no reader locking is required to search a bihash table.
+
+At lookup time, the implementation computes a key hash code. We use the
+least-significant N bits of the hash to select the bucket.
+
+With the bucket in hand, we learn log2 (nBackingPages) for the selected
+bucket. At this point, we use the next log2_size bits from the hash code
+to select the specific backing page in which the (key,value) page will
+be found.
+
+Net result: we search **one** backing page, not 2**log2_size pages. This
+is a key property of the algorithm.
+
+When sufficient collisions occur to fill the backing pages for a given
+bucket, we double the bucket size, rehash, and deal the bucket contents
+into a double-sized set of backing pages. In the future, we may
+represent the size as a linear combination of two powers-of-two, to
+increase space efficiency.
+
+To solve the “jackpot case” where a set of records collide under hashing
+in a bad way, the implementation will fall back to linear search across
+2**log2_size backing pages on a per-bucket basis.
+
+To maintain *space* efficiency, we should configure the bucket array so
+that backing pages are effectively utilized. Lookup performance tends to
+change *very little* if the bucket array is too small or too large.
+
+Bihash depends on selecting an effective hash function. If one were to
+use a truly broken hash function such as “return 1ULL.” bihash would
+still work, but it would be equivalent to poorly-programmed linear
+search.
+
+We often use cpu intrinsic functions - think crc32 - to rapidly compute
+a hash code which has decent statistics.
+
+Bihash Cookbook
+---------------
+
+Using current (key,value) template instance types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It’s quite easy to use one of the template instance types. As of this
+writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
+40, and 48 byte keys, u8 \* vector keys, and 8 byte values.
+
+See …/src/vppinfra/{bihash\_\_8}.h
+
+To define the data types, #include a specific template instance, most
+often in a subsystem header file:
+
+.. code:: c
+
+ #include <vppinfra/bihash_8_8.h>
+
+If you’re building a standalone application, you’ll need to define the
+various functions by #including the method implementation file in a C
+source file.
+
+The core vpp engine currently uses most if not all of the known bihash
+types, so you probably won’t need to #include the method implementation
+file.
+
+.. code:: c
+
+ #include <vppinfra/bihash_template.c>
+
+Add an instance of the selected bihash data structure to e.g. a “main_t”
+structure:
+
+.. code:: c
+
+ typedef struct
+ {
+ ...
+ BVT (clib_bihash) hash_table;
+ or
+ clib_bihash_8_8_t hash_table;
+ ...
+ } my_main_t;
+
+The BV macro concatenate its argument with the value of the preprocessor
+symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
+value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
+BVT (clib_bihash) generates “clib_bihash_8_8_t”.
+
+If you’re sure you won’t decide to change the template / type name
+later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.
+
+In fact, if you #include multiple template instances in a single source
+file, you **must** use fully-enumerated type names. The macros stand no
+chance of working.
+
+Initializing a bihash table
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Call the init function as shown. As a rough guide, pick a number of
+buckets which is approximately
+number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
+template instance header-file. See previous discussion.
+
+The amount of memory selected should easily contain all of the records,
+with a generous allowance for hash collisions. Bihash memory is
+allocated separately from the main heap, and won’t cost anything except
+kernel PTE’s until touched, so it’s OK to be reasonably generous.
+
+For example:
+
+.. code:: c
+
+ my_main_t *mm = &my_main;
+ clib_bihash_8_8_t *h;
+
+ h = &mm->hash_table;
+
+ clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
+ (uword) memory_size);
+
+Add or delete a key/value pair
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use BV(clib_bihash_add_del), or the explicit type variant:
+
+.. code:: c
+
+ clib_bihash_kv_8_8_t kv;
+ clib_bihash_8_8_t * h;
+ my_main_t *mm = &my_main;
+ clib_bihash_8_8_t *h;
+
+ h = &mm->hash_table;
+ kv.key = key_to_add_or_delete;
+ kv.value = value_to_add_or_delete;
+
+ clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
+
+In the delete case, kv.value is irrelevant. To change the value
+associated with an existing (key,value) pair, simply re-add the [new]
+pair.
+
+Simple search
+~~~~~~~~~~~~~
+
+The simplest possible (key, value) search goes like so:
+
+.. code:: c
+
+ clib_bihash_kv_8_8_t search_kv, return_kv;
+ clib_bihash_8_8_t * h;
+ my_main_t *mm = &my_main;
+ clib_bihash_8_8_t *h;
+
+ h = &mm->hash_table;
+ search_kv.key = key_to_add_or_delete;
+
+ if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
+ key_not_found();
+ else
+ key_found();
+
+Note that it’s perfectly fine to collect the lookup result
+
+.. code:: c
+
+ if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
+ key_not_found();
+ etc.
+
+Bihash vector processing
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+When processing a vector of packets which need a certain lookup
+performed, it’s worth the trouble to compute the key hash, and prefetch
+the correct bucket ahead of time.
+
+Here’s a sketch of one way to write the required code:
+
+Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
+data required to form the record keys \* 4 packets ahead, form 2x record
+keys and call BV(clib_bihash_hash) or the explicit hash function to
+calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
+prefetch the buckets \* 2 packets ahead, call 2x
+BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
+In the processing section, call 2x
+BV(clib_bihash_search_inline_with_hash) to perform the search
+
+Programmer’s choice whether to stash the hash code somewhere in
+vnet_buffer(b) metadata, or to use local variables.
+
+Single-loop: \* Use simple search as shown above.
+
+Walking a bihash table
+~~~~~~~~~~~~~~~~~~~~~~
+
+A fairly common scenario to build “show” commands involves walking a
+bihash table. It’s simple enough:
+
+.. code:: c
+
+ my_main_t *mm = &my_main;
+ clib_bihash_8_8_t *h;
+ void callback_fn (clib_bihash_kv_8_8_t *, void *);
+
+ h = &mm->hash_table;
+
+ BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
+
+To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
+across the entire table, calling callback_fn with active entries.
+
+Bihash table iteration safety
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The iterator template “clib_bihash_foreach_key_value_pair” must be used
+with a certain amount of care. For one thing, the iterator template does
+*not* take the bihash hash table writer lock. If your use-case requires
+it, lock the table.
+
+For another, the iterator template is not safe under all conditions:
+
+- It’s **OK to delete** bihash table entries during a table-walk. The
+ iterator checks whether the current bucket has been freed after each
+ *callback_fn(…)* invocation.
+
+- It is **not OK to add** entries during a table-walk.
+
+The add-during-walk case involves a jackpot: while processing a
+key-value-pair in a particular bucket, add a certain number of entries.
+By luck, assume that one or more of the added entries causes the
+**current bucket** to split-and-rehash.
+
+Since we rehash KVP’s to different pages based on what amounts to a
+different hash function, either of these things can go wrong:
+
+- We may revisit previously-visited entries. Depending on how one coded
+ the use-case, we could end up in a recursive-add situation.
+
+- We may skip entries that have not been visited
+
+One could build an add-safe iterator, at a significant cost in
+performance: copy the entire bucket, and walk the copy.
+
+It’s hard to imagine a worthwhile add-during walk use-case in the first
+place; let alone one which couldn’t be implemented by walking the table
+without modifying it, then adding a set of records.
+
+Creating a new template instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Creating a new template is easy. Use one of the existing templates as a
+model, and make the obvious changes. The hash and key_compare methods
+are performance-critical in multiple senses.
+
+If the key compare method is slow, every lookup will be slow. If the
+hash function is slow, same story. If the hash function has poor
+statistical properties, space efficiency will suffer. In the limit, a
+bad enough hash function will cause large portions of the table to
+revert to linear search.
+
+Use of the best available vector unit is well worth the trouble in the
+hash and key_compare functions.
diff --git a/docs/developer/corearchitecture/buffer_metadata.rst b/docs/developer/corearchitecture/buffer_metadata.rst
new file mode 100644
index 00000000000..545c31f3041
--- /dev/null
+++ b/docs/developer/corearchitecture/buffer_metadata.rst
@@ -0,0 +1,237 @@
+Buffer Metadata
+===============
+
+Each vlib_buffer_t (packet buffer) carries buffer metadata which
+describes the current packet-processing state. The underlying techniques
+have been used for decades, across multiple packet processing
+environments.
+
+We will examine vpp buffer metadata in some detail, but folks who need
+to manipulate and/or extend the scheme should expect to do a certain
+level of code inspection.
+
+Vlib (Vector library) primary buffer metadata
+---------------------------------------------
+
+The first 64 octets of each vlib_buffer_t carries the primary buffer
+metadata. See …/src/vlib/buffer.h for full details.
+
+Important fields:
+
+- i16 current_data: the signed offset in data[], pre_data[] that we are
+ currently processing. If negative current header points into the
+ pre-data (rewrite space) area.
+- u16 current_length: nBytes between current_data and the end of this
+ buffer.
+- u32 flags: Buffer flag bits. Heavily used, not many bits left
+
+ - src/vlib/buffer.h flag bits
+
+ - VLIB_BUFFER_IS_TRACED: buffer is traced
+ - VLIB_BUFFER_NEXT_PRESENT: buffer has multiple chunks
+ - VLIB_BUFFER_TOTAL_LENGTH_VALID:
+ total_length_not_including_first_buffer is valid (see below)
+
+ - src/vnet/buffer.h flag bits
+
+ - VNET_BUFFER_F_L4_CHECKSUM_COMPUTED: tcp/udp checksum has been
+ computed
+ - VNET_BUFFER_F_L4_CHECKSUM_CORRECT: tcp/udp checksum is correct
+ - VNET_BUFFER_F_VLAN_2_DEEP: two vlan tags present
+ - VNET_BUFFER_F_VLAN_1_DEEP: one vlan tag present
+ - VNET_BUFFER_F_SPAN_CLONE: packet has already been cloned (span
+ feature)
+ - VNET_BUFFER_F_LOOP_COUNTER_VALID: packet look-up loop count
+ valid
+ - VNET_BUFFER_F_LOCALLY_ORIGINATED: packet built by vpp
+ - VNET_BUFFER_F_IS_IP4: packet is ipv4, for checksum offload
+ - VNET_BUFFER_F_IS_IP6: packet is ipv6, for checksum offload
+ - VNET_BUFFER_F_OFFLOAD_IP_CKSUM: hardware ip checksum offload
+ requested
+ - VNET_BUFFER_F_OFFLOAD_TCP_CKSUM: hardware tcp checksum offload
+ requested
+ - VNET_BUFFER_F_OFFLOAD_UDP_CKSUM: hardware udp checksum offload
+ requested
+ - VNET_BUFFER_F_IS_NATED: natted packet, skip input checks
+ - VNET_BUFFER_F_L2_HDR_OFFSET_VALID: L2 header offset valid
+ - VNET_BUFFER_F_L3_HDR_OFFSET_VALID: L3 header offset valid
+ - VNET_BUFFER_F_L4_HDR_OFFSET_VALID: L4 header offset valid
+ - VNET_BUFFER_F_FLOW_REPORT: packet is an ipfix packet
+ - VNET_BUFFER_F_IS_DVR: packet to be reinjected into the l2
+ output path
+ - VNET_BUFFER_F_QOS_DATA_VALID: QoS data valid in
+ vnet_buffer_opaque2
+ - VNET_BUFFER_F_GSO: generic segmentation offload requested
+ - VNET_BUFFER_F_AVAIL1: available bit
+ - VNET_BUFFER_F_AVAIL2: available bit
+ - VNET_BUFFER_F_AVAIL3: available bit
+ - VNET_BUFFER_F_AVAIL4: available bit
+ - VNET_BUFFER_F_AVAIL5: available bit
+ - VNET_BUFFER_F_AVAIL6: available bit
+ - VNET_BUFFER_F_AVAIL7: available bit
+
+- u32 flow_id: generic flow identifier
+- u8 ref_count: buffer reference / clone count (e.g. for span
+ replication)
+- u8 buffer_pool_index: buffer pool index which owns this buffer
+- vlib_error_t (u16) error: error code for buffers enqueued to error
+ handler
+- u32 next_buffer: buffer index of next buffer in chain. Only valid if
+ VLIB_BUFFER_NEXT_PRESENT is set
+- union
+
+ - u32 current_config_index: current index on feature arc
+ - u32 punt_reason: reason code once packet punted. Mutually
+ exclusive with current_config_index
+
+- u32 opaque[10]: primary vnet-layer opaque data (see below)
+- END of first cache line / data initialized by the buffer allocator
+- u32 trace_index: buffer’s index in the packet trace subsystem
+- u32 total_length_not_including_first_buffer: see
+ VLIB_BUFFER_TOTAL_LENGTH_VALID above
+- u32 opaque2[14]: secondary vnet-layer opaque data (see below)
+- u8 pre_data[VLIB_BUFFER_PRE_DATA_SIZE]: rewrite space, often used to
+ prepend tunnel encapsulations
+- u8 data[0]: buffer data received from the wire. Ordinarily, hardware
+ devices use b->data[0] as the DMA target but there are exceptions. Do
+ not write code which blindly assumes that packet data starts in
+ b->data[0]. Use vlib_buffer_get_current(…).
+
+Vnet (network stack) primary buffer metadata
+--------------------------------------------
+
+Vnet primary buffer metadata occupies space reserved in the vlib opaque
+field shown above, and has the type name vnet_buffer_opaque_t.
+Ordinarily accessed using the vnet_buffer(b) macro. See
+../src/vnet/buffer.h for full details.
+
+Important fields:
+
+- u32 sw_if_index[2]: RX and TX interface handles. At the ip lookup
+ stage, vnet_buffer(b)->sw_if_index[VLIB_TX] is interpreted as a FIB
+ index.
+- i16 l2_hdr_offset: offset from b->data[0] of the packet L2 header.
+ Valid only if b->flags & VNET_BUFFER_F_L2_HDR_OFFSET_VALID is set
+- i16 l3_hdr_offset: offset from b->data[0] of the packet L3 header.
+ Valid only if b->flags & VNET_BUFFER_F_L3_HDR_OFFSET_VALID is set
+- i16 l4_hdr_offset: offset from b->data[0] of the packet L4 header.
+ Valid only if b->flags & VNET_BUFFER_F_L4_HDR_OFFSET_VALID is set
+- u8 feature_arc_index: feature arc that the packet is currently
+ traversing
+- union
+
+ - ip
+
+ - u32 adj_index[2]: adjacency from dest IP lookup in [VLIB_TX],
+ adjacency from source ip lookup in [VLIB_RX], set to ~0 until
+ source lookup done
+ - union
+
+ - generic fields
+ - ICMP fields
+ - reassembly fields
+
+ - mpls fields
+ - l2 bridging fields, only valid in the L2 path
+ - l2tpv3 fields
+ - l2 classify fields
+ - vnet policer fields
+ - MAP fields
+ - MAP-T fields
+ - ip fragmentation fields
+ - COP (whitelist/blacklist filter) fields
+ - LISP fields
+ - TCP fields
+
+ - connection index
+ - sequence numbers
+ - header and data offsets
+ - data length
+ - flags
+
+ - SCTP fields
+ - NAT fields
+ - u32 unused[6]
+
+Vnet (network stack) secondary buffer metadata
+----------------------------------------------
+
+Vnet primary buffer metadata occupies space reserved in the vlib opaque2
+field shown above, and has the type name vnet_buffer_opaque2_t.
+Ordinarily accessed using the vnet_buffer2(b) macro. See
+../src/vnet/buffer.h for full details.
+
+Important fields:
+
+- qos fields
+
+ - u8 bits
+ - u8 source
+
+- u8 loop_counter: used to detect and report internal forwarding loops
+- group-based policy fields
+
+ - u8 flags
+ - u16 sclass: the packet’s source class
+
+- u16 gso_size: L4 payload size, persists all the way to
+ interface-output in case GSO is not enabled
+- u16 gso_l4_hdr_sz: size of the L4 protocol header
+- union
+
+ - packet trajectory tracer (largely deprecated)
+
+ - u16 \*trajectory_trace; only #if VLIB_BUFFER_TRACE_TRAJECTORY >
+ 0
+
+ - packet generator
+
+ - u64 pg_replay_timestamp: timestamp for replayed pcap trace
+ packets
+
+ - u32 unused[8]
+
+Buffer Metadata Extensions
+--------------------------
+
+Plugin developers may wish to extend either the primary or secondary
+vnet buffer opaque unions. Please perform a manual live variable
+analysis, otherwise nodes which use shared buffer metadata space may
+break things.
+
+It’s not OK to add plugin or proprietary metadata to the core vpp engine
+header files named above. Instead, proceed as follows. The example
+concerns the vnet primary buffer opaque union vlib_buffer_opaque_t. It’s
+a very simple variation to use the vnet secondary buffer opaque union
+vlib_buffer_opaque2_t.
+
+In a plugin header file:
+
+::
+
+ /* Add arbitrary buffer metadata */
+ #include <vnet/buffer.h>
+
+ typedef struct
+ {
+ u32 my_stuff[6];
+ } my_buffer_opaque_t;
+
+ STATIC_ASSERT (sizeof (my_buffer_opaque_t) <=
+ STRUCT_SIZE_OF (vnet_buffer_opaque_t, unused),
+ "Custom meta-data too large for vnet_buffer_opaque_t");
+
+ #define my_buffer_opaque(b) \
+ ((my_buffer_opaque_t *)((u8 *)((b)->opaque) + STRUCT_OFFSET_OF (vnet_buffer_opaque_t, unused)))
+
+To set data in the custom buffer opaque type given a vlib_buffer_t \*b:
+
+::
+
+ my_buffer_opaque (b)->my_stuff[2] = 123;
+
+To read data from the custom buffer opaque type:
+
+::
+
+ stuff0 = my_buffer_opaque (b)->my_stuff[2];
diff --git a/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst
new file mode 100644
index 00000000000..1eb4e6b5301
--- /dev/null
+++ b/docs/developer/corearchitecture/buildsystem/buildrootmakefile.rst
@@ -0,0 +1,353 @@
+Introduction to build-root/Makefile
+===================================
+
+The vpp build system consists of a top-level Makefile, a data-driven
+build-root/Makefile, and a set of makefile fragments. The various parts
+come together as the result of a set of well-thought-out conventions.
+
+This section describes build-root/Makefile in some detail.
+
+Repository Groups and Source Paths
+----------------------------------
+
+Current vpp workspaces comprise a single repository group. The file
+.../build-root/build-config.mk defines a key variable called
+SOURCE\_PATH. The SOURCE\_PATH variable names the set of repository
+groups. At the moment, there is only one repository group.
+
+Single pass build system, dependencies and components
+-----------------------------------------------------
+
+The vpp build system caters to components built with GNU autoconf /
+automake. Adding such components is a simple process. Dealing with
+components which use BSD-style raw Makefiles is a more difficult.
+Dealing with toolchain components such as gcc, glibc, and binutils can
+be considerably more complicated.
+
+The vpp build system is a **single-pass** build system. A partial order
+must exist for any set of components: the set of (a before b) tuples
+must resolve to an ordered list. If you create a circular dependency of
+the form; (a,b) (b,c) (c,a), gmake will try to build the target list,
+but there’s a 0.0% chance that the results will be pleasant. Cut-n-paste
+mistakes in .../build-data/packages/.mk can produce confusing failures.
+
+In a single-pass build system, it’s best to separate libraries and
+applications which instantiate them. For example, if vpp depends on
+libfoo.a, and myapp depends on both vpp and libfoo.a, it's best to place
+libfoo.a and myapp in separate components. The build system will build
+libfoo.a, vpp, and then (as a separate component) myapp. If you try to
+build libfoo.a and myapp from the same component, it won’t work.
+
+If you absolutely, positively insist on having myapp and libfoo.a in the
+same source tree, you can create a pseudo-component in a separate .mk
+file in the .../build-data/packages/ directory. Define the code
+phoneycomponent\_source = realcomponent, and provide manual
+configure/build/install targets.
+
+Separate components for myapp, libfoo.a, and vpp is the best and easiest
+solution. However, the “mumble\_source = realsource” degree of freedom
+exists to solve intractable circular dependencies, such as: to build
+gcc-bootstrap, followed by glibc, followed by “real” gcc/g++ [which
+depends on glibc too].
+
+.../build-root
+--------------
+
+The .../build-root directory contains the repository group specification
+build-config.mk, the main Makefile, and the system-wide set of
+autoconf/automake variable overrides in config.site. We'll describe
+these files in some detail. To be clear about expectations: the main
+Makefile and config.site file are subtle and complex. It's unlikely that
+you'll need or want to modify them. Poorly planned changes in either
+place typically cause bugs that are difficult to solve.
+
+.../build-root/build-config.mk
+------------------------------
+
+As described above, the build-config.mk file is straightforward: it sets
+the make variable SOURCE\_PATH to a list of repository group absolute
+paths.
+
+The SOURCE\_PATH variable If you choose to move a workspace, make sure
+to modify the paths defined by the SOURCE\_PATH variable. Those paths
+need to match changes you make in the workspace paths. For example, if
+you place the vpp directory in the workspace of a user named jsmith, you
+might change the SOURCE\_PATH to:
+
+SOURCE\_PATH = /home/jsmithuser/workspace/vpp
+
+The "out of the box" setting should work 99.5% of the time:
+
+::
+
+ SOURCE_PATH = $(CURDIR)/..
+
+.../vpp/build-root/Makefile
+---------------------------
+
+The main Makefile is complex in a number of dimensions. If you think you
+need to modify it, it's a good idea to do some research, or ask for
+advice before you change it.
+
+The main Makefile was organized and designed to provide the following
+characteristics: excellent performance, accurate dependency processing,
+cache enablement, timestamp optimizations, git integration,
+extensibility, builds with cross-compilation tool chains, and builds
+with embedded Linux distributions.
+
+If you really need to do so, you can build double-cross tools with it,
+with a minimum amount of fuss. For example, you could: compile gdb on
+x86\_64, to run on PowerPC, to debug the Xtensa instruction set.
+
+The PLATFORM variable
+---------------------
+
+The PLATFORM make/environment variable controls a number of important
+characteristics, primarily:
+
+- CPU architecture
+- The list of images to build.
+
+With respect to .../build-root/Makefile, the list of images to build is
+specified by the target. For example:
+
+::
+
+ make PLATFORM=vpp TAG=vpp_debug install-deb
+
+builds vpp debug Debian packages.
+
+The main Makefile interprets $PLATFORM by attempting to "-include" the
+file /build-data/platforms.mk:
+
+::
+
+ $(foreach d,$(FULL_SOURCE_PATH), \
+ $(eval -include $(d)/platforms.mk))
+
+By convention, we don't define **platforms** in the
+...//build-data/platforms.mk file.
+
+In the vpp case, we search for platform definition makefile fragments in
+.../vpp/build-data/platforms.mk, as follows:
+
+::
+
+ $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \
+ $(eval -include $(d)/platforms/*.mk))
+
+With vpp, which uses the "vpp" platform as discussed above, we end up
+"-include"-ing .../vpp/build-data/platforms/vpp.mk.
+
+The platform-specific .mk fragment
+----------------------------------
+
+Here are the contents of .../build-data/platforms/vpp.mk:
+
+::
+
+ MACHINE=$(shell uname -m)
+
+ vpp_arch = native
+ ifeq ($(TARGET_PLATFORM),thunderx)
+ vpp_dpdk_target = arm64-thunderx-linuxapp-gcc
+ endif
+ vpp_native_tools = vppapigen
+
+ vpp_uses_dpdk = yes
+
+ # Uncomment to enable building unit tests
+ # vpp_enable_tests = yes
+
+ vpp_root_packages = vpp
+
+ # DPDK configuration parameters
+ # vpp_uses_dpdk_mlx4_pmd = yes
+ # vpp_uses_dpdk_mlx5_pmd = yes
+ # vpp_uses_external_dpdk = yes
+ # vpp_dpdk_inc_dir = /usr/include/dpdk
+ # vpp_dpdk_lib_dir = /usr/lib
+ # vpp_dpdk_shared_lib = yes
+
+ # Use '--without-libnuma' for non-numa aware architecture
+ # Use '--enable-dlmalloc' to use dlmalloc instead of mheap
+ vpp_configure_args_vpp = --enable-dlmalloc
+ sample-plugin_configure_args_vpp = --enable-dlmalloc
+
+ # load balancer plugin is not portable on 32 bit platform
+ ifeq ($(MACHINE),i686)
+ vpp_configure_args_vpp += --disable-lb-plugin
+ endif
+
+ vpp_debug_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG \
+ -fstack-protector-all -fPIC -Werror
+ vpp_debug_TAG_CXXFLAGS = -g -O0 -DCLIB_DEBUG \
+ -fstack-protector-all -fPIC -Werror
+ vpp_debug_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG \
+ -fstack-protector-all -fPIC -Werror
+
+ vpp_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror
+ vpp_TAG_CXXFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror
+ vpp_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror -pie -Wl,-z,now
+
+ vpp_clang_TAG_CFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror
+ vpp_clang_TAG_LDFLAGS = -g -O2 -D_FORTIFY_SOURCE=2 -fstack-protector -fPIC -Werror
+
+ vpp_gcov_TAG_CFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -fprofile-arcs -ftest-coverage
+ vpp_gcov_TAG_LDFLAGS = -g -O0 -DCLIB_DEBUG -fPIC -Werror -coverage
+
+ vpp_coverity_TAG_CFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__
+ vpp_coverity_TAG_LDFLAGS = -g -O2 -fPIC -Werror -D__COVERITY__
+
+Note the following variable settings:
+
+- The variable \_arch sets the CPU architecture used to build the
+ per-platform cross-compilation toolchain. With the exception of the
+ "native" architecture - used in our example - the vpp build system
+ produces cross-compiled binaries.
+
+- The variable \_native\_tools lists the required set of self-compiled
+ build tools.
+
+- The variable \_root\_packages lists the set of images to build when
+ specifying the target: make PLATFORM= TAG= [install-deb \|
+ install-rpm].
+
+The TAG variable
+----------------
+
+The TAG variable indirectly sets CFLAGS and LDFLAGS, as well as the
+build and install directory names in the .../vpp/build-root directory.
+See definitions above.
+
+Important targets build-root/Makefile
+-------------------------------------
+
+The main Makefile and the various makefile fragments implement the
+following user-visible targets:
+
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| Target | ENV Variable Settings| Notes |
+| | | |
++==================+======================+======================================================================================+
+| foo | bar | mumble |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| bootstrap-tools | none | Builds the set of native tools needed by the vpp build system to |
+| | | build images. Example: vppapigen. In a full cross compilation case might include |
+| | | include "make", "git", "find", and "tar |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| install-tools | PLATFORM | Builds the tool chain for the indicated <platform>. Not used in vpp builds |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| distclean | none | Roto-rooters everything in sight: toolchains, images, and so forth. |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| install-deb | PLATFORM and TAG | Build Debian packages comprising components listed in <platform>_root_packages, |
+| | | using compile / link options defined by TAG. |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+| install-rpm | PLATFORM and TAG | Build RPMs comprising components listed in <platform>_root_packages, |
+| | | using compile / link options defined by TAG. |
++------------------+----------------------+--------------------------------------------------------------------------------------+
+
+Additional build-root/Makefile environment variable settings
+------------------------------------------------------------
+
+These variable settings may be of use:
+
++----------------------+------------------------------------------------------------------------------------------------------------+
+| ENV Variable | Notes |
++======================+======================+=====================================================================================+
+| BUILD_DEBUG=vx | Directs Makefile et al. to make a good-faith effort to show what's going on in excruciating detail. |
+| | Use it as follows: "make ... BUILD_DEBUG=vx". Fairly effective in Makefile debug situations. |
++----------------------+------------------------------------------------------------------------------------------------------------+
+| V=1 | print detailed cc / ld command lines. Useful for discovering if -DFOO=11 is in the command line or not |
++----------------------+------------------------------------------------------------------------------------------------------------+
+| CC=mygcc | Override the configured C-compiler |
++----------------------+------------------------------------------------------------------------------------------------------------+
+
+.../build-root/config.site
+--------------------------
+
+The contents of .../build-root/config.site override individual autoconf /
+automake default variable settings. Here are a few sample settings related to
+building a full toolchain:
+
+::
+
+ # glibc needs these setting for cross compiling
+ libc_cv_forced_unwind=yes
+ libc_cv_c_cleanup=yes
+ libc_cv_ssp=no
+
+Determining the set of variables which need to be overridden, and the
+override values is a matter of trial and error. It should be
+unnecessary to modify this file for use with fd.io vpp.
+
+.../build-data/platforms.mk
+---------------------------
+
+Each repo group includes the platforms.mk file, which is included by
+the main Makefile. The vpp/build-data/platforms.mk file is not terribly
+complex. As of this writing, .../build-data/platforms.mk file accomplishes two
+tasks.
+
+First, it includes vpp/build-data/platforms/\*.mk:
+
+::
+
+ # Pick up per-platform makefile fragments
+ $(foreach d,$(SOURCE_PATH_BUILD_DATA_DIRS), \
+ $(eval -include $(d)/platforms/*.mk))
+
+This collects the set of platform definition makefile fragments, as discussed above.
+
+Second, platforms.mk implements the user-visible "install-deb" target.
+
+.../build-data/packages/\*.mk
+-----------------------------
+
+Each component needs a makefile fragment in order for the build system
+to recognize it. The per-component makefile fragments vary
+considerably in complexity. For a component built with GNU autoconf /
+automake which does not depend on other components, the make fragment
+can be empty. See .../build-data/packages/vpp.mk for an uncomplicated
+but fully realistic example.
+
+Here are some of the important variable settings in per-component makefile fragments:
+
++----------------------+------------------------------------------------------------------------------------------------------------+
+| Variable | Notes |
++======================+======================+=====================================================================================+
+| xxx_configure_depend | Lists the set of component build dependencies for the xxx component. In plain English: don't try to |
+| | configure this component until you've successfully built the indicated targets. Almost always, |
+| | xxx_configure_depend will list a set of "yyy-install" targets. Note the pattern: |
+| | "variable names contain underscores, make target names contain hyphens" |
++----------------------+------------------------------------------------------------------------------------------------------------+
+| xxx_configure_args | (optional) Lists any additional arguments to pass to the xxx component "configure" script. |
+| | The main Makefile %-configure rule adds the required settings for --libdir, --prefix, and |
+| | --host (when cross-compiling) |
++----------------------+------------------------------------------------------------------------------------------------------------+
+| xxx_CPPFLAGS | Adds -I stanzas to CPPFLAGS for components upon which xxx depends. |
+| | Almost invariably "xxx_CPPFLAGS = $(call installed_includes_fn, dep1 dep2 dep3)", where dep1, dep2, and |
+| | dep3 are listed in xxx_configure_depend. It is bad practice to set "-g -O3" here. Those settings |
+| | belong in a TAG. |
++----------------------+------------------------------------------------------------------------------------------------------------+
+| xxx_LDFLAGS | Adds -Wl,-rpath -Wl,depN stanzas to LDFLAGS for components upon which xxx depends. |
+| | Almost invariably "xxx_LDFLAGS = $(call installed_lib_fn, dep1 dep2 dep3)", where dep1, dep2, and |
+| | dep3 are listed in xxx_configure_depend. It is bad manners to set "-liberty-or-death" here. |
+| | Those settings belong in Makefile.am. |
++----------------------+------------------------------------------------------------------------------------------------------------+
+
+When dealing with "irritating" components built with raw Makefiles
+which only work when building in the source tree, we use a specific
+strategy in the xxx.mk file.
+
+The strategy is simple for those components: We copy the source tree
+into .../vpp/build-root/build-xxx. This works, but completely defeats
+dependency processing. This strategy is acceptable only for 3rd party
+software which won't need extensive (or preferably any) modifications.
+
+Take a look at .../vpp/build-data/packages/dpdk.mk. When invoked, the
+dpdk_configure variable copies source code into $(PACKAGE_BUILD_DIR),
+and performs the BSD equivalent of "autoreconf -i -f" to configure the
+build area. The rest of the file is similar: a bunch of hand-rolled
+glue code which manages to make the dpdk act like a good vpp build
+citizen even though it is not.
diff --git a/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst
new file mode 100644
index 00000000000..580d261bdac
--- /dev/null
+++ b/docs/developer/corearchitecture/buildsystem/cmakeandninja.rst
@@ -0,0 +1,186 @@
+Introduction to cmake and ninja
+===============================
+
+Cmake plus ninja is approximately equal to GNU autotools plus GNU
+make, respectively. Both cmake and GNU autotools support self and
+cross-compilation, checking for required components and versions.
+
+- For a decent-sized project - such as vpp - build performance is drastically better with (cmake, ninja).
+
+- The cmake input language looks like an actual language, rather than a shell scripting scheme on steroids.
+
+- Ninja doesn't pretend to support manually-generated input files. Think of it as a fast, dumb robot which eats mildly legible byte-code.
+
+See the `cmake website <http://cmake.org>`_, and the `ninja website
+<https://ninja-build.org>`_ for additional information.
+
+vpp cmake configuration files
+-----------------------------
+
+The top of the vpp project cmake hierarchy lives in .../src/CMakeLists.txt.
+This file defines the vpp project, and (recursively) includes two kinds
+of files: rule/function definitions, and target lists.
+
+- Rule/function definitions live in .../src/cmake/{\*.cmake}. Although the contents of these files is simple enough to read, it shouldn't be necessary to modify them very often
+
+- Build target lists come from CMakeLists.txt files found in subdirectories, which are named in the SUBDIRS list in .../src/CMakeLists.txt
+
+::
+
+ ##############################################################################
+ # subdirs - order matters
+ ##############################################################################
+ if("${CMAKE_SYSTEM_NAME}" STREQUAL "Linux")
+ find_package(OpenSSL REQUIRED)
+ set(SUBDIRS
+ vppinfra svm vlib vlibmemory vlibapi vnet vpp vat vcl plugins
+ vpp-api tools/vppapigen tools/g2 tools/perftool)
+ elseif("${CMAKE_SYSTEM_NAME}" STREQUAL "Darwin")
+ set(SUBDIRS vppinfra)
+ else()
+ message(FATAL_ERROR "Unsupported system: ${CMAKE_SYSTEM_NAME}")
+ endif()
+
+ foreach(DIR ${SUBDIRS})
+ add_subdirectory(${DIR})
+ endforeach()
+
+- The vpp cmake configuration hierarchy discovers the list of plugins to be built by searching for subdirectories in .../src/plugins which contain CMakeLists.txt files
+
+
+::
+
+ ##############################################################################
+ # find and add all plugin subdirs
+ ##############################################################################
+ FILE(GLOB files RELATIVE
+ ${CMAKE_CURRENT_SOURCE_DIR}
+ ${CMAKE_CURRENT_SOURCE_DIR}/*/CMakeLists.txt
+ )
+ foreach (f ${files})
+ get_filename_component(dir ${f} DIRECTORY)
+ add_subdirectory(${dir})
+ endforeach()
+
+How to write a plugin CMakeLists.txt file
+-----------------------------------------
+
+It's really quite simple. Follow the pattern:
+
+::
+
+ add_vpp_plugin(mactime
+ SOURCES
+ mactime.c
+ node.c
+
+ API_FILES
+ mactime.api
+
+ INSTALL_HEADERS
+ mactime_all_api_h.h
+ mactime_msg_enum.h
+
+ API_TEST_SOURCES
+ mactime_test.c
+ )
+
+Adding a target elsewhere in the source tree
+--------------------------------------------
+
+Within reason, adding a subdirectory to the SUBDIRS list in
+.../src/CMakeLists.txt is perfectly OK. The indicated directory will
+need a CMakeLists.txt file.
+
+.. _building-g2:
+
+Here's how we build the g2 event data visualization tool:
+
+::
+
+ option(VPP_BUILD_G2 "Build g2 tool." OFF)
+ if(VPP_BUILD_G2)
+ find_package(GTK2 COMPONENTS gtk)
+ if(GTK2_FOUND)
+ include_directories(${GTK2_INCLUDE_DIRS})
+ add_vpp_executable(g2
+ SOURCES
+ clib.c
+ cpel.c
+ events.c
+ main.c
+ menu1.c
+ pointsel.c
+ props.c
+ g2version.c
+ view1.c
+
+ LINK_LIBRARIES vppinfra Threads::Threads m ${GTK2_LIBRARIES}
+ NO_INSTALL
+ )
+ endif()
+ endif()
+
+The g2 component is optional, and is not built by default. There are
+a couple of ways to tell cmake to include it in build.ninja [or in Makefile.]
+
+When invoking cmake manually [rarely done and not very easy], specify
+-DVPP_BUILD_G2=ON:
+
+::
+
+ $ cmake ... -DVPP_BUILD_G2=ON
+
+Take a good look at .../build-data/packages/vpp.mk to see where and
+how the top-level Makefile and .../build-root/Makefile set all of the
+cmake arguments. One strategy to enable an optional component is fairly
+obvious. Add -DVPP_BUILD_G2=ON to vpp_cmake_args.
+
+That would work, of course, but it's not a particularly elegant solution.
+
+Tinkering with build options: ccmake
+------------------------------------
+
+The easy way to set VPP_BUILD_G2 - or frankly **any** cmake
+parameter - is to install the "cmake-curses-gui" package and use
+it.
+
+- Do a straightforward vpp build using the top level Makefile, "make build" or "make build-release"
+- Ajourn to .../build-root/build-vpp-native/vpp or .../build-root/build-vpp_debug-native/vpp
+- Invoke "ccmake ." to reconfigure the project as desired
+
+Here's approximately what you'll see:
+
+::
+
+ CCACHE_FOUND /usr/bin/ccache
+ CMAKE_BUILD_TYPE
+ CMAKE_INSTALL_PREFIX /scratch/vpp-gate/build-root/install-vpp-nati
+ DPDK_INCLUDE_DIR /scratch/vpp-gate/build-root/install-vpp-nati
+ DPDK_LIB /scratch/vpp-gate/build-root/install-vpp-nati
+ MBEDTLS_INCLUDE_DIR /usr/include
+ MBEDTLS_LIB1 /usr/lib/x86_64-linux-gnu/libmbedtls.so
+ MBEDTLS_LIB2 /usr/lib/x86_64-linux-gnu/libmbedx509.so
+ MBEDTLS_LIB3 /usr/lib/x86_64-linux-gnu/libmbedcrypto.so
+ MUSDK_INCLUDE_DIR MUSDK_INCLUDE_DIR-NOTFOUND
+ MUSDK_LIB MUSDK_LIB-NOTFOUND
+ PRE_DATA_SIZE 128
+ VPP_API_TEST_BUILTIN ON
+ VPP_BUILD_G2 OFF
+ VPP_BUILD_PERFTOOL OFF
+ VPP_BUILD_VCL_TESTS ON
+ VPP_BUILD_VPPINFRA_TESTS OFF
+
+ CCACHE_FOUND: Path to a program.
+ Press [enter] to edit option Press [d] to delete an entry CMake Version 3.10.2
+ Press [c] to configure
+ Press [h] for help Press [q] to quit without generating
+ Press [t] to toggle advanced mode (Currently Off)
+
+Use the cursor to point at the VPP_BUILD_G2 line. Press the return key
+to change OFF to ON. Press "c" to regenerate build.ninja, etc.
+
+At that point "make build" or "make build-release" will build g2. And so on.
+
+Note that toggling advanced mode ["t"] gives access to substantially
+all of the cmake option, discovered directories and paths.
diff --git a/docs/developer/corearchitecture/buildsystem/index.rst b/docs/developer/corearchitecture/buildsystem/index.rst
new file mode 100644
index 00000000000..908e91e1fc1
--- /dev/null
+++ b/docs/developer/corearchitecture/buildsystem/index.rst
@@ -0,0 +1,14 @@
+.. _buildsystem:
+
+Build System
+============
+
+This guide describes the vpp build system in detail. As of this writing,
+the build systems uses a mix of make / Makefiles, cmake, and ninja to
+achieve excellent build performance.
+
+.. toctree::
+
+ mainmakefile
+ cmakeandninja
+ buildrootmakefile
diff --git a/docs/developer/corearchitecture/buildsystem/mainmakefile.rst b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst
new file mode 100644
index 00000000000..96b97496350
--- /dev/null
+++ b/docs/developer/corearchitecture/buildsystem/mainmakefile.rst
@@ -0,0 +1,2 @@
+Introduction to the top-level Makefile
+======================================
diff --git a/docs/developer/corearchitecture/featurearcs.rst b/docs/developer/corearchitecture/featurearcs.rst
new file mode 100644
index 00000000000..89c50e38dce
--- /dev/null
+++ b/docs/developer/corearchitecture/featurearcs.rst
@@ -0,0 +1,225 @@
+Feature Arcs
+============
+
+A significant number of vpp features are configurable on a per-interface
+or per-system basis. Rather than ask feature coders to manually
+construct the required graph arcs, we built a general mechanism to
+manage these mechanics.
+
+Specifically, feature arcs comprise ordered sets of graph nodes. Each
+feature node in an arc is independently controlled. Feature arc nodes
+are generally unaware of each other. Handing a packet to “the next
+feature node” is quite inexpensive.
+
+The feature arc implementation solves the problem of creating graph arcs
+used for steering.
+
+At the beginning of a feature arc, a bit of setup work is needed, but
+only if at least one feature is enabled on the arc.
+
+On a per-arc basis, individual feature definitions create a set of
+ordering dependencies. Feature infrastructure performs a topological
+sort of the ordering dependencies, to determine the actual feature
+order. Missing dependencies **will** lead to runtime disorder. See
+https://gerrit.fd.io/r/#/c/12753 for an example.
+
+If no partial order exists, vpp will refuse to run. Circular dependency
+loops of the form “a then b, b then c, c then a” are impossible to
+satisfy.
+
+Adding a feature to an existing feature arc
+-------------------------------------------
+
+To nobody’s great surprise, we set up feature arcs using the typical
+“macro -> constructor function -> list of declarations” pattern:
+
+.. code:: c
+
+ VNET_FEATURE_INIT (mactime, static) =
+ {
+ .arc_name = "device-input",
+ .node_name = "mactime",
+ .runs_before = VNET_FEATURES ("ethernet-input"),
+ };
+
+This creates a “mactime” feature on the “device-input” arc.
+
+Once per frame, dig up the vnet_feature_config_main_t corresponding to
+the “device-input” feature arc:
+
+.. code:: c
+
+ vnet_main_t *vnm = vnet_get_main ();
+ vnet_interface_main_t *im = &vnm->interface_main;
+ u8 arc = im->output_feature_arc_index;
+ vnet_feature_config_main_t *fcm;
+
+ fcm = vnet_feature_get_config_main (arc);
+
+Note that in this case, we’ve stored the required arc index - assigned
+by the feature infrastructure - in the vnet_interface_main_t. Where to
+put the arc index is a programmer’s decision when creating a feature
+arc.
+
+Per packet, set next0 to steer packets to the next node they should
+visit:
+
+.. code:: c
+
+ vnet_get_config_data (&fcm->config_main,
+ &b0->current_config_index /* value-result */,
+ &next0, 0 /* # bytes of config data */);
+
+Configuration data is per-feature arc, and is often unused. Note that
+it’s normal to reset next0 to divert packets elsewhere; often, to drop
+them for cause:
+
+.. code:: c
+
+ next0 = MACTIME_NEXT_DROP;
+ b0->error = node->errors[DROP_CAUSE];
+
+Creating a feature arc
+----------------------
+
+Once again, we create feature arcs using constructor macros:
+
+.. code:: c
+
+ VNET_FEATURE_ARC_INIT (ip4_unicast, static) =
+ {
+ .arc_name = "ip4-unicast",
+ .start_nodes = VNET_FEATURES ("ip4-input", "ip4-input-no-checksum"),
+ .arc_index_ptr = &ip4_main.lookup_main.ucast_feature_arc_index,
+ };
+
+In this case, we configure two arc start nodes to handle the
+“hardware-verified ip checksum or not” cases. During initialization, the
+feature infrastructure stores the arc index as shown.
+
+In the head-of-arc node, do the following to send packets along the
+feature arc:
+
+.. code:: c
+
+ ip_lookup_main_t *lm = &im->lookup_main;
+ arc = lm->ucast_feature_arc_index;
+
+Once per packet, initialize packet metadata to walk the feature arc:
+
+.. code:: c
+
+ vnet_feature_arc_start (arc, sw_if_index0, &next, b0);
+
+Enabling / Disabling features
+-----------------------------
+
+Simply call vnet_feature_enable_disable to enable or disable a specific
+feature:
+
+.. code:: c
+
+ vnet_feature_enable_disable ("device-input", /* arc name */
+ "mactime", /* feature name */
+ sw_if_index, /* Interface sw_if_index */
+ enable_disable, /* 1 => enable */
+ 0 /* (void *) feature_configuration */,
+ 0 /* feature_configuration_nbytes */);
+
+The feature_configuration opaque is seldom used.
+
+If you wish to make a feature a *de facto* system-level concept, pass
+sw_if_index=0 at all times. Sw_if_index 0 is always valid, and
+corresponds to the “local” interface.
+
+Related “show” commands
+-----------------------
+
+To display the entire set of features, use “show features [verbose]”.
+The verbose form displays arc indices, and feature indicies within the
+arcs
+
+::
+
+ $ vppctl show features verbose
+ Available feature paths
+ <snip>
+ [14] ip4-unicast:
+ [ 0]: nat64-out2in-handoff
+ [ 1]: nat64-out2in
+ [ 2]: nat44-ed-hairpin-dst
+ [ 3]: nat44-hairpin-dst
+ [ 4]: ip4-dhcp-client-detect
+ [ 5]: nat44-out2in-fast
+ [ 6]: nat44-in2out-fast
+ [ 7]: nat44-handoff-classify
+ [ 8]: nat44-out2in-worker-handoff
+ [ 9]: nat44-in2out-worker-handoff
+ [10]: nat44-ed-classify
+ [11]: nat44-ed-out2in
+ [12]: nat44-ed-in2out
+ [13]: nat44-det-classify
+ [14]: nat44-det-out2in
+ [15]: nat44-det-in2out
+ [16]: nat44-classify
+ [17]: nat44-out2in
+ [18]: nat44-in2out
+ [19]: ip4-qos-record
+ [20]: ip4-vxlan-gpe-bypass
+ [21]: ip4-reassembly-feature
+ [22]: ip4-not-enabled
+ [23]: ip4-source-and-port-range-check-rx
+ [24]: ip4-flow-classify
+ [25]: ip4-inacl
+ [26]: ip4-source-check-via-rx
+ [27]: ip4-source-check-via-any
+ [28]: ip4-policer-classify
+ [29]: ipsec-input-ip4
+ [30]: vpath-input-ip4
+ [31]: ip4-vxlan-bypass
+ [32]: ip4-lookup
+ <snip>
+
+Here, we learn that the ip4-unicast feature arc has index 14, and that
+e.g. ip4-inacl is the 25th feature in the generated partial order.
+
+To display the features currently active on a specific interface, use
+“show interface features”:
+
+::
+
+ $ vppctl show interface GigabitEthernet3/0/0 features
+ Feature paths configured on GigabitEthernet3/0/0...
+ <snip>
+ ip4-unicast:
+ nat44-out2in
+ <snip>
+
+Table of Feature Arcs
+---------------------
+
+Simply search for name-strings to track down the arc definition,
+location of the arc index, etc.
+
+::
+
+ | Arc Name |
+ |------------------|
+ | device-input |
+ | ethernet-output |
+ | interface-output |
+ | ip4-drop |
+ | ip4-local |
+ | ip4-multicast |
+ | ip4-output |
+ | ip4-punt |
+ | ip4-unicast |
+ | ip6-drop |
+ | ip6-local |
+ | ip6-multicast |
+ | ip6-output |
+ | ip6-punt |
+ | ip6-unicast |
+ | mpls-input |
+ | mpls-output |
+ | nsh-output |
diff --git a/docs/developer/corearchitecture/index.rst b/docs/developer/corearchitecture/index.rst
new file mode 100644
index 00000000000..ecd5a3cdb08
--- /dev/null
+++ b/docs/developer/corearchitecture/index.rst
@@ -0,0 +1,21 @@
+.. _corearchitecture:
+
+=================
+Core Architecture
+=================
+
+.. toctree::
+ :maxdepth: 1
+
+ softwarearchitecture
+ infrastructure
+ vlib
+ vnet
+ featurearcs
+ buffer_metadata
+ multiarch/index
+ bihash
+ buildsystem/index
+ mem
+ multi_thread
+
diff --git a/docs/developer/corearchitecture/infrastructure.rst b/docs/developer/corearchitecture/infrastructure.rst
new file mode 100644
index 00000000000..b4e1065f81e
--- /dev/null
+++ b/docs/developer/corearchitecture/infrastructure.rst
@@ -0,0 +1,612 @@
+VPPINFRA (Infrastructure)
+=========================
+
+The files associated with the VPP Infrastructure layer are located in
+the ``./src/vppinfra`` folder.
+
+VPPinfra is a collection of basic c-library services, quite sufficient
+to build standalone programs to run directly on bare metal. It also
+provides high-performance dynamic arrays, hashes, bitmaps,
+high-precision real-time clock support, fine-grained event-logging, and
+data structure serialization.
+
+One fair comment / fair warning about vppinfra: you can't always tell a
+macro from an inline function from an ordinary function simply by name.
+Macros are used to avoid function calls in the typical case, and to
+cause (intentional) side-effects.
+
+Vppinfra has been around for almost 20 years and tends not to change
+frequently. The VPP Infrastructure layer contains the following
+functions:
+
+Vectors
+-------
+
+Vppinfra vectors are ubiquitous dynamically resized arrays with by user
+defined "headers". Many vpppinfra data structures (e.g. hash, heap,
+pool) are vectors with various different headers.
+
+The memory layout looks like this:
+
+::
+
+ User header (optional, uword aligned)
+ Alignment padding (if needed)
+ Vector length in elements
+ User's pointer -> Vector element 0
+ Vector element 1
+ ...
+ Vector element N-1
+
+As shown above, the vector APIs deal with pointers to the 0th element of
+a vector. Null pointers are valid vectors of length zero.
+
+To avoid thrashing the memory allocator, one often resets the length of
+a vector to zero while retaining the memory allocation. Set the vector
+length field to zero via the vec_reset_length(v) macro. [Use the macro!
+It’s smart about NULL pointers.]
+
+Typically, the user header is not present. User headers allow for other
+data structures to be built atop vppinfra vectors. Users may specify the
+alignment for first data element of a vector via the [vec]()*_aligned
+macros.
+
+Vector elements can be any C type e.g. (int, double, struct bar). This
+is also true for data types built atop vectors (e.g. heap, pool, etc.).
+Many macros have \_a variants supporting alignment of vector elements
+and \_h variants supporting non-zero-length vector headers. The \_ha
+variants support both. Additionally cacheline alignment within a vector
+element structure can be specified using the
+``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.
+
+Inconsistent usage of header and/or alignment related macro variants
+will cause delayed, confusing failures.
+
+Standard programming error: memorize a pointer to the ith element of a
+vector, and then expand the vector. Vectors expand by 3/2, so such code
+may appear to work for a period of time. Correct code almost always
+memorizes vector **indices** which are invariant across reallocations.
+
+In typical application images, one supplies a set of global functions
+designed to be called from gdb. Here are a few examples:
+
+- vl(v) - prints vec_len(v)
+- pe(p) - prints pool_elts(p)
+- pifi(p, index) - prints pool_is_free_index(p, index)
+- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p
+
+Use the “show gdb” debug CLI command to print the current set.
+
+Bitmaps
+-------
+
+Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
+Quite handy for a variety jobs.
+
+Pools
+-----
+
+Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
+fixed-size data structures with independent lifetimes. Pools are perfect
+for allocating per-session structures.
+
+Hashes
+------
+
+Vppinfra provides several hash flavors. Data plane problems involving
+packet classification / session lookup often use
+./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
+These templates are instantiated multiple times, to efficiently service
+different fixed-key sizes.
+
+Bihashes are thread-safe. Read-locking is not required. A simple
+spin-lock ensures that only one thread writes an entry at a time.
+
+The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
+are simple to use, and are often used in control-plane code which needs
+exact-string-matching.
+
+In either case, one almost always looks up a key in a hash table to
+obtain an index in a related vector or pool. The APIs are simple enough,
+but one must take care when using the unmanaged arbitrary-sized key
+variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
+key_pointer. It is usually a bad mistake to pass the address of a vector
+element as the second argument to hash_set_mem. It is perfectly fine to
+memorize constant string addresses in the text segment.
+
+Timekeeping
+-----------
+
+Vppinfra includes high-precision, low-cost timing services. The datatype
+clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
+Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
+object.
+
+Clib_time_init(…) can use a variety of different ways to establish the
+hardware clock frequency. At the end of the day, vppinfra timekeeping
+takes the attitude that the operating system’s clock is the closest
+thing to a gold standard it has handy.
+
+When properly configured, NTP maintains kernel clock synchronization
+with a highly accurate off-premises reference clock. Notwithstanding
+network propagation delays, a synchronized NTP client will keep the
+kernel clock accurate to within 50ms or so.
+
+Why should one care? Simply put, oscillators used to generate CPU ticks
+aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t
+be out of the question. That’s a minute and a half’s worth of error in 1
+day. The error changes constantly, due to temperature variation, and a
+host of other physical factors.
+
+It’s far too expensive to use system calls for timing, so we’re left
+with the problem of continuously adjusting our view of the CPU tick
+register’s clocks_per_second parameter.
+
+The clock rate adjustment algorithm measures the number of cpu ticks and
+the “gold standard” reference time across an interval of approximately
+16 seconds. We calculate clocks_per_second for the interval: use rdtsc
+(on x86_64) and a system call to get the latest cpu tick count and the
+kernel’s latest nanosecond timestamp. We subtract the previous interval
+end values, and use exponential smoothing to merge the new clock rate
+sample into the clocks_per_second parameter.
+
+As of this writing, we maintain the clock rate by way of the following
+first-order differential equation:
+
+.. code:: c
+
+ clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
+ where K = e**(-1.0/3.75);
+
+This yields a per observation “half-life” of 1 minute. Empirically, the
+clock rate converges within 5 minutes, and appears to maintain
+near-perfect agreement with the kernel clock in the face of ongoing NTP
+time adjustments.
+
+See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
+rate adjustment algorithm. The code rejects frequency samples
+corresponding to the sort of adjustment which might occur if someone
+changes the gold standard kernel clock by several seconds.
+
+Monotonic timebase support
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Particularly during system initialization, the “gold standard” system
+reference clock can change by a large amount, in an instant. It’s not a
+best practice to yank the reference clock - in either direction - by
+hours or days. In fact, some poorly-constructed use-cases do so.
+
+To deal with this reality, clib_time_now(…) returns the number of
+seconds since vpp started, *guaranteed to be monotonically increasing,
+no matter what happens to the system reference clock*.
+
+This is first-order important, to avoid breaking every active timer in
+the system. The vpp host stack alone may account for tens of millions of
+active timers. It’s utterly impractical to track down and fix timers, so
+we must deal with the issue at the timebase level.
+
+Here’s how it works. Prior to adjusting the clock rate, we collect the
+kernel reference clock and the cpu clock:
+
+.. code:: c
+
+ /* Ask the kernel and the CPU what time it is... */
+ now_reference = unix_time_now ();
+ now_clock = clib_cpu_time_now ();
+
+Compute changes for both clocks since the last rate adjustment, roughly
+15 seconds ago:
+
+.. code:: c
+
+ /* Compute change in the reference clock */
+ delta_reference = now_reference - c->last_verify_reference_time;
+
+ /* And change in the CPU clock */
+ delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
+ c->seconds_per_clock;
+
+Delta_reference is key. Almost 100% of the time, delta_reference and
+delta_clock_in_seconds are identical modulo one system-call time.
+However, NTP or a privileged user can yank the system reference time -
+in either direction - by an hour, a day, or a decade.
+
+As described above, clib_time_now(…) must return monotonically
+increasing answers to the question “how long has it been since vpp
+started, in seconds.” To do that, the clock rate adjustment algorithm
+begins by recomputing the initial reference time:
+
+.. code:: c
+
+ c->init_reference_time += (delta_reference - delta_clock_in_seconds);
+
+It’s easy to convince yourself that if the reference clock changes by
+15.000000 seconds and the cpu clock tick time changes by 15.000000
+seconds, the initial reference time won’t change.
+
+If, on the other hand, delta_reference is -86400.0 and delta clock is
+15.0 - reference time jumped backwards by exactly one day in a 15-second
+rate update interval - we add -86415.0 to the initial reference time.
+
+Given the corrected initial reference time, we recompute the total
+number of cpu ticks which have occurred since the corrected initial
+reference time, at the current clock tick rate:
+
+.. code:: c
+
+ c->total_cpu_time = (now_reference - c->init_reference_time)
+ * c->clocks_per_second;
+
+Timebase precision
+~~~~~~~~~~~~~~~~~~
+
+Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
+floating-point value; the number of seconds since vpp started.
+
+Please see `this Wikipedia
+article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
+for more information. C double-precision floating point numbers (called
+f64 in the vpp code base) have a 53-bit effective mantissa, and can
+accurately represent 15 decimal digits’ worth of precision.
+
+There are 315,360,000.000001 seconds in ten years plus one microsecond.
+That string has exactly 15 decimal digits. The vpp time base retains 1us
+precision for roughly 30 years.
+
+vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds.
+If necessary, please use clib_cpu_time_now(…) for direct access to the
+CPU clock-cycle counter. Note that the number of CPU clock cycles per
+second varies significantly across CPU architectures.
+
+Timer Wheels
+------------
+
+Vppinfra includes configurable timer wheel support. See the source code
+in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
+number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].
+
+Instantiation of tw_timer_template.h generates named structures to
+implement specific timer wheel geometries. Choices include: number of
+timer wheels (currently, 1 or 2), number of slots per ring (a power of
+two), and the number of timers per “object handle”.
+
+Internally, user object/timer handles are 32-bit integers, so if one
+selects 16 timers/object (4 bits), the resulting timer wheel handle is
+limited to 2**28 objects.
+
+Here are the specific settings required to generate a single 2048 slot
+wheel which supports 2 timers per object:
+
+.. code:: c
+
+ #define TW_TIMER_WHEELS 1
+ #define TW_SLOTS_PER_RING 2048
+ #define TW_RING_SHIFT 11
+ #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
+ #define TW_TIMERS_PER_OBJECT 2
+ #define LOG2_TW_TIMERS_PER_OBJECT 1
+ #define TW_SUFFIX _2t_1w_2048sl
+ #define TW_FAST_WHEEL_BITMAP 0
+ #define TW_TIMER_ALLOW_DUPLICATE_STOP 0
+
+See tw_timer_2t_1w_2048sl.h for a complete example.
+
+tw_timer_template.h is not intended to be #included directly. Client
+codes can include multiple timer geometry header files, although extreme
+caution would required to use the TW and TWT macros in such a case.
+
+API usage examples
+~~~~~~~~~~~~~~~~~~
+
+The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
+API usage example. It uses a synthetic clock to rapidly exercise the
+underlying tw_timer_expire_timers(…) template.
+
+There are not many API routines to call.
+
+Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: c
+
+ tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
+ expired_timer_single_callback,
+ 1.0 / * timer interval * / );
+
+Start a timer
+^^^^^^^^^^^^^
+
+.. code:: c
+
+ handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
+ [0 | 1] / * timer id * / ,
+ expiration_time_in_u32_ticks);
+
+Stop a timer
+^^^^^^^^^^^^
+
+.. code:: c
+
+ tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
+
+An expired timer callback
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: c
+
+ static void
+ expired_timer_single_callback (u32 * expired_timers)
+ {
+ int i;
+ u32 pool_index, timer_id;
+ tw_timer_test_elt_t *e;
+ tw_timer_test_main_t *tm = &tw_timer_test_main;
+
+ for (i = 0; i < vec_len (expired_timers);
+ {
+ pool_index = expired_timers[i] & 0x7FFFFFFF;
+ timer_id = expired_timers[i] >> 31;
+
+ ASSERT (timer_id == 1);
+
+ e = pool_elt_at_index (tm->test_elts, pool_index);
+
+ if (e->expected_to_expire != tm->single_wheel.current_tick)
+ {
+ fformat (stdout, "[%d] expired at %d not %d\n",
+ e - tm->test_elts, tm->single_wheel.current_tick,
+ e->expected_to_expire);
+ }
+ pool_put (tm->test_elts, e);
+ }
+ }
+
+We use wheel timers extensively in the vpp host stack. Each TCP session
+needs 5 timers, so supporting 10 million flows requires up to 50 million
+concurrent timers.
+
+Timers rarely expire, so it’s of utmost important that stopping and
+restarting a timer costs as few clock cycles as possible.
+
+Stopping a timer costs a doubly-linked list dequeue. Starting a timer
+involves modular arithmetic to determine the correct timer wheel and
+slot, and a list head enqueue.
+
+Expired timer processing generally involves bulk link-list retirement
+with user callback presentation. Some additional complexity at wheel
+wrap time, to relocate timers from slower-turning timer wheels into
+faster-turning wheels.
+
+Format
+------
+
+Vppinfra format is roughly equivalent to printf.
+
+Format has a few properties worth mentioning. Format’s first argument is
+a (u8 \*) vector to which it appends the result of the current format
+operation. Chaining calls is very easy:
+
+.. code:: c
+
+ u8 * result;
+
+ result = format (0, "junk = %d, ", junk);
+ result = format (result, "more junk = %d\n", more_junk);
+
+As previously noted, NULL pointers are perfectly proper 0-length
+vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
+wish to print a (u8 \*) vector, use the “%v” format string. If you need
+a (u8 \*) vector which is also a proper C-string, either of these
+schemes may be used:
+
+.. code:: c
+
+ vec_add1 (result, 0)
+ or
+ result = format (result, "<whatever>%c", 0);
+
+Remember to vec_free() the result if appropriate. Be careful not to pass
+format an uninitialized (u8 \*).
+
+Format implements a particularly handy user-format scheme via the “%U”
+format specification. For example:
+
+.. code:: c
+
+ u8 * format_junk (u8 * s, va_list *va)
+ {
+ junk = va_arg (va, u32);
+ s = format (s, "%s", junk);
+ return s;
+ }
+
+ result = format (0, "junk = %U, format_junk, "This is some junk");
+
+format_junk() can invoke other user-format functions if desired. The
+programmer shoulders responsibility for argument type-checking. It is
+typical for user format functions to blow up spectacularly if the
+va_arg(va, type) macros don’t match the caller’s idea of reality.
+
+Unformat
+--------
+
+Vppinfra unformat is vaguely related to scanf, but considerably more
+general.
+
+A typical use case involves initializing an unformat_input_t from either
+a C-string or a (u8 \*) vector, then parsing via unformat() as follows:
+
+.. code:: c
+
+ unformat_input_t input;
+ u8 *s = "<some-C-string>";
+
+ unformat_init_string (&input, (char *) s, strlen((char *) s));
+ /* or */
+ unformat_init_vector (&input, <u8-vector>);
+
+Then loop parsing individual elements:
+
+.. code:: c
+
+ while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (&input, "value1 %d", &value1))
+ ;/* unformat sets value1 */
+ else if (unformat (&input, "value2 %d", &value2)
+ ;/* unformat sets value2 */
+ else
+ return clib_error_return (0, "unknown input '%U'",
+ format_unformat_error, input);
+ }
+
+As with format, unformat implements a user-unformat function capability
+via a “%U” user unformat function scheme. Generally, one can trivially
+transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
+&foo)“.
+
+Unformat implements a couple of handy non-scanf-like format specifiers:
+
+.. code:: c
+
+ unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
+ unformat (input, "bitzero %|", &mask, (1<<0));
+ unformat (input, "bitone %|", &mask, (1<<1));
+ <etc>
+
+The phrase “enable %=” means “set the supplied variable to the default
+value” if unformat parses the “enable” keyword all by itself. If
+unformat parses “enable 123” set the supplied variable to 123.
+
+We could clean up a number of hand-rolled “verbose” + “verbose %d”
+argument parsing codes using “%=”.
+
+The phrase “bitzero %\|” means “set the specified bit in the supplied
+bitmask” if unformat parses “bitzero”. Although it looks like it could
+be fairly handy, it’s very lightly used in the code base.
+
+``%_`` toggles whether or not to skip input white space.
+
+For transition from skip to no-skip in middle of format string, skip
+input white space. For example, the following:
+
+.. code:: c
+
+ fmt = "%_%d.%d%_->%_%d.%d%_"
+ unformat (input, fmt, &one, &two, &three, &four);
+
+matches input “1.2 -> 3.4”. Without this, the space after -> does not
+get skipped.
+
+
+How to parse a single input line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Debug CLI command functions MUST NOT accidentally consume input
+belonging to other debug CLI commands. Otherwise, it's impossible to
+script a set of debug CLI commands which "work fine" when issued one
+at a time.
+
+This bit of code is NOT correct:
+
+.. code:: c
+
+ /* Eats script input NOT beloging to it, and chokes! */
+ while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (input, ...))
+ ;
+ else if (unformat (input, ...))
+ ;
+ else
+ return clib_error_return (0, "parse error: '%U'",
+ format_unformat_error, input);
+ }
+ }
+
+When executed as part of a script, such a function will return “parse
+error: ‘’” every time, unless it happens to be the last command in the
+script.
+
+Instead, use “unformat_line_input” to consume the rest of a line’s worth
+of input - everything past the path specified in the VLIB_CLI_COMMAND
+declaration.
+
+For example, unformat_line_input with “my_command” set up as shown below
+and user input “my path is clear” will produce an unformat_input_t that
+contains “is clear”.
+
+.. code:: c
+
+ VLIB_CLI_COMMAND (...) = {
+ .path = "my path",
+ };
+
+Here’s a bit of code which shows the required mechanics, in full:
+
+.. code:: c
+
+ static clib_error_t *
+ my_command_fn (vlib_main_t * vm,
+ unformat_input_t * input,
+ vlib_cli_command_t * cmd)
+ {
+ unformat_input_t _line_input, *line_input = &_line_input;
+ u32 this, that;
+ clib_error_t *error = 0;
+
+ if (!unformat_user (input, unformat_line_input, line_input))
+ return 0;
+
+ /*
+ * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
+ * not at the end of the script...
+ */
+ while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
+ {
+ if (unformat (line_input, "this %u", &this))
+ ;
+ else if (unformat (line_input, "that %u", &that))
+ ;
+ else
+ {
+ error = clib_error_return (0, "parse error: '%U'",
+ format_unformat_error, line_input);
+ goto done;
+ }
+ }
+
+ <do something based on "this" and "that", etc>
+
+ done:
+ unformat_free (line_input);
+ return error;
+ }
+ VLIB_CLI_COMMAND (my_command, static) = {
+ .path = "my path",
+ .function = my_command_fn",
+ };
+
+Vppinfra errors and warnings
+----------------------------
+
+Many functions within the vpp dataplane have return-values of type
+clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of
+metadata [fatal, warning] and are easy to announce. Returning a NULL
+clib_error_t \* indicates “A-OK, no error.”
+
+Clib_warning(format-args) is a handy way to add debugging output; clib
+warnings prepend function:line info to unambiguously locate the message
+source. Clib_unix_warning() adds perror()-style Linux system-call
+information. In production images, clib_warnings result in syslog
+entries.
+
+Serialization
+-------------
+
+Vppinfra serialization support allows the programmer to easily serialize
+and unserialize complex data structures.
+
+The underlying primitive serialize/unserialize functions use network
+byte-order, so there are no structural issues serializing on a
+little-endian host and unserializing on a big-endian host.
diff --git a/docs/developer/corearchitecture/mem.rst b/docs/developer/corearchitecture/mem.rst
new file mode 120000
index 00000000000..0fc53eab68c
--- /dev/null
+++ b/docs/developer/corearchitecture/mem.rst
@@ -0,0 +1 @@
+../../../src/vpp/mem/mem.rst \ No newline at end of file
diff --git a/docs/developer/corearchitecture/multi_thread.rst b/docs/developer/corearchitecture/multi_thread.rst
new file mode 100644
index 00000000000..195a9b791fd
--- /dev/null
+++ b/docs/developer/corearchitecture/multi_thread.rst
@@ -0,0 +1,169 @@
+.. _vpp_multi_thread:
+
+Multi-threading in VPP
+======================
+
+Modes
+-----
+
+VPP can work in 2 different modes:
+
+- single-thread
+- multi-thread with worker threads
+
+Single-thread
+~~~~~~~~~~~~~
+
+In a single-thread mode there is one main thread which handles both
+packet processing and other management functions (Command-Line Interface
+(CLI), API, stats). This is the default setup. There is no special
+startup config needed.
+
+Multi-thread with Worker Threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this mode, the main threads handles management functions(debug CLI,
+API, stats collection) and one or more worker threads handle packet
+processing from input to output of the packet.
+
+Each worker thread polls input queues on subset of interfaces.
+
+With RSS (Receive Side Scaling) enabled multiple threads can service one
+physical interface (RSS function on NIC distributes traffic between
+different queues which are serviced by different worker threads).
+
+Thread placement
+----------------
+
+Thread placement is defined in the startup config under the cpu { … }
+section.
+
+The VPP platform can place threads automatically or manually. Automatic
+placement works in the following way:
+
+- if “skip-cores X” is defined first X cores will not be used
+- if “main-core X” is defined, VPP main thread will be placed on core
+ X, otherwise 1st available one will be used
+- if “workers N” is defined vpp will allocate first N available cores
+ and it will run threads on them
+- if “corelist-workers A,B1-Bn,C1-Cn” is defined vpp will automatically
+ assign those CPU cores to worker threads
+
+User can see active placement of cores by using the VPP debug CLI
+command show threads:
+
+.. code-block:: console
+
+ vpd# show threads
+ ID Name Type LWP lcore Core Socket State
+ 0 vpe_main 59723 2 2 0 wait
+ 1 vpe_wk_0 workers 59755 4 4 0 running
+ 2 vpe_wk_1 workers 59756 5 5 0 running
+ 3 vpe_wk_2 workers 59757 6 0 1 running
+ 4 vpe_wk_3 workers 59758 7 1 1 running
+ 5 stats 59775
+ vpd#
+
+The sample output above shows the main thread running on core 2 (2nd
+core on the CPU socket 0), worker threads running on cores 4-7.
+
+Sample Configurations
+---------------------
+
+By default, at start-up VPP uses
+configuration values from: ``/etc/vpp/startup.conf``
+
+The following sections describe some of the additional changes that can be made to this file.
+This file is initially populated from the files located in the following directory ``/vpp/vpp/conf/``
+
+Manual Placement
+~~~~~~~~~~~~~~~~
+
+Manual placement places the main thread on core 1, workers on cores
+4,5,20,21.
+
+.. code-block:: console
+
+ cpu {
+ main-core 1
+ corelist-workers 4-5,20-21
+ }
+
+Auto placement
+--------------
+
+Auto placement is likely to place the main thread on core 1 and workers
+on cores 2,3,4.
+
+.. code-block:: console
+
+ cpu {
+ skip-cores 1
+ workers 3
+ }
+
+Buffer Memory Allocation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The VPP platform is NUMA aware. It can allocate memory for buffers on
+different CPU sockets (NUMA nodes). The amount of memory allocated can
+be defined in the startup config for each CPU socket by using the
+socket-mem A[[,B],C] statement inside the dpdk { … } section.
+
+For example:
+
+.. code-block:: console
+
+ dpdk {
+ socket-mem 1024,1024
+ }
+
+The above configuration allocates 1GB of memory on NUMA#0 and 1GB on
+NUMA#1. Each worker thread uses buffers which are local to itself.
+
+Buffer memory is allocated from hugepages. VPP prefers 1G pages if they
+are available. If not 2MB pages will be used.
+
+VPP takes care of mounting/unmounting hugepages file-system
+automatically so there is no need to do that manually.
+
+’‘’NOTE’’’: If you are running latest VPP release, there is no need for
+specifying socket-mem manually. VPP will discover all NUMA nodes and it
+will allocate 512M on each by default. socket-mem is only needed if
+bigger number of mbufs is required (default is 16384 per socket and can
+be changed with num-mbufs startup config command).
+
+Interface Placement in Multi-thread Setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On startup, the VPP platform assigns interfaces (or interface, queue
+pairs if RSS is used) to different worker threads in round robin
+fashion.
+
+The following example shows debug CLI commands to show and change
+interface placement:
+
+.. code-block:: console
+
+ vpd# sh dpdk interface placement
+ Thread 1 (vpp_wk_0 at lcore 5):
+ TenGigabitEthernet2/0/0 queue 0
+ TenGigabitEthernet2/0/1 queue 0
+ Thread 2 (vpp_wk_1 at lcore 6):
+ TenGigabitEthernet2/0/0 queue 1
+ TenGigabitEthernet2/0/1 queue 1
+
+The following shows an example of moving TenGigabitEthernet2/0/1 queue 1
+processing to 1st worker thread:
+
+.. code-block:: console
+
+ vpd# set interface placement TenGigabitEthernet2/0/1 queue 1 thread 1
+
+ vpp# sh dpdk interface placement
+ Thread 1 (vpp_wk_0 at lcore 5):
+ TenGigabitEthernet2/0/0 queue 0
+ TenGigabitEthernet2/0/1 queue 0
+ TenGigabitEthernet2/0/1 queue 1
+ Thread 2 (vpp_wk_1 at lcore 6):
+ TenGigabitEthernet2/0/0 queue 1
diff --git a/docs/developer/corearchitecture/multiarch/arbfns.rst b/docs/developer/corearchitecture/multiarch/arbfns.rst
new file mode 100644
index 00000000000..d469bd8a140
--- /dev/null
+++ b/docs/developer/corearchitecture/multiarch/arbfns.rst
@@ -0,0 +1,87 @@
+Multi-Architecture Arbitrary Function Cookbook
+==============================================
+
+Optimizing arbitrary functions for multiple architectures is simple
+enough, and very similar to process used to produce multi-architecture
+graph node dispatch functions.
+
+As with multi-architecture graph nodes, we compile source files
+multiple times, generating multiple implementations of the original
+function, and a public selector function.
+
+Details
+-------
+
+Decorate function definitions with CLIB_MARCH_FN macros. For example:
+
+Change the original function prototype...
+
+::
+
+ u32 vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index,
+ u32 frame_flags)
+
+...by recasting the function name and return type as the first two
+arguments to the CLIB_MARCH_FN macro:
+
+::
+
+ CLIB_MARCH_FN (vlib_frame_alloc_to_node, u32, vlib_main_t * vm,
+ u32 to_node_index, u32 frame_flags)
+
+In the actual vpp image, several versions of vlib_frame_alloc_to_node
+will appear: vlib_frame_alloc_to_node_avx2,
+vlib_frame_alloc_to_node_avx512, and so forth.
+
+
+For each multi-architecture function, use the CLIB_MARCH_FN_SELECT
+macro to help generate the one-and-only multi-architecture selector
+function:
+
+::
+
+ #ifndef CLIB_MARCH_VARIANT
+ u32
+ vlib_frame_alloc_to_node (vlib_main_t * vm, u32 to_node_index,
+ u32 frame_flags)
+ {
+ return CLIB_MARCH_FN_SELECT (vlib_frame_alloc_to_node)
+ (vm, to_node_index, frame_flags);
+ }
+ #endif /* CLIB_MARCH_VARIANT */
+
+Once bound, the multi-architecture selector function is about as
+expensive as an indirect function call; which is to say: not very
+expensive.
+
+Modify CMakeLists.txt
+---------------------
+
+If the component in question already lists "MULTIARCH_SOURCES", simply
+add the indicated .c file to the list. Otherwise, add as shown
+below. Note that the added file "new_multiarch_node.c" should appear in
+*both* SOURCES and MULTIARCH_SOURCES:
+
+::
+
+ add_vpp_plugin(myplugin
+ SOURCES
+ multiarch_code.c
+ ...
+
+ MULTIARCH_SOURCES
+ multiarch_code.c
+ ...
+ )
+
+A Word to the Wise
+------------------
+
+A file which liberally mixes functions worth compiling for multiple
+architectures and functions which are not will end up full of
+#ifndef CLIB_MARCH_VARIANT conditionals. This won't do a thing to make
+the code look any better.
+
+Depending on requirements, it may make sense to move functions to
+(new) files to reduce complexity and/or improve legibility of the
+resulting code.
diff --git a/docs/developer/corearchitecture/multiarch/index.rst b/docs/developer/corearchitecture/multiarch/index.rst
new file mode 100644
index 00000000000..824a8e68438
--- /dev/null
+++ b/docs/developer/corearchitecture/multiarch/index.rst
@@ -0,0 +1,12 @@
+.. _multiarch:
+
+Multi-architecture support
+==========================
+
+This reference guide describes how to use the vpp multi-architecture support scheme
+
+.. toctree::
+ :maxdepth: 1
+
+ nodefns
+ arbfns
diff --git a/docs/developer/corearchitecture/multiarch/nodefns.rst b/docs/developer/corearchitecture/multiarch/nodefns.rst
new file mode 100644
index 00000000000..9647e64f08c
--- /dev/null
+++ b/docs/developer/corearchitecture/multiarch/nodefns.rst
@@ -0,0 +1,138 @@
+Multi-Architecture Graph Node Cookbook
+======================================
+
+In the context of graph node dispatch functions, it's easy enough to
+use the vpp multi-architecture support setup. The point of the scheme
+is simple: for performance-critical nodes, generate multiple CPU
+hardware-dependent versions of the node dispatch functions, and pick
+the best one at runtime.
+
+The vpp scheme is simple enough to use, but details matter.
+
+100,000 foot view
+-----------------
+
+We compile entire graph node dispatch function implementation files
+multiple times. These compilations give rise to multiple versions of
+the graph node dispatch functions. Per-node constructor-functions
+interrogate CPU hardware, select the node dispatch function variant to
+use, and set the vlib_node_registration_t ".function" member to the
+address of the selected variant.
+
+Details
+-------
+
+Declare the node dispatch function as shown, using the VLIB\_NODE\_FN macro. The
+name of the node function **MUST** match the name of the graph node.
+
+::
+
+ VLIB_NODE_FN (ip4_sdp_node) (vlib_main_t * vm, vlib_node_runtime_t * node,
+ vlib_frame_t * frame)
+ {
+ if (PREDICT_FALSE (node->flags & VLIB_NODE_FLAG_TRACE))
+ return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ ,
+ 1 /* is_trace */ );
+ else
+ return ip46_sdp_inline (vm, node, frame, 1 /* is_ip4 */ ,
+ 0 /* is_trace */ );
+ }
+
+We need to generate *precisely one copy* of the
+vlib_node_registration_t, error strings, and packet trace decode function.
+
+Simply bracket these items with "#ifndef CLIB_MARCH_VARIANT...#endif":
+
+::
+
+ #ifndef CLIB_MARCH_VARIANT
+ static u8 *
+ format_sdp_trace (u8 * s, va_list * args)
+ {
+ <snip>
+ }
+ #endif
+
+ ...
+
+ #ifndef CLIB_MARCH_VARIANT
+ static char *sdp_error_strings[] = {
+ #define _(sym,string) string,
+ foreach_sdp_error
+ #undef _
+ };
+ #endif
+
+ ...
+
+ #ifndef CLIB_MARCH_VARIANT
+ VLIB_REGISTER_NODE (ip4_sdp_node) =
+ {
+ // DO NOT set the .function structure member.
+ // The multiarch selection __attribute__((constructor)) function
+ // takes care of it at runtime
+ .name = "ip4-sdp",
+ .vector_size = sizeof (u32),
+ .format_trace = format_sdp_trace,
+ .type = VLIB_NODE_TYPE_INTERNAL,
+
+ .n_errors = ARRAY_LEN(sdp_error_strings),
+ .error_strings = sdp_error_strings,
+
+ .n_next_nodes = SDP_N_NEXT,
+
+ /* edit / add dispositions here */
+ .next_nodes =
+ {
+ [SDP_NEXT_DROP] = "ip4-drop",
+ },
+ };
+ #endif
+
+To belabor the point: *do not* set the ".function" member! That's the job of the multi-arch
+selection \_\_attribute\_\_((constructor)) function
+
+Always inline node dispatch functions
+-------------------------------------
+
+It's typical for a graph dispatch function to contain one or more
+calls to an inline function. See above. If your node dispatch function
+is structured that way, make *ABSOLUTELY CERTAIN* to use the
+"always_inline" macro:
+
+::
+
+ always_inline uword
+ ip46_sdp_inline (vlib_main_t * vm, vlib_node_runtime_t * node,
+ vlib_frame_t * frame,
+ int is_ip4, int is_trace)
+ { ... }
+
+Otherwise, the compiler is highly likely NOT to build multiple
+versions of the guts of your dispatch function.
+
+It's fairly easy to spot this mistake in "perf top." If you see, for
+example, a bunch of functions with names of the form
+"xxx_node_fn_avx2" in the profile, *BUT* your brand-new node function
+shows up with a name of the form "xxx_inline.isra.1", it's quite likely
+that the inline was declared "static inline" instead of "always_inline".
+
+Modify CMakeLists.txt
+---------------------
+
+If the component in question already lists "MULTIARCH_SOURCES", simply
+add the indicated .c file to the list. Otherwise, add as shown
+below. Note that the added file "new_multiarch_node.c" should appear in
+*both* SOURCES and MULTIARCH_SOURCES:
+
+::
+
+ add_vpp_plugin(myplugin
+ SOURCES
+ new_multiarch_node.c
+ ...
+
+ MULTIARCH_SOURCES
+ new_ multiarch_node.c
+ ...
+ )
diff --git a/docs/developer/corearchitecture/softwarearchitecture.rst b/docs/developer/corearchitecture/softwarearchitecture.rst
new file mode 100644
index 00000000000..7f8a0e04645
--- /dev/null
+++ b/docs/developer/corearchitecture/softwarearchitecture.rst
@@ -0,0 +1,47 @@
+Software Architecture
+=====================
+
+The fd.io vpp implementation is a third-generation vector packet
+processing implementation specifically related to US Patent 7,961,636,
+as well as earlier work. Note that the Apache-2 license specifically
+grants non-exclusive patent licenses; we mention this patent as a point
+of historical interest.
+
+For performance, the vpp dataplane consists of a directed graph of
+forwarding nodes which process multiple packets per invocation. This
+schema enables a variety of micro-processor optimizations: pipelining
+and prefetching to cover dependent read latency, inherent I-cache phase
+behavior, vector instructions. Aside from hardware input and hardware
+output nodes, the entire forwarding graph is portable code.
+
+Depending on the scenario at hand, we often spin up multiple worker
+threads which process ingress-hashes packets from multiple queues using
+identical forwarding graph replicas.
+
+VPP Layers - Implementation Taxonomy
+------------------------------------
+
+.. figure:: /_images/VPP_Layering.png
+ :alt: image
+
+ image
+
+- VPP Infra - the VPP infrastructure layer, which contains the core
+ library source code. This layer performs memory functions, works with
+ vectors and rings, performs key lookups in hash tables, and works
+ with timers for dispatching graph nodes.
+- VLIB - the vector processing library. The vlib layer also handles
+ various application management functions: buffer, memory and graph
+ node management, maintaining and exporting counters, thread
+ management, packet tracing. Vlib implements the debug CLI (command
+ line interface).
+- VNET - works with VPP's networking interface (layers 2, 3, and 4)
+ performs session and traffic management, and works with devices and
+ the data control plane.
+- Plugins - Contains an increasingly rich set of data-plane plugins, as
+ noted in the above diagram.
+- VPP - the container application linked against all of the above.
+
+It’s important to understand each of these layers in a certain amount of
+detail. Much of the implementation is best dealt with at the API level
+and otherwise left alone.
diff --git a/docs/developer/corearchitecture/vlib.rst b/docs/developer/corearchitecture/vlib.rst
new file mode 100644
index 00000000000..f542d33ebb8
--- /dev/null
+++ b/docs/developer/corearchitecture/vlib.rst
@@ -0,0 +1,888 @@
+VLIB (Vector Processing Library)
+================================
+
+The files associated with vlib are located in the ./src/{vlib, vlibapi,
+vlibmemory} folders. These libraries provide vector processing support
+including graph-node scheduling, reliable multicast support,
+ultra-lightweight cooperative multi-tasking threads, a CLI, plug in .DLL
+support, physical memory and Linux epoll support. Parts of this library
+embody US Patent 7,961,636.
+
+Init function discovery
+-----------------------
+
+vlib applications register for various [initialization] events by
+placing structures and \__attribute__((constructor)) functions into the
+image. At appropriate times, the vlib framework walks
+constructor-generated singly-linked structure lists, performs a
+topological sort based on specified constraints, and calls the indicated
+functions. Vlib applications create graph nodes, add CLI functions,
+start cooperative multi-tasking threads, etc. etc. using this mechanism.
+
+vlib applications invariably include a number of VLIB_INIT_FUNCTION
+(my_init_function) macros.
+
+Each init / configure / etc. function has the return type clib_error_t
+\*. Make sure that the function returns 0 if all is well, otherwise the
+framework will announce an error and exit.
+
+vlib applications must link against vppinfra, and often link against
+other libraries such as VNET. In the latter case, it may be necessary to
+explicitly reference symbol(s) otherwise large portions of the library
+may be AWOL at runtime.
+
+Init function construction and constraint specification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It’s easy to add an init function:
+
+.. code:: c
+
+ static clib_error_t *my_init_function (vlib_main_t *vm)
+ {
+ /* ... initialize things ... */
+
+ return 0; // or return clib_error_return (0, "BROKEN!");
+ }
+ VLIB_INIT_FUNCTION(my_init_function);
+
+As given, my_init_function will be executed “at some point,” but with no
+ordering guarantees.
+
+Specifying ordering constraints is easy:
+
+.. code:: c
+
+ VLIB_INIT_FUNCTION(my_init_function) =
+ {
+ .runs_before = VLIB_INITS("we_run_before_function_1",
+ "we_run_before_function_2"),
+ .runs_after = VLIB_INITS("we_run_after_function_1",
+ "we_run_after_function_2),
+ };
+
+It’s also easy to specify bulk ordering constraints of the form “a then
+b then c then d”:
+
+.. code:: c
+
+ VLIB_INIT_FUNCTION(my_init_function) =
+ {
+ .init_order = VLIB_INITS("a", "b", "c", "d"),
+ };
+
+It’s OK to specify all three sorts of ordering constraints for a single
+init function, although it’s hard to imagine why it would be necessary.
+
+Node Graph Initialization
+-------------------------
+
+vlib packet-processing applications invariably define a set of graph
+nodes to process packets.
+
+One constructs a vlib_node_registration_t, most often via the
+VLIB_REGISTER_NODE macro. At runtime, the framework processes the set of
+such registrations into a directed graph. It is easy enough to add nodes
+to the graph at runtime. The framework does not support removing nodes.
+
+vlib provides several types of vector-processing graph nodes, primarily
+to control framework dispatch behaviors. The type member of the
+vlib_node_registration_t functions as follows:
+
+- VLIB_NODE_TYPE_PRE_INPUT - run before all other node types
+- VLIB_NODE_TYPE_INPUT - run as often as possible, after pre_input
+ nodes
+- VLIB_NODE_TYPE_INTERNAL - only when explicitly made runnable by
+ adding pending frames for processing
+- VLIB_NODE_TYPE_PROCESS - only when explicitly made runnable.
+ “Process” nodes are actually cooperative multi-tasking threads. They
+ **must** explicitly suspend after a reasonably short period of time.
+
+For a precise understanding of the graph node dispatcher, please read
+./src/vlib/main.c:vlib_main_loop.
+
+Graph node dispatcher
+---------------------
+
+Vlib_main_loop() dispatches graph nodes. The basic vector processing
+algorithm is diabolically simple, but may not be obvious from even a
+long stare at the code. Here’s how it works: some input node, or set of
+input nodes, produce a vector of work to process. The graph node
+dispatcher pushes the work vector through the directed graph,
+subdividing it as needed, until the original work vector has been
+completely processed. At that point, the process recurs.
+
+This scheme yields a stable equilibrium in frame size, by construction.
+Here’s why: as the frame size increases, the per-frame-element
+processing time decreases. There are several related forces at work; the
+simplest to describe is the effect of vector processing on the CPU L1
+I-cache. The first frame element [packet] processed by a given node
+warms up the node dispatch function in the L1 I-cache. All subsequent
+frame elements profit. As we increase the number of frame elements, the
+cost per element goes down.
+
+Under light load, it is a crazy waste of CPU cycles to run the graph
+node dispatcher flat-out. So, the graph node dispatcher arranges to wait
+for work by sitting in a timed epoll wait if the prevailing frame size
+is low. The scheme has a certain amount of hysteresis to avoid
+constantly toggling back and forth between interrupt and polling mode.
+Although the graph dispatcher supports interrupt and polling modes, our
+current default device drivers do not.
+
+The graph node scheduler uses a hierarchical timer wheel to reschedule
+process nodes upon timer expiration.
+
+Graph dispatcher internals
+--------------------------
+
+This section may be safely skipped. It’s not necessary to understand
+graph dispatcher internals to create graph nodes.
+
+Vector Data Structure
+---------------------
+
+In vpp / vlib, we represent vectors as instances of the vlib_frame_t
+type:
+
+.. code:: c
+
+ typedef struct vlib_frame_t
+ {
+ /* Frame flags. */
+ u16 flags;
+
+ /* Number of scalar bytes in arguments. */
+ u8 scalar_size;
+
+ /* Number of bytes per vector argument. */
+ u8 vector_size;
+
+ /* Number of vector elements currently in frame. */
+ u16 n_vectors;
+
+ /* Scalar and vector arguments to next node. */
+ u8 arguments[0];
+ } vlib_frame_t;
+
+Note that one *could* construct all kinds of vectors - including vectors
+with some associated scalar data - using this structure. In the vpp
+application, vectors typically use a 4-byte vector element size, and
+zero bytes’ worth of associated per-frame scalar data.
+
+Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries. Frames
+have u32 indices which make use of the alignment property, so the
+maximum feasible main heap offset of a frame is CLIB_CACHE_LINE_BYTES \*
+0xFFFFFFFF: 64*4 = 256 Gbytes.
+
+Scheduling Vectors
+------------------
+
+As you can see, vectors are not directly associated with graph nodes. We
+represent that association in a couple of ways. The simplest is the
+vlib_pending_frame_t:
+
+.. code:: c
+
+ /* A frame pending dispatch by main loop. */
+ typedef struct
+ {
+ /* Node and runtime for this frame. */
+ u32 node_runtime_index;
+
+ /* Frame index (in the heap). */
+ u32 frame_index;
+
+ /* Start of next frames for this node. */
+ u32 next_frame_index;
+
+ /* Special value for next_frame_index when there is no next frame. */
+ #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0)
+ } vlib_pending_frame_t;
+
+Here is the code in …/src/vlib/main.c:vlib_main_or_worker_loop() which
+processes frames:
+
+.. code:: c
+
+ /*
+ * Input nodes may have added work to the pending vector.
+ * Process pending vector until there is nothing left.
+ * All pending vectors will be processed from input -> output.
+ */
+ for (i = 0; i < _vec_len (nm->pending_frames); i++)
+ cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
+ /* Reset pending vector for next iteration. */
+
+The pending frame node_runtime_index associates the frame with the node
+which will process it.
+
+Complications
+-------------
+
+Fasten your seatbelt. Here’s where the story - and the data structures -
+become quite complicated…
+
+At 100,000 feet: vpp uses a directed graph, not a directed *acyclic*
+graph. It’s really quite normal for a packet to visit ip[46]-lookup
+multiple times. The worst-case: a graph node which enqueues packets to
+itself.
+
+To deal with this issue, the graph dispatcher must force allocation of a
+new frame if the current graph node’s dispatch function happens to
+enqueue a packet back to itself.
+
+There are no guarantees that a pending frame will be processed
+immediately, which means that more packets may be added to the
+underlying vlib_frame_t after it has been attached to a
+vlib_pending_frame_t. Care must be taken to allocate new frames and
+pending frames if a (pending_frame, frame) pair fills.
+
+Next frames, next frame ownership
+---------------------------------
+
+The vlib_next_frame_t is the last key graph dispatcher data structure:
+
+.. code:: c
+
+ typedef struct
+ {
+ /* Frame index. */
+ u32 frame_index;
+
+ /* Node runtime for this next. */
+ u32 node_runtime_index;
+
+ /* Next frame flags. */
+ u32 flags;
+
+ /* Reflects node frame-used flag for this next. */
+ #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \
+ VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH
+
+ /* This next frame owns enqueue to node
+ corresponding to node_runtime_index. */
+ #define VLIB_FRAME_OWNER (1 << 15)
+
+ /* Set when frame has been allocated for this next. */
+ #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT
+
+ /* Set when frame has been added to pending vector. */
+ #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP
+
+ /* Set when frame is to be freed after dispatch. */
+ #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT
+
+ /* Set when frame has traced packets. */
+ #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE
+
+ /* Number of vectors enqueue to this next since last overflow. */
+ u32 vectors_since_last_overflow;
+ } vlib_next_frame_t;
+
+Graph node dispatch functions call vlib_get_next_frame (…) to set “(u32
+\*)to_next” to the right place in the vlib_frame_t corresponding to the
+ith arc (aka next0) from the current node to the indicated next node.
+
+After some scuffling around - two levels of macros - processing reaches
+vlib_get_next_frame_internal (…). Get-next-frame-internal digs up the
+vlib_next_frame_t corresponding to the desired graph arc.
+
+The next frame data structure amounts to a graph-arc-centric frame
+cache. Once a node finishes adding element to a frame, it will acquire a
+vlib_pending_frame_t and end up on the graph dispatcher’s run-queue. But
+there’s no guarantee that more vector elements won’t be added to the
+underlying frame from the same (source_node, next_index) arc or from a
+different (source_node, next_index) arc.
+
+Maintaining consistency of the arc-to-frame cache is necessary. The
+first step in maintaining consistency is to make sure that only one
+graph node at a time thinks it “owns” the target vlib_frame_t.
+
+Back to the graph node dispatch function. In the usual case, a certain
+number of packets will be added to the vlib_frame_t acquired by calling
+vlib_get_next_frame (…).
+
+Before a dispatch function returns, it’s required to call
+vlib_put_next_frame (…) for all of the graph arcs it actually used. This
+action adds a vlib_pending_frame_t to the graph dispatcher’s pending
+frame vector.
+
+Vlib_put_next_frame makes a note in the pending frame of the frame
+index, and also of the vlib_next_frame_t index.
+
+dispatch_pending_node actions
+-----------------------------
+
+The main graph dispatch loop calls dispatch pending node as shown above.
+
+Dispatch_pending_node recovers the pending frame, and the graph node
+runtime / dispatch function. Further, it recovers the next_frame
+currently associated with the vlib_frame_t, and detaches the
+vlib_frame_t from the next_frame.
+
+In …/src/vlib/main.c:dispatch_pending_node(…), note this stanza:
+
+.. code:: c
+
+ /* Force allocation of new frame while current frame is being
+ dispatched. */
+ restore_frame_index = ~0;
+ if (nf->frame_index == p->frame_index)
+ {
+ nf->frame_index = ~0;
+ nf->flags &= ~VLIB_FRAME_IS_ALLOCATED;
+ if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH))
+ restore_frame_index = p->frame_index;
+ }
+
+dispatch_pending_node is worth a hard stare due to the several
+second-order optimizations it implements. Almost as an afterthought, it
+calls dispatch_node which actually calls the graph node dispatch
+function.
+
+Process / thread model
+----------------------
+
+vlib provides an ultra-lightweight cooperative multi-tasking thread
+model. The graph node scheduler invokes these processes in much the same
+way as traditional vector-processing run-to-completion graph nodes;
+plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply
+set the vlib_node_registration_t type field to vlib_NODE_TYPE_PROCESS.
+Yes, process is a misnomer. These are cooperative multi-tasking threads.
+
+As of this writing, the default stack size is 2<<15 = 32kb. Initialize
+the node registration’s process_log2_n_stack_bytes member as needed. The
+graph node dispatcher makes some effort to detect stack overrun, e.g. by
+mapping a no-access page below each thread stack.
+
+Process node dispatch functions are expected to be “while(1) { }” loops
+which suspend when not otherwise occupied, and which must not run for
+unreasonably long periods of time.
+
+“Unreasonably long” is an application-dependent concept. Over the years,
+we have constructed frame-size sensitive control-plane nodes which will
+use a much higher fraction of the available CPU bandwidth when the frame
+size is low. The classic example: modifying forwarding tables. So long
+as the table-builder leaves the forwarding tables in a valid state, one
+can suspend the table builder to avoid dropping packets as a result of
+control-plane activity.
+
+Process nodes can suspend for fixed amounts of time, or until another
+entity signals an event, or both. See the next section for a description
+of the vlib process event mechanism.
+
+When running in vlib process context, one must pay strict attention to
+loop invariant issues. If one walks a data structure and calls a
+function which may suspend, one had best know by construction that it
+cannot change. Often, it’s best to simply make a snapshot copy of a data
+structure, walk the copy at leisure, then free the copy.
+
+Process events
+--------------
+
+The vlib process event mechanism API is extremely lightweight and easy
+to use. Here is a typical example:
+
+.. code:: c
+
+ vlib_main_t *vm = &vlib_global_main;
+ uword event_type, * event_data = 0;
+
+ while (1)
+ {
+ vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */);
+
+ event_type = vlib_process_get_events (vm, &event_data);
+
+ switch (event_type) {
+ case EVENT1:
+ handle_event1s (event_data);
+ break;
+
+ case EVENT2:
+ handle_event2s (event_data);
+ break;
+
+ case ~0: /* 5-second idle/periodic */
+ handle_idle ();
+ break;
+
+ default: /* bug! */
+ ASSERT (0);
+ }
+
+ vec_reset_length(event_data);
+ }
+
+In this example, the VLIB process node waits for an event to occur, or
+for 5 seconds to elapse. The code demuxes on the event type, calling the
+appropriate handler function. Each call to vlib_process_get_events
+returns a vector of per-event-type data passed to successive
+vlib_process_signal_event calls; it is a serious error to process only
+event_data[0].
+
+Resetting the event_data vector-length to 0 [instead of calling
+vec_free] means that the event scheme doesn’t burn cycles continuously
+allocating and freeing the event data vector. This is a common vppinfra
+/ vlib coding pattern, well worth using when appropriate.
+
+Signaling an event is easy, for example:
+
+.. code:: c
+
+ vlib_process_signal_event (vm, process_node_index, EVENT1,
+ (uword)arbitrary_event1_data); /* and so forth */
+
+One can either know the process node index by construction - dig it out
+of the appropriate vlib_node_registration_t - or by finding the
+vlib_node_t with vlib_get_node_by_name(…).
+
+Buffers
+-------
+
+vlib buffering solves the usual set of packet-processing problems,
+albeit at high performance. Key in terms of performance: one ordinarily
+allocates / frees N buffers at a time rather than one at a time. Except
+when operating directly on a specific buffer, one deals with buffers by
+index, not by pointer.
+
+Packet-processing frames are u32[] arrays, not vlib_buffer_t[] arrays.
+
+Packets comprise one or more vlib buffers, chained together as required.
+Multiple particle sizes are supported; hardware input nodes simply ask
+for the required size(s). Coalescing support is available. For obvious
+reasons one is discouraged from writing one’s own wild and wacky buffer
+chain traversal code.
+
+vlib buffer headers are allocated immediately prior to the buffer data
+area. In typical packet processing this saves a dependent read wait:
+given a buffer’s address, one can prefetch the buffer header [metadata]
+at the same time as the first cache line of buffer data.
+
+Buffer header metadata (vlib_buffer_t) includes the usual rewrite
+expansion space, a current_data offset, RX and TX interface indices,
+packet trace information, and a opaque areas.
+
+The opaque data is intended to control packet processing in arbitrary
+subgraph-dependent ways. The programmer shoulders responsibility for
+data lifetime analysis, type-checking, etc.
+
+Buffers have reference-counts in support of e.g. multicast replication.
+
+Shared-memory message API
+-------------------------
+
+Local control-plane and application processes interact with the vpp
+dataplane via asynchronous message-passing in shared memory over
+unidirectional queues. The same application APIs are available via
+sockets.
+
+Capturing API traces and replaying them in a simulation environment
+requires a disciplined approach to the problem. This seems like a
+make-work task, but it is not. When something goes wrong in the
+control-plane after 300,000 or 3,000,000 operations, high-speed replay
+of the events leading up to the accident is a huge win.
+
+The shared-memory message API message allocator vl_api_msg_alloc uses a
+particularly cute trick. Since messages are processed in order, we try
+to allocate message buffering from a set of fixed-size, preallocated
+rings. Each ring item has a “busy” bit. Freeing one of the preallocated
+message buffers merely requires the message consumer to clear the busy
+bit. No locking required.
+
+Debug CLI
+---------
+
+Adding debug CLI commands to VLIB applications is very simple.
+
+Here is a complete example:
+
+.. code:: c
+
+ static clib_error_t *
+ show_ip_tuple_match (vlib_main_t * vm,
+ unformat_input_t * input,
+ vlib_cli_command_t * cmd)
+ {
+ vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main);
+ return 0;
+ }
+
+ static VLIB_CLI_COMMAND (show_ip_tuple_command) =
+ {
+ .path = "show ip tuple match",
+ .short_help = "Show ip 5-tuple match-and-broadcast tables",
+ .function = show_ip_tuple_match,
+ };
+
+This example implements the “show ip tuple match” debug cli command. In
+ordinary usage, the vlib cli is available via the “vppctl” application,
+which sends traffic to a named pipe. One can configure debug CLI telnet
+access on a configurable port.
+
+The cli implementation has an output redirection facility which makes it
+simple to deliver cli output via shared-memory API messaging,
+
+Particularly for debug or “show tech support” type commands, it would be
+wasteful to write vlib application code to pack binary data, write more
+code elsewhere to unpack the data and finally print the answer. If a
+certain cli command has the potential to hurt packet processing
+performance by running for too long, do the work incrementally in a
+process node. The client can wait.
+
+Macro expansion
+~~~~~~~~~~~~~~~
+
+The vpp debug CLI engine includes a recursive macro expander. This is
+quite useful for factoring out address and/or interface name specifics:
+
+::
+
+ define ip1 192.168.1.1/24
+ define ip2 192.168.2.1/24
+ define iface1 GigabitEthernet3/0/0
+ define iface2 loop1
+
+ set int ip address $iface1 $ip1
+ set int ip address $iface2 $(ip2)
+
+ undefine ip1
+ undefine ip2
+ undefine iface1
+ undefine iface2
+
+Each socket (or telnet) debug CLI session has its own macro tables. All
+debug CLI sessions which use CLI_INBAND binary API messages share a
+single table.
+
+The macro expander recognizes circular definitions:
+
+::
+
+ define foo \$(bar)
+ define bar \$(mumble)
+ define mumble \$(foo)
+
+At 8 levels of recursion, the macro expander throws up its hands and
+replies “CIRCULAR.”
+
+Macro-related debug CLI commands
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In addition to the “define” and “undefine” debug CLI commands, use “show
+macro [noevaluate]” to dump the macro table. The “echo” debug CLI
+command will evaluate and print its argument:
+
+::
+
+ vpp# define foo This\ Is\ Foo
+ vpp# echo $foo
+ This Is Foo
+
+Handing off buffers between threads
+-----------------------------------
+
+Vlib includes an easy-to-use mechanism for handing off buffers between
+worker threads. A typical use-case: software ingress flow hashing. At a
+high level, one creates a per-worker-thread queue which sends packets to
+a specific graph node in the indicated worker thread. With the queue in
+hand, enqueue packets to the worker thread of your choice.
+
+Initialize a handoff queue
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Simple enough, call vlib_frame_queue_main_init:
+
+.. code:: c
+
+ main_ptr->frame_queue_index
+ = vlib_frame_queue_main_init (dest_node.index, frame_queue_size);
+
+Frame_queue_size means what it says: the number of frames which may be
+queued. Since frames contain 1…256 packets, frame_queue_size should be a
+reasonably small number (32…64). If the frame queue producer(s) are
+faster than the frame queue consumer(s), congestion will occur. Suggest
+letting the enqueue operator deal with queue congestion, as shown in the
+enqueue example below.
+
+Under the floorboards, vlib_frame_queue_main_init creates an input queue
+for each worker thread.
+
+Please do NOT create frame queues until it’s clear that they will be
+used. Although the main dispatch loop is reasonably smart about how
+often it polls the (entire set of) frame queues, polling unused frame
+queues is a waste of clock cycles.
+
+Hand off packets
+~~~~~~~~~~~~~~~~
+
+The actual handoff mechanics are simple, and integrate nicely with a
+typical graph-node dispatch function:
+
+.. code:: c
+
+ always_inline uword
+ do_handoff_inline (vlib_main_t * vm,
+ vlib_node_runtime_t * node, vlib_frame_t * frame,
+ int is_ip4, int is_trace)
+ {
+ u32 n_left_from, *from;
+ vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
+ u16 thread_indices [VLIB_FRAME_SIZE];
+ u16 nexts[VLIB_FRAME_SIZE], *next;
+ u32 n_enq;
+ htest_main_t *hmp = &htest_main;
+ int i;
+
+ from = vlib_frame_vector_args (frame);
+ n_left_from = frame->n_vectors;
+
+ vlib_get_buffers (vm, from, bufs, n_left_from);
+ next = nexts;
+ b = bufs;
+
+ /*
+ * Typical frame traversal loop, details vary with
+ * use case. Make sure to set thread_indices[i] with
+ * the desired destination thread index. You may
+ * or may not bother to set next[i].
+ */
+
+ for (i = 0; i < frame->n_vectors; i++)
+ {
+ <snip>
+ /* Pick a thread to handle this packet */
+ thread_indices[i] = f (packet_data_or_whatever);
+ <snip>
+
+ b += 1;
+ next += 1;
+ n_left_from -= 1;
+ }
+
+ /* Enqueue buffers to threads */
+ n_enq =
+ vlib_buffer_enqueue_to_thread (vm, node, hmp->frame_queue_index,
+ from, thread_indices, frame->n_vectors,
+ 1 /* drop on congestion */);
+ /* Typical counters,
+ if (n_enq < frame->n_vectors)
+ vlib_node_increment_counter (vm, node->node_index,
+ XXX_ERROR_CONGESTION_DROP,
+ frame->n_vectors - n_enq);
+ vlib_node_increment_counter (vm, node->node_index,
+ XXX_ERROR_HANDED_OFF, n_enq);
+ return frame->n_vectors;
+ }
+
+Notes about calling vlib_buffer_enqueue_to_thread(…):
+
+- If you pass “drop on congestion” non-zero, all packets in the inbound
+ frame will be consumed one way or the other. This is the recommended
+ setting.
+
+- In the drop-on-congestion case, please don’t try to “help” in the
+ enqueue node by freeing dropped packets, or by pushing them to
+ “error-drop.” Either of those actions would be a severe error.
+
+- It’s perfectly OK to enqueue packets to the current thread.
+
+Handoff Demo Plugin
+-------------------
+
+Check out the sample (plugin) example in …/src/examples/handoffdemo. If
+you want to build the handoff demo plugin:
+
+::
+
+ $ cd .../src/plugins
+ $ ln -s ../examples/handoffdemo
+
+This plugin provides a simple example of how to hand off packets between
+threads. We used it to debug packet-tracer handoff tracing support.
+
+Packet generator input script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ packet-generator new {
+ name x
+ limit 5
+ size 128-128
+ interface local0
+ node handoffdemo-1
+ data {
+ incrementing 30
+ }
+ }
+
+Start vpp with 2 worker threads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The demo plugin hands packets from worker 1 to worker 2.
+
+Enable tracing, and start the packet generator
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ trace add pg-input 100
+ packet-generator enable
+
+Sample Run
+~~~~~~~~~~
+
+::
+
+ DBGvpp# ex /tmp/pg_input_script
+ DBGvpp# pa en
+ DBGvpp# sh err
+ Count Node Reason
+ 5 handoffdemo-1 packets handed off processed
+ 5 handoffdemo-2 completed packets
+ DBGvpp# show run
+ Thread 1 vpp_wk_0 (lcore 0)
+ Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00
+ vector rates in 3.7331e-2, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
+ Name State Calls Vectors Suspends Clocks Vectors/Call
+ handoffdemo-1 active 1 5 0 4.76e3 5.00
+ pg-input disabled 2 5 0 5.58e4 2.50
+ unix-epoll-input polling 22760 0 0 2.14e7 0.00
+ ---------------
+ Thread 2 vpp_wk_1 (lcore 2)
+ Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00
+ vector rates in 0.0000e0, out 0.0000e0, drop 3.7331e-2, punt 0.0000e0
+ Name State Calls Vectors Suspends Clocks Vectors/Call
+ drop active 1 5 0 1.35e4 5.00
+ error-drop active 1 5 0 2.52e4 5.00
+ handoffdemo-2 active 1 5 0 2.56e4 5.00
+ unix-epoll-input polling 22406 0 0 2.18e7 0.00
+
+Enable the packet tracer and run it again…
+
+::
+
+ DBGvpp# trace add pg-input 100
+ DBGvpp# pa en
+ DBGvpp# sh trace
+ sh trace
+ ------------------- Start of thread 0 vpp_main -------------------
+ No packets in trace buffer
+ ------------------- Start of thread 1 vpp_wk_0 -------------------
+ Packet 1
+
+ 00:06:50:520688: pg-input
+ stream x, 128 bytes, 0 sw_if_index
+ current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000000
+ 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000
+ 00000020: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000040: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000060: 0000000000000000000000000000000000000000000000000000000000000000
+ 00:06:50:520762: handoffdemo-1
+ HANDOFFDEMO: current thread 1
+
+ Packet 2
+
+ 00:06:50:520688: pg-input
+ stream x, 128 bytes, 0 sw_if_index
+ current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000001
+ 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000
+ 00000020: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000040: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000060: 0000000000000000000000000000000000000000000000000000000000000000
+ 00:06:50:520762: handoffdemo-1
+ HANDOFFDEMO: current thread 1
+
+ Packet 3
+
+ 00:06:50:520688: pg-input
+ stream x, 128 bytes, 0 sw_if_index
+ current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000002
+ 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000
+ 00000020: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000040: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000060: 0000000000000000000000000000000000000000000000000000000000000000
+ 00:06:50:520762: handoffdemo-1
+ HANDOFFDEMO: current thread 1
+
+ Packet 4
+
+ 00:06:50:520688: pg-input
+ stream x, 128 bytes, 0 sw_if_index
+ current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000003
+ 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000
+ 00000020: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000040: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000060: 0000000000000000000000000000000000000000000000000000000000000000
+ 00:06:50:520762: handoffdemo-1
+ HANDOFFDEMO: current thread 1
+
+ Packet 5
+
+ 00:06:50:520688: pg-input
+ stream x, 128 bytes, 0 sw_if_index
+ current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000004
+ 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000
+ 00000020: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000040: 0000000000000000000000000000000000000000000000000000000000000000
+ 00000060: 0000000000000000000000000000000000000000000000000000000000000000
+ 00:06:50:520762: handoffdemo-1
+ HANDOFFDEMO: current thread 1
+
+ ------------------- Start of thread 2 vpp_wk_1 -------------------
+ Packet 1
+
+ 00:06:50:520796: handoff_trace
+ HANDED-OFF: from thread 1 trace index 0
+ 00:06:50:520796: handoffdemo-2
+ HANDOFFDEMO: current thread 2
+ 00:06:50:520867: error-drop
+ rx:local0
+ 00:06:50:520914: drop
+ handoffdemo-2: completed packets
+
+ Packet 2
+
+ 00:06:50:520796: handoff_trace
+ HANDED-OFF: from thread 1 trace index 1
+ 00:06:50:520796: handoffdemo-2
+ HANDOFFDEMO: current thread 2
+ 00:06:50:520867: error-drop
+ rx:local0
+ 00:06:50:520914: drop
+ handoffdemo-2: completed packets
+
+ Packet 3
+
+ 00:06:50:520796: handoff_trace
+ HANDED-OFF: from thread 1 trace index 2
+ 00:06:50:520796: handoffdemo-2
+ HANDOFFDEMO: current thread 2
+ 00:06:50:520867: error-drop
+ rx:local0
+ 00:06:50:520914: drop
+ handoffdemo-2: completed packets
+
+ Packet 4
+
+ 00:06:50:520796: handoff_trace
+ HANDED-OFF: from thread 1 trace index 3
+ 00:06:50:520796: handoffdemo-2
+ HANDOFFDEMO: current thread 2
+ 00:06:50:520867: error-drop
+ rx:local0
+ 00:06:50:520914: drop
+ handoffdemo-2: completed packets
+
+ Packet 5
+
+ 00:06:50:520796: handoff_trace
+ HANDED-OFF: from thread 1 trace index 4
+ 00:06:50:520796: handoffdemo-2
+ HANDOFFDEMO: current thread 2
+ 00:06:50:520867: error-drop
+ rx:local0
+ 00:06:50:520914: drop
+ handoffdemo-2: completed packets
+ DBGvpp#
diff --git a/docs/developer/corearchitecture/vnet.rst b/docs/developer/corearchitecture/vnet.rst
new file mode 100644
index 00000000000..812e2fb4f8a
--- /dev/null
+++ b/docs/developer/corearchitecture/vnet.rst
@@ -0,0 +1,807 @@
+VNET (VPP Network Stack)
+========================
+
+The files associated with the VPP network stack layer are located in the
+*./src/vnet* folder. The Network Stack Layer is basically an
+instantiation of the code in the other layers. This layer has a vnet
+library that provides vectorized layer-2 and 3 networking graph nodes, a
+packet generator, and a packet tracer.
+
+In terms of building a packet processing application, vnet provides a
+platform-independent subgraph to which one connects a couple of
+device-driver nodes.
+
+Typical RX connections include “ethernet-input” [full software
+classification, feeds ipv4-input, ipv6-input, arp-input etc.] and
+“ipv4-input-no-checksum” [if hardware can classify, perform ipv4 header
+checksum].
+
+Effective graph dispatch function coding
+----------------------------------------
+
+Over the 15 years, multiple coding styles have emerged: a
+single/dual/quad loop coding model (with variations) and a
+fully-pipelined coding model.
+
+Single/dual loops
+-----------------
+
+The single/dual/quad loop model variations conveniently solve problems
+where the number of items to process is not known in advance: typical
+hardware RX-ring processing. This coding style is also very effective
+when a given node will not need to cover a complex set of dependent
+reads.
+
+Here is an quad/single loop which can leverage up-to-avx512 SIMD vector
+units to convert buffer indices to buffer pointers:
+
+.. code:: c
+
+ static uword
+ simulated_ethernet_interface_tx (vlib_main_t * vm,
+ vlib_node_runtime_t *
+ node, vlib_frame_t * frame)
+ {
+ u32 n_left_from, *from;
+ u32 next_index = 0;
+ u32 n_bytes;
+ u32 thread_index = vm->thread_index;
+ vnet_main_t *vnm = vnet_get_main ();
+ vnet_interface_main_t *im = &vnm->interface_main;
+ vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
+ u16 nexts[VLIB_FRAME_SIZE], *next;
+
+ n_left_from = frame->n_vectors;
+ from = vlib_frame_vector_args (frame);
+
+ /*
+ * Convert up to VLIB_FRAME_SIZE indices in "from" to
+ * buffer pointers in bufs[]
+ */
+ vlib_get_buffers (vm, from, bufs, n_left_from);
+ b = bufs;
+ next = nexts;
+
+ /*
+ * While we have at least 4 vector elements (pkts) to process..
+ */
+ while (n_left_from >= 4)
+ {
+ /* Prefetch next quad-loop iteration. */
+ if (PREDICT_TRUE (n_left_from >= 8))
+ {
+ vlib_prefetch_buffer_header (b[4], STORE);
+ vlib_prefetch_buffer_header (b[5], STORE);
+ vlib_prefetch_buffer_header (b[6], STORE);
+ vlib_prefetch_buffer_header (b[7], STORE);
+ }
+
+ /*
+ * $$$ Process 4x packets right here...
+ * set next[0..3] to send the packets where they need to go
+ */
+
+ do_something_to (b[0]);
+ do_something_to (b[1]);
+ do_something_to (b[2]);
+ do_something_to (b[3]);
+
+ /* Process the next 0..4 packets */
+ b += 4;
+ next += 4;
+ n_left_from -= 4;
+ }
+ /*
+ * Clean up 0...3 remaining packets at the end of the incoming frame
+ */
+ while (n_left_from > 0)
+ {
+ /*
+ * $$$ Process one packet right here...
+ * set next[0..3] to send the packets where they need to go
+ */
+ do_something_to (b[0]);
+
+ /* Process the next packet */
+ b += 1;
+ next += 1;
+ n_left_from -= 1;
+ }
+
+ /*
+ * Send the packets along their respective next-node graph arcs
+ * Considerable locality of reference is expected, most if not all
+ * packets in the inbound vector will traverse the same next-node
+ * arc
+ */
+ vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors);
+
+ return frame->n_vectors;
+ }
+
+Given a packet processing task to implement, it pays to scout around
+looking for similar tasks, and think about using the same coding
+pattern. It is not uncommon to recode a given graph node dispatch
+function several times during performance optimization.
+
+Creating Packets from Scratch
+-----------------------------
+
+At times, it’s necessary to create packets from scratch and send them.
+Tasks like sending keepalives or actively opening connections come to
+mind. Its not difficult, but accurate buffer metadata setup is required.
+
+Allocating Buffers
+~~~~~~~~~~~~~~~~~~
+
+Use vlib_buffer_alloc, which allocates a set of buffer indices. For
+low-performance applications, it’s OK to allocate one buffer at a time.
+Note that vlib_buffer_alloc(…) does NOT initialize buffer metadata. See
+below.
+
+In high-performance cases, allocate a vector of buffer indices, and hand
+them out from the end of the vector; decrement \_vec_len(..) as buffer
+indices are allocated. See tcp_alloc_tx_buffers(…) and
+tcp_get_free_buffer_index(…) for an example.
+
+Buffer Initialization Example
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following example shows the **main points**, but is not to be
+blindly cut-’n-pasted.
+
+.. code:: c
+
+ u32 bi0;
+ vlib_buffer_t *b0;
+ ip4_header_t *ip;
+ udp_header_t *udp;
+
+ /* Allocate a buffer */
+ if (vlib_buffer_alloc (vm, &bi0, 1) != 1)
+ return -1;
+
+ b0 = vlib_get_buffer (vm, bi0);
+
+ /* At this point b0->current_data = 0, b0->current_length = 0 */
+
+ /*
+ * Copy data into the buffer. This example ASSUMES that data will fit
+ * in a single buffer, and is e.g. an ip4 packet.
+ */
+ if (have_packet_rewrite)
+ {
+ clib_memcpy (b0->data, data, vec_len (data));
+ b0->current_length = vec_len (data);
+ }
+ else
+ {
+ /* OR, build a udp-ip packet (for example) */
+ ip = vlib_buffer_get_current (b0);
+ udp = (udp_header_t *) (ip + 1);
+ data_dst = (u8 *) (udp + 1);
+
+ ip->ip_version_and_header_length = 0x45;
+ ip->ttl = 254;
+ ip->protocol = IP_PROTOCOL_UDP;
+ ip->length = clib_host_to_net_u16 (sizeof (*ip) + sizeof (*udp) +
+ vec_len(udp_data));
+ ip->src_address.as_u32 = src_address->as_u32;
+ ip->dst_address.as_u32 = dst_address->as_u32;
+ udp->src_port = clib_host_to_net_u16 (src_port);
+ udp->dst_port = clib_host_to_net_u16 (dst_port);
+ udp->length = clib_host_to_net_u16 (vec_len (udp_data));
+ clib_memcpy (data_dst, udp_data, vec_len(udp_data));
+
+ if (compute_udp_checksum)
+ {
+ /* RFC 7011 section 10.3.2. */
+ udp->checksum = ip4_tcp_udp_compute_checksum (vm, b0, ip);
+ if (udp->checksum == 0)
+ udp->checksum = 0xffff;
+ }
+ b0->current_length = vec_len (sizeof (*ip) + sizeof (*udp) +
+ vec_len (udp_data));
+
+ }
+ b0->flags |= VLIB_BUFFER_TOTAL_LENGTH_VALID;
+
+ /* sw_if_index 0 is the "local" interface, which always exists */
+ vnet_buffer (b0)->sw_if_index[VLIB_RX] = 0;
+
+ /* Use the default FIB index for tx lookup. Set non-zero to use another fib */
+ vnet_buffer (b0)->sw_if_index[VLIB_TX] = 0;
+
+If your use-case calls for large packet transmission, use
+vlib_buffer_chain_append_data_with_alloc(…) to create the requisite
+buffer chain.
+
+Enqueueing packets for lookup and transmission
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest way to send a set of packets is to use
+vlib_get_frame_to_node(…) to allocate fresh frame(s) to ip4_lookup_node
+or ip6_lookup_node, add the constructed buffer indices, and dispatch the
+frame using vlib_put_frame_to_node(…).
+
+.. code:: c
+
+ vlib_frame_t *f;
+ f = vlib_get_frame_to_node (vm, ip4_lookup_node.index);
+ f->n_vectors = vec_len(buffer_indices_to_send);
+ to_next = vlib_frame_vector_args (f);
+
+ for (i = 0; i < vec_len (buffer_indices_to_send); i++)
+ to_next[i] = buffer_indices_to_send[i];
+
+ vlib_put_frame_to_node (vm, ip4_lookup_node_index, f);
+
+It is inefficient to allocate and schedule single packet frames. That’s
+typical in case you need to send one packet per second, but should
+**not** occur in a for-loop!
+
+Packet tracer
+-------------
+
+Vlib includes a frame element [packet] trace facility, with a simple
+debug CLI interface. The cli is straightforward: “trace add
+input-node-name count” to start capturing packet traces.
+
+To trace 100 packets on a typical x86_64 system running the dpdk plugin:
+“trace add dpdk-input 100”. When using the packet generator: “trace add
+pg-input 100”
+
+To display the packet trace: “show trace”
+
+Each graph node has the opportunity to capture its own trace data. It is
+almost always a good idea to do so. The trace capture APIs are simple.
+
+The packet capture APIs snapshoot binary data, to minimize processing at
+capture time. Each participating graph node initialization provides a
+vppinfra format-style user function to pretty-print data when required
+by the VLIB “show trace” command.
+
+Set the VLIB node registration “.format_trace” member to the name of the
+per-graph node format function.
+
+Here’s a simple example:
+
+.. code:: c
+
+ u8 * my_node_format_trace (u8 * s, va_list * args)
+ {
+ vlib_main_t * vm = va_arg (*args, vlib_main_t *);
+ vlib_node_t * node = va_arg (*args, vlib_node_t *);
+ my_node_trace_t * t = va_arg (*args, my_trace_t *);
+
+ s = format (s, "My trace data was: %d", t-><whatever>);
+
+ return s;
+ }
+
+The trace framework hands the per-node format function the data it
+captured as the packet whizzed by. The format function pretty-prints the
+data as desired.
+
+Graph Dispatcher Pcap Tracing
+-----------------------------
+
+The vpp graph dispatcher knows how to capture vectors of packets in pcap
+format as they’re dispatched. The pcap captures are as follows:
+
+::
+
+ VPP graph dispatch trace record description:
+
+ 0 1 2 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Major Version | Minor Version | NStrings | ProtoHint |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Buffer index (big endian) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ + VPP graph node name ... ... | NULL octet |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Buffer Metadata ... ... | NULL octet |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Buffer Opaque ... ... | NULL octet |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Buffer Opaque 2 ... ... | NULL octet |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | VPP ASCII packet trace (if NStrings > 4) | NULL octet |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ | Packet data (up to 16K) |
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+Graph dispatch records comprise a version stamp, an indication of how
+many NULL-terminated strings will follow the record header and preceed
+packet data, and a protocol hint.
+
+The buffer index is an opaque 32-bit cookie which allows consumers of
+these data to easily filter/track single packets as they traverse the
+forwarding graph.
+
+Multiple records per packet are normal, and to be expected. Packets will
+appear multiple times as they traverse the vpp forwarding graph. In this
+way, vpp graph dispatch traces are significantly different from regular
+network packet captures from an end-station. This property complicates
+stateful packet analysis.
+
+Restricting stateful analysis to records from a single vpp graph node
+such as “ethernet-input” seems likely to improve the situation.
+
+As of this writing: major version = 1, minor version = 0. Nstrings
+SHOULD be 4 or 5. Consumers SHOULD be wary values less than 4 or greater
+than 5. They MAY attempt to display the claimed number of strings, or
+they MAY treat the condition as an error.
+
+Here is the current set of protocol hints:
+
+.. code:: c
+
+ typedef enum
+ {
+ VLIB_NODE_PROTO_HINT_NONE = 0,
+ VLIB_NODE_PROTO_HINT_ETHERNET,
+ VLIB_NODE_PROTO_HINT_IP4,
+ VLIB_NODE_PROTO_HINT_IP6,
+ VLIB_NODE_PROTO_HINT_TCP,
+ VLIB_NODE_PROTO_HINT_UDP,
+ VLIB_NODE_N_PROTO_HINTS,
+ } vlib_node_proto_hint_t;
+
+Example: VLIB_NODE_PROTO_HINT_IP6 means that the first octet of packet
+data SHOULD be 0x60, and should begin an ipv6 packet header.
+
+Downstream consumers of these data SHOULD pay attention to the protocol
+hint. They MUST tolerate inaccurate hints, which MAY occur from time to
+time.
+
+Dispatch Pcap Trace Debug CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To start a dispatch trace capture of up to 10,000 trace records:
+
+::
+
+ pcap dispatch trace on max 10000 file dispatch.pcap
+
+To start a dispatch trace which will also include standard vpp packet
+tracing for packets which originate in dpdk-input:
+
+::
+
+ pcap dispatch trace on max 10000 file dispatch.pcap buffer-trace dpdk-input 1000
+
+To save the pcap trace, e.g. in /tmp/dispatch.pcap:
+
+::
+
+ pcap dispatch trace off
+
+Wireshark dissection of dispatch pcap traces
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It almost goes without saying that we built a companion wireshark
+dissector to display these traces. As of this writing, we have
+upstreamed the wireshark dissector.
+
+Since it will be a while before wireshark/master/latest makes it into
+all of the popular Linux distros, please see the “How to build a vpp
+dispatch trace aware Wireshark” page for build info.
+
+Here is a sample packet dissection, with some fields omitted for
+clarity. The point is that the wireshark dissector accurately displays
+**all** of the vpp buffer metadata, and the name of the graph node in
+question.
+
+::
+
+ Frame 1: 2216 bytes on wire (17728 bits), 2216 bytes captured (17728 bits)
+ Encapsulation type: USER 13 (58)
+ [Protocols in frame: vpp:vpp-metadata:vpp-opaque:vpp-opaque2:eth:ethertype:ip:tcp:data]
+ VPP Dispatch Trace
+ BufferIndex: 0x00036663
+ NodeName: ethernet-input
+ VPP Buffer Metadata
+ Metadata: flags:
+ Metadata: current_data: 0, current_length: 102
+ Metadata: current_config_index: 0, flow_id: 0, next_buffer: 0
+ Metadata: error: 0, n_add_refs: 0, buffer_pool_index: 0
+ Metadata: trace_index: 0, recycle_count: 0, len_not_first_buf: 0
+ Metadata: free_list_index: 0
+ Metadata:
+ VPP Buffer Opaque
+ Opaque: raw: 00000007 ffffffff 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
+ Opaque: sw_if_index[VLIB_RX]: 7, sw_if_index[VLIB_TX]: -1
+ Opaque: L2 offset 0, L3 offset 0, L4 offset 0, feature arc index 0
+ Opaque: ip.adj_index[VLIB_RX]: 0, ip.adj_index[VLIB_TX]: 0
+ Opaque: ip.flow_hash: 0x0, ip.save_protocol: 0x0, ip.fib_index: 0
+ Opaque: ip.save_rewrite_length: 0, ip.rpf_id: 0
+ Opaque: ip.icmp.type: 0 ip.icmp.code: 0, ip.icmp.data: 0x0
+ Opaque: ip.reass.next_index: 0, ip.reass.estimated_mtu: 0
+ Opaque: ip.reass.fragment_first: 0 ip.reass.fragment_last: 0
+ Opaque: ip.reass.range_first: 0 ip.reass.range_last: 0
+ Opaque: ip.reass.next_range_bi: 0x0, ip.reass.ip6_frag_hdr_offset: 0
+ Opaque: mpls.ttl: 0, mpls.exp: 0, mpls.first: 0, mpls.save_rewrite_length: 0, mpls.bier.n_bytes: 0
+ Opaque: l2.feature_bitmap: 00000000, l2.bd_index: 0, l2.l2_len: 0, l2.shg: 0, l2.l2fib_sn: 0, l2.bd_age: 0
+ Opaque: l2.feature_bitmap_input: none configured, L2.feature_bitmap_output: none configured
+ Opaque: l2t.next_index: 0, l2t.session_index: 0
+ Opaque: l2_classify.table_index: 0, l2_classify.opaque_index: 0, l2_classify.hash: 0x0
+ Opaque: policer.index: 0
+ Opaque: ipsec.flags: 0x0, ipsec.sad_index: 0
+ Opaque: map.mtu: 0
+ Opaque: map_t.v6.saddr: 0x0, map_t.v6.daddr: 0x0, map_t.v6.frag_offset: 0, map_t.v6.l4_offset: 0
+ Opaque: map_t.v6.l4_protocol: 0, map_t.checksum_offset: 0, map_t.mtu: 0
+ Opaque: ip_frag.mtu: 0, ip_frag.next_index: 0, ip_frag.flags: 0x0
+ Opaque: cop.current_config_index: 0
+ Opaque: lisp.overlay_afi: 0
+ Opaque: tcp.connection_index: 0, tcp.seq_number: 0, tcp.seq_end: 0, tcp.ack_number: 0, tcp.hdr_offset: 0, tcp.data_offset: 0
+ Opaque: tcp.data_len: 0, tcp.flags: 0x0
+ Opaque: sctp.connection_index: 0, sctp.sid: 0, sctp.ssn: 0, sctp.tsn: 0, sctp.hdr_offset: 0
+ Opaque: sctp.data_offset: 0, sctp.data_len: 0, sctp.subconn_idx: 0, sctp.flags: 0x0
+ Opaque: snat.flags: 0x0
+ Opaque:
+ VPP Buffer Opaque2
+ Opaque2: raw: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
+ Opaque2: qos.bits: 0, qos.source: 0
+ Opaque2: loop_counter: 0
+ Opaque2: gbp.flags: 0, gbp.src_epg: 0
+ Opaque2: pg_replay_timestamp: 0
+ Opaque2:
+ Ethernet II, Src: 06:d6:01:41:3b:92 (06:d6:01:41:3b:92), Dst: IntelCor_3d:f6 Transmission Control Protocol, Src Port: 22432, Dst Port: 54084, Seq: 1, Ack: 1, Len: 36
+ Source Port: 22432
+ Destination Port: 54084
+ TCP payload (36 bytes)
+ Data (36 bytes)
+
+ 0000 cf aa 8b f5 53 14 d4 c7 29 75 3e 56 63 93 9d 11 ....S...)u>Vc...
+ 0010 e5 f2 92 27 86 56 4c 21 ce c5 23 46 d7 eb ec 0d ...'.VL!..#F....
+ 0020 a8 98 36 5a ..6Z
+ Data: cfaa8bf55314d4c729753e5663939d11e5f2922786564c21…
+ [Length: 36]
+
+It’s a matter of a couple of mouse-clicks in Wireshark to filter the
+trace to a specific buffer index. With that specific kind of filtration,
+one can watch a packet walk through the forwarding graph; noting any/all
+metadata changes, header checksum changes, and so forth.
+
+This should be of significant value when developing new vpp graph nodes.
+If new code mispositions b->current_data, it will be completely obvious
+from looking at the dispatch trace in wireshark.
+
+pcap rx, tx, and drop tracing
+-----------------------------
+
+vpp also supports rx, tx, and drop packet capture in pcap format,
+through the “pcap trace” debug CLI command.
+
+This command is used to start or stop a packet capture, or show the
+status of packet capture. Each of “pcap trace rx”, “pcap trace tx”, and
+“pcap trace drop” is implemented. Supply one or more of “rx”, “tx”, and
+“drop” to enable multiple simultaneous capture types.
+
+These commands have the following optional parameters:
+
+- rx - trace received packets.
+
+- tx - trace transmitted packets.
+
+- drop - trace dropped packets.
+
+- max *nnnn*\ - file size, number of packet captures. Once packets
+ have been received, the trace buffer buffer is flushed to the
+ indicated file. Defaults to 1000. Can only be updated if packet
+ capture is off.
+
+- max-bytes-per-pkt *nnnn*\ - maximum number of bytes to trace on a
+ per-packet basis. Must be >32 and less than 9000. Default value:
+
+ 512.
+
+- filter - Use the pcap rx / tx / drop trace filter, which must be
+ configured. Use classify filter pcap… to configure the filter. The
+ filter will only be executed if the per-interface or any-interface
+ tests fail.
+
+- intfc *interface* \| *any*\ - Used to specify a given interface, or
+ use ‘any’ to run packet capture on all interfaces. ‘any’ is the
+ default if not provided. Settings from a previous packet capture are
+ preserved, so ‘any’ can be used to reset the interface setting.
+
+- file *filename*\ - Used to specify the output filename. The file
+ will be placed in the ‘/tmp’ directory. If *filename* already exists,
+ file will be overwritten. If no filename is provided, ‘/tmp/rx.pcap
+ or tx.pcap’ will be used, depending on capture direction. Can only be
+ updated when pcap capture is off.
+
+- status - Displays the current status and configured attributes
+ associated with a packet capture. If packet capture is in progress,
+ ‘status’ also will return the number of packets currently in the
+ buffer. Any additional attributes entered on command line with a
+ ‘status’ request will be ignored.
+
+- filter - Capture packets which match the current packet trace filter
+ set. See next section. Configure the capture filter first.
+
+packet trace capture filtering
+------------------------------
+
+The “classify filter pcap \| \| trace” debug CLI command constructs an
+arbitrary set of packet classifier tables for use with “pcap rx \| tx \|
+drop trace,” and with the vpp packet tracer on a per-interface or
+system-wide basis.
+
+Packets which match a rule in the classifier table chain will be traced.
+The tables are automatically ordered so that matches in the most
+specific table are tried first.
+
+It’s reasonably likely that folks will configure a single table with one
+or two matches. As a result, we configure 8 hash buckets and 128K of
+match rule space by default. One can override the defaults by specifying
+“buckets ” and “memory-size ” as desired.
+
+To build up complex filter chains, repeatedly issue the classify filter
+debug CLI command. Each command must specify the desired mask and match
+values. If a classifier table with a suitable mask already exists, the
+CLI command adds a match rule to the existing table. If not, the CLI
+command add a new table and the indicated mask rule
+
+Configure a simple pcap classify filter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter pcap mask l3 ip4 src match l3 ip4 src 192.168.1.11
+ pcap trace rx max 100 filter
+
+Configure a simple per-interface capture filter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter GigabitEthernet3/0/0 mask l3 ip4 src match l3 ip4 src 192.168.1.11"
+ pcap trace rx max 100 intfc GigabitEthernet3/0/0
+
+Note that per-interface capture filters are *always* applied.
+
+Clear per-interface capture filters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter GigabitEthernet3/0/0 del
+
+Configure another fairly simple pcap classify filter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter pcap mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10
+ pcap trace tx max 100 filter
+
+Configure a vpp packet tracer filter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter trace mask l3 ip4 src dst match l3 ip4 src 192.168.1.10 dst 192.168.2.10
+ trace add dpdk-input 100 filter
+
+Clear all current classifier filters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ classify filter [pcap | <interface> | trace] del
+
+To inspect the classifier tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ show classify table [verbose]
+
+The verbose form displays all of the match rules, with hit-counters.
+
+Terse description of the “mask ” syntax:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ l2 src dst proto tag1 tag2 ignore-tag1 ignore-tag2 cos1 cos2 dot1q dot1ad
+ l3 ip4 <ip4-mask> ip6 <ip6-mask>
+ <ip4-mask> version hdr_length src[/width] dst[/width]
+ tos length fragment_id ttl protocol checksum
+ <ip6-mask> version traffic-class flow-label src dst proto
+ payload_length hop_limit protocol
+ l4 tcp <tcp-mask> udp <udp_mask> src_port dst_port
+ <tcp-mask> src dst # ports
+ <udp-mask> src_port dst_port
+
+To construct **matches**, add the values to match after the indicated
+keywords in the mask syntax. For example: “… mask l3 ip4 src” -> “…
+match l3 ip4 src 192.168.1.11”
+
+VPP Packet Generator
+--------------------
+
+We use the VPP packet generator to inject packets into the forwarding
+graph. The packet generator can replay pcap traces, and generate packets
+out of whole cloth at respectably high performance.
+
+The VPP pg enables quite a variety of use-cases, ranging from functional
+testing of new data-plane nodes to regression testing to performance
+tuning.
+
+PG setup scripts
+----------------
+
+PG setup scripts describe traffic in detail, and leverage vpp debug CLI
+mechanisms. It’s reasonably unusual to construct a pg setup script which
+doesn’t include a certain amount of interface and FIB configuration.
+
+For example:
+
+::
+
+ loop create
+ set int ip address loop0 192.168.1.1/24
+ set int state loop0 up
+
+ packet-generator new {
+ name pg0
+ limit 100
+ rate 1e6
+ size 300-300
+ interface loop0
+ node ethernet-input
+ data { IP4: 1.2.3 -> 4.5.6
+ UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10
+ UDP: 1234 -> 2345
+ incrementing 286
+ }
+ }
+
+A packet generator stream definition includes two major sections: -
+Stream Parameter Setup - Packet Data
+
+Stream Parameter Setup
+~~~~~~~~~~~~~~~~~~~~~~
+
+Given the example above, let’s look at how to set up stream parameters:
+
+- **name pg0** - Name of the stream, in this case “pg0”
+
+- **limit 1000** - Number of packets to send when the stream is
+ enabled. “limit 0” means send packets continuously.
+
+- **maxframe <nnn>** - Maximum frame size. Handy for injecting multiple
+ frames no larger than <nnn>. Useful for checking dual / quad loop
+ codes
+
+- **rate 1e6** - Packet injection rate, in this case 1 MPPS. When not
+ specified, the packet generator injects packets as fast as possible
+
+- **size 300-300** - Packet size range, in this case send 300-byte
+ packets
+
+- **interface loop0** - Packets appear as if they were received on the
+ specified interface. This datum is used in multiple ways: to select
+ graph arc feature configuration, to select IP FIBs. Configure
+ features e.g. on loop0 to exercise those features.
+
+- **tx-interface <name>** - Packets will be transmitted on the
+ indicated interface. Typically required only when injecting packets
+ into post-IP-rewrite graph nodes.
+
+- **pcap <filename>** - Replay packets from the indicated pcap capture
+ file. “make test” makes extensive use of this feature: generate
+ packets using scapy, save them in a .pcap file, then inject them into
+ the vpp graph via a vpp pg “pcap <filename>” stream definition
+
+- **worker <nn>** - Generate packets for the stream using the indicated
+ vpp worker thread. The vpp pg generates and injects O(10 MPPS /
+ core). Use multiple stream definitions and worker threads to generate
+ and inject enough traffic to easily fill a 40 gbit pipe with small
+ packets.
+
+Data definition
+~~~~~~~~~~~~~~~
+
+Packet generator data definitions make use of a layered implementation
+strategy. Networking layers are specified in order, and the notation can
+seem a bit counter-intuitive. In the example above, the data definition
+stanza constructs a set of L2-L4 headers layers, and uses an
+incrementing fill pattern to round out the requested 300-byte packets.
+
+- **IP4: 1.2.3 -> 4.5.6** - Construct an L2 (MAC) header with the ip4
+ ethertype (0x800), src MAC address of 00:01:00:02:00:03 and dst MAC
+ address of 00:04:00:05:00:06. Mac addresses may be specified in
+ either *xxxx.xxxx.xxxx* format or *xx:xx:xx:xx:xx:xx* format.
+
+- **UDP: 192.168.1.10 - 192.168.1.254 -> 192.168.2.10** - Construct an
+ incrementing set of L3 (IPv4) headers for successive packets with
+ source addresses ranging from .10 to .254. All packets in the stream
+ have a constant dest address of 192.168.2.10. Set the protocol field
+ to 17, UDP.
+
+- **UDP: 1234 -> 2345** - Set the UDP source and destination ports to
+ 1234 and 2345, respectively
+
+- **incrementing 256** - Insert up to 256 incrementing data bytes.
+
+Obvious variations involve “s/IP4/IP6/” in the above, along with
+changing from IPv4 to IPv6 address notation.
+
+The vpp pg can set any / all IPv4 header fields, including tos, packet
+length, mf / df / fragment id and offset, ttl, protocol, checksum, and
+src/dst addresses. Take a look at ../src/vnet/ip/ip[46]_pg.c for
+details.
+
+If all else fails, specify the entire packet data in hex:
+
+- **hex 0xabcd…** - copy hex data verbatim into the packet
+
+When replaying pcap files (“**pcap <filename>**”), do not specify a data
+stanza.
+
+Diagnosing “packet-generator new” parse failures
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you want to inject packets into a brand-new graph node, remember to
+tell the packet generator debug CLI how to parse the packet data stanza.
+
+If the node expects L2 Ethernet MAC headers, specify “.unformat_buffer =
+unformat_ethernet_header”:
+
+.. code:: c
+
+ VLIB_REGISTER_NODE (ethernet_input_node) =
+ {
+ <snip>
+ .unformat_buffer = unformat_ethernet_header,
+ <snip>
+ };
+
+Beyond that, it may be necessary to set breakpoints in
+…/src/vnet/pg/cli.c. Debug image suggested.
+
+When debugging new nodes, it may be far simpler to directly inject
+ethernet frames - and add a corresponding vlib_buffer_advance in the new
+node - than to modify the packet generator.
+
+Debug CLI
+---------
+
+The descriptions above describe the “packet-generator new” debug CLI in
+detail.
+
+Additional debug CLI commands include:
+
+::
+
+ vpp# packet-generator enable [<stream-name>]
+
+which enables the named stream, or all streams.
+
+::
+
+ vpp# packet-generator disable [<stream-name>]
+
+disables the named stream, or all streams.
+
+::
+
+ vpp# packet-generator delete <stream-name>
+
+Deletes the named stream.
+
+::
+
+ vpp# packet-generator configure <stream-name> [limit <nnn>]
+ [rate <f64-pps>] [size <nn>-<nn>]
+
+Changes stream parameters without having to recreate the entire stream
+definition. Note that re-issuing a “packet-generator new” command will
+correctly recreate the named stream.