From 06dcd45ff81e06bc8cf40ed487c0b2652d346a5a Mon Sep 17 00:00:00 2001 From: John DeNisco Date: Thu, 26 Jul 2018 12:45:10 -0400 Subject: Initial commit of Sphinx docs Change-Id: I9fca8fb98502dffc2555f9de7f507b6f006e0e77 Signed-off-by: John DeNisco --- docs/gettingstarted/developers/bihash.md | 273 ++++++++++++ docs/gettingstarted/developers/building.rst | 151 +++++++ docs/gettingstarted/developers/featurearcs.md | 224 ++++++++++ docs/gettingstarted/developers/index.rst | 18 + docs/gettingstarted/developers/infrastructure.md | 330 ++++++++++++++ docs/gettingstarted/developers/plugins.md | 11 + .../developers/softwarearchitecture.md | 44 ++ docs/gettingstarted/developers/vlib.md | 496 +++++++++++++++++++++ docs/gettingstarted/developers/vnet.md | 171 +++++++ 9 files changed, 1718 insertions(+) create mode 100644 docs/gettingstarted/developers/bihash.md create mode 100644 docs/gettingstarted/developers/building.rst create mode 100644 docs/gettingstarted/developers/featurearcs.md create mode 100644 docs/gettingstarted/developers/index.rst create mode 100644 docs/gettingstarted/developers/infrastructure.md create mode 100644 docs/gettingstarted/developers/plugins.md create mode 100644 docs/gettingstarted/developers/softwarearchitecture.md create mode 100644 docs/gettingstarted/developers/vlib.md create mode 100644 docs/gettingstarted/developers/vnet.md (limited to 'docs/gettingstarted/developers') diff --git a/docs/gettingstarted/developers/bihash.md b/docs/gettingstarted/developers/bihash.md new file mode 100644 index 00000000000..3f53e7bbc3e --- /dev/null +++ b/docs/gettingstarted/developers/bihash.md @@ -0,0 +1,273 @@ +Bounded-index Extensible Hashing (bihash) +========================================= + +Vpp uses bounded-index extensible hashing to solve a variety of +exact-match (key, value) lookup problems. Benefits of the current +implementation: + +* Very high record count scaling, tested to 100,000,000 records. +* Lookup performance degrades gracefully as the number of records increases +* No reader locking required +* Template implementation, it's easy to support arbitrary (key,value) types + +Bounded-index extensible hashing has been widely used in databases for +decades. + +Bihash uses a two-level data structure: + +``` + +-----------------+ + | bucket-0 | + | log2_size | + | backing store | + +-----------------+ + | bucket-1 | + | log2_size | +--------------------------------+ + | backing store | --------> | KVP_PER_PAGE * key-value-pairs | + +-----------------+ | page 0 | + ... +--------------------------------+ + +-----------------+ | KVP_PER_PAGE * key-value-pairs | + | bucket-2**N-1 | | page 1 | + | log2_size | +--------------------------------+ + | backing store | --- + +-----------------+ +--------------------------------+ + | KVP_PER_PAGE * key-value-pairs | + | page 2**(log2(size)) - 1 | + +--------------------------------+ +``` + +Discussion of the algorithm +--------------------------- + +This structure has a couple of major advantages. In practice, each +bucket entry fits into a 64-bit integer. Coincidentally, vpp's target +CPU architectures support 64-bit atomic operations. When modifying the +contents of a specific bucket, we do the following: + +* Make a working copy of the bucket's backing storage +* Atomically swap a pointer to the working copy into the bucket array +* Change the original backing store data +* Atomically swap back to the original + +So, no reader locking is required to search a bihash table. + +At lookup time, the implementation computes a key hash code. We use +the least-significant N bits of the hash to select the bucket. + +With the bucket in hand, we learn log2 (nBackingPages) for the +selected bucket. At this point, we use the next log2_size bits from +the hash code to select the specific backing page in which the +(key,value) page will be found. + +Net result: we search **one** backing page, not 2**log2_size +pages. This is a key property of the algorithm. + +When sufficient collisions occur to fill the backing pages for a given +bucket, we double the bucket size, rehash, and deal the bucket +contents into a double-sized set of backing pages. In the future, we +may represent the size as a linear combination of two powers-of-two, +to increase space efficiency. + +To solve the "jackpot case" where a set of records collide under +hashing in a bad way, the implementation will fall back to linear +search across 2**log2_size backing pages on a per-bucket basis. + +To maintain *space* efficiency, we should configure the bucket array +so that backing pages are effectively utilized. Lookup performance +tends to change *very litte* if the bucket array is too small or too +large. + +Bihash depends on selecting an effective hash function. If one were to +use a truly broken hash function such as "return 1ULL." bihash would +still work, but it would be equivalent to poorly-programmed linear +search. + +We often use cpu intrinsic functions - think crc32 - to rapidly +compute a hash code which has decent statistics. + +Bihash Cookbook +--------------- + +### Using current (key,value) template instance types + +It's quite easy to use one of the template instance types. As of this +writing, .../src/vppinfra provides pre-built templates for 8, 16, 20, +24, 40, and 48 byte keys, u8 * vector keys, and 8 byte values. + +See .../src/vppinfra/{bihash__8}.h + +To define the data types, #include a specific template instance, most +often in a subsystem header file: + +```c + #include +``` + +If you're building a standalone application, you'll need to define the +various functions by #including the method implementation file in a C +source file. + +The core vpp engine currently uses most if not all of the known bihash +types, so you probably won't need to #include the method +implementation file. + + +```c + #include +``` + +Add an instance of the selected bihash data structure to e.g. a +"main_t" structure: + +```c + typedef struct + { + ... + BVT (clib_bihash) hash; + or + clib_bihash_8_8_t hash; + ... + } my_main_t; +``` + +The BV macro concatenate its argument with the value of the +preprocessor symbol BIHASH_TYPE. The BVT macro concatenates its +argument with the value of BIHASH_TYPE and the fixed-string "_t". So +in the above example, BVT (clib_bihash) generates "clib_bihash_8_8_t". + +If you're sure you won't decide to change the template / type name +later, it's perfectly OK to code "clib_bihash_8_8_t" and so forth. + +In fact, if you #include multiple template instances in a single +source file, you **must** use fully-enumerated type names. The macros +stand no chance of working. + +### Initializing a bihash table + +Call the init function as shown. As a rough guide, pick a number of +buckets which is approximately +number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant +template instance header-file. See previous discussion. + +The amount of memory selected should easily contain all of the +records, with a generous allowance for hash collisions. Bihash memory +is allocated separately from the main heap, and won't cost anything +except kernel PTE's until touched, so it's OK to be reasonably +generous. + +For example: + +```c + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + + clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets, + (uword) memory_size); +``` + +### Add or delete a key/value pair + +Use BV(clib_bihash_add_del), or the explicit type variant: + +```c + clib_bihash_kv_8_8_t kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + kv.key = key_to_add_or_delete; + kv.value = value_to_add_or_delete; + + clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */); +``` + +In the delete case, kv.value is irrelevant. To change the value associated +with an existing (key,value) pair, simply re-add the [new] pair. + +### Simple search + +The simplest possible (key, value) search goes like so: + +```c + clib_bihash_kv_8_8_t search_kv, return_kv; + clib_bihash_8_8_t * h; + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + + h = &mm->hash_table; + search_kv.key = key_to_add_or_delete; + + if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0) + key_not_found() + else + key_not_found(); +``` + +Note that it's perfectly fine to collect the lookup result + +```c + if (clib_bihash_search_8_8 (h, &search_kv, &search_kv)) + key_not_found(); + etc. +``` + +### Bihash vector processing + +When processing a vector of packets which need a certain lookup +performed, it's worth the trouble to compute the key hash, and +prefetch the correct bucket ahead of time. + +Here's a sketch of one way to write the required code: + +Dual-loop: +* 6 packets ahead, prefetch 2x vlib_buffer_t's and 2x packet data + required to form the record keys +* 4 packets ahead, form 2x record keys and call BV(clib_bihash_hash) + or the explicit hash function to calculate the record hashes. + Call 2x BV(clib_bihash_prefetch_bucket) to prefetch the buckets +* 2 packets ahead, call 2x BV(clib_bihash_prefetch_data) to prefetch + 2x (key,value) data pages. +* In the processing section, call 2x BV(clib_bihash_search_inline_with_hash) + to perform the search + +Programmer's choice whether to stash the hash code somewhere in +vnet_buffer(b) metadata, or to use local variables. + +Single-loop: +* Use simple search as shown above. + +### Walking a bihash table + +A fairly common scenario to build "show" commands involves walking a +bihash table. It's simple enough: + +```c + my_main_t *mm = &my_main; + clib_bihash_8_8_t *h; + void callback_fn (clib_bihash_kv_8_8_t *, void *); + + h = &mm->hash_table; + + BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg); +``` +To nobody's great surprise: clib_bihash_foreach_key_value_pair +iterates across the entire table, calling callback_fn with active +entries. + +### Creating a new template instance + +Creating a new template is easy. Use one of the existing templates as +a model, and make the obvious changes. The hash and key_compare +methods are performance-critical in multiple senses. + +If the key compare method is slow, every lookup will be slow. If the +hash function is slow, same story. If the hash function has poor +statistical properties, space efficiency will suffer. In the limit, a +bad enough hash function will cause large portions of the table to +revert to linear search. + +Use of the best available vector unit is well worth the trouble in the +hash and key_compare functions. diff --git a/docs/gettingstarted/developers/building.rst b/docs/gettingstarted/developers/building.rst new file mode 100644 index 00000000000..18fa943a6fb --- /dev/null +++ b/docs/gettingstarted/developers/building.rst @@ -0,0 +1,151 @@ +.. _building: + +.. toctree:: + +Building VPP +============ + +To get started developing with VPP you need to get the sources and build the packages. + +.. _setupproxies: + +Set up Proxies +-------------- + +Depending on the environment, proxies may need to be set. +You may run these commands: + +.. code-block:: console + + $ export http_proxy=http://.com: + $ export https_proxy=https://.com: + + +Get the VPP Sources +------------------- + +To get the VPP sources and get ready to build execute the following: + +.. code-block:: console + + $ git clone https://gerrit.fd.io/r/vpp + $ cd vpp + +Build VPP Dependencies +---------------------- + +Before building, make sure there are no FD.io VPP or DPDK packages installed by entering the following +commands: + +.. code-block:: console + + $ dpkg -l | grep vpp + $ dpkg -l | grep DPDK + +There should be no output, or packages showing after each of the above commands. + +Run this to install the dependencies for FD.io VPP. +If it hangs during downloading at any point, you may need to set up :ref:`proxies for this to work `. + +.. code-block:: console + + $ make install-dep + Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease + Get:2 http://us.archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB] + Get:3 http://security.ubuntu.com/ubuntu xenial-security InRelease [107 kB] + Get:4 http://us.archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB] + Get:5 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [803 kB] + Get:6 http://us.archive.ubuntu.com/ubuntu xenial-updates/main i386 Packages [732 kB] + ... + ... + Update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap to provide /usr/bin/jmap (jmap) in auto mode + Setting up default-jdk-headless (2:1.8-56ubuntu2) ... + Processing triggers for libc-bin (2.23-0ubuntu3) ... + Processing triggers for systemd (229-4ubuntu6) ... + Processing triggers for ureadahead (0.100.0-19) ... + Processing triggers for ca-certificates (20160104ubuntu1) ... + Updating certificates in /etc/ssl/certs... + 0 added, 0 removed; done. + Running hooks in /etc/ca-certificates/update.d... + + done. + done. + +Build VPP (Debug Mode) +---------------------- + +This build version contains debug symbols which is useful to modify VPP. The command below will build debug version of VPP. +This build will come with /build-root/vpp_debug-native. + +.. code-block:: console + + $ make build + make[1]: Entering directory '/home/vagrant/vpp-master/build-root' + @@@@ Arch for platform 'vpp' is native @@@@ + @@@@ Finding source for dpdk @@@@ + @@@@ Makefile fragment found in /home/vagrant/vpp-master/build-data/packages/dpdk.mk @@@@ + @@@@ Source found in /home/vagrant/vpp-master/dpdk @@@@ + @@@@ Arch for platform 'vpp' is native @@@@ + @@@@ Finding source for vpp @@@@ + @@@@ Makefile fragment found in /home/vagrant/vpp-master/build-data/packages/vpp.mk @@@@ + @@@@ Source found in /home/vagrant/vpp-master/src @@@@ + ... + ... + make[5]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp/vpp-api/java' + make[4]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp/vpp-api/java' + make[3]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp' + make[2]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp' + @@@@ Installing vpp: nothing to do @@@@ + make[1]: Leaving directory '/home/vagrant/vpp-master/build-root' + +Build VPP (Release Version) +--------------------------- + +To build the release version of FD.io VPP. +This build is optimized and will not create debug symbols. +This build will come with /build-root/build-vpp-native + +.. code-block:: console + + $ make release + + +Building Necessary Packages +--------------------------- + +To build the debian packages, one of the following commands below depending on the system: + +Building Debian Packages +^^^^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: console + + $ make pkg-deb + + +Building RPM Packages +^^^^^^^^^^^^^^^^^^^^^ + +.. code-block:: console + + $ make pkg-rpm + +The packages will be found in the build-root directory. + +.. code-block:: console + + $ ls *.deb + + If packages built correctly, this should be the Output + + vpp_18.07-rc0~456-gb361076_amd64.deb vpp-dbg_18.07-rc0~456-gb361076_amd64.deb + vpp-api-java_18.07-rc0~456-gb361076_amd64.deb vpp-dev_18.07-rc0~456-gb361076_amd64.deb + vpp-api-lua_18.07-rc0~456-gb361076_amd64.deb vpp-lib_18.07-rc0~456-gb361076_amd64.deb + vpp-api-python_18.07-rc0~456-gb361076_amd64.deb vpp-plugins_18.07-rc0~456-gb361076_amd64.deb + +Packages built installed end up in build-root directory. Finally, the command below installs all built packages. + +.. code-block:: console + + $ sudo bash + # dpkg -i *.deb diff --git a/docs/gettingstarted/developers/featurearcs.md b/docs/gettingstarted/developers/featurearcs.md new file mode 100644 index 00000000000..f1e3ec47d05 --- /dev/null +++ b/docs/gettingstarted/developers/featurearcs.md @@ -0,0 +1,224 @@ +Feature Arcs +============ + +A significant number of vpp features are configurable on a per-interface +or per-system basis. Rather than ask feature coders to manually +construct the required graph arcs, we built a general mechanism to +manage these mechanics. + +Specifically, feature arcs comprise ordered sets of graph nodes. Each +feature node in an arc is independently controlled. Feature arc nodes +are generally unaware of each other. Handing a packet to "the next +feature node" is quite inexpensive. + +The feature arc implementation solves the problem of creating graph arcs +used for steering. + +At the beginning of a feature arc, a bit of setup work is needed, but +only if at least one feature is enabled on the arc. + +On a per-arc basis, individual feature definitions create a set of +ordering dependencies. Feature infrastructure performs a topological +sort of the ordering dependencies, to determine the actual feature +order. Missing dependencies **will** lead to runtime disorder. See + for an example. + +If no partial order exists, vpp will refuse to run. Circular dependency +loops of the form "a then b, b then c, c then a" are impossible to +satisfy. + +Adding a feature to an existing feature arc +------------------------------------------- + +To nobody's great surprise, we set up feature arcs using the typical +"macro -> constructor function -> list of declarations" pattern: + +```c + VNET_FEATURE_INIT (mactime, static) = + { + .arc_name = "device-input", + .node_name = "mactime", + .runs_before = VNET_FEATURES ("ethernet-input"), + }; +``` + +This creates a "mactime" feature on the "device-input" arc. + +Once per frame, dig up the vnet\_feature\_config\_main\_t corresponding +to the "device-input" feature arc: + +```c + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + u8 arc = im->output_feature_arc_index; + vnet_feature_config_main_t *fcm; + + fcm = vnet_feature_get_config_main (arc); +``` + +Note that in this case, we've stored the required arc index - assigned +by the feature infrastructure - in the vnet\_interface\_main\_t. Where +to put the arc index is a programmer's decision when creating a feature +arc. + +Per packet, set next0 to steer packets to the next node they should +visit: + +```c + vnet_get_config_data (&fcm->config_main, + &b0->current_config_index /* value-result */, + &next0, 0 /* # bytes of config data */); +``` + +Configuration data is per-feature arc, and is often unused. Note that +it's normal to reset next0 to divert packets elsewhere; often, to drop +them for cause: + +```c + next0 = MACTIME_NEXT_DROP; + b0->error = node->errors[DROP_CAUSE]; +``` + +Creating a feature arc +---------------------- + +Once again, we create feature arcs using constructor macros: + +```c + VNET_FEATURE_ARC_INIT (ip4_unicast, static) = + { + .arc_name = "ip4-unicast", + .start_nodes = VNET_FEATURES ("ip4-input", "ip4-input-no-checksum"), + .arc_index_ptr = &ip4_main.lookup_main.ucast_feature_arc_index, + }; +``` + +In this case, we configure two arc start nodes to handle the +"hardware-verified ip checksum or not" cases. During initialization, +the feature infrastructure stores the arc index as shown. + +In the head-of-arc node, do the following to send packets along the +feature arc: + +```c + ip_lookup_main_t *lm = &im->lookup_main; + arc = lm->ucast_feature_arc_index; +``` + +Once per packet, initialize packet metadata to walk the feature arc: + +```c +vnet_feature_arc_start (arc, sw_if_index0, &next, b0); +``` + +Enabling / Disabling features +----------------------------- + +Simply call vnet_feature_enable_disable to enable or disable a specific +feature: + +```c + vnet_feature_enable_disable ("device-input", /* arc name */ + "mactime", /* feature name */ + sw_if_index, /* Interface sw_if_index */ + enable_disable, /* 1 => enable */ + 0 /* (void *) feature_configuration */, + 0 /* feature_configuration_nbytes */); +``` + +The feature_configuration opaque is seldom used. + +If you wish to make a feature a _de facto_ system-level concept, pass +sw_if_index=0 at all times. Sw_if_index 0 is always valid, and +corresponds to the "local" interface. + +Related "show" commands +----------------------- + +To display the entire set of features, use "show features [verbose]". The +verbose form displays arc indices, and feature indicies within the arcs + +``` +$ vppctl show features verbose +Available feature paths + +[14] ip4-unicast: + [ 0]: nat64-out2in-handoff + [ 1]: nat64-out2in + [ 2]: nat44-ed-hairpin-dst + [ 3]: nat44-hairpin-dst + [ 4]: ip4-dhcp-client-detect + [ 5]: nat44-out2in-fast + [ 6]: nat44-in2out-fast + [ 7]: nat44-handoff-classify + [ 8]: nat44-out2in-worker-handoff + [ 9]: nat44-in2out-worker-handoff + [10]: nat44-ed-classify + [11]: nat44-ed-out2in + [12]: nat44-ed-in2out + [13]: nat44-det-classify + [14]: nat44-det-out2in + [15]: nat44-det-in2out + [16]: nat44-classify + [17]: nat44-out2in + [18]: nat44-in2out + [19]: ip4-qos-record + [20]: ip4-vxlan-gpe-bypass + [21]: ip4-reassembly-feature + [22]: ip4-not-enabled + [23]: ip4-source-and-port-range-check-rx + [24]: ip4-flow-classify + [25]: ip4-inacl + [26]: ip4-source-check-via-rx + [27]: ip4-source-check-via-any + [28]: ip4-policer-classify + [29]: ipsec-input-ip4 + [30]: vpath-input-ip4 + [31]: ip4-vxlan-bypass + [32]: ip4-lookup + +``` + +Here, we learn that the ip4-unicast feature arc has index 14, and that +e.g. ip4-inacl is the 25th feature in the generated partial order. + +To display the features currently active on a specific interface, +use "show interface features": + +``` +$ vppctl show interface GigabitEthernet3/0/0 features +Feature paths configured on GigabitEthernet3/0/0... + +ip4-unicast: + nat44-out2in + +``` + +Table of Feature Arcs +--------------------- + +Simply search for name-strings to track down the arc definition, location of +the arc index, etc. + +``` + | Arc Name | + |------------------| + | device-input | + | ethernet-output | + | interface-output | + | ip4-drop | + | ip4-local | + | ip4-multicast | + | ip4-output | + | ip4-punt | + | ip4-unicast | + | ip6-drop | + | ip6-local | + | ip6-multicast | + | ip6-output | + | ip6-punt | + | ip6-unicast | + | mpls-input | + | mpls-output | + | nsh-output | +``` diff --git a/docs/gettingstarted/developers/index.rst b/docs/gettingstarted/developers/index.rst new file mode 100644 index 00000000000..cccb18d731a --- /dev/null +++ b/docs/gettingstarted/developers/index.rst @@ -0,0 +1,18 @@ +.. _gstarteddevel: + +########## +Developers +########## + +.. toctree:: + :maxdepth: 2 + + building + softwarearchitecture + infrastructure + vlib + plugins + vnet + featurearcs + bihash + diff --git a/docs/gettingstarted/developers/infrastructure.md b/docs/gettingstarted/developers/infrastructure.md new file mode 100644 index 00000000000..688c42133ed --- /dev/null +++ b/docs/gettingstarted/developers/infrastructure.md @@ -0,0 +1,330 @@ +VPPINFRA (Infrastructure) +========================= + +The files associated with the VPP Infrastructure layer are located in +the ./src/vppinfra folder. + +VPPinfra is a collection of basic c-library services, quite +sufficient to build standalone programs to run directly on bare metal. +It also provides high-performance dynamic arrays, hashes, bitmaps, +high-precision real-time clock support, fine-grained event-logging, and +data structure serialization. + +One fair comment / fair warning about vppinfra: you can\'t always tell a +macro from an inline function from an ordinary function simply by name. +Macros are used to avoid function calls in the typical case, and to +cause (intentional) side-effects. + +Vppinfra has been around for almost 20 years and tends not to change +frequently. The VPP Infrastructure layer contains the following +functions: + +Vectors +------- + +Vppinfra vectors are ubiquitous dynamically resized arrays with by user +defined \"headers\". Many vpppinfra data structures (e.g. hash, heap, +pool) are vectors with various different headers. + +The memory layout looks like this: + +``` + User header (optional, uword aligned) + Alignment padding (if needed) + Vector length in elements + User's pointer -> Vector element 0 + Vector element 1 + ... + Vector element N-1 +``` + +As shown above, the vector APIs deal with pointers to the 0th element of +a vector. Null pointers are valid vectors of length zero. + +To avoid thrashing the memory allocator, one often resets the length of +a vector to zero while retaining the memory allocation. Set the vector +length field to zero via the vec\_reset\_length(v) macro. \[Use the +macro! It's smart about NULL pointers.\] + +Typically, the user header is not present. User headers allow for other +data structures to be built atop vppinfra vectors. Users may specify the +alignment for data elements via the [vec]()\*\_aligned macros. + +Vectors elements can be any C type e.g. (int, double, struct bar). This +is also true for data types built atop vectors (e.g. heap, pool, etc.). +Many macros have \_a variants supporting alignment of vector data and +\_h variants supporting non-zero-length vector headers. The \_ha +variants support both. + +Inconsistent usage of header and/or alignment related macro variants +will cause delayed, confusing failures. + +Standard programming error: memorize a pointer to the ith element of a +vector, and then expand the vector. Vectors expand by 3/2, so such code +may appear to work for a period of time. Correct code almost always +memorizes vector **indices** which are invariant across reallocations. + +In typical application images, one supplies a set of global functions +designed to be called from gdb. Here are a few examples: + +- vl(v) - prints vec\_len(v) +- pe(p) - prints pool\_elts(p) +- pifi(p, index) - prints pool\_is\_free\_index(p, index) +- debug\_hex\_bytes (p, nbytes) - hex memory dump nbytes starting at p + +Use the "show gdb" debug CLI command to print the current set. + +Bitmaps +------- + +Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs. +Quite handy for a variety jobs. + +Pools +----- + +Vppinfra pools combine vectors and bitmaps to rapidly allocate and free +fixed-size data structures with independent lifetimes. Pools are perfect +for allocating per-session structures. + +Hashes +------ + +Vppinfra provides several hash flavors. Data plane problems involving +packet classification / session lookup often use +./src/vppinfra/bihash\_template.\[ch\] bounded-index extensible +hashes. These templates are instantiated multiple times, to efficiently +service different fixed-key sizes. + +Bihashes are thread-safe. Read-locking is not required. A simple +spin-lock ensures that only one thread writes an entry at a time. + +The original vppinfra hash implementation in +./src/vppinfra/hash.\[ch\] are simple to use, and are often used in +control-plane code which needs exact-string-matching. + +In either case, one almost always looks up a key in a hash table to +obtain an index in a related vector or pool. The APIs are simple enough, +but one must take care when using the unmanaged arbitrary-sized key +variant. Hash\_set\_mem (hash\_table, key\_pointer, value) memorizes +key\_pointer. It is usually a bad mistake to pass the address of a +vector element as the second argument to hash\_set\_mem. It is perfectly +fine to memorize constant string addresses in the text segment. + +Format +------ + +Vppinfra format is roughly equivalent to printf. + +Format has a few properties worth mentioning. Format's first argument is +a (u8 \*) vector to which it appends the result of the current format +operation. Chaining calls is very easy: + +```c + u8 * result; + + result = format (0, "junk = %d, ", junk); + result = format (result, "more junk = %d\n", more_junk); +``` + +As previously noted, NULL pointers are perfectly proper 0-length +vectors. Format returns a (u8 \*) vector, **not** a C-string. If you +wish to print a (u8 \*) vector, use the "%v" format string. If you need +a (u8 \*) vector which is also a proper C-string, either of these +schemes may be used: + +```c + vec_add1 (result, 0) + or + result = format (result, "%c", 0); +``` + +Remember to vec\_free() the result if appropriate. Be careful not to +pass format an uninitialized (u8 \*). + +Format implements a particularly handy user-format scheme via the "%U" +format specification. For example: + +```c + u8 * format_junk (u8 * s, va_list *va) + { + junk = va_arg (va, u32); + s = format (s, "%s", junk); + return s; + } + + result = format (0, "junk = %U, format_junk, "This is some junk"); +``` + +format\_junk() can invoke other user-format functions if desired. The +programmer shoulders responsibility for argument type-checking. It is +typical for user format functions to blow up if the va\_arg(va, +type) macros don't match the caller's idea of reality. + +Unformat +-------- + +Vppinfra unformat is vaguely related to scanf, but considerably more +general. + +A typical use case involves initializing an unformat\_input\_t from +either a C-string or a (u8 \*) vector, then parsing via unformat() as +follows: + +```c + unformat_input_t input; + + unformat_init_string (&input, ""); + /* or */ + unformat_init_vector (&input, ); +``` + +Then loop parsing individual elements: + +```c + while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT) + { + if (unformat (&input, "value1 %d", &value1)) + ;/* unformat sets value1 */ + else if (unformat (&input, "value2 %d", &value2) + ;/* unformat sets value2 */ + else + return clib_error_return (0, "unknown input '%U'", + format_unformat_error, input); + } +``` + +As with format, unformat implements a user-unformat function capability +via a "%U" user unformat function scheme. + +Vppinfra errors and warnings +---------------------------- + +Many functions within the vpp dataplane have return-values of type +clib\_error\_t \*. Clib\_error\_t's are arbitrary strings with a bit of +metadata \[fatal, warning\] and are easy to announce. Returning a NULL +clib\_error\_t \* indicates "A-OK, no error." + +Clib\_warning(format-args) is a handy way to add debugging +output; clib warnings prepend function:line info to unambiguously locate +the message source. Clib\_unix\_warning() adds perror()-style Linux +system-call information. In production images, clib\_warnings result in +syslog entries. + +Serialization +------------- + +Vppinfra serialization support allows the programmer to easily serialize +and unserialize complex data structures. + +The underlying primitive serialize/unserialize functions use network +byte-order, so there are no structural issues serializing on a +little-endian host and unserializing on a big-endian host. + +Event-logger, graphical event log viewer +---------------------------------------- + +The vppinfra event logger provides very lightweight (sub-100ns) +precisely time-stamped event-logging services. See +./src/vppinfra/{elog.c, elog.h} + +Serialization support makes it easy to save and ultimately to combine a +set of event logs. In a distributed system running NTP over a local LAN, +we find that event logs collected from multiple system elements can be +combined with a temporal uncertainty no worse than 50us. + +A typical event definition and logging call looks like this: + +```c + ELOG_TYPE_DECLARE (e) = + { + .format = "tx-msg: stream %d local seq %d attempt %d", + .format_args = "i4i4i4", + }; + struct { u32 stream_id, local_sequence, retry_count; } * ed; + ed = ELOG_DATA (m->elog_main, e); + ed->stream_id = stream_id; + ed->local_sequence = local_sequence; + ed->retry_count = retry_count; +``` + +The ELOG\_DATA macro returns a pointer to 20 bytes worth of arbitrary +event data, to be formatted (offline, not at runtime) as described by +format\_args. Aside from obvious integer formats, the CLIB event logger +provides a couple of interesting additions. The "t4" format +pretty-prints enumerated values: + +```c + ELOG_TYPE_DECLARE (e) = + { + .format = "get_or_create: %s", + .format_args = "t4", + .n_enum_strings = 2, + .enum_strings = { "old", "new", }, + }; +``` + +The "t" format specifier indicates that the corresponding datum is an +index in the event's set of enumerated strings, as shown in the previous +event type definition. + +The “T” format specifier indicates that the corresponding datum is an +index in the event log’s string heap. This allows the programmer to emit +arbitrary formatted strings. One often combines this facility with a +hash table to keep the event-log string heap from growing arbitrarily +large. + +Noting the 20-octet limit per-log-entry data field, the event log +formatter supports arbitrary combinations of these data types. As in: +the ".format" field may contain one or more instances of the following: + +- i1 - 8-bit unsigned integer +- i2 - 16-bit unsigned integer +- i4 - 32-bit unsigned integer +- i8 - 64-bit unsigned integer +- f4 - float +- f8 - double +- s - NULL-terminated string - be careful +- sN - N-byte character array +- t1,2,4 - per-event enumeration ID +- T4 - Event-log string table offset + +The vpp engine event log is thread-safe, and is shared by all threads. +Take care not to serialize the computation. Although the event-logger is +about as fast as practicable, it's not appropriate for per-packet use in +hard-core data plane code. It's most appropriate for capturing rare +events - link up-down events, specific control-plane events and so +forth. + +The vpp engine has several debug CLI commands for manipulating its event +log: + +``` + vpp# event-logger clear + vpp# event-logger save # for security, writes into /tmp/. + # must not contain '.' or '/' characters + vpp# show event-logger [all] [] # display the event log + # by default, the last 250 entries +``` + +The event log defaults to 128K entries. The command-line argument "... +vlib { elog-events nnn } ..." configures the size of the event log. + +As described above, the vpp engine event log is thread-safe and shared. +To avoid confusing non-appearance of events logged by worker threads, +make sure to code vlib\_global\_main.elog\_main - instead of +vm->elog\_main. The latter form is correct in the main thread, but +will almost certainly produce bad results in worker threads. + +G2 graphical event viewer +------------------------- + +The g2 graphical event viewer can display serialized vppinfra event logs +directly, or via the c2cpel tool. + +
+ +Todo: please convert wiki page and figures + +
+ diff --git a/docs/gettingstarted/developers/plugins.md b/docs/gettingstarted/developers/plugins.md new file mode 100644 index 00000000000..ba3a2446306 --- /dev/null +++ b/docs/gettingstarted/developers/plugins.md @@ -0,0 +1,11 @@ + +Plugins +======= + +vlib implements a straightforward plug-in DLL mechanism. VLIB client +applications specify a directory to search for plug-in .DLLs, and a name +filter to apply (if desired). VLIB needs to load plug-ins very early. + +Once loaded, the plug-in DLL mechanism uses dlsym to find and verify a +vlib\_plugin\_registration data structure in the newly-loaded plug-in. + diff --git a/docs/gettingstarted/developers/softwarearchitecture.md b/docs/gettingstarted/developers/softwarearchitecture.md new file mode 100644 index 00000000000..a663134cd46 --- /dev/null +++ b/docs/gettingstarted/developers/softwarearchitecture.md @@ -0,0 +1,44 @@ +Software Architecture +===================== + +The fd.io vpp implementation is a third-generation vector packet +processing implementation specifically related to US Patent 7,961,636, +as well as earlier work. Note that the Apache-2 license specifically +grants non-exclusive patent licenses; we mention this patent as a point +of historical interest. + +For performance, the vpp dataplane consists of a directed graph of +forwarding nodes which process multiple packets per invocation. This +schema enables a variety of micro-processor optimizations: pipelining +and prefetching to cover dependent read latency, inherent I-cache phase +behavior, vector instructions. Aside from hardware input and hardware +output nodes, the entire forwarding graph is portable code. + +Depending on the scenario at hand, we often spin up multiple worker +threads which process ingress-hashes packets from multiple queues using +identical forwarding graph replicas. + +VPP Layers - Implementation Taxonomy +------------------------------------ + +![image](/_images/VPP_Layering.png) + +- VPP Infra - the VPP infrastructure layer, which contains the core + library source code. This layer performs memory functions, works + with vectors and rings, performs key lookups in hash tables, and + works with timers for dispatching graph nodes. +- VLIB - the vector processing library. The vlib layer also handles + various application management functions: buffer, memory and graph + node management, maintaining and exporting counters, thread + management, packet tracing. Vlib implements the debug CLI (command + line interface). +- VNET - works with VPP\'s networking interface (layers 2, 3, and 4) + performs session and traffic management, and works with devices and + the data control plane. +- Plugins - Contains an increasingly rich set of data-plane plugins, + as noted in the above diagram. +- VPP - the container application linked against all of the above. + +It's important to understand each of these layers in a certain amount of +detail. Much of the implementation is best dealt with at the API level +and otherwise left alone. diff --git a/docs/gettingstarted/developers/vlib.md b/docs/gettingstarted/developers/vlib.md new file mode 100644 index 00000000000..9ef37fd2657 --- /dev/null +++ b/docs/gettingstarted/developers/vlib.md @@ -0,0 +1,496 @@ + +VLIB (Vector Processing Library) +================================ + +The files associated with vlib are located in the ./src/{vlib, +vlibapi, vlibmemory} folders. These libraries provide vector +processing support including graph-node scheduling, reliable multicast +support, ultra-lightweight cooperative multi-tasking threads, a CLI, +plug in .DLL support, physical memory and Linux epoll support. Parts of +this library embody US Patent 7,961,636. + +Init function discovery +----------------------- + +vlib applications register for various \[initialization\] events by +placing structures and \_\_attribute\_\_((constructor)) functions into +the image. At appropriate times, the vlib framework walks +constructor-generated singly-linked structure lists, calling the +indicated functions. vlib applications create graph nodes, add CLI +functions, start cooperative multi-tasking threads, etc. etc. using this +mechanism. + +vlib applications invariably include a number of VLIB\_INIT\_FUNCTION +(my\_init\_function) macros. + +Each init / configure / etc. function has the return type clib\_error\_t +\*. Make sure that the function returns 0 if all is well, otherwise the +framework will announce an error and exit. + +vlib applications must link against vppinfra, and often link against +other libraries such as VNET. In the latter case, it may be necessary to +explicitly reference symbol(s) otherwise large portions of the library +may be AWOL at runtime. + +Node Graph Initialization +------------------------- + +vlib packet-processing applications invariably define a set of graph +nodes to process packets. + +One constructs a vlib\_node\_registration\_t, most often via the +VLIB\_REGISTER\_NODE macro. At runtime, the framework processes the set +of such registrations into a directed graph. It is easy enough to add +nodes to the graph at runtime. The framework does not support removing +nodes. + +vlib provides several types of vector-processing graph nodes, primarily +to control framework dispatch behaviors. The type member of the +vlib\_node\_registration\_t functions as follows: + +- VLIB\_NODE\_TYPE\_PRE\_INPUT - run before all other node types +- VLIB\_NODE\_TYPE\_INPUT - run as often as possible, after pre\_input + nodes +- VLIB\_NODE\_TYPE\_INTERNAL - only when explicitly made runnable by + adding pending frames for processing +- VLIB\_NODE\_TYPE\_PROCESS - only when explicitly made runnable. + "Process" nodes are actually cooperative multi-tasking threads. They + **must** explicitly suspend after a reasonably short period of time. + +For a precise understanding of the graph node dispatcher, please read +./src/vlib/main.c:vlib\_main\_loop. + +Graph node dispatcher +--------------------- + +Vlib\_main\_loop() dispatches graph nodes. The basic vector processing +algorithm is diabolically simple, but may not be obvious from even a +long stare at the code. Here's how it works: some input node, or set of +input nodes, produce a vector of work to process. The graph node +dispatcher pushes the work vector through the directed graph, +subdividing it as needed, until the original work vector has been +completely processed. At that point, the process recurs. + +This scheme yields a stable equilibrium in frame size, by construction. +Here's why: as the frame size increases, the per-frame-element +processing time decreases. There are several related forces at work; the +simplest to describe is the effect of vector processing on the CPU L1 +I-cache. The first frame element \[packet\] processed by a given node +warms up the node dispatch function in the L1 I-cache. All subsequent +frame elements profit. As we increase the number of frame elements, the +cost per element goes down. + +Under light load, it is a crazy waste of CPU cycles to run the graph +node dispatcher flat-out. So, the graph node dispatcher arranges to wait +for work by sitting in a timed epoll wait if the prevailing frame size +is low. The scheme has a certain amount of hysteresis to avoid +constantly toggling back and forth between interrupt and polling mode. +Although the graph dispatcher supports interrupt and polling modes, our +current default device drivers do not. + +The graph node scheduler uses a hierarchical timer wheel to reschedule +process nodes upon timer expiration. + +Graph dispatcher internals +-------------------------- + +This section may be safely skipped. It's not necessary to understand +graph dispatcher internals to create graph nodes. + +Vector Data Structure +--------------------- + +In vpp / vlib, we represent vectors as instances of the vlib_frame_t type: + +```c + typedef struct vlib_frame_t + { + /* Frame flags. */ + u16 flags; + + /* Number of scalar bytes in arguments. */ + u8 scalar_size; + + /* Number of bytes per vector argument. */ + u8 vector_size; + + /* Number of vector elements currently in frame. */ + u16 n_vectors; + + /* Scalar and vector arguments to next node. */ + u8 arguments[0]; + } vlib_frame_t; +``` + +Note that one _could_ construct all kinds of vectors - including +vectors with some associated scalar data - using this structure. In +the vpp application, vectors typically use a 4-byte vector element +size, and zero bytes' worth of associated per-frame scalar data. + +Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries. +Frames have u32 indices which make use of the alignment property, so +the maximum feasible main heap offset of a frame is +CLIB_CACHE_LINE_BYTES * 0xFFFFFFFF: 64*4 = 256 Gbytes. + +Scheduling Vectors +------------------ + +As you can see, vectors are not directly associated with graph +nodes. We represent that association in a couple of ways. The +simplest is the vlib\_pending\_frame\_t: + +```c + /* A frame pending dispatch by main loop. */ + typedef struct + { + /* Node and runtime for this frame. */ + u32 node_runtime_index; + + /* Frame index (in the heap). */ + u32 frame_index; + + /* Start of next frames for this node. */ + u32 next_frame_index; + + /* Special value for next_frame_index when there is no next frame. */ + #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0) + } vlib_pending_frame_t; +``` + +Here is the code in .../src/vlib/main.c:vlib_main_or_worker_loop() +which processes frames: + +```c + /* + * Input nodes may have added work to the pending vector. + * Process pending vector until there is nothing left. + * All pending vectors will be processed from input -> output. + */ + for (i = 0; i < _vec_len (nm->pending_frames); i++) + cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now); + /* Reset pending vector for next iteration. */ +``` + +The pending frame node_runtime_index associates the frame with the +node which will process it. + +Complications +------------- + +Fasten your seatbelt. Here's where the story - and the data structures +\- become quite complicated... + +At 100,000 feet: vpp uses a directed graph, not a directed _acyclic_ +graph. It's really quite normal for a packet to visit ip\[46\]-lookup +multiple times. The worst-case: a graph node which enqueues packets to +itself. + +To deal with this issue, the graph dispatcher must force allocation of +a new frame if the current graph node's dispatch function happens to +enqueue a packet back to itself. + +There are no guarantees that a pending frame will be processed +immediately, which means that more packets may be added to the +underlying vlib_frame_t after it has been attached to a +vlib_pending_frame_t. Care must be taken to allocate new +frames and pending frames if a (pending\_frame, frame) pair fills. + +Next frames, next frame ownership +--------------------------------- + +The vlib\_next\_frame\_t is the last key graph dispatcher data structure: + +```c + typedef struct + { + /* Frame index. */ + u32 frame_index; + + /* Node runtime for this next. */ + u32 node_runtime_index; + + /* Next frame flags. */ + u32 flags; + + /* Reflects node frame-used flag for this next. */ + #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \ + VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH + + /* This next frame owns enqueue to node + corresponding to node_runtime_index. */ + #define VLIB_FRAME_OWNER (1 << 15) + + /* Set when frame has been allocated for this next. */ + #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT + + /* Set when frame has been added to pending vector. */ + #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP + + /* Set when frame is to be freed after dispatch. */ + #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT + + /* Set when frame has traced packets. */ + #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE + + /* Number of vectors enqueue to this next since last overflow. */ + u32 vectors_since_last_overflow; + } vlib_next_frame_t; +``` + +Graph node dispatch functions call vlib\_get\_next\_frame (...) to +set "(u32 \*)to_next" to the right place in the vlib_frame_t +corresponding to the ith arc (aka next0) from the current node to the +indicated next node. + +After some scuffling around - two levels of macros - processing +reaches vlib\_get\_next\_frame_internal (...). Get-next-frame-internal +digs up the vlib\_next\_frame\_t corresponding to the desired graph +arc. + +The next frame data structure amounts to a graph-arc-centric frame +cache. Once a node finishes adding element to a frame, it will acquire +a vlib_pending_frame_t and end up on the graph dispatcher's +run-queue. But there's no guarantee that more vector elements won't be +added to the underlying frame from the same (source\_node, +next\_index) arc or from a different (source\_node, next\_index) arc. + +Maintaining consistency of the arc-to-frame cache is necessary. The +first step in maintaining consistency is to make sure that only one +graph node at a time thinks it "owns" the target vlib\_frame\_t. + +Back to the graph node dispatch function. In the usual case, a certain +number of packets will be added to the vlib\_frame\_t acquired by +calling vlib\_get\_next\_frame (...). + +Before a dispatch function returns, it's required to call +vlib\_put\_next\_frame (...) for all of the graph arcs it actually +used. This action adds a vlib\_pending\_frame\_t to the graph +dispatcher's pending frame vector. + +Vlib\_put\_next\_frame makes a note in the pending frame of the frame +index, and also of the vlib\_next\_frame\_t index. + +dispatch\_pending\_node actions +------------------------------- + +The main graph dispatch loop calls dispatch pending node as shown +above. + +Dispatch\_pending\_node recovers the pending frame, and the graph node +runtime / dispatch function. Further, it recovers the next\_frame +currently associated with the vlib\_frame\_t, and detaches the +vlib\_frame\_t from the next\_frame. + +In .../src/vlib/main.c:dispatch\_pending\_node(...), note this stanza: + +```c + /* Force allocation of new frame while current frame is being + dispatched. */ + restore_frame_index = ~0; + if (nf->frame_index == p->frame_index) + { + nf->frame_index = ~0; + nf->flags &= ~VLIB_FRAME_IS_ALLOCATED; + if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH)) + restore_frame_index = p->frame_index; + } +``` + +dispatch\_pending\_node is worth a hard stare due to the several +second-order optimizations it implements. Almost as an afterthought, +it calls dispatch_node which actually calls the graph node dispatch +function. + +Process / thread model +---------------------- + +vlib provides an ultra-lightweight cooperative multi-tasking thread +model. The graph node scheduler invokes these processes in much the same +way as traditional vector-processing run-to-completion graph nodes; +plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply +set the vlib\_node\_registration\_t type field to +vlib\_NODE\_TYPE\_PROCESS. Yes, process is a misnomer. These are +cooperative multi-tasking threads. + +As of this writing, the default stack size is 2<<15 = 32kb. +Initialize the node registration's process\_log2\_n\_stack\_bytes member +as needed. The graph node dispatcher makes some effort to detect stack +overrun, e.g. by mapping a no-access page below each thread stack. + +Process node dispatch functions are expected to be "while(1) { }" loops +which suspend when not otherwise occupied, and which must not run for +unreasonably long periods of time. + +"Unreasonably long" is an application-dependent concept. Over the years, +we have constructed frame-size sensitive control-plane nodes which will +use a much higher fraction of the available CPU bandwidth when the frame +size is low. The classic example: modifying forwarding tables. So long +as the table-builder leaves the forwarding tables in a valid state, one +can suspend the table builder to avoid dropping packets as a result of +control-plane activity. + +Process nodes can suspend for fixed amounts of time, or until another +entity signals an event, or both. See the next section for a description +of the vlib process event mechanism. + +When running in vlib process context, one must pay strict attention to +loop invariant issues. If one walks a data structure and calls a +function which may suspend, one had best know by construction that it +cannot change. Often, it's best to simply make a snapshot copy of a data +structure, walk the copy at leisure, then free the copy. + +Process events +-------------- + +The vlib process event mechanism API is extremely lightweight and easy +to use. Here is a typical example: + +```c + vlib_main_t *vm = &vlib_global_main; + uword event_type, * event_data = 0; + + while (1) + { + vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */); + + event_type = vlib_process_get_events (vm, &event_data); + + switch (event_type) { + case EVENT1: + handle_event1s (event_data); + break; + + case EVENT2: + handle_event2s (event_data); + break; + + case ~0: /* 5-second idle/periodic */ + handle_idle (); + break; + + default: /* bug! */ + ASSERT (0); + } + + vec_reset_length(event_data); + } +``` + +In this example, the VLIB process node waits for an event to occur, or +for 5 seconds to elapse. The code demuxes on the event type, calling +the appropriate handler function. Each call to +vlib\_process\_get\_events returns a vector of per-event-type data +passed to successive vlib\_process\_signal\_event calls; it is a +serious error to process only event\_data\[0\]. + +Resetting the event\_data vector-length to 0 \[instead of calling +vec\_free\] means that the event scheme doesn't burn cycles continuously +allocating and freeing the event data vector. This is a common vppinfra +/ vlib coding pattern, well worth using when appropriate. + +Signaling an event is easy, for example: + +```c + vlib_process_signal_event (vm, process_node_index, EVENT1, + (uword)arbitrary_event1_data); /* and so forth */ +``` + +One can either know the process node index by construction - dig it out +of the appropriate vlib\_node\_registration\_t - or by finding the +vlib\_node\_t with vlib\_get\_node\_by\_name(...). + +Buffers +------- + +vlib buffering solves the usual set of packet-processing problems, +albeit at high performance. Key in terms of performance: one ordinarily +allocates / frees N buffers at a time rather than one at a time. Except +when operating directly on a specific buffer, one deals with buffers by +index, not by pointer. + +Packet-processing frames are u32\[\] arrays, not +vlib\_buffer\_t\[\] arrays. + +Packets comprise one or more vlib buffers, chained together as required. +Multiple particle sizes are supported; hardware input nodes simply ask +for the required size(s). Coalescing support is available. For obvious +reasons one is discouraged from writing one's own wild and wacky buffer +chain traversal code. + +vlib buffer headers are allocated immediately prior to the buffer data +area. In typical packet processing this saves a dependent read wait: +given a buffer's address, one can prefetch the buffer header +\[metadata\] at the same time as the first cache line of buffer data. + +Buffer header metadata (vlib\_buffer\_t) includes the usual rewrite +expansion space, a current\_data offset, RX and TX interface indices, +packet trace information, and a opaque areas. + +The opaque data is intended to control packet processing in arbitrary +subgraph-dependent ways. The programmer shoulders responsibility for +data lifetime analysis, type-checking, etc. + +Buffers have reference-counts in support of e.g. multicast replication. + +Shared-memory message API +------------------------- + +Local control-plane and application processes interact with the vpp +dataplane via asynchronous message-passing in shared memory over +unidirectional queues. The same application APIs are available via +sockets. + +Capturing API traces and replaying them in a simulation environment +requires a disciplined approach to the problem. This seems like a +make-work task, but it is not. When something goes wrong in the +control-plane after 300,000 or 3,000,000 operations, high-speed replay +of the events leading up to the accident is a huge win. + +The shared-memory message API message allocator vl\_api\_msg\_alloc uses +a particularly cute trick. Since messages are processed in order, we try +to allocate message buffering from a set of fixed-size, preallocated +rings. Each ring item has a "busy" bit. Freeing one of the preallocated +message buffers merely requires the message consumer to clear the busy +bit. No locking required. + +Debug CLI +--------- + +Adding debug CLI commands to VLIB applications is very simple. + +Here is a complete example: + +```c + static clib_error_t * + show_ip_tuple_match (vlib_main_t * vm, + unformat_input_t * input, + vlib_cli_command_t * cmd) + { + vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main); + return 0; + } + + /* *INDENT-OFF* */ + static VLIB_CLI_COMMAND (show_ip_tuple_command) = + { + .path = "show ip tuple match", + .short_help = "Show ip 5-tuple match-and-broadcast tables", + .function = show_ip_tuple_match, + }; + /* *INDENT-ON* */ +``` + +This example implements the "show ip tuple match" debug cli +command. In ordinary usage, the vlib cli is available via the "vppctl" +applicationn, which sends traffic to a named pipe. One can configure +debug CLI telnet access on a configurable port. + +The cli implementation has an output redirection facility which makes it +simple to deliver cli output via shared-memory API messaging, + +Particularly for debug or "show tech support" type commands, it would be +wasteful to write vlib application code to pack binary data, write more +code elsewhere to unpack the data and finally print the answer. If a +certain cli command has the potential to hurt packet processing +performance by running for too long, do the work incrementally in a +process node. The client can wait. diff --git a/docs/gettingstarted/developers/vnet.md b/docs/gettingstarted/developers/vnet.md new file mode 100644 index 00000000000..191a2a16969 --- /dev/null +++ b/docs/gettingstarted/developers/vnet.md @@ -0,0 +1,171 @@ + +VNET (VPP Network Stack) +======================== + +The files associated with the VPP network stack layer are located in the +./src/vnet folder. The Network Stack Layer is basically an +instantiation of the code in the other layers. This layer has a vnet +library that provides vectorized layer-2 and 3 networking graph nodes, a +packet generator, and a packet tracer. + +In terms of building a packet processing application, vnet provides a +platform-independent subgraph to which one connects a couple of +device-driver nodes. + +Typical RX connections include "ethernet-input" \[full software +classification, feeds ipv4-input, ipv6-input, arp-input etc.\] and +"ipv4-input-no-checksum" \[if hardware can classify, perform ipv4 header +checksum\]. + +![image](/_images/VNET_Features.png) + +List of features and layer areas that VNET works with: + +Effective graph dispatch function coding +---------------------------------------- + +Over the 15 years, multiple coding styles have emerged: a +single/dual/quad loop coding model (with variations) and a +fully-pipelined coding model. + +Single/dual loops +----------------- + +The single/dual/quad loop model variations conveniently solve problems +where the number of items to process is not known in advance: typical +hardware RX-ring processing. This coding style is also very effective +when a given node will not need to cover a complex set of dependent +reads. + +Here is an quad/single loop which can leverage up-to-avx512 SIMD vector +units to convert buffer indices to buffer pointers: + +```c + static uword + simulated_ethernet_interface_tx (vlib_main_t * vm, + vlib_node_runtime_t * + node, vlib_frame_t * frame) + { + u32 n_left_from, *from; + u32 next_index = 0; + u32 n_bytes; + u32 thread_index = vm->thread_index; + vnet_main_t *vnm = vnet_get_main (); + vnet_interface_main_t *im = &vnm->interface_main; + vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; + u16 nexts[VLIB_FRAME_SIZE], *next; + + n_left_from = frame->n_vectors; + from = vlib_frame_args (frame); + + /* + * Convert up to VLIB_FRAME_SIZE indices in "from" to + * buffer pointers in bufs[] + */ + vlib_get_buffers (vm, from, bufs, n_left_from); + b = bufs; + next = nexts; + + /* + * While we have at least 4 vector elements (pkts) to process.. + */ + while (n_left_from >= 4) + { + /* Prefetch next quad-loop iteration. */ + if (PREDICT_TRUE (n_left_from >= 8)) + { + vlib_prefetch_buffer_header (b[4], STORE); + vlib_prefetch_buffer_header (b[5], STORE); + vlib_prefetch_buffer_header (b[6], STORE); + vlib_prefetch_buffer_header (b[7], STORE); + } + + /* + * $$$ Process 4x packets right here... + * set next[0..3] to send the packets where they need to go + */ + + do_something_to (b[0]); + do_something_to (b[1]); + do_something_to (b[2]); + do_something_to (b[3]); + + /* Process the next 0..4 packets */ + b += 4; + next += 4; + n_left_from -= 4; + } + /* + * Clean up 0...3 remaining packets at the end of the incoming frame + */ + while (n_left_from > 0) + { + /* + * $$$ Process one packet right here... + * set next[0..3] to send the packets where they need to go + */ + do_something_to (b[0]); + + /* Process the next packet */ + b += 1; + next += 1; + n_left_from -= 1; + } + + /* + * Send the packets along their respective next-node graph arcs + * Considerable locality of reference is expected, most if not all + * packets in the inbound vector will traverse the same next-node + * arc + */ + vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors); + + return frame->n_vectors; + } +``` + +Given a packet processing task to implement, it pays to scout around +looking for similar tasks, and think about using the same coding +pattern. It is not uncommon to recode a given graph node dispatch function +several times during performance optimization. + +Packet tracer +------------- + +Vlib includes a frame element \[packet\] trace facility, with a simple +vlib cli interface. The cli is straightforward: "trace add +input-node-name count". + +To trace 100 packets on a typical x86\_64 system running the dpdk +plugin: "trace add dpdk-input 100". When using the packet generator: +"trace add pg-input 100" + +Each graph node has the opportunity to capture its own trace data. It is +almost always a good idea to do so. The trace capture APIs are simple. + +The packet capture APIs snapshoot binary data, to minimize processing at +capture time. Each participating graph node initialization provides a +vppinfra format-style user function to pretty-print data when required +by the VLIB "show trace" command. + +Set the VLIB node registration ".format\_trace" member to the name of +the per-graph node format function. + +Here's a simple example: + +```c + u8 * my_node_format_trace (u8 * s, va_list * args) + { + vlib_main_t * vm = va_arg (*args, vlib_main_t *); + vlib_node_t * node = va_arg (*args, vlib_node_t *); + my_node_trace_t * t = va_arg (*args, my_trace_t *); + + s = format (s, "My trace data was: %d", t->); + + return s; + } +``` + +The trace framework hands the per-node format function the data it +captured as the packet whizzed by. The format function pretty-prints the +data as desired. -- cgit 1.2.3-korg