From 06dcd45ff81e06bc8cf40ed487c0b2652d346a5a Mon Sep 17 00:00:00 2001
From: John DeNisco <jdenisco@cisco.com>
Date: Thu, 26 Jul 2018 12:45:10 -0400
Subject: Initial commit of Sphinx docs

Change-Id: I9fca8fb98502dffc2555f9de7f507b6f006e0e77
Signed-off-by: John DeNisco <jdenisco@cisco.com>
---
 docs/gettingstarted/developers/bihash.md           | 273 ++++++++++++
 docs/gettingstarted/developers/building.rst        | 151 +++++++
 docs/gettingstarted/developers/featurearcs.md      | 224 ++++++++++
 docs/gettingstarted/developers/index.rst           |  18 +
 docs/gettingstarted/developers/infrastructure.md   | 330 ++++++++++++++
 docs/gettingstarted/developers/plugins.md          |  11 +
 .../developers/softwarearchitecture.md             |  44 ++
 docs/gettingstarted/developers/vlib.md             | 496 +++++++++++++++++++++
 docs/gettingstarted/developers/vnet.md             | 171 +++++++
 9 files changed, 1718 insertions(+)
 create mode 100644 docs/gettingstarted/developers/bihash.md
 create mode 100644 docs/gettingstarted/developers/building.rst
 create mode 100644 docs/gettingstarted/developers/featurearcs.md
 create mode 100644 docs/gettingstarted/developers/index.rst
 create mode 100644 docs/gettingstarted/developers/infrastructure.md
 create mode 100644 docs/gettingstarted/developers/plugins.md
 create mode 100644 docs/gettingstarted/developers/softwarearchitecture.md
 create mode 100644 docs/gettingstarted/developers/vlib.md
 create mode 100644 docs/gettingstarted/developers/vnet.md

(limited to 'docs/gettingstarted/developers')

diff --git a/docs/gettingstarted/developers/bihash.md b/docs/gettingstarted/developers/bihash.md
new file mode 100644
index 00000000000..3f53e7bbc3e
--- /dev/null
+++ b/docs/gettingstarted/developers/bihash.md
@@ -0,0 +1,273 @@
+Bounded-index Extensible Hashing (bihash)
+=========================================
+
+Vpp uses bounded-index extensible hashing to solve a variety of
+exact-match (key, value) lookup problems. Benefits of the current
+implementation:
+
+* Very high record count scaling, tested to 100,000,000 records.
+* Lookup performance degrades gracefully as the number of records increases
+* No reader locking required
+* Template implementation, it's easy to support arbitrary (key,value) types
+
+Bounded-index extensible hashing has been widely used in databases for
+decades. 
+
+Bihash uses a two-level data structure:
+
+```
+    +-----------------+                                              
+    | bucket-0        |                                             
+    |  log2_size      |                                             
+    |  backing store  |                                             
+    +-----------------+                                             
+    | bucket-1        |                                             
+    |  log2_size      |           +--------------------------------+
+    |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
+    +-----------------+           | page 0                         |
+         ...                      +--------------------------------+
+    +-----------------+           | KVP_PER_PAGE * key-value-pairs |
+    | bucket-2**N-1   |           | page 1                         |
+    |  log2_size      |           +--------------------------------+
+    |  backing store  |                       ---                   
+    +-----------------+           +--------------------------------+
+                                  | KVP_PER_PAGE * key-value-pairs |
+                                  | page 2**(log2(size)) - 1       |
+                                  +--------------------------------+
+```                                  
+
+Discussion of the algorithm
+---------------------------
+
+This structure has a couple of major advantages. In practice, each
+bucket entry fits into a 64-bit integer. Coincidentally, vpp's target
+CPU architectures support 64-bit atomic operations. When modifying the
+contents of a specific bucket, we do the following:
+
+* Make a working copy of the bucket's backing storage
+* Atomically swap a pointer to the working copy into the bucket array
+* Change the original backing store data
+* Atomically swap back to the original
+
+So, no reader locking is required to search a bihash table.
+
+At lookup time, the implementation computes a key hash code. We use
+the least-significant N bits of the hash to select the bucket.
+
+With the bucket in hand, we learn log2 (nBackingPages) for the
+selected bucket. At this point, we use the next log2_size bits from
+the hash code to select the specific backing page in which the
+(key,value) page will be found.
+
+Net result: we search **one** backing page, not 2**log2_size
+pages. This is a key property of the algorithm.
+
+When sufficient collisions occur to fill the backing pages for a given
+bucket, we double the bucket size, rehash, and deal the bucket
+contents into a double-sized set of backing pages. In the future, we
+may represent the size as a linear combination of two powers-of-two,
+to increase space efficiency.
+
+To solve the "jackpot case" where a set of records collide under
+hashing in a bad way, the implementation will fall back to linear
+search across 2**log2_size backing pages on a per-bucket basis.
+
+To maintain *space* efficiency, we should configure the bucket array
+so that backing pages are effectively utilized. Lookup performance
+tends to change *very litte* if the bucket array is too small or too
+large.
+
+Bihash depends on selecting an effective hash function. If one were to
+use a truly broken hash function such as "return 1ULL." bihash would
+still work, but it would be equivalent to poorly-programmed linear
+search.
+
+We often use cpu intrinsic functions - think crc32 - to rapidly
+compute a hash code which has decent statistics.
+
+Bihash Cookbook
+---------------
+
+### Using current (key,value) template instance types
+
+It's quite easy to use one of the template instance types. As of this
+writing, .../src/vppinfra provides pre-built templates for 8, 16, 20,
+24, 40, and 48 byte keys, u8 * vector keys, and 8 byte values.
+
+See .../src/vppinfra/{bihash_<key-size>_8}.h
+
+To define the data types, #include a specific template instance, most
+often in a subsystem header file:
+
+```c
+     #include <vppinfra/bihash_8_8.h>
+```
+
+If you're building a standalone application, you'll need to define the
+various functions by #including the method implementation file in a C
+source file. 
+
+The core vpp engine currently uses most if not all of the known bihash
+types, so you probably won't need to #include the method
+implementation file.
+
+
+```c
+     #include <vppinfra/bihash_template.c>
+```
+
+Add an instance of the selected bihash data structure to e.g. a
+"main_t" structure:
+
+```c
+    typedef struct
+    {
+      ...
+      BVT (clib_bihash) hash;
+      or
+      clib_bihash_8_8_t hash;
+      ...
+    } my_main_t;
+```
+
+The BV macro concatenate its argument with the value of the
+preprocessor symbol BIHASH_TYPE. The BVT macro concatenates its
+argument with the value of BIHASH_TYPE and the fixed-string "_t". So
+in the above example, BVT (clib_bihash) generates "clib_bihash_8_8_t".
+
+If you're sure you won't decide to change the template / type name
+later, it's perfectly OK to code "clib_bihash_8_8_t" and so forth.
+
+In fact, if you #include multiple template instances in a single
+source file, you **must** use fully-enumerated type names. The macros
+stand no chance of working.
+
+### Initializing a bihash table
+
+Call the init function as shown. As a rough guide, pick a number of
+buckets which is approximately
+number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
+template instance header-file.  See previous discussion. 
+
+The amount of memory selected should easily contain all of the
+records, with a generous allowance for hash collisions. Bihash memory
+is allocated separately from the main heap, and won't cost anything
+except kernel PTE's until touched, so it's OK to be reasonably
+generous.
+
+For example:
+
+```c
+    my_main_t *mm = &my_main;
+    clib_bihash_8_8_t *h;
+        
+    h = &mm->hash_table;
+
+    clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets, 
+                           (uword) memory_size);
+```
+
+### Add or delete a key/value pair
+
+Use BV(clib_bihash_add_del), or the explicit type variant:
+
+```c
+   clib_bihash_kv_8_8_t kv;
+   clib_bihash_8_8_t * h;
+   my_main_t *mm = &my_main;
+   clib_bihash_8_8_t *h;
+        
+   h = &mm->hash_table;
+   kv.key = key_to_add_or_delete;
+   kv.value = value_to_add_or_delete;
+
+   clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
+```
+
+In the delete case, kv.value is irrelevant. To change the value associated
+with an existing (key,value) pair, simply re-add the [new] pair.
+
+### Simple search
+
+The simplest possible (key, value) search goes like so:
+
+```c
+   clib_bihash_kv_8_8_t search_kv, return_kv;
+   clib_bihash_8_8_t * h;
+   my_main_t *mm = &my_main;
+   clib_bihash_8_8_t *h;
+        
+   h = &mm->hash_table;
+   search_kv.key = key_to_add_or_delete;
+
+   if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
+     key_not_found()
+   else
+     key_not_found();
+```
+
+Note that it's perfectly fine to collect the lookup result
+
+```c
+   if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
+     key_not_found();
+   etc.
+```
+
+### Bihash vector processing
+
+When processing a vector of packets which need a certain lookup
+performed, it's worth the trouble to compute the key hash, and
+prefetch the correct bucket ahead of time.
+
+Here's a sketch of one way to write the required code:
+
+Dual-loop:
+* 6 packets ahead, prefetch 2x vlib_buffer_t's and 2x packet data
+  required to form the record keys
+* 4 packets ahead, form 2x record keys and call BV(clib_bihash_hash)
+  or the explicit hash function to calculate the record hashes.
+  Call 2x BV(clib_bihash_prefetch_bucket) to prefetch the buckets
+* 2 packets ahead, call 2x BV(clib_bihash_prefetch_data) to prefetch 
+  2x (key,value) data pages.
+* In the processing section, call 2x BV(clib_bihash_search_inline_with_hash)
+  to perform the search
+
+Programmer's choice whether to stash the hash code somewhere in
+vnet_buffer(b) metadata, or to use local variables.
+
+Single-loop:
+* Use simple search as shown above.
+
+### Walking a bihash table
+
+A fairly common scenario to build "show" commands involves walking a
+bihash table. It's simple enough:
+
+```c
+   my_main_t *mm = &my_main;
+   clib_bihash_8_8_t *h;
+   void callback_fn (clib_bihash_kv_8_8_t *, void *);
+
+   h = &mm->hash_table;
+
+   BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
+```
+To nobody's great surprise: clib_bihash_foreach_key_value_pair
+iterates across the entire table, calling callback_fn with active
+entries.
+
+### Creating a new template instance
+
+Creating a new template is easy. Use one of the existing templates as
+a model, and make the obvious changes. The hash and key_compare
+methods are performance-critical in multiple senses.
+
+If the key compare method is slow, every lookup will be slow. If the
+hash function is slow, same story. If the hash function has poor
+statistical properties, space efficiency will suffer. In the limit, a
+bad enough hash function will cause large portions of the table to
+revert to linear search.
+
+Use of the best available vector unit is well worth the trouble in the
+hash and key_compare functions.
diff --git a/docs/gettingstarted/developers/building.rst b/docs/gettingstarted/developers/building.rst
new file mode 100644
index 00000000000..18fa943a6fb
--- /dev/null
+++ b/docs/gettingstarted/developers/building.rst
@@ -0,0 +1,151 @@
+.. _building:
+
+.. toctree::
+
+Building VPP
+============
+
+To get started developing with VPP you need to get the sources and build the packages.
+
+.. _setupproxies:
+
+Set up Proxies
+--------------
+
+Depending on the environment, proxies may need to be set. 
+You may run these commands:
+
+.. code-block:: console
+
+    $ export http_proxy=http://<proxy-server-name>.com:<port-number>
+    $ export https_proxy=https://<proxy-server-name>.com:<port-number>
+
+
+Get the VPP Sources
+-------------------
+
+To get the VPP sources and get ready to build execute the following:
+
+.. code-block:: console
+
+    $ git clone https://gerrit.fd.io/r/vpp
+    $ cd vpp
+
+Build VPP Dependencies
+----------------------
+
+Before building, make sure there are no FD.io VPP or DPDK packages installed by entering the following
+commands:
+
+.. code-block:: console
+
+    $ dpkg -l | grep vpp 
+    $ dpkg -l | grep DPDK
+
+There should be no output, or packages showing after each of the above commands.
+
+Run this to install the dependencies for FD.io VPP. 
+If it hangs during downloading at any point, you may need to set up :ref:`proxies for this to work <setupproxies>`.
+
+.. code-block:: console
+
+    $ make install-dep
+    Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease
+    Get:2 http://us.archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
+    Get:3 http://security.ubuntu.com/ubuntu xenial-security InRelease [107 kB]
+    Get:4 http://us.archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
+    Get:5 http://us.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [803 kB]
+    Get:6 http://us.archive.ubuntu.com/ubuntu xenial-updates/main i386 Packages [732 kB]
+    ...
+    ...
+    Update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap to provide /usr/bin/jmap (jmap) in auto mode
+    Setting up default-jdk-headless (2:1.8-56ubuntu2) ...
+    Processing triggers for libc-bin (2.23-0ubuntu3) ...
+    Processing triggers for systemd (229-4ubuntu6) ...
+    Processing triggers for ureadahead (0.100.0-19) ...
+    Processing triggers for ca-certificates (20160104ubuntu1) ...
+    Updating certificates in /etc/ssl/certs...
+    0 added, 0 removed; done.
+    Running hooks in /etc/ca-certificates/update.d...
+
+    done.
+    done.
+
+Build VPP (Debug Mode)
+----------------------
+
+This build version contains debug symbols which is useful to modify VPP. The command below will build debug version of VPP. 
+This build will come with /build-root/vpp_debug-native.
+
+.. code-block:: console
+
+    $ make build
+    make[1]: Entering directory '/home/vagrant/vpp-master/build-root'
+    @@@@ Arch for platform 'vpp' is native @@@@
+    @@@@ Finding source for dpdk @@@@
+    @@@@ Makefile fragment found in /home/vagrant/vpp-master/build-data/packages/dpdk.mk @@@@
+    @@@@ Source found in /home/vagrant/vpp-master/dpdk @@@@
+    @@@@ Arch for platform 'vpp' is native @@@@
+    @@@@ Finding source for vpp @@@@
+    @@@@ Makefile fragment found in /home/vagrant/vpp-master/build-data/packages/vpp.mk @@@@
+    @@@@ Source found in /home/vagrant/vpp-master/src @@@@
+    ...
+    ...
+    make[5]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp/vpp-api/java'
+    make[4]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp/vpp-api/java'
+    make[3]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp'
+    make[2]: Leaving directory '/home/vagrant/vpp-master/build-root/build-vpp_debug-native/vpp'
+    @@@@ Installing vpp: nothing to do @@@@
+    make[1]: Leaving directory '/home/vagrant/vpp-master/build-root'
+
+Build VPP (Release Version)
+---------------------------
+
+To build the release version of FD.io VPP.
+This build is optimized and will not create debug symbols.
+This build will come with /build-root/build-vpp-native
+
+.. code-block:: console
+
+    $ make release
+
+
+Building Necessary Packages
+---------------------------
+
+To build the debian packages, one of the following commands below depending on the system:
+
+Building Debian Packages
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: console
+
+    $ make pkg-deb 
+
+
+Building RPM Packages
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: console
+
+    $ make pkg-rpm
+
+The packages will be found in the build-root directory.
+
+.. code-block:: console
+    
+    $ ls *.deb
+
+    If packages built correctly, this should be the Output
+
+    vpp_18.07-rc0~456-gb361076_amd64.deb             vpp-dbg_18.07-rc0~456-gb361076_amd64.deb
+    vpp-api-java_18.07-rc0~456-gb361076_amd64.deb    vpp-dev_18.07-rc0~456-gb361076_amd64.deb
+    vpp-api-lua_18.07-rc0~456-gb361076_amd64.deb     vpp-lib_18.07-rc0~456-gb361076_amd64.deb
+    vpp-api-python_18.07-rc0~456-gb361076_amd64.deb  vpp-plugins_18.07-rc0~456-gb361076_amd64.deb
+
+Packages built installed end up in build-root directory. Finally, the command below installs all built packages.
+
+.. code-block:: console
+
+   $ sudo bash
+   # dpkg -i *.deb
diff --git a/docs/gettingstarted/developers/featurearcs.md b/docs/gettingstarted/developers/featurearcs.md
new file mode 100644
index 00000000000..f1e3ec47d05
--- /dev/null
+++ b/docs/gettingstarted/developers/featurearcs.md
@@ -0,0 +1,224 @@
+Feature Arcs
+============
+
+A significant number of vpp features are configurable on a per-interface
+or per-system basis. Rather than ask feature coders to manually
+construct the required graph arcs, we built a general mechanism to
+manage these mechanics.
+
+Specifically, feature arcs comprise ordered sets of graph nodes. Each
+feature node in an arc is independently controlled. Feature arc nodes
+are generally unaware of each other. Handing a packet to "the next
+feature node" is quite inexpensive.
+
+The feature arc implementation solves the problem of creating graph arcs
+used for steering.
+
+At the beginning of a feature arc, a bit of setup work is needed, but
+only if at least one feature is enabled on the arc.
+
+On a per-arc basis, individual feature definitions create a set of
+ordering dependencies. Feature infrastructure performs a topological
+sort of the ordering dependencies, to determine the actual feature
+order. Missing dependencies **will** lead to runtime disorder. See
+<https://gerrit.fd.io/r/#/c/12753> for an example.
+
+If no partial order exists, vpp will refuse to run. Circular dependency
+loops of the form "a then b, b then c, c then a" are impossible to
+satisfy.
+
+Adding a feature to an existing feature arc
+-------------------------------------------
+
+To nobody's great surprise, we set up feature arcs using the typical
+"macro -> constructor function -> list of declarations" pattern:
+
+```c
+    VNET_FEATURE_INIT (mactime, static) =
+    {
+      .arc_name = "device-input",
+      .node_name = "mactime",
+      .runs_before = VNET_FEATURES ("ethernet-input"),
+    };  
+```
+
+This creates a "mactime" feature on the "device-input" arc.
+
+Once per frame, dig up the vnet\_feature\_config\_main\_t corresponding
+to the "device-input" feature arc:
+
+```c
+    vnet_main_t *vnm = vnet_get_main ();
+    vnet_interface_main_t *im = &vnm->interface_main;
+    u8 arc = im->output_feature_arc_index;
+    vnet_feature_config_main_t *fcm;
+
+    fcm = vnet_feature_get_config_main (arc);
+```
+
+Note that in this case, we've stored the required arc index - assigned
+by the feature infrastructure - in the vnet\_interface\_main\_t. Where
+to put the arc index is a programmer's decision when creating a feature
+arc.
+
+Per packet, set next0 to steer packets to the next node they should
+visit:
+
+```c
+    vnet_get_config_data (&fcm->config_main,
+                          &b0->current_config_index /* value-result */, 
+                          &next0, 0 /* # bytes of config data */);
+```
+
+Configuration data is per-feature arc, and is often unused. Note that
+it's normal to reset next0 to divert packets elsewhere; often, to drop
+them for cause:
+
+```c
+    next0 = MACTIME_NEXT_DROP;
+    b0->error = node->errors[DROP_CAUSE];
+```
+
+Creating a feature arc
+----------------------
+
+Once again, we create feature arcs using constructor macros:
+
+```c
+    VNET_FEATURE_ARC_INIT (ip4_unicast, static) =
+    {
+      .arc_name = "ip4-unicast",
+      .start_nodes = VNET_FEATURES ("ip4-input", "ip4-input-no-checksum"),
+      .arc_index_ptr = &ip4_main.lookup_main.ucast_feature_arc_index,
+    };  
+```
+
+In this case, we configure two arc start nodes to handle the
+"hardware-verified ip checksum or not" cases. During initialization,
+the feature infrastructure stores the arc index as shown.
+
+In the head-of-arc node, do the following to send packets along the
+feature arc:
+
+```c
+    ip_lookup_main_t *lm = &im->lookup_main;
+    arc = lm->ucast_feature_arc_index;
+```
+
+Once per packet, initialize packet metadata to walk the feature arc:
+
+```c
+vnet_feature_arc_start (arc, sw_if_index0, &next, b0);
+```
+
+Enabling / Disabling features
+-----------------------------
+
+Simply call vnet_feature_enable_disable to enable or disable a specific
+feature:
+
+```c
+    vnet_feature_enable_disable ("device-input", /* arc name */
+                                 "mactime",      /* feature name */
+           		             sw_if_index,    /* Interface sw_if_index */
+                                 enable_disable, /* 1 => enable */
+                                 0 /* (void *) feature_configuration */, 
+                                 0 /* feature_configuration_nbytes */);
+```
+
+The feature_configuration opaque is seldom used. 
+
+If you wish to make a feature a _de facto_ system-level concept, pass
+sw_if_index=0 at all times. Sw_if_index 0 is always valid, and
+corresponds to the "local" interface.
+
+Related "show" commands
+-----------------------
+
+To display the entire set of features, use "show features [verbose]". The
+verbose form displays arc indices, and feature indicies within the arcs
+
+```
+$ vppctl show features verbose
+Available feature paths
+<snip>
+[14] ip4-unicast:
+  [ 0]: nat64-out2in-handoff
+  [ 1]: nat64-out2in
+  [ 2]: nat44-ed-hairpin-dst
+  [ 3]: nat44-hairpin-dst
+  [ 4]: ip4-dhcp-client-detect
+  [ 5]: nat44-out2in-fast
+  [ 6]: nat44-in2out-fast
+  [ 7]: nat44-handoff-classify
+  [ 8]: nat44-out2in-worker-handoff
+  [ 9]: nat44-in2out-worker-handoff
+  [10]: nat44-ed-classify
+  [11]: nat44-ed-out2in
+  [12]: nat44-ed-in2out
+  [13]: nat44-det-classify
+  [14]: nat44-det-out2in
+  [15]: nat44-det-in2out
+  [16]: nat44-classify
+  [17]: nat44-out2in
+  [18]: nat44-in2out
+  [19]: ip4-qos-record
+  [20]: ip4-vxlan-gpe-bypass
+  [21]: ip4-reassembly-feature
+  [22]: ip4-not-enabled
+  [23]: ip4-source-and-port-range-check-rx
+  [24]: ip4-flow-classify
+  [25]: ip4-inacl
+  [26]: ip4-source-check-via-rx
+  [27]: ip4-source-check-via-any
+  [28]: ip4-policer-classify
+  [29]: ipsec-input-ip4
+  [30]: vpath-input-ip4
+  [31]: ip4-vxlan-bypass
+  [32]: ip4-lookup
+<snip>
+```
+
+Here, we learn that the ip4-unicast feature arc has index 14, and that
+e.g. ip4-inacl is the 25th feature in the generated partial order.
+
+To display the features currently active on a specific interface,
+use "show interface <name> features":
+
+```
+$ vppctl show interface GigabitEthernet3/0/0 features
+Feature paths configured on GigabitEthernet3/0/0...
+<snip>
+ip4-unicast:
+  nat44-out2in
+<snip>
+```
+
+Table of Feature Arcs
+---------------------
+
+Simply search for name-strings to track down the arc definition, location of
+the arc index, etc.
+
+```
+            |    Arc Name      |
+            |------------------|
+            | device-input     |
+            | ethernet-output  |
+            | interface-output |
+            | ip4-drop         |
+            | ip4-local        |
+            | ip4-multicast    |
+            | ip4-output       |
+            | ip4-punt         |
+            | ip4-unicast      |
+            | ip6-drop         |
+            | ip6-local        |
+            | ip6-multicast    |
+            | ip6-output       |
+            | ip6-punt         |
+            | ip6-unicast      |
+            | mpls-input       |
+            | mpls-output      |
+            | nsh-output       |
+```
diff --git a/docs/gettingstarted/developers/index.rst b/docs/gettingstarted/developers/index.rst
new file mode 100644
index 00000000000..cccb18d731a
--- /dev/null
+++ b/docs/gettingstarted/developers/index.rst
@@ -0,0 +1,18 @@
+.. _gstarteddevel:
+
+##########
+Developers
+##########
+
+.. toctree::
+   :maxdepth: 2
+
+   building
+   softwarearchitecture
+   infrastructure
+   vlib
+   plugins
+   vnet
+   featurearcs
+   bihash
+
diff --git a/docs/gettingstarted/developers/infrastructure.md b/docs/gettingstarted/developers/infrastructure.md
new file mode 100644
index 00000000000..688c42133ed
--- /dev/null
+++ b/docs/gettingstarted/developers/infrastructure.md
@@ -0,0 +1,330 @@
+VPPINFRA (Infrastructure)
+=========================
+
+The files associated with the VPP Infrastructure layer are located in
+the ./src/vppinfra folder.
+
+VPPinfra is a collection of basic c-library services, quite
+sufficient to build standalone programs to run directly on bare metal.
+It also provides high-performance dynamic arrays, hashes, bitmaps,
+high-precision real-time clock support, fine-grained event-logging, and
+data structure serialization.
+
+One fair comment / fair warning about vppinfra: you can\'t always tell a
+macro from an inline function from an ordinary function simply by name.
+Macros are used to avoid function calls in the typical case, and to
+cause (intentional) side-effects.
+
+Vppinfra has been around for almost 20 years and tends not to change
+frequently. The VPP Infrastructure layer contains the following
+functions:
+
+Vectors
+-------
+
+Vppinfra vectors are ubiquitous dynamically resized arrays with by user
+defined \"headers\". Many vpppinfra data structures (e.g. hash, heap,
+pool) are vectors with various different headers.
+
+The memory layout looks like this:
+
+```
+                   User header (optional, uword aligned)
+                   Alignment padding (if needed)
+                   Vector length in elements
+ User's pointer -> Vector element 0
+                   Vector element 1
+                   ...
+                   Vector element N-1
+```
+
+As shown above, the vector APIs deal with pointers to the 0th element of
+a vector. Null pointers are valid vectors of length zero.
+
+To avoid thrashing the memory allocator, one often resets the length of
+a vector to zero while retaining the memory allocation. Set the vector
+length field to zero via the vec\_reset\_length(v) macro. \[Use the
+macro! It's smart about NULL pointers.\]
+
+Typically, the user header is not present. User headers allow for other
+data structures to be built atop vppinfra vectors. Users may specify the
+alignment for data elements via the [vec]()\*\_aligned macros.
+
+Vectors elements can be any C type e.g. (int, double, struct bar). This
+is also true for data types built atop vectors (e.g. heap, pool, etc.).
+Many macros have \_a variants supporting alignment of vector data and
+\_h variants supporting non-zero-length vector headers. The \_ha
+variants support both.
+
+Inconsistent usage of header and/or alignment related macro variants
+will cause delayed, confusing failures.
+
+Standard programming error: memorize a pointer to the ith element of a
+vector, and then expand the vector. Vectors expand by 3/2, so such code
+may appear to work for a period of time. Correct code almost always
+memorizes vector **indices** which are invariant across reallocations.
+
+In typical application images, one supplies a set of global functions
+designed to be called from gdb. Here are a few examples:
+
+-   vl(v) - prints vec\_len(v)
+-   pe(p) - prints pool\_elts(p)
+-   pifi(p, index) - prints pool\_is\_free\_index(p, index)
+-   debug\_hex\_bytes (p, nbytes) - hex memory dump nbytes starting at p
+
+Use the "show gdb" debug CLI command to print the current set.
+
+Bitmaps
+-------
+
+Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
+Quite handy for a variety jobs.
+
+Pools
+-----
+
+Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
+fixed-size data structures with independent lifetimes. Pools are perfect
+for allocating per-session structures.
+
+Hashes
+------
+
+Vppinfra provides several hash flavors. Data plane problems involving
+packet classification / session lookup often use
+./src/vppinfra/bihash\_template.\[ch\] bounded-index extensible
+hashes. These templates are instantiated multiple times, to efficiently
+service different fixed-key sizes.
+
+Bihashes are thread-safe. Read-locking is not required. A simple
+spin-lock ensures that only one thread writes an entry at a time.
+
+The original vppinfra hash implementation in
+./src/vppinfra/hash.\[ch\] are simple to use, and are often used in
+control-plane code which needs exact-string-matching.
+
+In either case, one almost always looks up a key in a hash table to
+obtain an index in a related vector or pool. The APIs are simple enough,
+but one must take care when using the unmanaged arbitrary-sized key
+variant. Hash\_set\_mem (hash\_table, key\_pointer, value) memorizes
+key\_pointer. It is usually a bad mistake to pass the address of a
+vector element as the second argument to hash\_set\_mem. It is perfectly
+fine to memorize constant string addresses in the text segment.
+
+Format
+------
+
+Vppinfra format is roughly equivalent to printf.
+
+Format has a few properties worth mentioning. Format's first argument is
+a (u8 \*) vector to which it appends the result of the current format
+operation. Chaining calls is very easy:
+
+```c
+    u8 * result;
+
+    result = format (0, "junk = %d, ", junk);
+    result = format (result, "more junk = %d\n", more_junk);
+```
+
+As previously noted, NULL pointers are perfectly proper 0-length
+vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
+wish to print a (u8 \*) vector, use the "%v" format string. If you need
+a (u8 \*) vector which is also a proper C-string, either of these
+schemes may be used:
+
+```c
+    vec_add1 (result, 0)
+    or 
+    result = format (result, "<whatever>%c", 0); 
+```
+
+Remember to vec\_free() the result if appropriate. Be careful not to
+pass format an uninitialized (u8 \*).
+
+Format implements a particularly handy user-format scheme via the "%U"
+format specification. For example:
+
+```c
+    u8 * format_junk (u8 * s, va_list *va)
+    {
+      junk = va_arg (va, u32);
+      s = format (s, "%s", junk);
+      return s;
+    }
+
+    result = format (0, "junk = %U, format_junk, "This is some junk");
+```
+
+format\_junk() can invoke other user-format functions if desired. The
+programmer shoulders responsibility for argument type-checking. It is
+typical for user format functions to blow up if the va\_arg(va,
+type) macros don't match the caller's idea of reality.
+
+Unformat
+--------
+
+Vppinfra unformat is vaguely related to scanf, but considerably more
+general.
+
+A typical use case involves initializing an unformat\_input\_t from
+either a C-string or a (u8 \*) vector, then parsing via unformat() as
+follows:
+
+```c
+    unformat_input_t input;
+
+    unformat_init_string (&input, "<some-C-string>");
+    /* or */
+    unformat_init_vector (&input, <u8-vector>);
+```
+
+Then loop parsing individual elements:
+
+```c
+    while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT) 
+    {
+      if (unformat (&input, "value1 %d", &value1))
+        ;/* unformat sets value1 */
+      else if (unformat (&input, "value2 %d", &value2)
+        ;/* unformat sets value2 */
+      else
+        return clib_error_return (0, "unknown input '%U'", 
+                                  format_unformat_error, input);
+    }
+```
+
+As with format, unformat implements a user-unformat function capability
+via a "%U" user unformat function scheme.
+
+Vppinfra errors and warnings
+----------------------------
+
+Many functions within the vpp dataplane have return-values of type
+clib\_error\_t \*. Clib\_error\_t's are arbitrary strings with a bit of
+metadata \[fatal, warning\] and are easy to announce. Returning a NULL
+clib\_error\_t \* indicates "A-OK, no error."
+
+Clib\_warning(format-args) is a handy way to add debugging
+output; clib warnings prepend function:line info to unambiguously locate
+the message source. Clib\_unix\_warning() adds perror()-style Linux
+system-call information. In production images, clib\_warnings result in
+syslog entries.
+
+Serialization
+-------------
+
+Vppinfra serialization support allows the programmer to easily serialize
+and unserialize complex data structures.
+
+The underlying primitive serialize/unserialize functions use network
+byte-order, so there are no structural issues serializing on a
+little-endian host and unserializing on a big-endian host.
+
+Event-logger, graphical event log viewer
+----------------------------------------
+
+The vppinfra event logger provides very lightweight (sub-100ns)
+precisely time-stamped event-logging services. See
+./src/vppinfra/{elog.c, elog.h}
+
+Serialization support makes it easy to save and ultimately to combine a
+set of event logs. In a distributed system running NTP over a local LAN,
+we find that event logs collected from multiple system elements can be
+combined with a temporal uncertainty no worse than 50us.
+
+A typical event definition and logging call looks like this:
+
+```c
+    ELOG_TYPE_DECLARE (e) = 
+    {
+      .format = "tx-msg: stream %d local seq %d attempt %d",
+      .format_args = "i4i4i4",
+    };
+    struct { u32 stream_id, local_sequence, retry_count; } * ed;
+    ed = ELOG_DATA (m->elog_main, e);
+    ed->stream_id = stream_id;
+    ed->local_sequence = local_sequence;
+    ed->retry_count = retry_count;
+```
+
+The ELOG\_DATA macro returns a pointer to 20 bytes worth of arbitrary
+event data, to be formatted (offline, not at runtime) as described by
+format\_args. Aside from obvious integer formats, the CLIB event logger
+provides a couple of interesting additions. The "t4" format
+pretty-prints enumerated values:
+
+```c
+    ELOG_TYPE_DECLARE (e) = 
+    {
+      .format = "get_or_create: %s",
+      .format_args = "t4",
+      .n_enum_strings = 2,
+      .enum_strings = { "old", "new", },
+    };
+```
+
+The "t" format specifier indicates that the corresponding datum is an
+index in the event's set of enumerated strings, as shown in the previous
+event type definition.
+
+The “T” format specifier indicates that the corresponding datum is an
+index in the event log’s string heap. This allows the programmer to emit
+arbitrary formatted strings. One often combines this facility with a
+hash table to keep the event-log string heap from growing arbitrarily
+large.
+
+Noting the 20-octet limit per-log-entry data field, the event log
+formatter supports arbitrary combinations of these data types. As in:
+the ".format" field may contain one or more instances of the following:
+
+-   i1 - 8-bit unsigned integer
+-   i2 - 16-bit unsigned integer
+-   i4 - 32-bit unsigned integer
+-   i8 - 64-bit unsigned integer
+-   f4 - float
+-   f8 - double
+-   s - NULL-terminated string - be careful
+-   sN - N-byte character array
+-   t1,2,4 - per-event enumeration ID
+-   T4 - Event-log string table offset
+
+The vpp engine event log is thread-safe, and is shared by all threads.
+Take care not to serialize the computation. Although the event-logger is
+about as fast as practicable, it's not appropriate for per-packet use in
+hard-core data plane code. It's most appropriate for capturing rare
+events - link up-down events, specific control-plane events and so
+forth.
+
+The vpp engine has several debug CLI commands for manipulating its event
+log:
+
+```
+    vpp# event-logger clear
+    vpp# event-logger save <filename> # for security, writes into /tmp/<filename>.
+                                      # <filename> must not contain '.' or '/' characters
+    vpp# show event-logger [all] [<nnn>] # display the event log
+                                       # by default, the last 250 entries
+```
+
+The event log defaults to 128K entries. The command-line argument "...
+vlib { elog-events nnn } ..." configures the size of the event log.
+
+As described above, the vpp engine event log is thread-safe and shared.
+To avoid confusing non-appearance of events logged by worker threads,
+make sure to code vlib\_global\_main.elog\_main - instead of
+vm->elog\_main. The latter form is correct in the main thread, but
+will almost certainly produce bad results in worker threads.
+
+G2 graphical event viewer
+-------------------------
+
+The g2 graphical event viewer can display serialized vppinfra event logs
+directly, or via the c2cpel tool.
+
+<div class="admonition note">
+
+Todo: please convert wiki page and figures
+
+</div>
+
diff --git a/docs/gettingstarted/developers/plugins.md b/docs/gettingstarted/developers/plugins.md
new file mode 100644
index 00000000000..ba3a2446306
--- /dev/null
+++ b/docs/gettingstarted/developers/plugins.md
@@ -0,0 +1,11 @@
+
+Plugins
+=======
+
+vlib implements a straightforward plug-in DLL mechanism. VLIB client
+applications specify a directory to search for plug-in .DLLs, and a name
+filter to apply (if desired). VLIB needs to load plug-ins very early.
+
+Once loaded, the plug-in DLL mechanism uses dlsym to find and verify a
+vlib\_plugin\_registration data structure in the newly-loaded plug-in.
+
diff --git a/docs/gettingstarted/developers/softwarearchitecture.md b/docs/gettingstarted/developers/softwarearchitecture.md
new file mode 100644
index 00000000000..a663134cd46
--- /dev/null
+++ b/docs/gettingstarted/developers/softwarearchitecture.md
@@ -0,0 +1,44 @@
+Software Architecture
+=====================
+
+The fd.io vpp implementation is a third-generation vector packet
+processing implementation specifically related to US Patent 7,961,636,
+as well as earlier work. Note that the Apache-2 license specifically
+grants non-exclusive patent licenses; we mention this patent as a point
+of historical interest.
+
+For performance, the vpp dataplane consists of a directed graph of
+forwarding nodes which process multiple packets per invocation. This
+schema enables a variety of micro-processor optimizations: pipelining
+and prefetching to cover dependent read latency, inherent I-cache phase
+behavior, vector instructions. Aside from hardware input and hardware
+output nodes, the entire forwarding graph is portable code.
+
+Depending on the scenario at hand, we often spin up multiple worker
+threads which process ingress-hashes packets from multiple queues using
+identical forwarding graph replicas.
+
+VPP Layers - Implementation Taxonomy
+------------------------------------
+
+![image](/_images/VPP_Layering.png)
+
+-   VPP Infra - the VPP infrastructure layer, which contains the core
+    library source code. This layer performs memory functions, works
+    with vectors and rings, performs key lookups in hash tables, and
+    works with timers for dispatching graph nodes.
+-   VLIB - the vector processing library. The vlib layer also handles
+    various application management functions: buffer, memory and graph
+    node management, maintaining and exporting counters, thread
+    management, packet tracing. Vlib implements the debug CLI (command
+    line interface).
+-   VNET - works with VPP\'s networking interface (layers 2, 3, and 4)
+    performs session and traffic management, and works with devices and
+    the data control plane.
+-   Plugins - Contains an increasingly rich set of data-plane plugins,
+    as noted in the above diagram.
+-   VPP - the container application linked against all of the above.
+
+It's important to understand each of these layers in a certain amount of
+detail. Much of the implementation is best dealt with at the API level
+and otherwise left alone.
diff --git a/docs/gettingstarted/developers/vlib.md b/docs/gettingstarted/developers/vlib.md
new file mode 100644
index 00000000000..9ef37fd2657
--- /dev/null
+++ b/docs/gettingstarted/developers/vlib.md
@@ -0,0 +1,496 @@
+
+VLIB (Vector Processing Library)
+================================
+
+The files associated with vlib are located in the ./src/{vlib,
+vlibapi, vlibmemory} folders. These libraries provide vector
+processing support including graph-node scheduling, reliable multicast
+support, ultra-lightweight cooperative multi-tasking threads, a CLI,
+plug in .DLL support, physical memory and Linux epoll support. Parts of
+this library embody US Patent 7,961,636.
+
+Init function discovery
+-----------------------
+
+vlib applications register for various \[initialization\] events by
+placing structures and \_\_attribute\_\_((constructor)) functions into
+the image. At appropriate times, the vlib framework walks
+constructor-generated singly-linked structure lists, calling the
+indicated functions. vlib applications create graph nodes, add CLI
+functions, start cooperative multi-tasking threads, etc. etc. using this
+mechanism.
+
+vlib applications invariably include a number of VLIB\_INIT\_FUNCTION
+(my\_init\_function) macros.
+
+Each init / configure / etc. function has the return type clib\_error\_t
+\*. Make sure that the function returns 0 if all is well, otherwise the
+framework will announce an error and exit.
+
+vlib applications must link against vppinfra, and often link against
+other libraries such as VNET. In the latter case, it may be necessary to
+explicitly reference symbol(s) otherwise large portions of the library
+may be AWOL at runtime.
+
+Node Graph Initialization
+-------------------------
+
+vlib packet-processing applications invariably define a set of graph
+nodes to process packets.
+
+One constructs a vlib\_node\_registration\_t, most often via the
+VLIB\_REGISTER\_NODE macro. At runtime, the framework processes the set
+of such registrations into a directed graph. It is easy enough to add
+nodes to the graph at runtime. The framework does not support removing
+nodes.
+
+vlib provides several types of vector-processing graph nodes, primarily
+to control framework dispatch behaviors. The type member of the
+vlib\_node\_registration\_t functions as follows:
+
+-   VLIB\_NODE\_TYPE\_PRE\_INPUT - run before all other node types
+-   VLIB\_NODE\_TYPE\_INPUT - run as often as possible, after pre\_input
+    nodes
+-   VLIB\_NODE\_TYPE\_INTERNAL - only when explicitly made runnable by
+    adding pending frames for processing
+-   VLIB\_NODE\_TYPE\_PROCESS - only when explicitly made runnable.
+    "Process" nodes are actually cooperative multi-tasking threads. They
+    **must** explicitly suspend after a reasonably short period of time.
+
+For a precise understanding of the graph node dispatcher, please read
+./src/vlib/main.c:vlib\_main\_loop.
+
+Graph node dispatcher
+---------------------
+
+Vlib\_main\_loop() dispatches graph nodes. The basic vector processing
+algorithm is diabolically simple, but may not be obvious from even a
+long stare at the code. Here's how it works: some input node, or set of
+input nodes, produce a vector of work to process. The graph node
+dispatcher pushes the work vector through the directed graph,
+subdividing it as needed, until the original work vector has been
+completely processed. At that point, the process recurs.
+
+This scheme yields a stable equilibrium in frame size, by construction.
+Here's why: as the frame size increases, the per-frame-element
+processing time decreases. There are several related forces at work; the
+simplest to describe is the effect of vector processing on the CPU L1
+I-cache. The first frame element \[packet\] processed by a given node
+warms up the node dispatch function in the L1 I-cache. All subsequent
+frame elements profit. As we increase the number of frame elements, the
+cost per element goes down.
+
+Under light load, it is a crazy waste of CPU cycles to run the graph
+node dispatcher flat-out. So, the graph node dispatcher arranges to wait
+for work by sitting in a timed epoll wait if the prevailing frame size
+is low. The scheme has a certain amount of hysteresis to avoid
+constantly toggling back and forth between interrupt and polling mode.
+Although the graph dispatcher supports interrupt and polling modes, our
+current default device drivers do not.
+
+The graph node scheduler uses a hierarchical timer wheel to reschedule
+process nodes upon timer expiration.
+
+Graph dispatcher internals
+--------------------------
+
+This section may be safely skipped. It's not necessary to understand
+graph dispatcher internals to create graph nodes. 
+
+Vector Data Structure
+---------------------
+
+In vpp / vlib, we represent vectors as instances of the vlib_frame_t type:
+
+```c
+    typedef struct vlib_frame_t
+    {
+      /* Frame flags. */
+      u16 flags;
+
+      /* Number of scalar bytes in arguments. */
+      u8 scalar_size;
+
+      /* Number of bytes per vector argument. */
+      u8 vector_size;
+
+      /* Number of vector elements currently in frame. */
+      u16 n_vectors;
+
+      /* Scalar and vector arguments to next node. */
+      u8 arguments[0];
+    } vlib_frame_t;
+```
+
+Note that one _could_ construct all kinds of vectors - including
+vectors with some associated scalar data - using this structure. In
+the vpp application, vectors typically use a 4-byte vector element
+size, and zero bytes' worth of associated per-frame scalar data.
+
+Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries.
+Frames have u32 indices which make use of the alignment property, so
+the maximum feasible main heap offset of a frame is
+CLIB_CACHE_LINE_BYTES * 0xFFFFFFFF: 64*4 = 256 Gbytes.
+
+Scheduling Vectors
+------------------
+
+As you can see, vectors are not directly associated with graph
+nodes. We represent that association in a couple of ways.  The
+simplest is the vlib\_pending\_frame\_t:
+
+```c
+    /* A frame pending dispatch by main loop. */
+    typedef struct
+    {
+      /* Node and runtime for this frame. */
+      u32 node_runtime_index;
+
+      /* Frame index (in the heap). */
+      u32 frame_index;
+
+      /* Start of next frames for this node. */
+      u32 next_frame_index;
+
+      /* Special value for next_frame_index when there is no next frame. */
+    #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0)
+    } vlib_pending_frame_t;
+```
+
+Here is the code in .../src/vlib/main.c:vlib_main_or_worker_loop()
+which processes frames:
+
+```c
+      /* 
+       * Input nodes may have added work to the pending vector.
+       * Process pending vector until there is nothing left.
+       * All pending vectors will be processed from input -> output. 
+       */
+      for (i = 0; i < _vec_len (nm->pending_frames); i++)
+	cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
+      /* Reset pending vector for next iteration. */
+```
+
+The pending frame node_runtime_index associates the frame with the
+node which will process it.
+
+Complications
+-------------
+
+Fasten your seatbelt. Here's where the story - and the data structures
+\- become quite complicated...
+
+At 100,000 feet: vpp uses a directed graph, not a directed _acyclic_
+graph. It's really quite normal for a packet to visit ip\[46\]-lookup
+multiple times. The worst-case: a graph node which enqueues packets to
+itself.
+
+To deal with this issue, the graph dispatcher must force allocation of
+a new frame if the current graph node's dispatch function happens to
+enqueue a packet back to itself.
+
+There are no guarantees that a pending frame will be processed
+immediately, which means that more packets may be added to the
+underlying vlib_frame_t after it has been attached to a
+vlib_pending_frame_t. Care must be taken to allocate new
+frames and pending frames if a (pending\_frame, frame) pair fills.
+
+Next frames, next frame ownership
+---------------------------------
+
+The vlib\_next\_frame\_t is the last key graph dispatcher data structure:
+
+```c
+    typedef struct
+    {
+      /* Frame index. */
+      u32 frame_index;
+
+      /* Node runtime for this next. */
+      u32 node_runtime_index;
+
+      /* Next frame flags. */
+      u32 flags;
+
+      /* Reflects node frame-used flag for this next. */
+    #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \
+      VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH
+
+      /* This next frame owns enqueue to node
+         corresponding to node_runtime_index. */
+    #define VLIB_FRAME_OWNER (1 << 15)
+
+      /* Set when frame has been allocated for this next. */
+    #define VLIB_FRAME_IS_ALLOCATED	VLIB_NODE_FLAG_IS_OUTPUT
+
+      /* Set when frame has been added to pending vector. */
+    #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP
+
+      /* Set when frame is to be freed after dispatch. */
+    #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT
+
+      /* Set when frame has traced packets. */
+    #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE
+
+      /* Number of vectors enqueue to this next since last overflow. */
+      u32 vectors_since_last_overflow;
+    } vlib_next_frame_t;
+```
+
+Graph node dispatch functions call vlib\_get\_next\_frame (...)  to
+set "(u32 \*)to_next" to the right place in the vlib_frame_t
+corresponding to the ith arc (aka next0) from the current node to the
+indicated next node.
+
+After some scuffling around - two levels of macros - processing
+reaches vlib\_get\_next\_frame_internal (...). Get-next-frame-internal
+digs up the vlib\_next\_frame\_t corresponding to the desired graph
+arc. 
+
+The next frame data structure amounts to a graph-arc-centric frame
+cache. Once a node finishes adding element to a frame, it will acquire
+a vlib_pending_frame_t and end up on the graph dispatcher's
+run-queue. But there's no guarantee that more vector elements won't be
+added to the underlying frame from the same (source\_node,
+next\_index) arc or from a different (source\_node, next\_index) arc. 
+
+Maintaining consistency of the arc-to-frame cache is necessary. The
+first step in maintaining consistency is to make sure that only one
+graph node at a time thinks it "owns" the target vlib\_frame\_t.
+
+Back to the graph node dispatch function. In the usual case, a certain
+number of packets will be added to the vlib\_frame\_t acquired by
+calling vlib\_get\_next\_frame (...). 
+
+Before a dispatch function returns, it's required to call
+vlib\_put\_next\_frame (...) for all of the graph arcs it actually
+used.  This action adds a vlib\_pending\_frame\_t to the graph
+dispatcher's pending frame vector.
+
+Vlib\_put\_next\_frame makes a note in the pending frame of the frame
+index, and also of the vlib\_next\_frame\_t index.
+
+dispatch\_pending\_node actions
+-------------------------------
+
+The main graph dispatch loop calls dispatch pending node as shown
+above.  
+
+Dispatch\_pending\_node recovers the pending frame, and the graph node
+runtime / dispatch function. Further, it recovers the next\_frame
+currently associated with the vlib\_frame\_t, and detaches the
+vlib\_frame\_t from the next\_frame.  
+
+In .../src/vlib/main.c:dispatch\_pending\_node(...), note this stanza:
+
+```c
+  /* Force allocation of new frame while current frame is being
+     dispatched. */
+  restore_frame_index = ~0;
+  if (nf->frame_index == p->frame_index)
+    {
+      nf->frame_index = ~0;
+      nf->flags &= ~VLIB_FRAME_IS_ALLOCATED;
+      if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH))
+	restore_frame_index = p->frame_index;
+    }
+```
+
+dispatch\_pending\_node is worth a hard stare due to the several
+second-order optimizations it implements. Almost as an afterthought,
+it calls dispatch_node which actually calls the graph node dispatch
+function.
+
+Process / thread model
+----------------------
+
+vlib provides an ultra-lightweight cooperative multi-tasking thread
+model. The graph node scheduler invokes these processes in much the same
+way as traditional vector-processing run-to-completion graph nodes;
+plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply
+set the vlib\_node\_registration\_t type field to
+vlib\_NODE\_TYPE\_PROCESS. Yes, process is a misnomer. These are
+cooperative multi-tasking threads.
+
+As of this writing, the default stack size is 2<<15 = 32kb.
+Initialize the node registration's process\_log2\_n\_stack\_bytes member
+as needed. The graph node dispatcher makes some effort to detect stack
+overrun, e.g. by mapping a no-access page below each thread stack.
+
+Process node dispatch functions are expected to be "while(1) { }" loops
+which suspend when not otherwise occupied, and which must not run for
+unreasonably long periods of time.
+
+"Unreasonably long" is an application-dependent concept. Over the years,
+we have constructed frame-size sensitive control-plane nodes which will
+use a much higher fraction of the available CPU bandwidth when the frame
+size is low. The classic example: modifying forwarding tables. So long
+as the table-builder leaves the forwarding tables in a valid state, one
+can suspend the table builder to avoid dropping packets as a result of
+control-plane activity.
+
+Process nodes can suspend for fixed amounts of time, or until another
+entity signals an event, or both. See the next section for a description
+of the vlib process event mechanism.
+
+When running in vlib process context, one must pay strict attention to
+loop invariant issues. If one walks a data structure and calls a
+function which may suspend, one had best know by construction that it
+cannot change. Often, it's best to simply make a snapshot copy of a data
+structure, walk the copy at leisure, then free the copy.
+
+Process events
+--------------
+
+The vlib process event mechanism API is extremely lightweight and easy
+to use. Here is a typical example:
+
+```c
+    vlib_main_t *vm = &vlib_global_main;
+    uword event_type, * event_data = 0;
+
+    while (1) 
+    {
+       vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */);
+
+       event_type = vlib_process_get_events (vm, &event_data);
+
+       switch (event_type) {
+       case EVENT1:
+           handle_event1s (event_data);
+           break;
+
+       case EVENT2:
+           handle_event2s (event_data);
+           break; 
+
+       case ~0: /* 5-second idle/periodic */
+           handle_idle ();
+           break;
+
+       default: /* bug! */
+           ASSERT (0);
+       }
+
+       vec_reset_length(event_data);
+    }
+```
+
+In this example, the VLIB process node waits for an event to occur, or
+for 5 seconds to elapse. The code demuxes on the event type, calling
+the appropriate handler function. Each call to
+vlib\_process\_get\_events returns a vector of per-event-type data
+passed to successive vlib\_process\_signal\_event calls; it is a
+serious error to process only event\_data\[0\].
+
+Resetting the event\_data vector-length to 0 \[instead of calling
+vec\_free\] means that the event scheme doesn't burn cycles continuously
+allocating and freeing the event data vector. This is a common vppinfra
+/ vlib coding pattern, well worth using when appropriate.
+
+Signaling an event is easy, for example:
+
+```c
+    vlib_process_signal_event (vm, process_node_index, EVENT1,
+        (uword)arbitrary_event1_data); /* and so forth */
+```
+
+One can either know the process node index by construction - dig it out
+of the appropriate vlib\_node\_registration\_t - or by finding the
+vlib\_node\_t with vlib\_get\_node\_by\_name(...).
+
+Buffers
+-------
+
+vlib buffering solves the usual set of packet-processing problems,
+albeit at high performance. Key in terms of performance: one ordinarily
+allocates / frees N buffers at a time rather than one at a time. Except
+when operating directly on a specific buffer, one deals with buffers by
+index, not by pointer.
+
+Packet-processing frames are u32\[\] arrays, not
+vlib\_buffer\_t\[\] arrays.
+
+Packets comprise one or more vlib buffers, chained together as required.
+Multiple particle sizes are supported; hardware input nodes simply ask
+for the required size(s). Coalescing support is available. For obvious
+reasons one is discouraged from writing one's own wild and wacky buffer
+chain traversal code.
+
+vlib buffer headers are allocated immediately prior to the buffer data
+area. In typical packet processing this saves a dependent read wait:
+given a buffer's address, one can prefetch the buffer header
+\[metadata\] at the same time as the first cache line of buffer data.
+
+Buffer header metadata (vlib\_buffer\_t) includes the usual rewrite
+expansion space, a current\_data offset, RX and TX interface indices,
+packet trace information, and a opaque areas.
+
+The opaque data is intended to control packet processing in arbitrary
+subgraph-dependent ways. The programmer shoulders responsibility for
+data lifetime analysis, type-checking, etc.
+
+Buffers have reference-counts in support of e.g. multicast replication.
+
+Shared-memory message API
+-------------------------
+
+Local control-plane and application processes interact with the vpp
+dataplane via asynchronous message-passing in shared memory over
+unidirectional queues. The same application APIs are available via
+sockets.
+
+Capturing API traces and replaying them in a simulation environment
+requires a disciplined approach to the problem. This seems like a
+make-work task, but it is not. When something goes wrong in the
+control-plane after 300,000 or 3,000,000 operations, high-speed replay
+of the events leading up to the accident is a huge win.
+
+The shared-memory message API message allocator vl\_api\_msg\_alloc uses
+a particularly cute trick. Since messages are processed in order, we try
+to allocate message buffering from a set of fixed-size, preallocated
+rings. Each ring item has a "busy" bit. Freeing one of the preallocated
+message buffers merely requires the message consumer to clear the busy
+bit. No locking required.
+
+Debug CLI
+---------
+
+Adding debug CLI commands to VLIB applications is very simple.
+
+Here is a complete example:
+
+```c
+    static clib_error_t *
+    show_ip_tuple_match (vlib_main_t * vm,
+                         unformat_input_t * input,
+                         vlib_cli_command_t * cmd)
+    {
+        vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main);
+        return 0;
+    }
+
+    /* *INDENT-OFF* */
+    static VLIB_CLI_COMMAND (show_ip_tuple_command) = 
+    {
+        .path = "show ip tuple match",
+        .short_help = "Show ip 5-tuple match-and-broadcast tables",
+        .function = show_ip_tuple_match,
+    };
+    /* *INDENT-ON* */
+```
+
+This example implements the "show ip tuple match" debug cli
+command. In ordinary usage, the vlib cli is available via the "vppctl"
+applicationn, which sends traffic to a named pipe. One can configure
+debug CLI telnet access on a configurable port.
+
+The cli implementation has an output redirection facility which makes it
+simple to deliver cli output via shared-memory API messaging,
+
+Particularly for debug or "show tech support" type commands, it would be
+wasteful to write vlib application code to pack binary data, write more
+code elsewhere to unpack the data and finally print the answer. If a
+certain cli command has the potential to hurt packet processing
+performance by running for too long, do the work incrementally in a
+process node. The client can wait.
diff --git a/docs/gettingstarted/developers/vnet.md b/docs/gettingstarted/developers/vnet.md
new file mode 100644
index 00000000000..191a2a16969
--- /dev/null
+++ b/docs/gettingstarted/developers/vnet.md
@@ -0,0 +1,171 @@
+
+VNET (VPP Network Stack)
+========================
+
+The files associated with the VPP network stack layer are located in the
+./src/vnet folder. The Network Stack Layer is basically an
+instantiation of the code in the other layers. This layer has a vnet
+library that provides vectorized layer-2 and 3 networking graph nodes, a
+packet generator, and a packet tracer.
+
+In terms of building a packet processing application, vnet provides a
+platform-independent subgraph to which one connects a couple of
+device-driver nodes.
+
+Typical RX connections include "ethernet-input" \[full software
+classification, feeds ipv4-input, ipv6-input, arp-input etc.\] and
+"ipv4-input-no-checksum" \[if hardware can classify, perform ipv4 header
+checksum\].
+
+![image](/_images/VNET_Features.png)
+
+List of features and layer areas that VNET works with:
+
+Effective graph dispatch function coding
+----------------------------------------
+
+Over the 15 years, multiple coding styles have emerged: a
+single/dual/quad loop coding model (with variations) and a
+fully-pipelined coding model.
+
+Single/dual loops
+-----------------
+
+The single/dual/quad loop model variations conveniently solve problems
+where the number of items to process is not known in advance: typical
+hardware RX-ring processing. This coding style is also very effective
+when a given node will not need to cover a complex set of dependent
+reads.
+
+Here is an quad/single loop which can leverage up-to-avx512 SIMD vector
+units to convert buffer indices to buffer pointers:
+
+```c
+   static uword
+   simulated_ethernet_interface_tx (vlib_main_t * vm,
+   				 vlib_node_runtime_t *
+   				 node, vlib_frame_t * frame)
+   {
+     u32 n_left_from, *from;
+     u32 next_index = 0;
+     u32 n_bytes;
+     u32 thread_index = vm->thread_index;
+     vnet_main_t *vnm = vnet_get_main ();
+     vnet_interface_main_t *im = &vnm->interface_main;
+     vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
+     u16 nexts[VLIB_FRAME_SIZE], *next;
+
+     n_left_from = frame->n_vectors;
+     from = vlib_frame_args (frame);
+
+     /* 
+      * Convert up to VLIB_FRAME_SIZE indices in "from" to 
+      * buffer pointers in bufs[]
+      */
+     vlib_get_buffers (vm, from, bufs, n_left_from);
+     b = bufs;
+     next = nexts;
+
+     /* 
+      * While we have at least 4 vector elements (pkts) to process.. 
+      */
+     while (n_left_from >= 4)
+       {
+         /* Prefetch next quad-loop iteration. */
+         if (PREDICT_TRUE (n_left_from >= 8))
+   	   {
+   	     vlib_prefetch_buffer_header (b[4], STORE);
+   	     vlib_prefetch_buffer_header (b[5], STORE);
+   	     vlib_prefetch_buffer_header (b[6], STORE);
+   	     vlib_prefetch_buffer_header (b[7], STORE);
+           }
+
+         /* 
+          * $$$ Process 4x packets right here...
+          * set next[0..3] to send the packets where they need to go
+          */
+
+          do_something_to (b[0]);
+          do_something_to (b[1]);
+          do_something_to (b[2]);
+          do_something_to (b[3]);
+
+         /* Process the next 0..4 packets */
+   	 b += 4;
+   	 next += 4;
+   	 n_left_from -= 4;
+   	}
+     /* 
+      * Clean up 0...3 remaining packets at the end of the incoming frame
+      */
+     while (n_left_from > 0)
+       {
+         /* 
+          * $$$ Process one packet right here...
+          * set next[0..3] to send the packets where they need to go
+          */
+          do_something_to (b[0]);
+
+         /* Process the next packet */
+         b += 1;
+         next += 1;
+         n_left_from -= 1;
+       }
+
+     /*
+      * Send the packets along their respective next-node graph arcs
+      * Considerable locality of reference is expected, most if not all
+      * packets in the inbound vector will traverse the same next-node
+      * arc
+      */
+     vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors);
+
+     return frame->n_vectors;
+   }  
+```
+
+Given a packet processing task to implement, it pays to scout around
+looking for similar tasks, and think about using the same coding
+pattern. It is not uncommon to recode a given graph node dispatch function
+several times during performance optimization.
+
+Packet tracer
+-------------
+
+Vlib includes a frame element \[packet\] trace facility, with a simple
+vlib cli interface. The cli is straightforward: "trace add
+input-node-name count".
+
+To trace 100 packets on a typical x86\_64 system running the dpdk
+plugin: "trace add dpdk-input 100". When using the packet generator:
+"trace add pg-input 100"
+
+Each graph node has the opportunity to capture its own trace data. It is
+almost always a good idea to do so. The trace capture APIs are simple.
+
+The packet capture APIs snapshoot binary data, to minimize processing at
+capture time. Each participating graph node initialization provides a
+vppinfra format-style user function to pretty-print data when required
+by the VLIB "show trace" command.
+
+Set the VLIB node registration ".format\_trace" member to the name of
+the per-graph node format function.
+
+Here's a simple example:
+
+```c
+    u8 * my_node_format_trace (u8 * s, va_list * args)
+    {
+        vlib_main_t * vm = va_arg (*args, vlib_main_t *);
+        vlib_node_t * node = va_arg (*args, vlib_node_t *);
+        my_node_trace_t * t = va_arg (*args, my_trace_t *);
+
+        s = format (s, "My trace data was: %d", t-><whatever>);
+
+        return s;
+    } 
+```
+
+The trace framework hands the per-node format function the data it
+captured as the packet whizzed by. The format function pretty-prints the
+data as desired.
-- 
cgit