1 files changed, 313 insertions, 0 deletions
diff --git a/docs/developer/corearchitecture/bihash.rst b/docs/developer/corearchitecture/bihash.rst
new file mode 100644
index 00000000000..9b62baaf9cf
--- /dev/null
+++ b/docs/developer/corearchitecture/bihash.rst
@@ -0,0 +1,313 @@
+Bounded-index Extensible Hashing (bihash)
+=========================================
+
+Vpp uses bounded-index extensible hashing to solve a variety of
+exact-match (key, value) lookup problems. Benefits of the current
+implementation:
+
+-  Very high record count scaling, tested to 100,000,000 records.
+-  Lookup performance degrades gracefully as the number of records
+   increases
+-  No reader locking required
+-  Template implementation, it’s easy to support arbitrary (key,value)
+   types
+
+Bounded-index extensible hashing has been widely used in databases for
+decades.
+
+Bihash uses a two-level data structure:
+
+::
+
+       +-----------------+
+       | bucket-0        |
+       |  log2_size      |
+       |  backing store  |
+       +-----------------+
+       | bucket-1        |
+       |  log2_size      |           +--------------------------------+
+       |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
+       +-----------------+           | page 0                         |
+            ...                      +--------------------------------+
+       +-----------------+           | KVP_PER_PAGE * key-value-pairs |
+       | bucket-2**N-1   |           | page 1                         |
+       |  log2_size      |           +--------------------------------+
+       |  backing store  |                       ---
+       +-----------------+           +--------------------------------+
+                                     | KVP_PER_PAGE * key-value-pairs |
+                                     | page 2**(log2(size)) - 1       |
+                                     +--------------------------------+
+
+Discussion of the algorithm
+---------------------------
+
+This structure has a couple of major advantages. In practice, each
+bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
+CPU architectures support 64-bit atomic operations. When modifying the
+contents of a specific bucket, we do the following:
+
+-  Make a working copy of the bucket’s backing storage
+-  Atomically swap a pointer to the working copy into the bucket array
+-  Change the original backing store data
+-  Atomically swap back to the original
+
+So, no reader locking is required to search a bihash table.
+
+At lookup time, the implementation computes a key hash code. We use the
+least-significant N bits of the hash to select the bucket.
+
+With the bucket in hand, we learn log2 (nBackingPages) for the selected
+bucket. At this point, we use the next log2_size bits from the hash code
+to select the specific backing page in which the (key,value) page will
+be found.
+
+Net result: we search **one** backing page, not 2**log2_size pages. This
+is a key property of the algorithm.
+
+When sufficient collisions occur to fill the backing pages for a given
+bucket, we double the bucket size, rehash, and deal the bucket contents
+into a double-sized set of backing pages. In the future, we may
+represent the size as a linear combination of two powers-of-two, to
+increase space efficiency.
+
+To solve the “jackpot case” where a set of records collide under hashing
+in a bad way, the implementation will fall back to linear search across
+2**log2_size backing pages on a per-bucket basis.
+
+To maintain *space* efficiency, we should configure the bucket array so
+that backing pages are effectively utilized. Lookup performance tends to
+change *very little* if the bucket array is too small or too large.
+
+Bihash depends on selecting an effective hash function. If one were to
+use a truly broken hash function such as “return 1ULL.” bihash would
+still work, but it would be equivalent to poorly-programmed linear
+search.
+
+We often use cpu intrinsic functions - think crc32 - to rapidly compute
+a hash code which has decent statistics.
+
+Bihash Cookbook
+---------------
+
+Using current (key,value) template instance types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It’s quite easy to use one of the template instance types. As of this
+writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
+40, and 48 byte keys, u8 \* vector keys, and 8 byte values.
+
+See …/src/vppinfra/{bihash\_\_8}.h
+
+To define the data types, #include a specific template instance, most
+often in a subsystem header file:
+
+.. code:: c
+
+        #include <vppinfra/bihash_8_8.h>
+
+If you’re building a standalone application, you’ll need to define the
+various functions by #including the method implementation file in a C
+source file.
+
+The core vpp engine currently uses most if not all of the known bihash
+types, so you probably won’t need to #include the method implementation
+file.
+
+.. code:: c
+
+        #include <vppinfra/bihash_template.c>
+
+Add an instance of the selected bihash data structure to e.g. a “main_t”
+structure:
+
+.. code:: c
+
+       typedef struct
+       {
+         ...
+         BVT (clib_bihash) hash_table;
+         or
+         clib_bihash_8_8_t hash_table;
+         ...
+       } my_main_t;
+
+The BV macro concatenate its argument with the value of the preprocessor
+symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
+value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
+BVT (clib_bihash) generates “clib_bihash_8_8_t”.
+
+If you’re sure you won’t decide to change the template / type name
+later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.
+
+In fact, if you #include multiple template instances in a single source
+file, you **must** use fully-enumerated type names. The macros stand no
+chance of working.
+
+Initializing a bihash table
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Call the init function as shown. As a rough guide, pick a number of
+buckets which is approximately
+number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
+template instance header-file. See previous discussion.
+
+The amount of memory selected should easily contain all of the records,
+with a generous allowance for hash collisions. Bihash memory is
+allocated separately from the main heap, and won’t cost anything except
+kernel PTE’s until touched, so it’s OK to be reasonably generous.
+
+For example:
+
+.. code:: c
+
+       my_main_t *mm = &my_main;
+       clib_bihash_8_8_t *h;
+
+       h = &mm->hash_table;
+
+       clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
+                              (uword) memory_size);
+
+Add or delete a key/value pair
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use BV(clib_bihash_add_del), or the explicit type variant:
+
+.. code:: c
+
+      clib_bihash_kv_8_8_t kv;
+      clib_bihash_8_8_t * h;
+      my_main_t *mm = &my_main;
+      clib_bihash_8_8_t *h;
+
+      h = &mm->hash_table;
+      kv.key = key_to_add_or_delete;
+      kv.value = value_to_add_or_delete;
+
+      clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
+
+In the delete case, kv.value is irrelevant. To change the value
+associated with an existing (key,value) pair, simply re-add the [new]
+pair.
+
+Simple search
+~~~~~~~~~~~~~
+
+The simplest possible (key, value) search goes like so:
+
+.. code:: c
+
+      clib_bihash_kv_8_8_t search_kv, return_kv;
+      clib_bihash_8_8_t * h;
+      my_main_t *mm = &my_main;
+      clib_bihash_8_8_t *h;
+
+      h = &mm->hash_table;
+      search_kv.key = key_to_add_or_delete;
+
+      if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
+        key_not_found();
+      else
+        key_found();
+
+Note that it’s perfectly fine to collect the lookup result
+
+.. code:: c
+
+      if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
+        key_not_found();
+      etc.
+
+Bihash vector processing
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+When processing a vector of packets which need a certain lookup
+performed, it’s worth the trouble to compute the key hash, and prefetch
+the correct bucket ahead of time.
+
+Here’s a sketch of one way to write the required code:
+
+Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
+data required to form the record keys \* 4 packets ahead, form 2x record
+keys and call BV(clib_bihash_hash) or the explicit hash function to
+calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
+prefetch the buckets \* 2 packets ahead, call 2x
+BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
+In the processing section, call 2x
+BV(clib_bihash_search_inline_with_hash) to perform the search
+
+Programmer’s choice whether to stash the hash code somewhere in
+vnet_buffer(b) metadata, or to use local variables.
+
+Single-loop: \* Use simple search as shown above.
+
+Walking a bihash table
+~~~~~~~~~~~~~~~~~~~~~~
+
+A fairly common scenario to build “show” commands involves walking a
+bihash table. It’s simple enough:
+
+.. code:: c
+
+      my_main_t *mm = &my_main;
+      clib_bihash_8_8_t *h;
+      void callback_fn (clib_bihash_kv_8_8_t *, void *);
+
+      h = &mm->hash_table;
+
+      BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
+
+To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
+across the entire table, calling callback_fn with active entries.
+
+Bihash table iteration safety
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The iterator template “clib_bihash_foreach_key_value_pair” must be used
+with a certain amount of care. For one thing, the iterator template does
+*not* take the bihash hash table writer lock. If your use-case requires
+it, lock the table.
+
+For another, the iterator template is not safe under all conditions:
+
+-  It’s **OK to delete** bihash table entries during a table-walk. The
+   iterator checks whether the current bucket has been freed after each
+   *callback_fn(…)* invocation.
+
+-  It is **not OK to add** entries during a table-walk.
+
+The add-during-walk case involves a jackpot: while processing a
+key-value-pair in a particular bucket, add a certain number of entries.
+By luck, assume that one or more of the added entries causes the
+**current bucket** to split-and-rehash.
+
+Since we rehash KVP’s to different pages based on what amounts to a
+different hash function, either of these things can go wrong:
+
+-  We may revisit previously-visited entries. Depending on how one coded
+   the use-case, we could end up in a recursive-add situation.
+
+-  We may skip entries that have not been visited
+
+One could build an add-safe iterator, at a significant cost in
+performance: copy the entire bucket, and walk the copy.
+
+It’s hard to imagine a worthwhile add-during walk use-case in the first
+place; let alone one which couldn’t be implemented by walking the table
+without modifying it, then adding a set of records.
+
+Creating a new template instance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Creating a new template is easy. Use one of the existing templates as a
+model, and make the obvious changes. The hash and key_compare methods
+are performance-critical in multiple senses.
+
+If the key compare method is slow, every lookup will be slow. If the
+hash function is slow, same story. If the hash function has poor
+statistical properties, space efficiency will suffer. In the limit, a
+bad enough hash function will cause large portions of the table to
+revert to linear search.
+
+Use of the best available vector unit is well worth the trouble in the
+hash and key_compare functions.