aboutsummaryrefslogtreecommitdiffstats
path: root/docs/gettingstarted/developers/bihash.md
blob: b0f9886988778c9c406c908588fb273aa9b9a284 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
Bounded-index Extensible Hashing (bihash)
=========================================

Vpp uses bounded-index extensible hashing to solve a variety of
exact-match (key, value) lookup problems. Benefits of the current
implementation:

* Very high record count scaling, tested to 100,000,000 records.
* Lookup performance degrades gracefully as the number of records increases
* No reader locking required
* Template implementation, it's easy to support arbitrary (key,value) types

Bounded-index extensible hashing has been widely used in databases for
decades.

Bihash uses a two-level data structure:

```
    +-----------------+
    | bucket-0        |
    |  log2_size      |
    |  backing store  |
    +-----------------+
    | bucket-1        |
    |  log2_size      |           +--------------------------------+
    |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
    +-----------------+           | page 0                         |
         ...                      +--------------------------------+
    +-----------------+           | KVP_PER_PAGE * key-value-pairs |
    | bucket-2**N-1   |           | page 1                         |
    |  log2_size      |           +--------------------------------+
    |  backing store  |                       ---
    +-----------------+           +--------------------------------+
                                  | KVP_PER_PAGE * key-value-pairs |
                                  | page 2**(log2(size)) - 1       |
                                  +--------------------------------+
```

Discussion of the algorithm
---------------------------

This structure has a couple of major advantages. In practice, each
bucket entry fits into a 64-bit integer. Coincidentally, vpp's target
CPU architectures support 64-bit atomic operations. When modifying the
contents of a specific bucket, we do the following:

* Make a working copy of the bucket's backing storage
* Atomically swap a pointer to the working copy into the bucket array
* Change the original backing store data
* Atomically swap back to the original

So, no reader locking is required to search a bihash table.

At lookup time, the implementation computes a key hash code. We use
the least-significant N bits of the hash to select the bucket.

With the bucket in hand, we learn log2 (nBackingPages) for the
selected bucket. At this point, we use the next log2_size bits from
the hash code to select the specific backing page in which the
(key,value) page will be found.

Net result: we search **one** backing page, not 2**log2_size
pages. This is a key property of the algorithm.

When sufficient collisions occur to fill the backing pages for a given
bucket, we double the bucket size, rehash, and deal the bucket
contents into a double-sized set of backing pages. In the future, we
may represent the size as a linear combination of two powers-of-two,
to increase space efficiency.

To solve the "jackpot case" where a set of records collide under
hashing in a bad way, the implementation will fall back to linear
search across 2**log2_size backing pages on a per-bucket basis.

To maintain *space* efficiency, we should configure the bucket array
so that backing pages are effectively utilized. Lookup performance
tends to change *very little* if the bucket array is too small or too
large.

Bihash depends on selecting an effective hash function. If one were to
use a truly broken hash function such as "return 1ULL." bihash would
still work, but it would be equivalent to poorly-programmed linear
search.

We often use cpu intrinsic functions - think crc32 - to rapidly
compute a hash code which has decent statistics.

Bihash Cookbook
---------------

### Using current (key,value) template instance types

It's quite easy to use one of the template instance types. As of this
writing, .../src/vppinfra provides pre-built templates for 8, 16, 20,
24, 40, and 48 byte keys, u8 * vector keys, and 8 byte values.

See .../src/vppinfra/{bihash_<key-size>_8}.h

To define the data types, #include a specific template instance, most
often in a subsystem header file:

```c
     #include <vppinfra/bihash_8_8.h>
```

If you're building a standalone application, you'll need to define the
various functions by #including the method implementation file in a C
source file.

The core vpp engine currently uses most if not all of the known bihash
types, so you probably won't need to #include the method
implementation file.


```c
     #include <vppinfra/bihash_template.c>
```

Add an instance of the selected bihash data structure to e.g. a
"main_t" structure:

```c
    typedef struct
    {
      ...
      BVT (clib_bihash) hash_table;
      or
      clib_bihash_8_8_t hash_table;
      ...
    } my_main_t;
```

The BV macro concatenate its argument with the value of the
preprocessor symbol BIHASH_TYPE. The BVT macro concatenates its
argument with the value of BIHASH_TYPE and the fixed-string "_t". So
in the above example, BVT (clib_bihash) generates "clib_bihash_8_8_t".

If you're sure you won't decide to change the template / type name
later, it's perfectly OK to code "clib_bihash_8_8_t" and so forth.

In fact, if you #include multiple template instances in a single
source file, you **must** use fully-enumerated type names. The macros
stand no chance of working.

### Initializing a bihash table

Call the init function as shown. As a rough guide, pick a number of
buckets which is approximately
number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
template instance header-file.  See previous discussion.

The amount of memory selected should easily contain all of the
records, with a generous allowance for hash collisions. Bihash memory
is allocated separately from the main heap, and won't cost anything
except kernel PTE's until touched, so it's OK to be reasonably
generous.

For example:

```c
    my_main_t *mm = &my_main;
    clib_bihash_8_8_t *h;

    h = &mm->hash_table;

    clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
                           (uword) memory_size);
```

### Add or delete a key/value pair

Use BV(clib_bihash_add_del), or the explicit type variant:

```c
   clib_bihash_kv_8_8_t kv;
   clib_bihash_8_8_t * h;
   my_main_t *mm = &my_main;
   clib_bihash_8_8_t *h;

   h = &mm->hash_table;
   kv.key = key_to_add_or_delete;
   kv.value = value_to_add_or_delete;

   clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
```

In the delete case, kv.value is irrelevant. To change the value associated
with an existing (key,value) pair, simply re-add the [new] pair.

### Simple search

The simplest possible (key, value) search goes like so:

```c
   clib_bihash_kv_8_8_t search_kv, return_kv;
   clib_bihash_8_8_t * h;
   my_main_t *mm = &my_main;
   clib_bihash_8_8_t *h;

   h = &mm->hash_table;
   search_kv.key = key_to_add_or_delete;

   if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
     key_not_found();
   else
     key_found();
```

Note that it's perfectly fine to collect the lookup result

```c
   if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
     key_not_found();
   etc.
```

### Bihash vector processing

When processing a vector of packets which need a certain lookup
performed, it's worth the trouble to compute the key hash, and
prefetch the correct bucket ahead of time.

Here's a sketch of one way to write the required code:

Dual-loop:
* 6 packets ahead, prefetch 2x vlib_buffer_t's and 2x packet data
  required to form the record keys
* 4 packets ahead, form 2x record keys and call BV(clib_bihash_hash)
  or the explicit hash function to calculate the record hashes.
  Call 2x BV(clib_bihash_prefetch_bucket) to prefetch the buckets
* 2 packets ahead, call 2x BV(clib_bihash_prefetch_data) to prefetch
  2x (key,value) data pages.
* In the processing section, call 2x BV(clib_bihash_search_inline_with_hash)
  to perform the search

Programmer's choice whether to stash the hash code somewhere in
vnet_buffer(b) metadata, or to use local variables.

Single-loop:
* Use simple search as shown above.

### Walking a bihash table

A fairly common scenario to build "show" commands involves walking a
bihash table. It's simple enough:

```c
   my_main_t *mm = &my_main;
   clib_bihash_8_8_t *h;
   void callback_fn (clib_bihash_kv_8_8_t *, void *);

   h = &mm->hash_table;

   BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
```
To nobody's great surprise: clib_bihash_foreach_key_value_pair
iterates across the entire table, calling callback_fn with active
entries.

#### Bihash table iteration safety

The iterator template "clib_bihash_foreach_key_value_pair" must be
used with a certain amount of care. For one thing, the iterator
template does _not_ take the bihash hash table writer lock. If your
use-case requires it, lock the table.

For another, the iterator template is not safe under all conditions:

* It's __OK to delete__ bihash table entries during a table-walk. The
iterator checks whether the current bucket has been freed after each
_callback_fn(...)_ invocation.

* It is __not OK to add__ entries during a table-walk.

The add-during-walk case involves a jackpot: while processing a
key-value-pair in a particular bucket, add a certain number of
entries. By luck, assume that one or more of the added entries causes
the __current bucket__ to split-and-rehash.

Since we rehash KVP's to different pages based on what amounts to a
different hash function, either of these things can go wrong:

* We may revisit previously-visited entries. Depending on how one
coded the use-case, we could end up in a recursive-add situation.

* We may skip entries that have not been visited

One could build an add-safe iterator, at a significant cost in
performance: copy the entire bucket, and walk the copy.

It's hard to imagine a worthwhile add-during walk use-case in the
first place; let alone one which couldn't be implemented by walking
the table without modifying it, then adding a set of records.

### Creating a new template instance

Creating a new template is easy. Use one of the existing templates as
a model, and make the obvious changes. The hash and key_compare
methods are performance-critical in multiple senses.

If the key compare method is slow, every lookup will be slow. If the
hash function is slow, same story. If the hash function has poor
statistical properties, space efficiency will suffer. In the limit, a
bad enough hash function will cause large portions of the table to
revert to linear search.

Use of the best available vector unit is well worth the trouble in the
hash and key_compare functions.