1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
|
.. SPDX-License-Identifier: BSD-3-Clause
Copyright(c) 2010-2015 Intel Corporation.
Copyright(c) 2018 Arm Limited.
.. _Hash_Library:
Hash Library
============
The DPDK provides a Hash Library for creating hash table for fast lookup.
The hash table is a data structure optimized for searching through a set of entries that are each identified by a unique key.
For increased performance the DPDK Hash requires that all the keys have the same number of bytes which is set at the hash creation time.
Hash API Overview
-----------------
The main configuration parameters for the hash are:
* Total number of hash entries
* Size of the key in bytes
* An extra flag used to describe additional settings, for example the multithreading mode of operation (as will be described later)
The hash also allows the configuration of some low-level implementation related parameters such as:
* Hash function to translate the key into a bucket index
The main methods exported by the hash are:
* Add entry with key: The key is provided as input. If a new entry is successfully added to the hash for the specified key,
or there is already an entry in the hash for the specified key, then the position of the entry is returned.
If the operation was not successful, for example due to lack of free entries in the hash, then a negative value is returned;
* Delete entry with key: The key is provided as input. If an entry with the specified key is found in the hash,
then the entry is removed from the hash and the position where the entry was found in the hash is returned.
If no entry with the specified key exists in the hash, then a negative value is returned
* Lookup for entry with key: The key is provided as input. If an entry with the specified key is found in the hash (lookup hit),
then the position of the entry is returned, otherwise (lookup miss) a negative value is returned.
Apart from these methods explained above, the API provides the user with few more options:
* Add / lookup / delete with key and precomputed hash: Both the key and its precomputed hash are provided as input. This allows
the user to perform these operations faster, as hash is already computed.
* Add / lookup with key and data: A pair of key-value is provided as input. This allows the user to store
not only the key, but also data which may be either a 8-byte integer or a pointer to external data (if data size is more than 8 bytes).
* Combination of the two options above: User can provide key, precomputed hash and data.
* Ability to not free the position of the entry in the hash upon calling delete. This is useful for multi-threaded scenarios where
readers continue to use the position even after the entry is deleted.
Also, the API contains a method to allow the user to look up entries in bursts, achieving higher performance
than looking up individual entries, as the function prefetches next entries at the time it is operating
with the first ones, which reduces significantly the impact of the necessary memory accesses.
The actual data associated with each key can be either managed by the user using a separate table that
mirrors the hash in terms of number of entries and position of each entry,
as shown in the Flow Classification use case describes in the following sections,
or stored in the hash table itself.
The example hash tables in the L2/L3 Forwarding sample applications defines which port to forward a packet to based on a packet flow identified by the five-tuple lookup.
However, this table could also be used for more sophisticated features and provide many other functions and actions that could be performed on the packets and flows.
Multi-process support
---------------------
The hash library can be used in a multi-process environment.
The only function that can only be used in single-process mode is rte_hash_set_cmp_func(), which sets up
a custom compare function, which is assigned to a function pointer (therefore, it is not supported in
multi-process mode).
Multi-thread support
---------------------
The hash library supports multithreading, and the user specifies the needed mode of operation at the creation time of the hash table
by appropriately setting the flag. In all modes of operation lookups are thread-safe meaning lookups can be called from multiple
threads concurrently.
For concurrent writes, and concurrent reads and writes the following flag values define the corresponding modes of operation:
* If the multi-writer flag (RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) is set, multiple threads writing to the table is allowed.
Key add, delete, and table reset are protected from other writer threads. With only this flag set, readers are not protected from ongoing writes.
* If the read/write concurrency (RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY) is set, multithread read/write operation is safe
(i.e., application does not need to stop the readers from accessing the hash table until writers finish their updates. Readers and writers can operate on the table concurrently).
The library uses a reader-writer lock to provide the concurrency.
* In addition to these two flag values, if the transactional memory flag (RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) is also set,
the reader-writer lock will use hardware transactional memory to guarantee thread safety as long as it is supported by the hardware (for example the Intel® TSX support).
If the platform supports Intel® TSX, it is advised to set the transactional memory flag, as this will speed up concurrent table operations.
Otherwise concurrent operations will be slower because of the overhead associated with the software locking mechanisms.
* If lock free read/write concurrency (RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF) is set, read/write concurrency is provided without using reader-writer lock.
For platforms (ex: current Arm based platforms), that do not support transactional memory, it is advised to set this flag to achieve greater scalability in performance.
* If, do not free on delete (RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL) flag is set, the position of the entry in the hash is not freed upon calling delete. This flag is enabled
by default when lock free read/write concurrency is set. The application should free the position after all the readers have stopped referencing the position.
Where required, the application can make use of RCU mechanisms to determine when the readers have stopped referencing the position.
Implementation Details
----------------------
The hash table has two main tables:
* First table is an array of entries which is further divided into buckets,
with the same number of consecutive array entries in each bucket. Each entry contains the computed primary
and secondary hashes of a given key (explained below), and an index to the second table.
* The second table is an array of all the keys stored in the hash table and its data associated to each key.
The hash library uses the cuckoo hash method to resolve collisions.
For any input key, there are two possible buckets (primary and secondary/alternative location)
where that key can be stored in the hash, therefore only the entries within those bucket need to be examined
when the key is looked up.
The lookup speed is achieved by reducing the number of entries to be scanned from the total
number of hash entries down to the number of entries in the two hash buckets,
as opposed to the basic method of linearly scanning all the entries in the array.
The hash uses a hash function (configurable) to translate the input key into a 4-byte key signature.
The bucket index is the key signature modulo the number of hash buckets.
Once the buckets are identified, the scope of the hash add,
delete and lookup operations is reduced to the entries in those buckets (it is very likely that entries are in the primary bucket).
To speed up the search logic within the bucket, each hash entry stores the 4-byte key signature together with the full key for each hash entry.
For large key sizes, comparing the input key against a key from the bucket can take significantly more time than
comparing the 4-byte signature of the input key against the signature of a key from the bucket.
Therefore, the signature comparison is done first and the full key comparison done only when the signatures matches.
The full key comparison is still necessary, as two input keys from the same bucket can still potentially have the same 4-byte hash signature,
although this event is relatively rare for hash functions providing good uniform distributions for the set of input keys.
Example of lookup:
First of all, the primary bucket is identified and entry is likely to be stored there.
If signature was stored there, we compare its key against the one provided and return the position
where it was stored and/or the data associated to that key if there is a match.
If signature is not in the primary bucket, the secondary bucket is looked up, where same procedure
is carried out. If there is no match there either, key is considered not to be in the table.
Example of addition:
Like lookup, the primary and secondary buckets are identified. If there is an empty slot in
the primary bucket, primary and secondary signatures are stored in that slot, key and data (if any) are added to
the second table and an index to the position in the second table is stored in the slot of the first table.
If there is no space in the primary bucket, one of the entries on that bucket is pushed to its alternative location,
and the key to be added is inserted in its position.
To know where the alternative bucket of the evicted entry is, the secondary signature is looked up and alternative bucket index
is calculated from doing the modulo, as seen above. If there is room in the alternative bucket, the evicted entry
is stored in it. If not, same process is repeated (one of the entries gets pushed) until a non full bucket is found.
Notice that despite all the entry movement in the first table, the second table is not touched, which would impact
greatly in performance.
In the very unlikely event that table enters in a loop where same entries are being evicted indefinitely,
key is considered not able to be stored.
With random keys, this method allows the user to get around 90% of the table utilization, without
having to drop any stored entry (LRU) or allocate more memory (extended buckets).
Example of deletion:
Similar to lookup, the key is searched in its primary and secondary buckets. If the key is found, the bucket
entry is marked as an empty slot. If the hash table was configured with 'no free on delete' or 'lock free read/write concurrency',
the position of the key is not freed. It is the responsibility of the user to free the position while making sure that
readers are not referencing the position anymore.
Entry distribution in hash table
--------------------------------
As mentioned above, Cuckoo hash implementation pushes elements out of their bucket,
if there is a new entry to be added which primary location coincides with their current bucket,
being pushed to their alternative location.
Therefore, as user adds more entries to the hash table, distribution of the hash values
in the buckets will change, being most of them in their primary location and a few in
their secondary location, which the later will increase, as table gets busier.
This information is quite useful, as performance may be lower as more entries
are evicted to their secondary location.
See the tables below showing example entry distribution as table utilization increases.
.. _table_hash_lib_1:
.. table:: Entry distribution measured with an example table with 1024 random entries using jhash algorithm
+--------------+-----------------------+-------------------------+
| % Table used | % In Primary location | % In Secondary location |
+==============+=======================+=========================+
| 25 | 100 | 0 |
+--------------+-----------------------+-------------------------+
| 50 | 96.1 | 3.9 |
+--------------+-----------------------+-------------------------+
| 75 | 88.2 | 11.8 |
+--------------+-----------------------+-------------------------+
| 80 | 86.3 | 13.7 |
+--------------+-----------------------+-------------------------+
| 85 | 83.1 | 16.9 |
+--------------+-----------------------+-------------------------+
| 90 | 77.3 | 22.7 |
+--------------+-----------------------+-------------------------+
| 95.8 | 64.5 | 35.5 |
+--------------+-----------------------+-------------------------+
|
.. _table_hash_lib_2:
.. table:: Entry distribution measured with an example table with 1 million random entries using jhash algorithm
+--------------+-----------------------+-------------------------+
| % Table used | % In Primary location | % In Secondary location |
+==============+=======================+=========================+
| 50 | 96 | 4 |
+--------------+-----------------------+-------------------------+
| 75 | 86.9 | 13.1 |
+--------------+-----------------------+-------------------------+
| 80 | 83.9 | 16.1 |
+--------------+-----------------------+-------------------------+
| 85 | 80.1 | 19.9 |
+--------------+-----------------------+-------------------------+
| 90 | 74.8 | 25.2 |
+--------------+-----------------------+-------------------------+
| 94.5 | 67.4 | 32.6 |
+--------------+-----------------------+-------------------------+
.. note::
Last values on the tables above are the average maximum table
utilization with random keys and using Jenkins hash function.
Use Case: Flow Classification
-----------------------------
Flow classification is used to map each input packet to the connection/flow it belongs to.
This operation is necessary as the processing of each input packet is usually done in the context of their connection,
so the same set of operations is applied to all the packets from the same flow.
Applications using flow classification typically have a flow table to manage, with each separate flow having an entry associated with it in this table.
The size of the flow table entry is application specific, with typical values of 4, 16, 32 or 64 bytes.
Each application using flow classification typically has a mechanism defined to uniquely identify a flow based on
a number of fields read from the input packet that make up the flow key.
One example is to use the DiffServ 5-tuple made up of the following fields of the IP and transport layer packet headers:
Source IP Address, Destination IP Address, Protocol, Source Port, Destination Port.
The DPDK hash provides a generic method to implement an application specific flow classification mechanism.
Given a flow table implemented as an array, the application should create a hash object with the same number of entries as the flow table and
with the hash key size set to the number of bytes in the selected flow key.
The flow table operations on the application side are described below:
* Add flow: Add the flow key to hash.
If the returned position is valid, use it to access the flow entry in the flow table for adding a new flow or
updating the information associated with an existing flow.
Otherwise, the flow addition failed, for example due to lack of free entries for storing new flows.
* Delete flow: Delete the flow key from the hash. If the returned position is valid,
use it to access the flow entry in the flow table to invalidate the information associated with the flow.
* Free flow: Free flow key position. If 'no free on delete' or 'lock-free read/write concurrency' flags are set,
wait till the readers are not referencing the position returned during add/delete flow and then free the position.
RCU mechanisms can be used to find out when the readers are not referencing the position anymore.
* Lookup flow: Lookup for the flow key in the hash.
If the returned position is valid (flow lookup hit), use the returned position to access the flow entry in the flow table.
Otherwise (flow lookup miss) there is no flow registered for the current packet.
References
----------
* Donald E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition), 1998, Addison-Wesley Professional
|