src/plugins/acl/acl_multicore_doc.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349

Multicore support for ACL plugin    {#acl_multicore}
================================

This captures some considerations and design decisions that I have made,
both for my own memory later on ("what the hell was I thinking?!?"),
and for anyone interested to criticize/improve/hack on this code.

One of the factors taken into account while making these decisions,
was the relative emphasis on the multi-thread vs. single-thread
use cases: the latter is the vastly more prevalent. But,
one can not optimize the single-thread performance without
having a functioning code for multi-thread.

stateless ACLs
==============

The stateless trivially parallelizes, and the only potential for the
race between the different threads is during the reconfiguration,
at the time of replacing the old ACL being checked, with
the new ACL.

In case an acl_add_replace is being used to replace the rules
within the existing entry, a reallocation of `am->acls[X].rules`
vector will happen and potentially a change in count.

acl_match_5tuple() has the following code:

```{.c}
  a = am->acls + acl_index;
  for (i = 0; i < a->count; i++)
    {
      r = a->rules + i;
     . . .
```

Ideally we should be immune from a->rules changing,
but the problem arises if the count changes in flight,
and the new ruleset is smaller - then we will attempt
to "match" against the free memory.

This can(?) be solved by replacing the for() with while(),
so the comparison happens at each iteration.

full_acl_match_5tuple(), which iterates over the list
of ACLs, is a bit less immune, since it takes the pointer
to the vector to iterate and keeps a local copy of
that pointer.

This race can be solved by checking the
current pointer to the vector with the source pointer,
and seeing if there is an (unlikely) change, and if
there is, return the "deny" action, or, better,
restart the check.

Since the check reloads the ACL list on a per-packet basis,
there is only a window of opportunity of one packet to
"match" packet against an incorrect rule set.
The workers also do not change anything, only read.
Therefore, it looks like building special structures
to ensure that it does not happen at all might be not
worth it.

At least not until we have a unit-test able to
reliably catch this condition and test that
the measures applied are effective. Adding the code
which is not possible to exercise is worse than
not adding any code at all.

So, I opt for "do-nothing" here for the moment.

reflexive ACLs: single-thread
=============================

Before we talk multi-thread, is worth revisiting the
design of the reflexive ACLs in the plugin, and
the history of their evolution.

The very first version of the ACL plugin, shipped in
1701, mostly did the job using the existing components
and gluing them together. Because it needed to work
in bridged forwarding path only, using L2 classifier
as an insertion point appeared natural, also L2 classifier,
being a table with sessions, seemed like a good place
to hold the sessions.

So, the original design had two conceptual nodes:
one, pointed by the next_miss from the L2 classifier table,
was checking the actual ACL, and inserting session into
the L2 classifier table, and the other one, pointed
to by the next_match within the specific session rule,
was checking the existing session. The timing out
of the existing connections was done in the datapath,
by periodically calling the aging function.

This decision to use the existing components,
with its attractiveness, did bring a few limitations as well:

* L2 classifier is a simple mask-and-value match, with
a fixed mask across the table. So, sanely supporting IPv6
packets with extension headers in that framework was impossible.

* There is no way to get a backpressure from L2 classifier
depending on memory usage. When it runs out of memory,
it simply crashes the box. When it runs out of memory ?
We don't really know. Depends on how it allocates it.

* Since we need to match the *reflected* traffic,
we had to create *two* full session entries
in two different directions, which is quite wasteful memory-wise.

* (showstopper): the L2 classifier runs only in
the bridged data path, so supporting routed data path
would require creating something else entirely different,
which would mean much more headaches support-wise going forward.

Because of that, I have moved to a different model of
creating a session-5-tuple from the packet data - once,
and then doing all the matching just on that 5-tuple.

This has allowed to add support for skipping IPv6 extension headers.

Also, this new version started to store the sessions in a dedicated
bihash-per-interface, with the session key data being
aligned for the ingress packets, and being mirrored for the
egress packets. This allows of significant savings in memory,
because now we need to keep only one copy of the session table per
interface instead of two, and also to only have ONE node for all the lookups,
(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.

Unfortunately, bihash still has the "lack of backpressure" problem,
in a sense that if you try to insert too many entries and run out
of memory in the heap you supplied, you get a crash.

To somewhat workaround against that, there is a "maximum tested number of sessions"
value, which tracks the currently inserted sessions in the bihash,
and if this number is being approached, a more aggressive cleanup
can happen. If this number is reached, two behaviors are possible:

* attempt to do the stateless ACL matching and permit the packet
  if it succeeds

* deny the packet

Currently I have opted for a second one, since it allows for
a better defined behavior, and if you have to permit
the traffic in both directions, why using stateful anyway ?

In order to be able to do the cleanup, we need to discriminate between
the session types, with each session type having its own idle timeout.
In order to do that, we keep three lists, defined in enum acl_timeout_e:
ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.

The first one is hopefully obvious - it is just all UDP connections.
They have an idle timeout of 600 seconds.

The second and third is a bit more subtle. TCP is a complicated protocol,
and we need to tread the fine line between doing too little and doing
too much, and triggering the potential compatibility issues because of
being a "middlebox".

I decided to split the TCP connections into two classes:
established, and everything else. "Established", means we have seen
the SYN and ACK from both sides (with PUSH obviously masked out).
This is the "active" state of any TCP connection and we would like
to ensure we do not screw it up. So, the connections in this state
have the default idle timer of 24 hours.

All the rest of the connections have the idle timeout of 2 minutes,
(inspired by an old value of MSL) and based on the observation
that the states this class represent are usually very short lived.

Once we have these three baskets of connections, it is trivial to
imagine a simple cleanup mechanism to deal with this: take a
TCP transient connection that has been hanging around.

It is debatable whether we want to do discrimination between the
different TCP transient connections. Assuming we do FIFO (and
the lists allow us to do just that), it means a given connection
on the head of the list has been hanging around for longest.
Thus, if we are short on resources, we might just go ahead and
reuse it within the datapath.

This is where we are slowly approaching the question
"Why in the world have not you used timer wheel or such ?"

The answer is simple: within the above constraints, it does
not buy me much.

Also, timer wheel creates a leaky abstraction with a difficult
to manage corner case. Which corner case ?

We have a set of objects (sessions) with an event that may
or may not happen (idle timeout timer firing), and a
necessity to reset the idle timeout when there is
activity on the session.

In the worst case, where we had a 10000 of one-packet
UDP sessions just created 10 minutes ago, we would need
to deal with a spike of 10000 expired timers.

Of course, if we have the active traffic on all
of these 10000 connections, then we will not have
to deal with that ? Right, but we will still have to deal
with canceling and requeueing the timers.

In the best possible case, requeueing a timer is
going to be something along the lines of a linked-list
removal and reinsertion.

However, keep in mind we already need to classify the
connections for reuse, so therefore we already have
the linked lists!

And if we just check these linked lists periodically in
a FIFO fashion, we can get away with a very simple per-packet operation:
writing back the timestamp of "now" into the connection structure.

Then rather than requeueing the list on a per-packet or per-frame
basis, we can defer this action until the time this session
appears on the head of the FIFO list, and the cleaning
routine makes the decision about whether to discard
the session (because the interval since last activity is bigger
than the idle timeout), or to requeue the session back to
the end of the list (because the last activity was less
than idle timeout ago).

So, rather than using the timers, we can simply reuse our classification
FIFOs, with the following heuristic: do not look at the session that was
enqueued at time X until X+session_timeout. If we enqueue the sessions
in the order of their initial activity, then we can simply use enqueue
timestamp of the head session as a decision criterion for when we need
to get back at looking at it for the timeout purposes.

Since the number of FIFOs is small, we get a slightly worse check
performance than with timers, but still O(1).

We seemingly do quite a few "useless" operations of requeueing the items
back to the tail of the list - but, these are the operations we do not
have to do in the active data path, so overall it is a win.

(Diversion: I believe this problem is congruent to poll vs. epoll or
events vs. threads, some reading on this subject:
http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)

We can also can run a TCP-like scheme for adaptively changing
the wait period in the routine that deals with the connection timeouts:
we can attempt to check the connections a couple of times per second
(same as we would advance the timer wheel), and then if we have requeued
close to a max-per-quantum number of connections, we can half the waiting
interval, and if we did not requeue any, we can slowly increment the waiting
interval - which at a steady state should stabilize similar to what the TCP rate
does.

reflexive ACLs: multi-thread
=============================

The single-threaded implementation in 1704 used a separate "cleaner" process
to deal with the timing out of the connections.
It is all good and great when you know that there is only a single core
to run everything on, but the existence of the lists proves to be
a massive difficulty when it comes to operating from multiple threads.

Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
and the worker have a demarcation point in time where either one or the other one touches
the session in the list) it might be possible to make it work, but the resulting
trickiness of doing it neatly with all the corner cases is quite large.

So, for the multi-threaded scenario, we need to move the connection
aging back to the same CPU as its creation.

Luckily we can do this with the help of the interrupts.

So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
using vlib_node_set_interrupt_pending(), and
the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
doing is putting the items onto FIFO, and updating the last active time on the existing connection.

The one "delicate" part is that the worker for one leg of the connection might be different from
the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
will get the timestamp updated - same thing about the TCP flags seen.

A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
and the return packet processed by another worker, and as a result changes the
the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
If the class changes from one with the shorter idle time to the one with the longer idle time,
then unless we are in the starvation mode where the transient connections are recycled,
we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
will affect the resource usage.

One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
can signal to the owner the connection# that needs to be requeued out of order.

A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
This way the resource starvation problem is taken care of, at an expense of some additional work.

This all looks sufficiently nice and simple until a skeleton falls out of the closet:
sometimes we want to clean the connections en masse before they expire.

There few potential scenarios:
1) removal of an ACL from the interface
2) removal of an interface
3) manual action of an operator (in the future).

In order to tackle this, we need to modify the logic which decides whether to requeue the
connection on the end of the list, or to delete it due to idle timeout:

We define a point in time, and have each worker thread fast-forward through its FIFO,
in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.

To keep the ease of appearance to the outside world, we still process this as an event
within the connection cleaner thread, but this event handler does as follows:
1) it creates the bitmap of the sw_if_index values requested to be cleared
2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
3) wait until all cleanup operations have completed.

Within the worker interrupt node, we check if the "cleanup in progress" is set,
and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.

(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).

If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
meaning there is more work to do.
When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
zeroize the bitmap of sw_if_index-es requested to be cleaned.

The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
flag within the per-worker structure. The main thread, while waiting for the
cleanup operation to complete, checks if there is a request for interrupt,
and if there is - it sends one.

This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
connection cleanup.

One potential inefficiency is the bitmap values set by the session insertion
in the data path - there is nothing to clear them.

So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).

=== the end ===