diff options
Diffstat (limited to 'src/plugins/acl/acl_multicore_doc.rst')
-rw-r--r-- | src/plugins/acl/acl_multicore_doc.rst | 354 |
1 files changed, 354 insertions, 0 deletions
diff --git a/src/plugins/acl/acl_multicore_doc.rst b/src/plugins/acl/acl_multicore_doc.rst new file mode 100644 index 00000000000..142b6b216d2 --- /dev/null +++ b/src/plugins/acl/acl_multicore_doc.rst @@ -0,0 +1,354 @@ +Multicore support for ACL plugin +================================ + +This captures some considerations and design decisions that I have made, +both for my own memory later on (“what the hell was I thinking?!?”), and +for anyone interested to criticize/improve/hack on this code. + +One of the factors taken into account while making these decisions, was +the relative emphasis on the multi-thread vs. single-thread use cases: +the latter is the vastly more prevalent. But, one can not optimize the +single-thread performance without having a functioning code for +multi-thread. + +stateless ACLs +-------------- + +The stateless trivially parallelizes, and the only potential for the +race between the different threads is during the reconfiguration, at the +time of replacing the old ACL being checked, with the new ACL. + +In case an acl_add_replace is being used to replace the rules within the +existing entry, a reallocation of ``am->acls[X].rules`` vector will +happen and potentially a change in count. + +acl_match_5tuple() has the following code: + +.. code:: c + + a = am->acls + acl_index; + for (i = 0; i < a->count; i++) + { + r = a->rules + i; + . . . + +Ideally we should be immune from a->rules changing, but the problem +arises if the count changes in flight, and the new ruleset is smaller - +then we will attempt to “match” against the free memory. + +This can(?) be solved by replacing the for() with while(), so the +comparison happens at each iteration. + +full_acl_match_5tuple(), which iterates over the list of ACLs, is a bit +less immune, since it takes the pointer to the vector to iterate and +keeps a local copy of that pointer. + +This race can be solved by checking the current pointer to the vector +with the source pointer, and seeing if there is an (unlikely) change, +and if there is, return the “deny” action, or, better, restart the +check. + +Since the check reloads the ACL list on a per-packet basis, there is +only a window of opportunity of one packet to “match” packet against an +incorrect rule set. The workers also do not change anything, only read. +Therefore, it looks like building special structures to ensure that it +does not happen at all might be not worth it. + +At least not until we have a unit-test able to reliably catch this +condition and test that the measures applied are effective. Adding the +code which is not possible to exercise is worse than not adding any code +at all. + +So, I opt for “do-nothing” here for the moment. + +reflexive ACLs: single-thread +----------------------------- + +Before we talk multi-thread, is worth revisiting the design of the +reflexive ACLs in the plugin, and the history of their evolution. + +The very first version of the ACL plugin, shipped in 1701, mostly did +the job using the existing components and gluing them together. Because +it needed to work in bridged forwarding path only, using L2 classifier +as an insertion point appeared natural, also L2 classifier, being a +table with sessions, seemed like a good place to hold the sessions. + +So, the original design had two conceptual nodes: one, pointed by the +next_miss from the L2 classifier table, was checking the actual ACL, and +inserting session into the L2 classifier table, and the other one, +pointed to by the next_match within the specific session rule, was +checking the existing session. The timing out of the existing +connections was done in the datapath, by periodically calling the aging +function. + +This decision to use the existing components, with its attractiveness, +did bring a few limitations as well: + +- L2 classifier is a simple mask-and-value match, with a fixed mask + across the table. So, sanely supporting IPv6 packets with extension + headers in that framework was impossible. + +- There is no way to get a backpressure from L2 classifier depending on + memory usage. When it runs out of memory, it simply crashes the box. + When it runs out of memory ? We don’t really know. Depends on how it + allocates it. + +- Since we need to match the *reflected* traffic, we had to create + *two* full session entries in two different directions, which is + quite wasteful memory-wise. + +- (showstopper): the L2 classifier runs only in the bridged data path, + so supporting routed data path would require creating something else + entirely different, which would mean much more headaches support-wise + going forward. + +Because of that, I have moved to a different model of creating a +session-5-tuple from the packet data - once, and then doing all the +matching just on that 5-tuple. + +This has allowed to add support for skipping IPv6 extension headers. + +Also, this new version started to store the sessions in a dedicated +bihash-per-interface, with the session key data being aligned for the +ingress packets, and being mirrored for the egress packets. This allows +of significant savings in memory, because now we need to keep only one +copy of the session table per interface instead of two, and also to only +have ONE node for all the lookups, (L2/L3 path, in/out, IPv4/IPv6) - +significantly reducing the code complexity. + +Unfortunately, bihash still has the “lack of backpressure” problem, in a +sense that if you try to insert too many entries and run out of memory +in the heap you supplied, you get a crash. + +To somewhat workaround against that, there is a “maximum tested number +of sessions” value, which tracks the currently inserted sessions in the +bihash, and if this number is being approached, a more aggressive +cleanup can happen. If this number is reached, two behaviors are +possible: + +- attempt to do the stateless ACL matching and permit the packet if it + succeeds + +- deny the packet + +Currently I have opted for a second one, since it allows for a better +defined behavior, and if you have to permit the traffic in both +directions, why using stateful anyway ? + +In order to be able to do the cleanup, we need to discriminate between +the session types, with each session type having its own idle timeout. +In order to do that, we keep three lists, defined in enum acl_timeout_e: +ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT. + +The first one is hopefully obvious - it is just all UDP connections. +They have an idle timeout of 600 seconds. + +The second and third is a bit more subtle. TCP is a complicated +protocol, and we need to tread the fine line between doing too little +and doing too much, and triggering the potential compatibility issues +because of being a “middlebox”. + +I decided to split the TCP connections into two classes: established, +and everything else. “Established”, means we have seen the SYN and ACK +from both sides (with PUSH obviously masked out). This is the “active” +state of any TCP connection and we would like to ensure we do not screw +it up. So, the connections in this state have the default idle timer of +24 hours. + +All the rest of the connections have the idle timeout of 2 minutes, +(inspired by an old value of MSL) and based on the observation that the +states this class represent are usually very short lived. + +Once we have these three baskets of connections, it is trivial to +imagine a simple cleanup mechanism to deal with this: take a TCP +transient connection that has been hanging around. + +It is debatable whether we want to do discrimination between the +different TCP transient connections. Assuming we do FIFO (and the lists +allow us to do just that), it means a given connection on the head of +the list has been hanging around for longest. Thus, if we are short on +resources, we might just go ahead and reuse it within the datapath. + +This is where we are slowly approaching the question “Why in the world +have not you used timer wheel or such ?” + +The answer is simple: within the above constraints, it does not buy me +much. + +Also, timer wheel creates a leaky abstraction with a difficult to manage +corner case. Which corner case ? + +We have a set of objects (sessions) with an event that may or may not +happen (idle timeout timer firing), and a necessity to reset the idle +timeout when there is activity on the session. + +In the worst case, where we had a 10000 of one-packet UDP sessions just +created 10 minutes ago, we would need to deal with a spike of 10000 +expired timers. + +Of course, if we have the active traffic on all of these 10000 +connections, then we will not have to deal with that ? Right, but we +will still have to deal with canceling and requeueing the timers. + +In the best possible case, requeueing a timer is going to be something +along the lines of a linked-list removal and reinsertion. + +However, keep in mind we already need to classify the connections for +reuse, so therefore we already have the linked lists! + +And if we just check these linked lists periodically in a FIFO fashion, +we can get away with a very simple per-packet operation: writing back +the timestamp of “now” into the connection structure. + +Then rather than requeueing the list on a per-packet or per-frame basis, +we can defer this action until the time this session appears on the head +of the FIFO list, and the cleaning routine makes the decision about +whether to discard the session (because the interval since last activity +is bigger than the idle timeout), or to requeue the session back to the +end of the list (because the last activity was less than idle timeout +ago). + +So, rather than using the timers, we can simply reuse our classification +FIFOs, with the following heuristic: do not look at the session that was +enqueued at time X until X+session_timeout. If we enqueue the sessions +in the order of their initial activity, then we can simply use enqueue +timestamp of the head session as a decision criterion for when we need +to get back at looking at it for the timeout purposes. + +Since the number of FIFOs is small, we get a slightly worse check +performance than with timers, but still O(1). + +We seemingly do quite a few “useless” operations of requeueing the items +back to the tail of the list - but, these are the operations we do not +have to do in the active data path, so overall it is a win. + +(Diversion: I believe this problem is congruent to poll vs. epoll or +events vs. threads, some reading on this subject: +http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html) + +We can also can run a TCP-like scheme for adaptively changing the wait +period in the routine that deals with the connection timeouts: we can +attempt to check the connections a couple of times per second (same as +we would advance the timer wheel), and then if we have requeued close to +a max-per-quantum number of connections, we can half the waiting +interval, and if we did not requeue any, we can slowly increment the +waiting interval - which at a steady state should stabilize similar to +what the TCP rate does. + +reflexive ACLs: multi-thread +---------------------------- + +The single-threaded implementation in 1704 used a separate “cleaner” +process to deal with the timing out of the connections. It is all good +and great when you know that there is only a single core to run +everything on, but the existence of the lists proves to be a massive +difficulty when it comes to operating from multiple threads. + +Initial study shows that with a few assumptions (e.g. that the cleaner +running in main thread and the worker have a demarcation point in time +where either one or the other one touches the session in the list) it +might be possible to make it work, but the resulting trickiness of doing +it neatly with all the corner cases is quite large. + +So, for the multi-threaded scenario, we need to move the connection +aging back to the same CPU as its creation. + +Luckily we can do this with the help of the interrupts. + +So, the design is as follows: the aging thread +(acl_fa_session_cleaner_process) periodically fires the interrupts to +the workers interrupt nodes +(acl_fa_worker_session_cleaner_process_node.index), using +vlib_node_set_interrupt_pending(), and the interrupt node +acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions() +which does the actual job of advancing the lists. And within the actual +datapath the only thing we will be doing is putting the items onto FIFO, +and updating the last active time on the existing connection. + +The one “delicate” part is that the worker for one leg of the connection +might be different from the worker of another leg of the connection - +but, even if the “owner” tries to free the connection, nothing terrible +can happen - worst case the element of the pool (which is nominally free +for a short period) will get the timestamp updated - same thing about +the TCP flags seen. + +A slightly trickier issue arises when the packet initially seen by one +worker (thus owned by that worker), and the return packet processed by +another worker, and as a result changes the the class of the connection +(e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). If the +class changes from one with the shorter idle time to the one with the +longer idle time, then unless we are in the starvation mode where the +transient connections are recycled, we can simply do nothing and let the +normal requeue mechanism kick in. If the class changes from the longer +idle timer to the shorter idle timer, then we risk keeping the +connection around for longer than needed, which will affect the resource +usage. + +One solution to that is to have NxN ring buffers (where N is the number +of workers), such that the non-owner can signal to the owner the +connection# that needs to be requeued out of order. + +A simpler solution though, is to ensure that each FIFO’s period is equal +to that of a shortest timer. This way the resource starvation problem is +taken care of, at an expense of some additional work. + +This all looks sufficiently nice and simple until a skeleton falls out +of the closet: sometimes we want to clean the connections en masse +before they expire. + +There few potential scenarios: 1) removal of an ACL from the interface +2) removal of an interface 3) manual action of an operator (in the +future). + +In order to tackle this, we need to modify the logic which decides +whether to requeue the connection on the end of the list, or to delete +it due to idle timeout: + +We define a point in time, and have each worker thread fast-forward +through its FIFO, in the process looking for sessions that satisfy the +criteria, and either keeping them or requeueing them. + +To keep the ease of appearance to the outside world, we still process +this as an event within the connection cleaner thread, but this event +handler does as follows: 1) it creates the bitmap of the sw_if_index +values requested to be cleared 2) for each worker, it waits to ensure +there is no cleanup operation in progress (and if there is one, it +waits), and then makes a copy of the bitmap, sets the per-worker flag of +a cleanup operation, and sends an interrupt. 3) wait until all cleanup +operations have completed. + +Within the worker interrupt node, we check if the “cleanup in progress” +is set, and if it is, we check the “fast forward time” value. If unset, +we initialize it to value now, and compare the requested bitmap of +sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of +sw_if_index that this worker deals with. + +(we set the bit in the bitmap every time we enqueue the packet onto a +FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session). + +If the result of this AND operation is zero - then we can clear the flag +of cleanup in progress and return. Else we kick off the quantum of +cleanup, and make sure we get another interrupt ASAP if that cleanup +operation returns non-zero, meaning there is more work to do. When that +operation returns zero, everything has been processed, we can clear the +“cleanup-in-progress” flag, and zeroize the bitmap of sw_if_index-es +requested to be cleaned. + +The interrupt node signals its wish to receive an interrupt ASAP by +setting interrupt_is_needed flag within the per-worker structure. The +main thread, while waiting for the cleanup operation to complete, checks +if there is a request for interrupt, and if there is - it sends one. + +This approach gives us a way to mass-clean the connections which is +reusing the code of the regular idle connection cleanup. + +One potential inefficiency is the bitmap values set by the session +insertion in the data path - there is nothing to clear them. + +So, if one rearranges the interface placement with the workers, then the +cleanups will cause some unnecessary work. For now, we consider it an +acceptable limitation. It can be resolved by having another per-worker +bitmap, which, when set, would trigger the cleanup of the bits in the +serviced_sw_if_index_bitmap). + +=== the end === |