From 64ebb5ff1338140d94c7f9ee72138fe84d89de2e Mon Sep 17 00:00:00 2001 From: Chris Luke Date: Wed, 27 Sep 2017 15:09:48 -0400 Subject: General documentation updates - We now have several developer-focused docs, so create an index page for them. - Rework several docs to fit into the index structure. - Experiment with code highlighting; tweak the CSS slightly to make it slightly nicer to look at. Change-Id: I4185a18f84fa0764745ca7a3148276064a3155c6 Signed-off-by: Chris Luke --- src/plugins/acl/acl-plugin.md | 347 ------------------------------------------ 1 file changed, 347 deletions(-) delete mode 100644 src/plugins/acl/acl-plugin.md (limited to 'src/plugins/acl/acl-plugin.md') diff --git a/src/plugins/acl/acl-plugin.md b/src/plugins/acl/acl-plugin.md deleted file mode 100644 index 1b44bca959c..00000000000 --- a/src/plugins/acl/acl-plugin.md +++ /dev/null @@ -1,347 +0,0 @@ -Multicore support for ACL plugin -================================ - -This captures some considerations and design decisions that I have made, -both for my own memory later on ("what the hell was I thinking?!?"), -and for anyone interested to criticize/improve/hack on this code. - -One of the factors taken into account while making these decisions, -was the relative emphasis on the multi-thread vs. single-thread -use cases: the latter is the vastly more prevalent. But, -one can not optimize the single-thread performance without -having a functioning code for multi-thread. - -stateless ACLs -============== - -The stateless trivially parallelizes, and the only potential for the -race between the different threads is during the reconfiguration, -at the time of replacing the old ACL being checked, with -the new ACL. - -In case an acl_add_replace is being used to replace the rules -within the existing entry, a reallocation of am->acls[X].rules -vector will happen and potentially a change in count. - -acl_match_5tuple() has the following code: - - a = am->acls + acl_index; - for (i = 0; i < a->count; i++) - { - r = a->rules + i; - . . . - -Ideally we should be immune from a->rules changing, -but the problem arises if the count changes in flight, -and the new ruleset is smaller - then we will attempt -to "match" against the free memory. - -This can(?) be solved by replacing the for() with while(), -so the comparison happens at each iteration. - -full_acl_match_5tuple(), which iterates over the list -of ACLs, is a bit less immune, since it takes the pointer -to the vector to iterate and keeps a local copy of -that pointer. - -This race can be solved by checking the -current pointer to the vector with the source pointer, -and seeing if there is an (unlikely) change, and if -there is, return the "deny" action, or, better, -restart the check. - -Since the check reloads the ACL list on a per-packet basis, -there is only a window of opportunity of one packet to -"match" packet against an incorrect rule set. -The workers also do not change anything, only read. -Therefore, it looks like building special structures -to ensure that it does not happen at all might be not -worth it. - -At least not until we have a unit-test able to -reliably catch this condition and test that -the measures applied are effective. Adding the code -which is not possible to exercise is worse than -not adding any code at all. - -So, I opt for "do-nothing" here for the moment. - -reflexive ACLs: single-thread -============================= - -Before we talk multi-thread, is worth revisiting the -design of the reflexive ACLs in the plugin, and -the history of their evolution. - -The very first version of the ACL plugin, shipped in -1701, mostly did the job using the existing components -and gluing them together. Because it needed to work -in bridged forwarding path only, using L2 classifier -as an insertion point appeared natural, also L2 classifier, -being a table with sessions, seemed like a good place -to hold the sessions. - -So, the original design had two conceptual nodes: -one, pointed by the next_miss from the L2 classifier table, -was checking the actual ACL, and inserting session into -the L2 classifier table, and the other one, pointed -to by the next_match within the specific session rule, -was checking the existing session. The timing out -of the existing connections was done in the datapath, -by periodically calling the aging function. - -This decision to use the existing components, -with its attrativeness, did bring a few limitations as well: - -* L2 classifier is a simple mask-and-value match, with -a fixed mask across the table. So, sanely supporting IPv6 -packets with extension headers in that framework was impossible. - -* There is no way to get a backpressure from L2 classifier -depending on memory usage. When it runs out of memory, -it simply crashes the box. When it runs out of memory ? -We don't really know. Depends on how it allocates it. - -* Since we need to match the *reflected* traffic, -we had to create *two* full session entries -in two different directions, which is quite wasteful memory-wise. - -* (showstopper): the L2 classifier runs only in -the bridged data path, so supporting routed data path -would require creating something else entirely different, -which would mean much more headaches support-wise going forward. - -Because of that, I have moved to a different model of -creating a session-5-tuple from the packet data - once, -and then doing all the matching just on that 5-tuple. - -This has allowed to add support for skipping IPv6 extension headers. - -Also, this new version started to store the sessions in a dedicated -bihash-per-interface, with the session key data being -aligned for the ingress packets, and being mirrored for the -egress packets. This allows of significant savings in memory, -because now we need to keep only one copy of the session table per -interface instead of two, and also to only have ONE node for all the lookups, -(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity. - -Unfortunately, bihash still has the "lack of backpressure" problem, -in a sense that if you try to insert too many entries and run out -of memory in the heap you supplied, you get a crash. - -To somewhat workaround against that, there is a "maximum tested number of sessions" -value, which tracks the currently inserted sessions in the bihash, -and if this number is being approached, a more aggressive cleanup -can happen. If this number is reached, two behaviors are possible: - -* attempt to do the stateless ACL matching and permit the packet - if it succeeds - -* deny the packet - -Currently I have opted for a second one, since it allows for -a better defined behavior, and if you have to permit -the traffic in both directions, why using stateful anyway ? - -In order to be able to do the cleanup, we need to discriminate between -the session types, with each session type having its own idle timeout. -In order to do that, we keep three lists, defined in enum acl_timeout_e: -ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT. - -The first one is hopefully obvious - it is just all UDP connections. -They have an idle timeout of 600 seconds. - -The second and third is a bit more subtle. TCP is a complicated protocol, -and we need to tread the fine line between doing too little and doing -too much, and triggering the potential compatibility issues because of -being a "middlebox". - -I decided to split the TCP connections into two classes: -established, and everything else. "Established", means we have seen -the SYN and ACK from both sides (with PUSH obviously masked out). -This is the "active" state of any TCP connection and we would like -to ensure we do not screw it up. So, the connections in this state -have the default idle timer of 24 hours. - -All the rest of the connections have the idle timeout of 2 minutes, -(inspired by an old value of MSL) and based on the observation -that the states this class represent are usually very short lived. - -Once we have these three baskets of connections, it is trivial to -imagine a simple cleanup mechanism to deal with this: take a -TCP transient connection that has been hanging around. - -It is debatable whether we want to do discrimination between the -different TCP transient connections. Assuming we do FIFO (and -the lists allow us to do just that), it means a given connection -on the head of the list has been hanging around for longest. -Thus, if we are short on resources, we might just go ahead and -reuse it within the datapath. - -This is where we are slowly approaching the question -"Why in the world have not you used timer wheel or such ?" - -The answer is simple: within the above constraints, it does -not buy me much. - -Also, timer wheel creates a leaky abstraction with a difficult -to manage corner case. Which corner case ? - -We have a set of objects (sessions) with an event that may -or may not happen (idle timeout timer firing), and a -necessity to reset the idle timeout when there is -activity on the session. - -In the worst case, where we had a 10000 of one-packet -UDP sessions just created 10 minutes ago, we would need -to deal with a spike of 10000 expired timers. - -Of course, if we have the active traffic on all -of these 10000 connections, then we will not have -to deal with that ? Right, but we will still have to deal -with canceling and requeueing the timers. - -In the best possible case, requeueing a timer is -going to be something along the lines of a linked-list -removal and reinsertion. - -However, keep in mind we already need to classify the -connections for reuse, so therefore we already have -the linked lists! - -And if we just check these linked lists periodically in -a FIFO fashion, we can get away with a very simple per-packet operation: -writing back the timestamp of "now" into the connection structure. - -Then rather than requeueing the list on a per-packet or per-frame -basis, we can defer this action until the time this session -appears on the head of the FIFO list, and the cleaning -routine makes the decision about whether to discard -the session (because the interval since last activity is bigger -than the idle timeout), or to requeue the session back to -the end of the list (because the last activity was less -than idle timeout ago). - -So, rather than using the timers, we can simply reuse our classification -FIFOs, with the following heuristic: do not look at the session that was -enqueued at time X until X+session_timeout. If we enqueue the sessions -in the order of their initial activity, then we can simply use enqueue -timestamp of the head session as a decision criterion for when we need -to get back at looking at it for the timeout purposes. - -Since the number of FIFOs is small, we get a slightly worse check -performance than with timers, but still O(1). - -We seemingly do quite a few "useless" operations of requeueing the items -back to the tail of the list - but, these are the operations we do not -have to do in the active data path, so overall it is a win. - -(Diversion: I believe this problem is congruent to poll vs. epoll or -events vs. threads, some reading on this subject: -http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html) - -We can also can run a TCP-like scheme for adaptively changing -the wait period in the routine that deals with the connection timeouts: -we can attempt to check the connections a couple of times per second -(same as we would advance the timer wheel), and then if we have requeued -close to a max-per-quantum number of connections, we can half the waiting -interval, and if we did not requeue any, we can slowly increment the waiting -interval - which at a steady state should stabilize similar to what the TCP rate -does. - -reflexive ACLs: multi-thread -============================= - -The single-threaded implementation in 1704 used a separate "cleaner" process -to deal with the timing out of the connections. -It is all good and great when you know that there is only a single core -to run everything on, but the existence of the lists proves to be -a massive difficulty when it comes to operating from multiple threads. - -Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread -and the worker have a demarcation point in time where either one or the other one touches -the session in the list) it might be possible to make it work, but the resulting -trickiness of doing it neatly with all the corner cases is quite large. - -So, for the multi-threaded scenario, we need to move the connection -aging back to the same CPU as its creation. - -Luckily we can do this with the help of the interrupts. - -So, the design is as follows: the aging thread (acl_fa_session_cleaner_process) -periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index), -using vlib_node_set_interrupt_pending(), and -the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions() -which does the actual job of advancing the lists. And within the actual datapath the only thing we will be -doing is putting the items onto FIFO, and updating the last active time on the existing connection. - -The one "delicate" part is that the worker for one leg of the connection might be different from -the worker of another leg of the connection - but, even if the "owner" tries to free the connection, -nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period) -will get the timestamp updated - same thing about the TCP flags seen. - -A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker), -and the return packet processed by another worker, and as a result changes the -the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). -If the class changes from one with the shorter idle time to the one with the longer idle time, -then unless we are in the starvation mode where the transient connections are recycled, -we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle -timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which -will affect the resource usage. - -One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner -can signal to the owner the connection# that needs to be requeued out of order. - -A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer. -This way the resource starvation problem is taken care of, at an expense of some additional work. - -This all looks sufficiently nice and simple until a skeleton falls out of the closet: -sometimes we want to clean the connections en masse before they expire. - -There few potential scenarios: -1) removal of an ACL from the interface -2) removal of an interface -3) manual action of an operator (in the future). - -In order to tackle this, we need to modify the logic which decides whether to requeue the -connection on the end of the list, or to delete it due to idle timeout: - -We define a point in time, and have each worker thread fast-forward through its FIFO, -in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them. - -To keep the ease of appearance to the outside world, we still process this as an event -within the connection cleaner thread, but this event handler does as follows: -1) it creates the bitmap of the sw_if_index values requested to be cleared -2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one, -it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt. -3) wait until all cleanup operations have completed. - -Within the worker interrupt node, we check if the "cleanup in progress" is set, -and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the -requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with. - -(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session). - -If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return. -Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero, -meaning there is more work to do. -When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and -zeroize the bitmap of sw_if_index-es requested to be cleaned. - -The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed -flag within the per-worker structure. The main thread, while waiting for the -cleanup operation to complete, checks if there is a request for interrupt, -and if there is - it sends one. - -This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle -connection cleanup. - -One potential inefficiency is the bitmap values set by the session insertion -in the data path - there is nothing to clear them. - -So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work. -For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set, -would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap). - -=== the end === - -- cgit 1.2.3-korg