aboutsummaryrefslogtreecommitdiffstats
path: root/src/plugins/acl/acl_multicore_doc.md
diff options
context:
space:
mode:
authorNathan Skrzypczak <nathan.skrzypczak@gmail.com>2021-10-08 14:05:35 +0200
committerDave Wallace <dwallacelf@gmail.com>2021-10-13 23:22:20 +0000
commitf47122e07e1ecd0151902a3cabe46c60a99bee8e (patch)
tree0c28c0eca2cb17050d6f31fd8f0ca8f78299bf0d /src/plugins/acl/acl_multicore_doc.md
parent1e4281223ab4d655b54496ae13fbdb68f867e351 (diff)
docs: convert plugins doc md->rst
Type: improvement Change-Id: I7e821cce1feae229e1be4baeed249b9cca658135 Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Diffstat (limited to 'src/plugins/acl/acl_multicore_doc.md')
-rw-r--r--src/plugins/acl/acl_multicore_doc.md349
1 files changed, 0 insertions, 349 deletions
diff --git a/src/plugins/acl/acl_multicore_doc.md b/src/plugins/acl/acl_multicore_doc.md
deleted file mode 100644
index deec5e9d566..00000000000
--- a/src/plugins/acl/acl_multicore_doc.md
+++ /dev/null
@@ -1,349 +0,0 @@
-Multicore support for ACL plugin {#acl_multicore}
-================================
-
-This captures some considerations and design decisions that I have made,
-both for my own memory later on ("what the hell was I thinking?!?"),
-and for anyone interested to criticize/improve/hack on this code.
-
-One of the factors taken into account while making these decisions,
-was the relative emphasis on the multi-thread vs. single-thread
-use cases: the latter is the vastly more prevalent. But,
-one can not optimize the single-thread performance without
-having a functioning code for multi-thread.
-
-stateless ACLs
-==============
-
-The stateless trivially parallelizes, and the only potential for the
-race between the different threads is during the reconfiguration,
-at the time of replacing the old ACL being checked, with
-the new ACL.
-
-In case an acl_add_replace is being used to replace the rules
-within the existing entry, a reallocation of `am->acls[X].rules`
-vector will happen and potentially a change in count.
-
-acl_match_5tuple() has the following code:
-
-```{.c}
- a = am->acls + acl_index;
- for (i = 0; i < a->count; i++)
- {
- r = a->rules + i;
- . . .
-```
-
-Ideally we should be immune from a->rules changing,
-but the problem arises if the count changes in flight,
-and the new ruleset is smaller - then we will attempt
-to "match" against the free memory.
-
-This can(?) be solved by replacing the for() with while(),
-so the comparison happens at each iteration.
-
-full_acl_match_5tuple(), which iterates over the list
-of ACLs, is a bit less immune, since it takes the pointer
-to the vector to iterate and keeps a local copy of
-that pointer.
-
-This race can be solved by checking the
-current pointer to the vector with the source pointer,
-and seeing if there is an (unlikely) change, and if
-there is, return the "deny" action, or, better,
-restart the check.
-
-Since the check reloads the ACL list on a per-packet basis,
-there is only a window of opportunity of one packet to
-"match" packet against an incorrect rule set.
-The workers also do not change anything, only read.
-Therefore, it looks like building special structures
-to ensure that it does not happen at all might be not
-worth it.
-
-At least not until we have a unit-test able to
-reliably catch this condition and test that
-the measures applied are effective. Adding the code
-which is not possible to exercise is worse than
-not adding any code at all.
-
-So, I opt for "do-nothing" here for the moment.
-
-reflexive ACLs: single-thread
-=============================
-
-Before we talk multi-thread, is worth revisiting the
-design of the reflexive ACLs in the plugin, and
-the history of their evolution.
-
-The very first version of the ACL plugin, shipped in
-1701, mostly did the job using the existing components
-and gluing them together. Because it needed to work
-in bridged forwarding path only, using L2 classifier
-as an insertion point appeared natural, also L2 classifier,
-being a table with sessions, seemed like a good place
-to hold the sessions.
-
-So, the original design had two conceptual nodes:
-one, pointed by the next_miss from the L2 classifier table,
-was checking the actual ACL, and inserting session into
-the L2 classifier table, and the other one, pointed
-to by the next_match within the specific session rule,
-was checking the existing session. The timing out
-of the existing connections was done in the datapath,
-by periodically calling the aging function.
-
-This decision to use the existing components,
-with its attractiveness, did bring a few limitations as well:
-
-* L2 classifier is a simple mask-and-value match, with
-a fixed mask across the table. So, sanely supporting IPv6
-packets with extension headers in that framework was impossible.
-
-* There is no way to get a backpressure from L2 classifier
-depending on memory usage. When it runs out of memory,
-it simply crashes the box. When it runs out of memory ?
-We don't really know. Depends on how it allocates it.
-
-* Since we need to match the *reflected* traffic,
-we had to create *two* full session entries
-in two different directions, which is quite wasteful memory-wise.
-
-* (showstopper): the L2 classifier runs only in
-the bridged data path, so supporting routed data path
-would require creating something else entirely different,
-which would mean much more headaches support-wise going forward.
-
-Because of that, I have moved to a different model of
-creating a session-5-tuple from the packet data - once,
-and then doing all the matching just on that 5-tuple.
-
-This has allowed to add support for skipping IPv6 extension headers.
-
-Also, this new version started to store the sessions in a dedicated
-bihash-per-interface, with the session key data being
-aligned for the ingress packets, and being mirrored for the
-egress packets. This allows of significant savings in memory,
-because now we need to keep only one copy of the session table per
-interface instead of two, and also to only have ONE node for all the lookups,
-(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.
-
-Unfortunately, bihash still has the "lack of backpressure" problem,
-in a sense that if you try to insert too many entries and run out
-of memory in the heap you supplied, you get a crash.
-
-To somewhat workaround against that, there is a "maximum tested number of sessions"
-value, which tracks the currently inserted sessions in the bihash,
-and if this number is being approached, a more aggressive cleanup
-can happen. If this number is reached, two behaviors are possible:
-
-* attempt to do the stateless ACL matching and permit the packet
- if it succeeds
-
-* deny the packet
-
-Currently I have opted for a second one, since it allows for
-a better defined behavior, and if you have to permit
-the traffic in both directions, why using stateful anyway ?
-
-In order to be able to do the cleanup, we need to discriminate between
-the session types, with each session type having its own idle timeout.
-In order to do that, we keep three lists, defined in enum acl_timeout_e:
-ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
-
-The first one is hopefully obvious - it is just all UDP connections.
-They have an idle timeout of 600 seconds.
-
-The second and third is a bit more subtle. TCP is a complicated protocol,
-and we need to tread the fine line between doing too little and doing
-too much, and triggering the potential compatibility issues because of
-being a "middlebox".
-
-I decided to split the TCP connections into two classes:
-established, and everything else. "Established", means we have seen
-the SYN and ACK from both sides (with PUSH obviously masked out).
-This is the "active" state of any TCP connection and we would like
-to ensure we do not screw it up. So, the connections in this state
-have the default idle timer of 24 hours.
-
-All the rest of the connections have the idle timeout of 2 minutes,
-(inspired by an old value of MSL) and based on the observation
-that the states this class represent are usually very short lived.
-
-Once we have these three baskets of connections, it is trivial to
-imagine a simple cleanup mechanism to deal with this: take a
-TCP transient connection that has been hanging around.
-
-It is debatable whether we want to do discrimination between the
-different TCP transient connections. Assuming we do FIFO (and
-the lists allow us to do just that), it means a given connection
-on the head of the list has been hanging around for longest.
-Thus, if we are short on resources, we might just go ahead and
-reuse it within the datapath.
-
-This is where we are slowly approaching the question
-"Why in the world have not you used timer wheel or such ?"
-
-The answer is simple: within the above constraints, it does
-not buy me much.
-
-Also, timer wheel creates a leaky abstraction with a difficult
-to manage corner case. Which corner case ?
-
-We have a set of objects (sessions) with an event that may
-or may not happen (idle timeout timer firing), and a
-necessity to reset the idle timeout when there is
-activity on the session.
-
-In the worst case, where we had a 10000 of one-packet
-UDP sessions just created 10 minutes ago, we would need
-to deal with a spike of 10000 expired timers.
-
-Of course, if we have the active traffic on all
-of these 10000 connections, then we will not have
-to deal with that ? Right, but we will still have to deal
-with canceling and requeueing the timers.
-
-In the best possible case, requeueing a timer is
-going to be something along the lines of a linked-list
-removal and reinsertion.
-
-However, keep in mind we already need to classify the
-connections for reuse, so therefore we already have
-the linked lists!
-
-And if we just check these linked lists periodically in
-a FIFO fashion, we can get away with a very simple per-packet operation:
-writing back the timestamp of "now" into the connection structure.
-
-Then rather than requeueing the list on a per-packet or per-frame
-basis, we can defer this action until the time this session
-appears on the head of the FIFO list, and the cleaning
-routine makes the decision about whether to discard
-the session (because the interval since last activity is bigger
-than the idle timeout), or to requeue the session back to
-the end of the list (because the last activity was less
-than idle timeout ago).
-
-So, rather than using the timers, we can simply reuse our classification
-FIFOs, with the following heuristic: do not look at the session that was
-enqueued at time X until X+session_timeout. If we enqueue the sessions
-in the order of their initial activity, then we can simply use enqueue
-timestamp of the head session as a decision criterion for when we need
-to get back at looking at it for the timeout purposes.
-
-Since the number of FIFOs is small, we get a slightly worse check
-performance than with timers, but still O(1).
-
-We seemingly do quite a few "useless" operations of requeueing the items
-back to the tail of the list - but, these are the operations we do not
-have to do in the active data path, so overall it is a win.
-
-(Diversion: I believe this problem is congruent to poll vs. epoll or
-events vs. threads, some reading on this subject:
-http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
-
-We can also can run a TCP-like scheme for adaptively changing
-the wait period in the routine that deals with the connection timeouts:
-we can attempt to check the connections a couple of times per second
-(same as we would advance the timer wheel), and then if we have requeued
-close to a max-per-quantum number of connections, we can half the waiting
-interval, and if we did not requeue any, we can slowly increment the waiting
-interval - which at a steady state should stabilize similar to what the TCP rate
-does.
-
-reflexive ACLs: multi-thread
-=============================
-
-The single-threaded implementation in 1704 used a separate "cleaner" process
-to deal with the timing out of the connections.
-It is all good and great when you know that there is only a single core
-to run everything on, but the existence of the lists proves to be
-a massive difficulty when it comes to operating from multiple threads.
-
-Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
-and the worker have a demarcation point in time where either one or the other one touches
-the session in the list) it might be possible to make it work, but the resulting
-trickiness of doing it neatly with all the corner cases is quite large.
-
-So, for the multi-threaded scenario, we need to move the connection
-aging back to the same CPU as its creation.
-
-Luckily we can do this with the help of the interrupts.
-
-So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
-periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
-using vlib_node_set_interrupt_pending(), and
-the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
-which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
-doing is putting the items onto FIFO, and updating the last active time on the existing connection.
-
-The one "delicate" part is that the worker for one leg of the connection might be different from
-the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
-nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
-will get the timestamp updated - same thing about the TCP flags seen.
-
-A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
-and the return packet processed by another worker, and as a result changes the
-the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
-If the class changes from one with the shorter idle time to the one with the longer idle time,
-then unless we are in the starvation mode where the transient connections are recycled,
-we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
-timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
-will affect the resource usage.
-
-One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
-can signal to the owner the connection# that needs to be requeued out of order.
-
-A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
-This way the resource starvation problem is taken care of, at an expense of some additional work.
-
-This all looks sufficiently nice and simple until a skeleton falls out of the closet:
-sometimes we want to clean the connections en masse before they expire.
-
-There few potential scenarios:
-1) removal of an ACL from the interface
-2) removal of an interface
-3) manual action of an operator (in the future).
-
-In order to tackle this, we need to modify the logic which decides whether to requeue the
-connection on the end of the list, or to delete it due to idle timeout:
-
-We define a point in time, and have each worker thread fast-forward through its FIFO,
-in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.
-
-To keep the ease of appearance to the outside world, we still process this as an event
-within the connection cleaner thread, but this event handler does as follows:
-1) it creates the bitmap of the sw_if_index values requested to be cleared
-2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
-it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
-3) wait until all cleanup operations have completed.
-
-Within the worker interrupt node, we check if the "cleanup in progress" is set,
-and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
-requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.
-
-(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
-
-If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
-Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
-meaning there is more work to do.
-When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
-zeroize the bitmap of sw_if_index-es requested to be cleaned.
-
-The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
-flag within the per-worker structure. The main thread, while waiting for the
-cleanup operation to complete, checks if there is a request for interrupt,
-and if there is - it sends one.
-
-This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
-connection cleanup.
-
-One potential inefficiency is the bitmap values set by the session insertion
-in the data path - there is nothing to clear them.
-
-So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
-For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
-would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).
-
-=== the end ===
-