diff options
author | Andrew Yourtchenko <ayourtch@gmail.com> | 2017-05-17 21:27:03 +0200 |
---|---|---|
committer | Damjan Marion <dmarion.lists@gmail.com> | 2017-06-07 13:37:46 +0000 |
commit | 5dbfbb7110a52595915acd5ec034f82ce517846a (patch) | |
tree | 1214c5cb0904a5ca3eadba59f587a12af531364a | |
parent | a62699954a9f57c8407ba10d08636c79166f56ed (diff) |
acl-plugin: make the ACL plugin multicore-capable
Add the logic to be able to use stateful ACLs in a multithreaded setup.
Change-Id: I3b0cfa6ca4ea8f46f61648611c3e97b00c3376b6
Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
-rw-r--r-- | src/plugins/acl/acl-plugin.md | 347 | ||||
-rw-r--r-- | src/plugins/acl/acl.c | 39 | ||||
-rw-r--r-- | src/plugins/acl/acl.h | 11 | ||||
-rw-r--r-- | src/plugins/acl/fa_node.c | 646 | ||||
-rw-r--r-- | src/plugins/acl/fa_node.h | 76 |
5 files changed, 912 insertions, 207 deletions
diff --git a/src/plugins/acl/acl-plugin.md b/src/plugins/acl/acl-plugin.md new file mode 100644 index 00000000000..1b44bca959c --- /dev/null +++ b/src/plugins/acl/acl-plugin.md @@ -0,0 +1,347 @@ +Multicore support for ACL plugin +================================ + +This captures some considerations and design decisions that I have made, +both for my own memory later on ("what the hell was I thinking?!?"), +and for anyone interested to criticize/improve/hack on this code. + +One of the factors taken into account while making these decisions, +was the relative emphasis on the multi-thread vs. single-thread +use cases: the latter is the vastly more prevalent. But, +one can not optimize the single-thread performance without +having a functioning code for multi-thread. + +stateless ACLs +============== + +The stateless trivially parallelizes, and the only potential for the +race between the different threads is during the reconfiguration, +at the time of replacing the old ACL being checked, with +the new ACL. + +In case an acl_add_replace is being used to replace the rules +within the existing entry, a reallocation of am->acls[X].rules +vector will happen and potentially a change in count. + +acl_match_5tuple() has the following code: + + a = am->acls + acl_index; + for (i = 0; i < a->count; i++) + { + r = a->rules + i; + . . . + +Ideally we should be immune from a->rules changing, +but the problem arises if the count changes in flight, +and the new ruleset is smaller - then we will attempt +to "match" against the free memory. + +This can(?) be solved by replacing the for() with while(), +so the comparison happens at each iteration. + +full_acl_match_5tuple(), which iterates over the list +of ACLs, is a bit less immune, since it takes the pointer +to the vector to iterate and keeps a local copy of +that pointer. + +This race can be solved by checking the +current pointer to the vector with the source pointer, +and seeing if there is an (unlikely) change, and if +there is, return the "deny" action, or, better, +restart the check. + +Since the check reloads the ACL list on a per-packet basis, +there is only a window of opportunity of one packet to +"match" packet against an incorrect rule set. +The workers also do not change anything, only read. +Therefore, it looks like building special structures +to ensure that it does not happen at all might be not +worth it. + +At least not until we have a unit-test able to +reliably catch this condition and test that +the measures applied are effective. Adding the code +which is not possible to exercise is worse than +not adding any code at all. + +So, I opt for "do-nothing" here for the moment. + +reflexive ACLs: single-thread +============================= + +Before we talk multi-thread, is worth revisiting the +design of the reflexive ACLs in the plugin, and +the history of their evolution. + +The very first version of the ACL plugin, shipped in +1701, mostly did the job using the existing components +and gluing them together. Because it needed to work +in bridged forwarding path only, using L2 classifier +as an insertion point appeared natural, also L2 classifier, +being a table with sessions, seemed like a good place +to hold the sessions. + +So, the original design had two conceptual nodes: +one, pointed by the next_miss from the L2 classifier table, +was checking the actual ACL, and inserting session into +the L2 classifier table, and the other one, pointed +to by the next_match within the specific session rule, +was checking the existing session. The timing out +of the existing connections was done in the datapath, +by periodically calling the aging function. + +This decision to use the existing components, +with its attrativeness, did bring a few limitations as well: + +* L2 classifier is a simple mask-and-value match, with +a fixed mask across the table. So, sanely supporting IPv6 +packets with extension headers in that framework was impossible. + +* There is no way to get a backpressure from L2 classifier +depending on memory usage. When it runs out of memory, +it simply crashes the box. When it runs out of memory ? +We don't really know. Depends on how it allocates it. + +* Since we need to match the *reflected* traffic, +we had to create *two* full session entries +in two different directions, which is quite wasteful memory-wise. + +* (showstopper): the L2 classifier runs only in +the bridged data path, so supporting routed data path +would require creating something else entirely different, +which would mean much more headaches support-wise going forward. + +Because of that, I have moved to a different model of +creating a session-5-tuple from the packet data - once, +and then doing all the matching just on that 5-tuple. + +This has allowed to add support for skipping IPv6 extension headers. + +Also, this new version started to store the sessions in a dedicated +bihash-per-interface, with the session key data being +aligned for the ingress packets, and being mirrored for the +egress packets. This allows of significant savings in memory, +because now we need to keep only one copy of the session table per +interface instead of two, and also to only have ONE node for all the lookups, +(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity. + +Unfortunately, bihash still has the "lack of backpressure" problem, +in a sense that if you try to insert too many entries and run out +of memory in the heap you supplied, you get a crash. + +To somewhat workaround against that, there is a "maximum tested number of sessions" +value, which tracks the currently inserted sessions in the bihash, +and if this number is being approached, a more aggressive cleanup +can happen. If this number is reached, two behaviors are possible: + +* attempt to do the stateless ACL matching and permit the packet + if it succeeds + +* deny the packet + +Currently I have opted for a second one, since it allows for +a better defined behavior, and if you have to permit +the traffic in both directions, why using stateful anyway ? + +In order to be able to do the cleanup, we need to discriminate between +the session types, with each session type having its own idle timeout. +In order to do that, we keep three lists, defined in enum acl_timeout_e: +ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT. + +The first one is hopefully obvious - it is just all UDP connections. +They have an idle timeout of 600 seconds. + +The second and third is a bit more subtle. TCP is a complicated protocol, +and we need to tread the fine line between doing too little and doing +too much, and triggering the potential compatibility issues because of +being a "middlebox". + +I decided to split the TCP connections into two classes: +established, and everything else. "Established", means we have seen +the SYN and ACK from both sides (with PUSH obviously masked out). +This is the "active" state of any TCP connection and we would like +to ensure we do not screw it up. So, the connections in this state +have the default idle timer of 24 hours. + +All the rest of the connections have the idle timeout of 2 minutes, +(inspired by an old value of MSL) and based on the observation +that the states this class represent are usually very short lived. + +Once we have these three baskets of connections, it is trivial to +imagine a simple cleanup mechanism to deal with this: take a +TCP transient connection that has been hanging around. + +It is debatable whether we want to do discrimination between the +different TCP transient connections. Assuming we do FIFO (and +the lists allow us to do just that), it means a given connection +on the head of the list has been hanging around for longest. +Thus, if we are short on resources, we might just go ahead and +reuse it within the datapath. + +This is where we are slowly approaching the question +"Why in the world have not you used timer wheel or such ?" + +The answer is simple: within the above constraints, it does +not buy me much. + +Also, timer wheel creates a leaky abstraction with a difficult +to manage corner case. Which corner case ? + +We have a set of objects (sessions) with an event that may +or may not happen (idle timeout timer firing), and a +necessity to reset the idle timeout when there is +activity on the session. + +In the worst case, where we had a 10000 of one-packet +UDP sessions just created 10 minutes ago, we would need +to deal with a spike of 10000 expired timers. + +Of course, if we have the active traffic on all +of these 10000 connections, then we will not have +to deal with that ? Right, but we will still have to deal +with canceling and requeueing the timers. + +In the best possible case, requeueing a timer is +going to be something along the lines of a linked-list +removal and reinsertion. + +However, keep in mind we already need to classify the +connections for reuse, so therefore we already have +the linked lists! + +And if we just check these linked lists periodically in +a FIFO fashion, we can get away with a very simple per-packet operation: +writing back the timestamp of "now" into the connection structure. + +Then rather than requeueing the list on a per-packet or per-frame +basis, we can defer this action until the time this session +appears on the head of the FIFO list, and the cleaning +routine makes the decision about whether to discard +the session (because the interval since last activity is bigger +than the idle timeout), or to requeue the session back to +the end of the list (because the last activity was less +than idle timeout ago). + +So, rather than using the timers, we can simply reuse our classification +FIFOs, with the following heuristic: do not look at the session that was +enqueued at time X until X+session_timeout. If we enqueue the sessions +in the order of their initial activity, then we can simply use enqueue +timestamp of the head session as a decision criterion for when we need +to get back at looking at it for the timeout purposes. + +Since the number of FIFOs is small, we get a slightly worse check +performance than with timers, but still O(1). + +We seemingly do quite a few "useless" operations of requeueing the items +back to the tail of the list - but, these are the operations we do not +have to do in the active data path, so overall it is a win. + +(Diversion: I believe this problem is congruent to poll vs. epoll or +events vs. threads, some reading on this subject: +http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html) + +We can also can run a TCP-like scheme for adaptively changing +the wait period in the routine that deals with the connection timeouts: +we can attempt to check the connections a couple of times per second +(same as we would advance the timer wheel), and then if we have requeued +close to a max-per-quantum number of connections, we can half the waiting +interval, and if we did not requeue any, we can slowly increment the waiting +interval - which at a steady state should stabilize similar to what the TCP rate +does. + +reflexive ACLs: multi-thread +============================= + +The single-threaded implementation in 1704 used a separate "cleaner" process +to deal with the timing out of the connections. +It is all good and great when you know that there is only a single core +to run everything on, but the existence of the lists proves to be +a massive difficulty when it comes to operating from multiple threads. + +Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread +and the worker have a demarcation point in time where either one or the other one touches +the session in the list) it might be possible to make it work, but the resulting +trickiness of doing it neatly with all the corner cases is quite large. + +So, for the multi-threaded scenario, we need to move the connection +aging back to the same CPU as its creation. + +Luckily we can do this with the help of the interrupts. + +So, the design is as follows: the aging thread (acl_fa_session_cleaner_process) +periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index), +using vlib_node_set_interrupt_pending(), and +the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions() +which does the actual job of advancing the lists. And within the actual datapath the only thing we will be +doing is putting the items onto FIFO, and updating the last active time on the existing connection. + +The one "delicate" part is that the worker for one leg of the connection might be different from +the worker of another leg of the connection - but, even if the "owner" tries to free the connection, +nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period) +will get the timestamp updated - same thing about the TCP flags seen. + +A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker), +and the return packet processed by another worker, and as a result changes the +the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). +If the class changes from one with the shorter idle time to the one with the longer idle time, +then unless we are in the starvation mode where the transient connections are recycled, +we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle +timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which +will affect the resource usage. + +One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner +can signal to the owner the connection# that needs to be requeued out of order. + +A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer. +This way the resource starvation problem is taken care of, at an expense of some additional work. + +This all looks sufficiently nice and simple until a skeleton falls out of the closet: +sometimes we want to clean the connections en masse before they expire. + +There few potential scenarios: +1) removal of an ACL from the interface +2) removal of an interface +3) manual action of an operator (in the future). + +In order to tackle this, we need to modify the logic which decides whether to requeue the +connection on the end of the list, or to delete it due to idle timeout: + +We define a point in time, and have each worker thread fast-forward through its FIFO, +in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them. + +To keep the ease of appearance to the outside world, we still process this as an event +within the connection cleaner thread, but this event handler does as follows: +1) it creates the bitmap of the sw_if_index values requested to be cleared +2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one, +it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt. +3) wait until all cleanup operations have completed. + +Within the worker interrupt node, we check if the "cleanup in progress" is set, +and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the +requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with. + +(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session). + +If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return. +Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero, +meaning there is more work to do. +When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and +zeroize the bitmap of sw_if_index-es requested to be cleaned. + +The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed +flag within the per-worker structure. The main thread, while waiting for the +cleanup operation to complete, checks if there is a request for interrupt, +and if there is - it sends one. + +This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle +connection cleanup. + +One potential inefficiency is the bitmap values set by the session insertion +in the data path - there is nothing to clear them. + +So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work. +For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set, +would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap). + +=== the end === + diff --git a/src/plugins/acl/acl.c b/src/plugins/acl/acl.c index 6b79411e771..84ed7af6ed3 100644 --- a/src/plugins/acl/acl.c +++ b/src/plugins/acl/acl.c @@ -1785,6 +1785,7 @@ done: return error; } + static clib_error_t * acl_show_aclplugin_fn (vlib_main_t * vm, unformat_input_t * input, @@ -1799,6 +1800,7 @@ acl_show_aclplugin_fn (vlib_main_t * vm, if (unformat (input, "sessions")) { u8 * out0 = 0; + u16 wk; pool_foreach (swif, im->sw_interfaces, ({ u32 sw_if_index = swif->sw_if_index; @@ -1806,6 +1808,24 @@ acl_show_aclplugin_fn (vlib_main_t * vm, u64 n_dels = sw_if_index < vec_len(am->fa_session_dels_by_sw_if_index) ? am->fa_session_dels_by_sw_if_index[sw_if_index] : 0; out0 = format(out0, "sw_if_index %d: add %lu - del %lu = %lu\n", sw_if_index, n_adds, n_dels, n_adds - n_dels); })); + out0 = format(out0, "\n\nPer-worker data:\n"); + for (wk = 0; wk < vec_len (am->per_worker_data); wk++) { + acl_fa_per_worker_data_t *pw = &am->per_worker_data[wk]; + out0 = format(out0, "Worker #%d:\n", wk); + out0 = format(out0, " Next expiry time: %lu\n", pw->next_expiry_time); + out0 = format(out0, " Requeue until time: %lu\n", pw->requeue_until_time); + out0 = format(out0, " Current time wait interval: %lu\n", pw->current_time_wait_interval); + out0 = format(out0, " Count of deleted sessions: %lu\n", pw->cnt_deleted_sessions); + out0 = format(out0, " Delete already deleted: %lu\n", pw->cnt_already_deleted_sessions); + out0 = format(out0, " Session timers restarted: %lu\n", pw->cnt_session_timer_restarted); + out0 = format(out0, " Swipe until this time: %lu\n", pw->swipe_end_time); + out0 = format(out0, " sw_if_index serviced bitmap: %U\n", format_bitmap_hex, pw->serviced_sw_if_index_bitmap); + out0 = format(out0, " pending clear intfc bitmap : %U\n", format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap); + out0 = format(out0, " clear in progress: %u\n", pw->clear_in_process); + out0 = format(out0, " interrupt is pending: %d\n", pw->interrupt_is_pending); + out0 = format(out0, " interrupt is needed: %d\n", pw->interrupt_is_needed); + out0 = format(out0, " interrupt is unwanted: %d\n", pw->interrupt_is_unwanted); + } out0 = format(out0, "\n\nConn cleaner thread counters:\n"); #define _(cnt, desc) out0 = format(out0, " %20lu: %s\n", am->cnt, desc); foreach_fa_cleaner_counter; @@ -1867,14 +1887,24 @@ acl_init (vlib_main_t * vm) am->fa_conn_table_hash_num_buckets = ACL_FA_CONN_TABLE_DEFAULT_HASH_NUM_BUCKETS; am->fa_conn_table_hash_memory_size = ACL_FA_CONN_TABLE_DEFAULT_HASH_MEMORY_SIZE; am->fa_conn_table_max_entries = ACL_FA_CONN_TABLE_DEFAULT_MAX_ENTRIES; - + vlib_thread_main_t *tm = vlib_get_thread_main (); + // vec_validate(am->per_worker_data, os_get_nthreads()-1); + vec_validate(am->per_worker_data, tm->n_vlib_mains-1); + clib_warning("ACL_FA_INIT: per-worker len: %d", vec_len(am->per_worker_data)); { + u16 wk; u8 tt; - for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) { - am->fa_conn_list_head[tt] = ~0; - am->fa_conn_list_tail[tt] = ~0; + for (wk = 0; wk < vec_len (am->per_worker_data); wk++) { + acl_fa_per_worker_data_t *pw = &am->per_worker_data[wk]; + vec_validate(pw->fa_conn_list_head, ACL_N_TIMEOUTS-1); + vec_validate(pw->fa_conn_list_tail, ACL_N_TIMEOUTS-1); + for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) { + pw->fa_conn_list_head[tt] = ~0; + pw->fa_conn_list_tail[tt] = ~0; + } } } + clib_warning("ACL_FA_INIT-DONE: per-worker len: %d", vec_len(am->per_worker_data)); am->fa_min_deleted_sessions_per_interval = ACL_FA_DEFAULT_MIN_DELETED_SESSIONS_PER_INTERVAL; am->fa_max_deleted_sessions_per_interval = ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL; @@ -1883,7 +1913,6 @@ acl_init (vlib_main_t * vm) am->fa_cleaner_cnt_delete_by_sw_index = 0; am->fa_cleaner_cnt_delete_by_sw_index_ok = 0; am->fa_cleaner_cnt_unknown_event = 0; - am->fa_cleaner_cnt_deleted_sessions = 0; am->fa_cleaner_cnt_timer_restarted = 0; am->fa_cleaner_cnt_wait_with_timeout = 0; diff --git a/src/plugins/acl/acl.h b/src/plugins/acl/acl.h index eb074a7b6a2..e35e0ea753a 100644 --- a/src/plugins/acl/acl.h +++ b/src/plugins/acl/acl.h @@ -138,9 +138,7 @@ typedef struct { /* bitmap, when set the hash is initialized */ uword *fa_sessions_on_sw_if_index; clib_bihash_40_8_t *fa_sessions_by_sw_if_index; - /* pool for FA session data. See fa_node.h */ - fa_session_t *fa_sessions_pool; - /* The process node which is responsible to deleting the sessions */ + /* The process node which orcherstrates the cleanup */ u32 fa_cleaner_node_index; /* FA session timeouts, in seconds */ u32 session_timeout_sec[ACL_N_TIMEOUTS]; @@ -192,8 +190,9 @@ typedef struct { f64 fa_cleaner_wait_time_increment; u64 fa_current_cleaner_timer_wait_interval; - u32 fa_conn_list_head[ACL_N_TIMEOUTS]; - u32 fa_conn_list_tail[ACL_N_TIMEOUTS]; + + /* per-worker data related t conn management */ + acl_fa_per_worker_data_t *per_worker_data; /* Configured session timeout */ u64 session_timeout[ACL_N_TIMEOUTS]; @@ -205,12 +204,10 @@ typedef struct { _(fa_cleaner_cnt_delete_by_sw_index, "delete_by_sw_index events") \ _(fa_cleaner_cnt_delete_by_sw_index_ok, "delete_by_sw_index handled ok") \ _(fa_cleaner_cnt_unknown_event, "unknown events received") \ - _(fa_cleaner_cnt_deleted_sessions, "sessions deleted") \ _(fa_cleaner_cnt_timer_restarted, "session idle timers restarted") \ _(fa_cleaner_cnt_wait_with_timeout, "event wait with timeout called") \ _(fa_cleaner_cnt_wait_without_timeout, "event wait w/o timeout called") \ _(fa_cleaner_cnt_event_cycles, "total event cycles") \ - _(fa_cleaner_cnt_already_deleted, "try to delete already deleted conn") \ /* end of counters */ #define _(id, desc) u32 id; foreach_fa_cleaner_counter diff --git a/src/plugins/acl/fa_node.c b/src/plugins/acl/fa_node.c index c71429e76c6..c6059aae76a 100644 --- a/src/plugins/acl/fa_node.c +++ b/src/plugins/acl/fa_node.c @@ -543,6 +543,39 @@ fa_session_get_timeout_type (acl_main_t * am, fa_session_t * sess) static u64 +fa_session_get_shortest_timeout(acl_main_t * am) +{ + int timeout_type; + u64 timeout = ~0LL; + for(timeout_type = 0; timeout_type < ACL_N_TIMEOUTS; timeout_type++) { + if (timeout > am->session_timeout_sec[timeout_type]) { + timeout = am->session_timeout_sec[timeout_type]; + } + } + return timeout; +} + +/* + * Get the timeout of the session in a list since its enqueue time. + */ + +static u64 +fa_session_get_list_timeout (acl_main_t * am, fa_session_t * sess) +{ + u64 timeout = am->vlib_main->clib_time.clocks_per_second; + /* + * we have the shortest possible timeout type in all the lists + * (see README-multicore for the rationale) + */ + timeout *= fa_session_get_shortest_timeout(am); + return timeout; +} + +/* + * Get the idle timeout of a session. + */ + +static u64 fa_session_get_timeout (acl_main_t * am, fa_session_t * sess) { u64 timeout = am->vlib_main->clib_time.clocks_per_second; @@ -554,6 +587,7 @@ fa_session_get_timeout (acl_main_t * am, fa_session_t * sess) static void acl_fa_ifc_init_sessions (acl_main_t * am, int sw_if_index0) { + /// FIXME-MULTICORE: lock around this function #ifdef FA_NODE_VERBOSE_DEBUG clib_warning ("Initializing bihash for sw_if_index %d num buckets %lu memory size %llu", @@ -569,68 +603,97 @@ acl_fa_ifc_init_sessions (acl_main_t * am, int sw_if_index0) clib_bitmap_set (am->fa_sessions_on_sw_if_index, sw_if_index0, 1); } +static inline fa_session_t *get_session_ptr(acl_main_t *am, u16 thread_index, u32 session_index) +{ + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + fa_session_t *sess = pw->fa_sessions_pool + session_index; + return sess; +} + static void -acl_fa_conn_list_add_session (acl_main_t * am, u32 sess_id, u64 now) +acl_fa_conn_list_add_session (acl_main_t * am, fa_full_session_id_t sess_id, u64 now) { - fa_session_t *sess = am->fa_sessions_pool + sess_id; + fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index); u8 list_id = fa_session_get_timeout_type(am, sess); + uword thread_index = os_get_thread_index (); + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + /* the retrieved session thread index must be necessarily the same as the one in the key */ + ASSERT (sess->thread_index == sess_id.thread_index); + /* the retrieved session thread index must be the same as current thread */ + ASSERT (sess->thread_index == thread_index); sess->link_enqueue_time = now; sess->link_list_id = list_id; sess->link_next_idx = ~0; - sess->link_prev_idx = am->fa_conn_list_tail[list_id]; - if (~0 != am->fa_conn_list_tail[list_id]) { - fa_session_t *prev_sess = am->fa_sessions_pool + am->fa_conn_list_tail[list_id]; - prev_sess->link_next_idx = sess_id; + sess->link_prev_idx = pw->fa_conn_list_tail[list_id]; + if (~0 != pw->fa_conn_list_tail[list_id]) { + fa_session_t *prev_sess = get_session_ptr(am, thread_index, pw->fa_conn_list_tail[list_id]); + prev_sess->link_next_idx = sess_id.session_index; + /* We should never try to link with a session on another thread */ + ASSERT(prev_sess->thread_index == sess->thread_index); } - am->fa_conn_list_tail[list_id] = sess_id; + pw->fa_conn_list_tail[list_id] = sess_id.session_index; + pw->serviced_sw_if_index_bitmap = clib_bitmap_set(pw->serviced_sw_if_index_bitmap, sess->sw_if_index, 1); - if (~0 == am->fa_conn_list_head[list_id]) { - am->fa_conn_list_head[list_id] = sess_id; - /* If it is a first conn in any list, kick off the cleaner */ + if (~0 == pw->fa_conn_list_head[list_id]) { + pw->fa_conn_list_head[list_id] = sess_id.session_index; + /* If it is a first conn in any list, kick the cleaner */ vlib_process_signal_event (am->vlib_main, am->fa_cleaner_node_index, ACL_FA_CLEANER_RESCHEDULE, 0); - } } -static void -acl_fa_conn_list_delete_session (acl_main_t *am, u32 sess_id) +static int +acl_fa_conn_list_delete_session (acl_main_t *am, fa_full_session_id_t sess_id) { - fa_session_t *sess = am->fa_sessions_pool + sess_id; + uword thread_index = os_get_thread_index (); + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + if (thread_index != sess_id.thread_index) { + /* If another thread attempts to delete the session, fail it. */ +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("thread id in key %d != curr thread index, not deleting"); +#endif + return 0; + } + fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index); + /* we should never try to delete the session with another thread index */ + ASSERT(sess->thread_index == thread_index); if (~0 != sess->link_prev_idx) { - fa_session_t *prev_sess = am->fa_sessions_pool + sess->link_prev_idx; + fa_session_t *prev_sess = get_session_ptr(am, thread_index, sess->link_prev_idx); + /* the previous session must be in the same list as this one */ + ASSERT(prev_sess->link_list_id == sess->link_list_id); prev_sess->link_next_idx = sess->link_next_idx; - if (prev_sess->link_list_id != sess->link_list_id) - clib_warning("(prev_sess->link_list_id != sess->link_list_id)"); } if (~0 != sess->link_next_idx) { - fa_session_t *next_sess = am->fa_sessions_pool + sess->link_next_idx; + fa_session_t *next_sess = get_session_ptr(am, thread_index, sess->link_next_idx); + /* The next session must be in the same list as the one we are deleting */ + ASSERT(next_sess->link_list_id == sess->link_list_id); next_sess->link_prev_idx = sess->link_prev_idx; - if (next_sess->link_list_id != sess->link_list_id) - clib_warning("(next_sess->link_list_id != sess->link_list_id)"); } - if (am->fa_conn_list_head[sess->link_list_id] == sess_id) { - am->fa_conn_list_head[sess->link_list_id] = sess->link_next_idx; + if (pw->fa_conn_list_head[sess->link_list_id] == sess_id.session_index) { + pw->fa_conn_list_head[sess->link_list_id] = sess->link_next_idx; } - if (am->fa_conn_list_tail[sess->link_list_id] == sess_id) { - am->fa_conn_list_tail[sess->link_list_id] = sess->link_prev_idx; + if (pw->fa_conn_list_tail[sess->link_list_id] == sess_id.session_index) { + pw->fa_conn_list_tail[sess->link_list_id] = sess->link_prev_idx; } + return 1; } - -int -acl_fa_session_is_dead (acl_main_t * am, u32 sw_if_index, u64 now, - u32 sess_id) -{ - return 0; -} - -static void -acl_fa_restart_timer_for_session (acl_main_t * am, u64 now, u32 sess_id) +static int +acl_fa_restart_timer_for_session (acl_main_t * am, u64 now, fa_full_session_id_t sess_id) { - // fa_session_t *sess = am->fa_sessions_pool + sess_id; - acl_fa_conn_list_delete_session(am, sess_id); - acl_fa_conn_list_add_session(am, sess_id, now); + if (acl_fa_conn_list_delete_session(am, sess_id)) { + acl_fa_conn_list_add_session(am, sess_id, now); + return 1; + } else { + /* + * Our thread does not own this connection, so we can not delete + * The session. To avoid the complicated signaling, we simply + * pick the list waiting time to be the shortest of the timeouts. + * This way we do not have to do anything special, and let + * the regular requeue check take care of everything. + */ + return 0; + } } @@ -648,13 +711,16 @@ acl_fa_track_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now, static void -acl_fa_delete_session (acl_main_t * am, u32 sw_if_index, u32 sess_id) +acl_fa_delete_session (acl_main_t * am, u32 sw_if_index, fa_full_session_id_t sess_id) { - fa_session_t *sess = (fa_session_t *) am->fa_sessions_pool + sess_id; + fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index); + ASSERT(sess->thread_index == os_get_thread_index ()); BV (clib_bihash_add_del) (&am->fa_sessions_by_sw_if_index[sw_if_index], &sess->info.kv, 0); - pool_put_index (am->fa_sessions_pool, sess_id); - /* Deleting from timer wheel not needed, as the cleaner deals with the timers. */ + acl_fa_per_worker_data_t *pw = &am->per_worker_data[sess_id.thread_index]; + pool_put_index (pw->fa_sessions_pool, sess_id.session_index); + /* Deleting from timer structures not needed, + as the caller must have dealt with the timers. */ vec_validate (am->fa_session_dels_by_sw_if_index, sw_if_index); am->fa_session_dels_by_sw_if_index[sw_if_index]++; } @@ -671,13 +737,114 @@ acl_fa_can_add_session (acl_main_t * am, int is_input, u32 sw_if_index) return (curr_sess < am->fa_conn_table_max_entries); } +static u64 +acl_fa_get_list_head_expiry_time(acl_main_t *am, acl_fa_per_worker_data_t *pw, u64 now, u16 thread_index, int timeout_type) +{ + if (~0 == pw->fa_conn_list_head[timeout_type]) { + return ~0LL; // infinity. + } else { + fa_session_t *sess = get_session_ptr(am, thread_index, pw->fa_conn_list_head[timeout_type]); + u64 timeout_time = + sess->link_enqueue_time + fa_session_get_list_timeout (am, sess); + return timeout_time; + } +} + +static int +acl_fa_conn_time_to_check (acl_main_t *am, acl_fa_per_worker_data_t *pw, u64 now, u16 thread_index, u32 session_index) +{ + fa_session_t *sess = get_session_ptr(am, thread_index, session_index); + u64 timeout_time = + sess->link_enqueue_time + fa_session_get_list_timeout (am, sess); + return (timeout_time < now) || (sess->link_enqueue_time <= pw->swipe_end_time); +} + +/* + * see if there are sessions ready to be checked, + * do the maintenance (requeue or delete), and + * return the total number of sessions reclaimed. + */ +static int +acl_fa_check_idle_sessions(acl_main_t *am, u16 thread_index, u64 now) +{ + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + fa_full_session_id_t fsid; + fsid.thread_index = thread_index; + int total_expired = 0; + + { + u8 tt = 0; + for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) { + while((vec_len(pw->expired) < am->fa_max_deleted_sessions_per_interval) + && (~0 != pw->fa_conn_list_head[tt]) + && (acl_fa_conn_time_to_check(am, pw, now, thread_index, + pw->fa_conn_list_head[tt]))) { + fsid.session_index = pw->fa_conn_list_head[tt]; + vec_add1(pw->expired, fsid.session_index); + acl_fa_conn_list_delete_session(am, fsid); + } + } + } + + u32 *psid = NULL; + vec_foreach (psid, pw->expired) + { + fsid.session_index = *psid; + if (!pool_is_free_index (pw->fa_sessions_pool, fsid.session_index)) + { + fa_session_t *sess = get_session_ptr(am, thread_index, fsid.session_index); + u32 sw_if_index = sess->sw_if_index; + u64 sess_timeout_time = + sess->last_active_time + fa_session_get_timeout (am, sess); + if ((now < sess_timeout_time) && (0 == clib_bitmap_get(pw->pending_clear_sw_if_index_bitmap, sw_if_index))) + { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning ("ACL_FA_NODE_CLEAN: Restarting timer for session %d", + (int) session_index); +#endif + /* There was activity on the session, so the idle timeout + has not passed. Enqueue for another time period. */ + + acl_fa_conn_list_add_session(am, fsid, now); + pw->cnt_session_timer_restarted++; + } + else + { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning ("ACL_FA_NODE_CLEAN: Deleting session %d", + (int) session_index); +#endif + acl_fa_delete_session (am, sw_if_index, fsid); + pw->cnt_deleted_sessions++; + } + } + else + { + pw->cnt_already_deleted_sessions++; + } + } + total_expired = vec_len(pw->expired); + /* zero out the vector which we have acted on */ + if (pw->expired) + _vec_len (pw->expired) = 0; + /* if we were advancing and reached the end + * (no more sessions to recycle), reset the fast-forward timestamp */ + + if (pw->swipe_end_time && 0 == total_expired) + pw->swipe_end_time = 0; + return (total_expired); +} + always_inline void -acl_fa_try_recycle_session (acl_main_t * am, int is_input, u32 sw_if_index) +acl_fa_try_recycle_session (acl_main_t * am, int is_input, u16 thread_index, u32 sw_if_index) { /* try to recycle a TCP transient session */ + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; u8 timeout_type = ACL_TIMEOUT_TCP_TRANSIENT; - u32 sess_id = am->fa_conn_list_head[timeout_type]; - if (~0 != sess_id) { + fa_full_session_id_t sess_id; + sess_id.session_index = pw->fa_conn_list_head[timeout_type]; + if (~0 != sess_id.session_index) { + sess_id.thread_index = thread_index; acl_fa_conn_list_delete_session(am, sess_id); acl_fa_delete_session(am, sw_if_index, sess_id); } @@ -689,25 +856,28 @@ acl_fa_add_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now, { clib_bihash_kv_40_8_t *pkv = &p5tuple->kv; clib_bihash_kv_40_8_t kv; - u32 sess_id; - fa_session_t *sess; + fa_full_session_id_t f_sess_id; + uword thread_index = os_get_thread_index(); + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; - pool_get (am->fa_sessions_pool, sess); - sess_id = sess - am->fa_sessions_pool; + f_sess_id.thread_index = thread_index; + fa_session_t *sess; + pool_get_aligned (pw->fa_sessions_pool, sess, CLIB_CACHE_LINE_BYTES); + f_sess_id.session_index = sess - pw->fa_sessions_pool; kv.key[0] = pkv->key[0]; kv.key[1] = pkv->key[1]; kv.key[2] = pkv->key[2]; kv.key[3] = pkv->key[3]; kv.key[4] = pkv->key[4]; - kv.value = sess_id; + kv.value = f_sess_id.as_u64; memcpy (sess, pkv, sizeof (pkv->key)); sess->last_active_time = now; sess->sw_if_index = sw_if_index; sess->tcp_flags_seen.as_u16 = 0; - sess->reserved1 = 0; + sess->thread_index = thread_index; sess->link_list_id = ~0; sess->link_prev_idx = ~0; sess->link_next_idx = ~0; @@ -721,7 +891,7 @@ acl_fa_add_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now, BV (clib_bihash_add_del) (&am->fa_sessions_by_sw_if_index[sw_if_index], &kv, 1); - acl_fa_conn_list_add_session(am, sess_id, now); + acl_fa_conn_list_add_session(am, f_sess_id, now); vec_validate (am->fa_session_adds_by_sw_if_index, sw_if_index); am->fa_session_adds_by_sw_if_index[sw_if_index]++; @@ -757,6 +927,7 @@ acl_fa_node_fn (vlib_main_t * vm, clib_bihash_kv_40_8_t value_sess; vlib_node_runtime_t *error_node; u64 now = clib_cpu_time_now (); + uword thread_index = os_get_thread_index (); from = vlib_frame_vector_args (frame); n_left_from = frame->n_vectors; @@ -827,16 +998,19 @@ acl_fa_node_fn (vlib_main_t * vm, { trace_bitmap |= 0x80000000; error0 = ACL_FA_ERROR_ACL_EXIST_SESSION; - // FIXME assert(value_sess.value == (0xffffffff & value_sess.value)); - u32 sess_id = value_sess.value; - fa_session_t *sess = am->fa_sessions_pool + sess_id; + fa_full_session_id_t f_sess_id; + + f_sess_id.as_u64 = value_sess.value; + ASSERT(f_sess_id.thread_index < vec_len(vlib_mains)); + + fa_session_t *sess = get_session_ptr(am, f_sess_id.thread_index, f_sess_id.session_index); int old_timeout_type = fa_session_get_timeout_type (am, sess); action = acl_fa_track_session (am, is_input, sw_if_index0, now, sess, &fa_5tuple); /* expose the session id to the tracer */ - match_rule_index = sess_id; + match_rule_index = f_sess_id.session_index; int new_timeout_type = fa_session_get_timeout_type (am, sess); acl_check_needed = 0; @@ -844,7 +1018,7 @@ acl_fa_node_fn (vlib_main_t * vm, /* Tracking might have changed the session timeout type, e.g. from transient to established */ if (PREDICT_FALSE (old_timeout_type != new_timeout_type)) { - acl_fa_restart_timer_for_session (am, now, sess_id); + acl_fa_restart_timer_for_session (am, now, f_sess_id); pkts_restart_session_timer++; trace_bitmap |= 0x00010000 + ((0xff & old_timeout_type) << 8) + @@ -865,7 +1039,7 @@ acl_fa_node_fn (vlib_main_t * vm, if (2 == action) { if (!acl_fa_can_add_session (am, is_input, sw_if_index0)) - acl_fa_try_recycle_session (am, is_input, sw_if_index0); + acl_fa_try_recycle_session (am, is_input, thread_index, sw_if_index0); if (acl_fa_can_add_session (am, is_input, sw_if_index0)) { @@ -1024,16 +1198,9 @@ acl_out_ip4_fa_node_fn (vlib_main_t * vm, } /* - * This process performs all the connection clean up - both for idle connections, - * as well as receiving the signals to clean up the connections in case of sw_if_index deletion, - * or (maybe in the future) the connection deletion due to policy reasons. - * - * The previous iteration (l2sess) attempted to clean up the connections in small increments, - * in-band, but the problem it tried to preemptively address (process starvation) is yet to be seen. - * - * The approach with a single thread deleting the connections is simpler, thus we use it until - * there is a real starvation problem to solve. - * + * This process ensures the connection cleanup happens every so often + * even in absence of traffic, as well as provides general orchestration + * for requests like connection deletion on a given sw_if_index. */ @@ -1056,57 +1223,131 @@ static char *acl_fa_cleaner_error_strings[] = { #undef _ }; -static int -acl_fa_clean_sessions_by_sw_if_index (acl_main_t *am, u32 sw_if_index, u32 *count) -{ - - int undeleted = 0; - fa_session_t *sess; - uword *dv = NULL; - uword *ii; - - pool_foreach(sess, am->fa_sessions_pool, ({ - if ( (~0 == sw_if_index) || (sw_if_index == sess->sw_if_index) ) - vec_add1(dv, sess-am->fa_sessions_pool); - })); - vec_foreach(ii, dv) - { - sess = pool_elt_at_index(am->fa_sessions_pool, *ii); - acl_fa_delete_session(am, sess->sw_if_index, *ii); - (*count)++; - } - - pool_foreach(sess, am->fa_sessions_pool, ({ - if ( (~0 == sw_if_index) || (sw_if_index == sess->sw_if_index) ) - undeleted++; - })); - if (undeleted == 0) - { - if (~0 == sw_if_index) - { - /* FIXME: clean-up tables ? */ - } - else - { - /* FIXME: clean-up tables ? */ - } - } - return (undeleted == 0); -} /* *INDENT-ON* */ static vlib_node_registration_t acl_fa_session_cleaner_process_node; +static vlib_node_registration_t acl_fa_worker_session_cleaner_process_node; -static int -acl_fa_conn_time_to_check (acl_main_t *am, u64 now, u32 session_index) +/* + * Per-worker thread interrupt-driven cleaner thread + * to clean idle connections if there are no packets + */ +static uword +acl_fa_worker_conn_cleaner_process(vlib_main_t * vm, + vlib_node_runtime_t * rt, vlib_frame_t * f) { - fa_session_t *sess = am->fa_sessions_pool + session_index; - u64 timeout_time = - sess->link_enqueue_time + fa_session_get_timeout (am, sess); - return (timeout_time < now); + acl_main_t *am = &acl_main; + u64 now = clib_cpu_time_now (); + u16 thread_index = os_get_thread_index (); + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + int num_expired; +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("\nacl_fa_worker_conn_cleaner: thread index %d now %lu\n\n", thread_index, now); +#endif + /* allow another interrupt to be queued */ + pw->interrupt_is_pending = 0; + if (pw->clear_in_process) { + if (0 == pw->swipe_end_time) { + /* + * Someone has just set the flag to start clearing. + * we do this by combing through the connections up to a "time T" + * which is now, and requeueing everything except the expired + * connections and those matching the interface(s) being cleared. + */ + + /* + * first filter the sw_if_index bitmap that they want from us, by + * a bitmap of sw_if_index for which we actually have connections. + */ + if ((pw->pending_clear_sw_if_index_bitmap == 0) + || (pw->serviced_sw_if_index_bitmap == 0)) { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("WORKER-CLEAR: someone tried to call clear, but one of the bitmaps are empty"); +#endif + clib_bitmap_zero(pw->pending_clear_sw_if_index_bitmap); + } else { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("WORKER-CLEAR: (before and) swiping sw-if-index bitmap: %U, my serviced bitmap %U", + format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap, + format_bitmap_hex, pw->serviced_sw_if_index_bitmap); +#endif + pw->pending_clear_sw_if_index_bitmap = clib_bitmap_and(pw->pending_clear_sw_if_index_bitmap, + pw->serviced_sw_if_index_bitmap); + } + + if (clib_bitmap_is_zero(pw->pending_clear_sw_if_index_bitmap)) { + /* if the cross-section is a zero vector, no need to do anything. */ + clib_warning("WORKER: clearing done - nothing to do"); + pw->clear_in_process = 0; + } else { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("WORKER-CLEAR: swiping sw-if-index bitmap: %U, my serviced bitmap %U", + format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap, + format_bitmap_hex, pw->serviced_sw_if_index_bitmap); +#endif + /* swipe through the connection lists until enqueue timestamps become above "now" */ + pw->swipe_end_time = now; + } + } + } + num_expired = acl_fa_check_idle_sessions(am, thread_index, now); + // clib_warning("WORKER-CLEAR: checked %d sessions (clear_in_progress: %d)", num_expired, pw->clear_in_process); + if (pw->clear_in_process) { + if (0 == num_expired) { + /* we were clearing but we could not process any more connections. time to stop. */ + clib_bitmap_zero(pw->pending_clear_sw_if_index_bitmap); + pw->clear_in_process = 0; +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("WORKER: clearing done, all done"); +#endif + } else { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("WORKER-CLEAR: more work to do, raising interrupt"); +#endif + /* should continue clearing.. So could they please sent an interrupt again? */ + pw->interrupt_is_needed = 1; + } + } else { + if (num_expired >= am->fa_max_deleted_sessions_per_interval) { + /* there was too much work, we should get an interrupt ASAP */ + pw->interrupt_is_needed = 1; + } else if (num_expired <= am->fa_min_deleted_sessions_per_interval) { + /* signal that they should trigger us less */ + pw->interrupt_is_unwanted = 1; + } else { + /* the current rate of interrupts is ok */ + pw->interrupt_is_needed = 0; + pw->interrupt_is_unwanted = 0; + } + } + return 0; } +static void +send_one_worker_interrupt (vlib_main_t * vm, acl_main_t *am, int thread_index) +{ + acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index]; + if (!pw->interrupt_is_pending) { + vlib_node_set_interrupt_pending (vlib_mains[thread_index], + acl_fa_worker_session_cleaner_process_node.index); + pw->interrupt_is_pending = 1; + /* if the interrupt was requested, mark that done. */ + pw->interrupt_is_needed = 0; + } +} +static void +send_interrupts_to_workers (vlib_main_t * vm, acl_main_t *am) +{ + int i; + /* Can't use vec_len(am->per_worker_data) since the threads might not have come up yet; */ + int n_threads = vec_len(vlib_mains); + for (i = n_threads > 1 ? 1 : 0; i < n_threads; i++) { + send_one_worker_interrupt(vm, am, i); + } +} + +/* centralized process to drive per-worker cleaners */ static uword acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, vlib_frame_t * f) @@ -1115,28 +1356,48 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, u64 now = clib_cpu_time_now (); f64 cpu_cps = vm->clib_time.clocks_per_second; u64 next_expire; - /* We should call timer wheel at least twice a second */ + /* We should check if there are connections to clean up - at least twice a second */ u64 max_timer_wait_interval = cpu_cps / 2; - am->fa_current_cleaner_timer_wait_interval = max_timer_wait_interval; - - u32 *expired = NULL; uword event_type, *event_data = 0; + acl_fa_per_worker_data_t *pw0; + am->fa_current_cleaner_timer_wait_interval = max_timer_wait_interval; am->fa_cleaner_node_index = acl_fa_session_cleaner_process_node.index; while (1) { - u32 count_deleted_sessions = 0; - u32 count_already_deleted = 0; now = clib_cpu_time_now (); next_expire = now + am->fa_current_cleaner_timer_wait_interval; int has_pending_conns = 0; + u16 ti; u8 tt; - for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) - { - if (~0 != am->fa_conn_list_head[tt]) + + /* + * walk over all per-thread list heads of different timeouts, + * and see if there are any connections pending. + * If there aren't - we do not need to wake up until the + * worker code signals that it has added a connection. + * + * Also, while we are at it, calculate the earliest we need to wake up. + */ + for(ti = 0; ti < vec_len(vlib_mains); ti++) { + if (ti >= vec_len(am->per_worker_data)) { + continue; + } + acl_fa_per_worker_data_t *pw = &am->per_worker_data[ti]; + for(tt = 0; tt < vec_len(pw->fa_conn_list_head); tt++) { + u64 head_expiry = acl_fa_get_list_head_expiry_time(am, pw, now, ti, tt); + if ((head_expiry < next_expire) && !pw->interrupt_is_pending) { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("Head expiry: %lu, now: %lu, next_expire: %lu (worker: %d, tt: %d)", head_expiry, now, next_expire, ti, tt); +#endif + next_expire = head_expiry; + } + if (~0 != pw->fa_conn_list_head[tt]) { has_pending_conns = 1; + } } + } /* If no pending connections then no point in timing out */ if (!has_pending_conns) @@ -1155,9 +1416,6 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, } else { - /* Timing wheel code is happier if it is called regularly */ - if (timeout > 0.5) - timeout = 0.5; am->fa_cleaner_cnt_wait_with_timeout++; (void) vlib_process_wait_for_event_or_clock (vm, timeout); event_type = vlib_process_get_events (vm, &event_data); @@ -1175,7 +1433,11 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, break; case ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX: { + uword *clear_sw_if_index_bitmap = 0; uword *sw_if_index0; +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX received"); +#endif vec_foreach (sw_if_index0, event_data) { am->fa_cleaner_cnt_delete_by_sw_index++; @@ -1184,13 +1446,54 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, ("ACL_FA_NODE_CLEAN: ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX: %d", *sw_if_index0); #endif - u32 count = 0; - int result = - acl_fa_clean_sessions_by_sw_if_index (am, *sw_if_index0, - &count); - count_deleted_sessions += count; - am->fa_cleaner_cnt_delete_by_sw_index_ok += result; + clear_sw_if_index_bitmap = clib_bitmap_set(clear_sw_if_index_bitmap, *sw_if_index0, 1); } +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX bitmap: %U", format_bitmap_hex, clear_sw_if_index_bitmap); +#endif + vec_foreach(pw0, am->per_worker_data) { + CLIB_MEMORY_BARRIER (); + while (pw0->clear_in_process) { + CLIB_MEMORY_BARRIER (); +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ACL_FA_NODE_CLEAN: waiting previous cleaning cycle to finish on %d...", pw0 - am->per_worker_data); +#endif + vlib_process_suspend(vm, 0.0001); + if (pw0->interrupt_is_needed) { + send_one_worker_interrupt(vm, am, (pw0 - am->per_worker_data)); + } + } + if (pw0->clear_in_process) { + clib_warning("ERROR-BUG! Could not initiate cleaning on worker because another cleanup in progress"); + } else { + pw0->pending_clear_sw_if_index_bitmap = clib_bitmap_dup(clear_sw_if_index_bitmap); + pw0->clear_in_process = 1; + } + } + /* send some interrupts so they can start working */ + send_interrupts_to_workers(vm, am); + + /* now wait till they all complete */ +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("CLEANER mains len: %d per-worker len: %d", vec_len(vlib_mains), vec_len(am->per_worker_data)); +#endif + vec_foreach(pw0, am->per_worker_data) { + CLIB_MEMORY_BARRIER (); + while (pw0->clear_in_process) { + CLIB_MEMORY_BARRIER (); +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ACL_FA_NODE_CLEAN: waiting for my cleaning cycle to finish on %d...", pw0 - am->per_worker_data); +#endif + vlib_process_suspend(vm, 0.0001); + if (pw0->interrupt_is_needed) { + send_one_worker_interrupt(vm, am, (pw0 - am->per_worker_data)); + } + } + } +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ACL_FA_NODE_CLEAN: cleaning done"); +#endif + clib_bitmap_free(clear_sw_if_index_bitmap); } break; default: @@ -1206,74 +1509,34 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt, break; } - { - u8 tt = 0; - for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) { - while((vec_len(expired) < 2*am->fa_max_deleted_sessions_per_interval) - && (~0 != am->fa_conn_list_head[tt]) - && (acl_fa_conn_time_to_check(am, now, - am->fa_conn_list_head[tt]))) { - u32 sess_id = am->fa_conn_list_head[tt]; - vec_add1(expired, sess_id); - acl_fa_conn_list_delete_session(am, sess_id); - } - } - } + send_interrupts_to_workers(vm, am); - u32 *psid = NULL; - vec_foreach (psid, expired) - { - u32 session_index = *psid; - if (!pool_is_free_index (am->fa_sessions_pool, session_index)) - { - fa_session_t *sess = am->fa_sessions_pool + session_index; - u32 sw_if_index = sess->sw_if_index; - u64 sess_timeout_time = - sess->last_active_time + fa_session_get_timeout (am, sess); - if (now < sess_timeout_time) - { - /* clib_warning ("ACL_FA_NODE_CLEAN: Restarting timer for session %d", - (int) session_index); */ - - /* There was activity on the session, so the idle timeout - has not passed. Enqueue for another time period. */ - - acl_fa_conn_list_add_session(am, session_index, now); - - /* FIXME: When/if moving to timer wheel, - pretend we did this in the past, - at last_active moment, so the timer is accurate */ - am->fa_cleaner_cnt_timer_restarted++; - } - else - { - /* clib_warning ("ACL_FA_NODE_CLEAN: Deleting session %d", - (int) session_index); */ - acl_fa_delete_session (am, sw_if_index, session_index); - count_deleted_sessions++; - } - } - else - { - count_already_deleted++; - } - } - if (expired) - _vec_len (expired) = 0; if (event_data) _vec_len (event_data) = 0; - if (count_deleted_sessions > am->fa_max_deleted_sessions_per_interval) { - /* if there was too many sessions to delete, do less waiting around next time */ + + int interrupts_needed = 0; + int interrupts_unwanted = 0; + + vec_foreach(pw0, am->per_worker_data) { + if (pw0->interrupt_is_needed) { + interrupts_needed++; + /* the per-worker value is reset when sending the interrupt */ + } + if (pw0->interrupt_is_unwanted) { + interrupts_unwanted++; + pw0->interrupt_is_unwanted = 0; + } + } + if (interrupts_needed) { + /* they need more interrupts, do less waiting around next time */ am->fa_current_cleaner_timer_wait_interval /= 2; - } else if (count_deleted_sessions < am->fa_min_deleted_sessions_per_interval) { - /* Too few deleted sessions, slowly increase the amount of sleep up to a limit */ + } else if (interrupts_unwanted) { + /* slowly increase the amount of sleep up to a limit */ if (am->fa_current_cleaner_timer_wait_interval < max_timer_wait_interval) am->fa_current_cleaner_timer_wait_interval += cpu_cps * am->fa_cleaner_wait_time_increment; } am->fa_cleaner_cnt_event_cycles++; - am->fa_cleaner_cnt_deleted_sessions += count_deleted_sessions; - am->fa_cleaner_cnt_already_deleted += count_already_deleted; } /* NOT REACHED */ return 0; @@ -1307,6 +1570,9 @@ acl_fa_enable_disable (u32 sw_if_index, int is_input, int enable_disable) if ((!enable_disable) && (!acl_fa_ifc_has_in_acl (am, sw_if_index)) && (!acl_fa_ifc_has_out_acl (am, sw_if_index))) { +#ifdef FA_NODE_VERBOSE_DEBUG + clib_warning("ENABLE-DISABLE: clean the connections on interface %d", sw_if_index); +#endif vlib_process_signal_event (am->vlib_main, am->fa_cleaner_node_index, ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX, sw_if_index); @@ -1317,6 +1583,12 @@ acl_fa_enable_disable (u32 sw_if_index, int is_input, int enable_disable) /* *INDENT-OFF* */ +VLIB_REGISTER_NODE (acl_fa_worker_session_cleaner_process_node, static) = { + .function = acl_fa_worker_conn_cleaner_process, + .name = "acl-plugin-fa-worker-cleaner-process", + .type = VLIB_NODE_TYPE_INPUT, + .state = VLIB_NODE_STATE_INTERRUPT, +}; VLIB_REGISTER_NODE (acl_fa_session_cleaner_process_node, static) = { .function = acl_fa_session_cleaner_process, diff --git a/src/plugins/acl/fa_node.h b/src/plugins/acl/fa_node.h index 861836226da..a94e7db9eea 100644 --- a/src/plugins/acl/fa_node.h +++ b/src/plugins/acl/fa_node.h @@ -59,15 +59,29 @@ typedef struct { u8 as_u8[2]; u16 as_u16; } tcp_flags_seen; ; /* +2 bytes = 62 */ - u8 link_list_id; /* +1 bytes = 63 */ - u8 reserved1; /* +1 bytes = 64 */ - u32 link_prev_idx; - u32 link_next_idx; - u64 link_enqueue_time; - u64 reserved2[6]; + u16 thread_index; /* +2 bytes = 64 */ + u64 link_enqueue_time; /* 8 byte = 8 */ + u32 link_prev_idx; /* +4 bytes = 12 */ + u32 link_next_idx; /* +4 bytes = 16 */ + u8 link_list_id; /* +1 bytes = 17 */ + u8 reserved1[7]; /* +7 bytes = 24 */ + u64 reserved2[5]; /* +5*8 bytes = 64 */ } fa_session_t; +/* This structure is used to fill in the u64 value + in the per-sw-if-index hash table */ +typedef struct { + union { + u64 as_u64; + struct { + u32 session_index; + u16 thread_index; + u16 reserved0; + }; + }; +} fa_full_session_id_t; + /* * A few compile-time constraints on the size and the layout of the union, to ensure * it makes sense both for bihash and for us. @@ -79,10 +93,56 @@ CT_ASSERT_EQUAL(fa_l4_key_t_is_8, sizeof(fa_session_l4_key_t), sizeof(u64)); CT_ASSERT_EQUAL(fa_packet_info_t_is_8, sizeof(fa_packet_info_t), sizeof(u64)); CT_ASSERT_EQUAL(fa_l3_kv_size_is_48, sizeof(fa_5tuple_t), sizeof(clib_bihash_kv_40_8_t)); -/* Let's try to fit within the cacheline */ -CT_ASSERT_EQUAL(fa_session_t_size_is_64, sizeof(fa_session_t), 128); +/* Let's try to fit within two cachelines */ +CT_ASSERT_EQUAL(fa_session_t_size_is_128, sizeof(fa_session_t), 128); + +/* Session ID MUST be the same as u64 */ +CT_ASSERT_EQUAL(fa_full_session_id_size_is_64, sizeof(fa_full_session_id_t), sizeof(u64)); #undef CT_ASSERT_EQUAL +typedef struct { + /* The pool of sessions managed by this worker */ + fa_session_t *fa_sessions_pool; + /* per-worker ACL_N_TIMEOUTS of conn lists */ + u32 *fa_conn_list_head; + u32 *fa_conn_list_tail; + /* Vector of expired connections retrieved from lists */ + u32 *expired; + /* the earliest next expiry time */ + u64 next_expiry_time; + /* if not zero, look at all the elements until their enqueue timestamp is after below one */ + u64 requeue_until_time; + /* Current time between the checks */ + u64 current_time_wait_interval; + /* Counter of how many sessions we did delete */ + u64 cnt_deleted_sessions; + /* Counter of already deleted sessions being deleted - should not increment unless a bug */ + u64 cnt_already_deleted_sessions; + /* Number of times we requeued a session to a head of the list */ + u64 cnt_session_timer_restarted; + /* swipe up to this enqueue time, rather than following the timeouts */ + u64 swipe_end_time; + /* bitmap of sw_if_index serviced by this worker */ + uword *serviced_sw_if_index_bitmap; + /* bitmap of sw_if_indices to clear. set by main thread, cleared by worker */ + uword *pending_clear_sw_if_index_bitmap; + /* atomic, indicates that the swipe-deletion of connections is in progress */ + u32 clear_in_process; + /* Interrupt is pending from main thread */ + int interrupt_is_pending; + /* + * Interrupt node on the worker thread sets this if it knows there is + * more work to do, but it has to finish to avoid hogging the + * core for too long. + */ + int interrupt_is_needed; + /* + * Set to indicate that the interrupt node wants to get less interrupts + * because there is not enough work for the current rate. + */ + int interrupt_is_unwanted; +} acl_fa_per_worker_data_t; + typedef enum { ACL_FA_ERROR_DROP, |