From 5dbfbb7110a52595915acd5ec034f82ce517846a Mon Sep 17 00:00:00 2001
From: Andrew Yourtchenko <ayourtch@gmail.com>
Date: Wed, 17 May 2017 21:27:03 +0200
Subject: acl-plugin: make the ACL plugin multicore-capable

Add the logic to be able to use stateful ACLs in a multithreaded setup.

Change-Id: I3b0cfa6ca4ea8f46f61648611c3e97b00c3376b6
Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
---
 src/plugins/acl/acl-plugin.md | 347 +++++++++++++++++++++++
 src/plugins/acl/acl.c         |  39 ++-
 src/plugins/acl/acl.h         |  11 +-
 src/plugins/acl/fa_node.c     | 646 ++++++++++++++++++++++++++++++------------
 src/plugins/acl/fa_node.h     |  76 ++++-
 5 files changed, 912 insertions(+), 207 deletions(-)
 create mode 100644 src/plugins/acl/acl-plugin.md

diff --git a/src/plugins/acl/acl-plugin.md b/src/plugins/acl/acl-plugin.md
new file mode 100644
index 00000000..1b44bca9
--- /dev/null
+++ b/src/plugins/acl/acl-plugin.md
@@ -0,0 +1,347 @@
+Multicore support for ACL plugin
+================================
+
+This captures some considerations and design decisions that I have made,
+both for my own memory later on ("what the hell was I thinking?!?"),
+and for anyone interested to criticize/improve/hack on this code.
+
+One of the factors taken into account while making these decisions,
+was the relative emphasis on the multi-thread vs. single-thread
+use cases: the latter is the vastly more prevalent. But,
+one can not optimize the single-thread performance without
+having a functioning code for multi-thread.
+
+stateless ACLs
+==============
+
+The stateless trivially parallelizes, and the only potential for the
+race between the different threads is during the reconfiguration,
+at the time of replacing the old ACL being checked, with
+the new ACL.
+
+In case an acl_add_replace is being used to replace the rules
+within the existing entry, a reallocation of am->acls[X].rules
+vector will happen and potentially a change in count.
+
+acl_match_5tuple() has the following code:
+
+  a = am->acls + acl_index;
+  for (i = 0; i < a->count; i++)
+    {
+      r = a->rules + i;
+     . . .
+
+Ideally we should be immune from a->rules changing,
+but the problem arises if the count changes in flight,
+and the new ruleset is smaller - then we will attempt
+to "match" against the free memory.
+
+This can(?) be solved by replacing the for() with while(),
+so the comparison happens at each iteration.
+
+full_acl_match_5tuple(), which iterates over the list
+of ACLs, is a bit less immune, since it takes the pointer
+to the vector to iterate and keeps a local copy of
+that pointer.
+
+This race can be solved by checking the
+current pointer to the vector with the source pointer,
+and seeing if there is an (unlikely) change, and if
+there is, return the "deny" action, or, better,
+restart the check.
+
+Since the check reloads the ACL list on a per-packet basis,
+there is only a window of opportunity of one packet to
+"match" packet against an incorrect rule set.
+The workers also do not change anything, only read.
+Therefore, it looks like building special structures
+to ensure that it does not happen at all might be not
+worth it.
+
+At least not until we have a unit-test able to
+reliably catch this condition and test that
+the measures applied are effective. Adding the code
+which is not possible to exercise is worse than
+not adding any code at all.
+
+So, I opt for "do-nothing" here for the moment.
+
+reflexive ACLs: single-thread
+=============================
+
+Before we talk multi-thread, is worth revisiting the
+design of the reflexive ACLs in the plugin, and
+the history of their evolution.
+
+The very first version of the ACL plugin, shipped in
+1701, mostly did the job using the existing components
+and gluing them together. Because it needed to work
+in bridged forwarding path only, using L2 classifier
+as an insertion point appeared natural, also L2 classifier,
+being a table with sessions, seemed like a good place
+to hold the sessions.
+
+So, the original design had two conceptual nodes:
+one, pointed by the next_miss from the L2 classifier table,
+was checking the actual ACL, and inserting session into
+the L2 classifier table, and the other one, pointed
+to by the next_match within the specific session rule,
+was checking the existing session. The timing out
+of the existing connections was done in the datapath,
+by periodically calling the aging function.
+
+This decision to use the existing components,
+with its attrativeness, did bring a few limitations as well:
+
+* L2 classifier is a simple mask-and-value match, with
+a fixed mask across the table. So, sanely supporting IPv6
+packets with extension headers in that framework was impossible.
+
+* There is no way to get a backpressure from L2 classifier
+depending on memory usage. When it runs out of memory,
+it simply crashes the box. When it runs out of memory ?
+We don't really know. Depends on how it allocates it.
+
+* Since we need to match the *reflected* traffic,
+we had to create *two* full session entries
+in two different directions, which is quite wasteful memory-wise.
+
+* (showstopper): the L2 classifier runs only in
+the bridged data path, so supporting routed data path
+would require creating something else entirely different,
+which would mean much more headaches support-wise going forward.
+
+Because of that, I have moved to a different model of
+creating a session-5-tuple from the packet data - once,
+and then doing all the matching just on that 5-tuple.
+
+This has allowed to add support for skipping IPv6 extension headers.
+
+Also, this new version started to store the sessions in a dedicated
+bihash-per-interface, with the session key data being
+aligned for the ingress packets, and being mirrored for the
+egress packets. This allows of significant savings in memory,
+because now we need to keep only one copy of the session table per
+interface instead of two, and also to only have ONE node for all the lookups,
+(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.
+
+Unfortunately, bihash still has the "lack of backpressure" problem,
+in a sense that if you try to insert too many entries and run out
+of memory in the heap you supplied, you get a crash.
+
+To somewhat workaround against that, there is a "maximum tested number of sessions"
+value, which tracks the currently inserted sessions in the bihash,
+and if this number is being approached, a more aggressive cleanup
+can happen. If this number is reached, two behaviors are possible:
+
+* attempt to do the stateless ACL matching and permit the packet
+  if it succeeds
+
+* deny the packet
+
+Currently I have opted for a second one, since it allows for
+a better defined behavior, and if you have to permit
+the traffic in both directions, why using stateful anyway ?
+
+In order to be able to do the cleanup, we need to discriminate between
+the session types, with each session type having its own idle timeout.
+In order to do that, we keep three lists, defined in enum acl_timeout_e:
+ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
+
+The first one is hopefully obvious - it is just all UDP connections.
+They have an idle timeout of 600 seconds.
+
+The second and third is a bit more subtle. TCP is a complicated protocol,
+and we need to tread the fine line between doing too little and doing
+too much, and triggering the potential compatibility issues because of
+being a "middlebox".
+
+I decided to split the TCP connections into two classes:
+established, and everything else. "Established", means we have seen
+the SYN and ACK from both sides (with PUSH obviously masked out).
+This is the "active" state of any TCP connection and we would like
+to ensure we do not screw it up. So, the connections in this state
+have the default idle timer of 24 hours.
+
+All the rest of the connections have the idle timeout of 2 minutes,
+(inspired by an old value of MSL) and based on the observation
+that the states this class represent are usually very short lived.
+
+Once we have these three baskets of connections, it is trivial to
+imagine a simple cleanup mechanism to deal with this: take a
+TCP transient connection that has been hanging around.
+
+It is debatable whether we want to do discrimination between the
+different TCP transient connections. Assuming we do FIFO (and
+the lists allow us to do just that), it means a given connection
+on the head of the list has been hanging around for longest.
+Thus, if we are short on resources, we might just go ahead and
+reuse it within the datapath.
+
+This is where we are slowly approaching the question
+"Why in the world have not you used timer wheel or such ?"
+
+The answer is simple: within the above constraints, it does
+not buy me much.
+
+Also, timer wheel creates a leaky abstraction with a difficult
+to manage corner case. Which corner case ?
+
+We have a set of objects (sessions) with an event that may
+or may not happen (idle timeout timer firing), and a
+necessity to reset the idle timeout when there is
+activity on the session.
+
+In the worst case, where we had a 10000 of one-packet
+UDP sessions just created 10 minutes ago, we would need
+to deal with a spike of 10000 expired timers.
+
+Of course, if we have the active traffic on all
+of these 10000 connections, then we will not have
+to deal with that ? Right, but we will still have to deal
+with canceling and requeueing the timers.
+
+In the best possible case, requeueing a timer is
+going to be something along the lines of a linked-list
+removal and reinsertion.
+
+However, keep in mind we already need to classify the
+connections for reuse, so therefore we already have
+the linked lists!
+
+And if we just check these linked lists periodically in
+a FIFO fashion, we can get away with a very simple per-packet operation:
+writing back the timestamp of "now" into the connection structure.
+
+Then rather than requeueing the list on a per-packet or per-frame
+basis, we can defer this action until the time this session
+appears on the head of the FIFO list, and the cleaning
+routine makes the decision about whether to discard
+the session (because the interval since last activity is bigger
+than the idle timeout), or to requeue the session back to
+the end of the list (because the last activity was less
+than idle timeout ago).
+
+So, rather than using the timers, we can simply reuse our classification
+FIFOs, with the following heuristic: do not look at the session that was
+enqueued at time X until X+session_timeout. If we enqueue the sessions
+in the order of their initial activity, then we can simply use enqueue
+timestamp of the head session as a decision criterion for when we need
+to get back at looking at it for the timeout purposes.
+
+Since the number of FIFOs is small, we get a slightly worse check
+performance than with timers, but still O(1).
+
+We seemingly do quite a few "useless" operations of requeueing the items
+back to the tail of the list - but, these are the operations we do not
+have to do in the active data path, so overall it is a win.
+
+(Diversion: I believe this problem is congruent to poll vs. epoll or
+events vs. threads, some reading on this subject:
+http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
+
+We can also can run a TCP-like scheme for adaptively changing
+the wait period in the routine that deals with the connection timeouts:
+we can attempt to check the connections a couple of times per second
+(same as we would advance the timer wheel), and then if we have requeued
+close to a max-per-quantum number of connections, we can half the waiting
+interval, and if we did not requeue any, we can slowly increment the waiting
+interval - which at a steady state should stabilize similar to what the TCP rate
+does.
+
+reflexive ACLs: multi-thread
+=============================
+
+The single-threaded implementation in 1704 used a separate "cleaner" process
+to deal with the timing out of the connections.
+It is all good and great when you know that there is only a single core
+to run everything on, but the existence of the lists proves to be
+a massive difficulty when it comes to operating from multiple threads.
+
+Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
+and the worker have a demarcation point in time where either one or the other one touches
+the session in the list) it might be possible to make it work, but the resulting
+trickiness of doing it neatly with all the corner cases is quite large.
+
+So, for the multi-threaded scenario, we need to move the connection
+aging back to the same CPU as its creation.
+
+Luckily we can do this with the help of the interrupts.
+
+So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
+periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
+using vlib_node_set_interrupt_pending(), and
+the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
+which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
+doing is putting the items onto FIFO, and updating the last active time on the existing connection.
+
+The one "delicate" part is that the worker for one leg of the connection might be different from
+the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
+nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
+will get the timestamp updated - same thing about the TCP flags seen.
+
+A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
+and the return packet processed by another worker, and as a result changes the
+the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
+If the class changes from one with the shorter idle time to the one with the longer idle time,
+then unless we are in the starvation mode where the transient connections are recycled,
+we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
+timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
+will affect the resource usage.
+
+One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
+can signal to the owner the connection# that needs to be requeued out of order.
+
+A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
+This way the resource starvation problem is taken care of, at an expense of some additional work.
+
+This all looks sufficiently nice and simple until a skeleton falls out of the closet:
+sometimes we want to clean the connections en masse before they expire.
+
+There few potential scenarios:
+1) removal of an ACL from the interface
+2) removal of an interface
+3) manual action of an operator (in the future).
+
+In order to tackle this, we need to modify the logic which decides whether to requeue the
+connection on the end of the list, or to delete it due to idle timeout:
+
+We define a point in time, and have each worker thread fast-forward through its FIFO,
+in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.
+
+To keep the ease of appearance to the outside world, we still process this as an event
+within the connection cleaner thread, but this event handler does as follows:
+1) it creates the bitmap of the sw_if_index values requested to be cleared
+2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
+it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
+3) wait until all cleanup operations have completed.
+
+Within the worker interrupt node, we check if the "cleanup in progress" is set,
+and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
+requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.
+
+(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
+
+If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
+Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
+meaning there is more work to do.
+When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
+zeroize the bitmap of sw_if_index-es requested to be cleaned.
+
+The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
+flag within the per-worker structure. The main thread, while waiting for the
+cleanup operation to complete, checks if there is a request for interrupt,
+and if there is - it sends one.
+
+This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
+connection cleanup.
+
+One potential inefficiency is the bitmap values set by the session insertion
+in the data path - there is nothing to clear them.
+
+So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
+For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
+would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).
+
+=== the end ===
+
diff --git a/src/plugins/acl/acl.c b/src/plugins/acl/acl.c
index 6b79411e..84ed7af6 100644
--- a/src/plugins/acl/acl.c
+++ b/src/plugins/acl/acl.c
@@ -1785,6 +1785,7 @@ done:
   return error;
 }
 
+
 static clib_error_t *
 acl_show_aclplugin_fn (vlib_main_t * vm,
                               unformat_input_t * input,
@@ -1799,6 +1800,7 @@ acl_show_aclplugin_fn (vlib_main_t * vm,
   if (unformat (input, "sessions"))
     {
       u8 * out0 = 0;
+      u16 wk;
       pool_foreach (swif, im->sw_interfaces,
       ({
         u32 sw_if_index =  swif->sw_if_index;
@@ -1806,6 +1808,24 @@ acl_show_aclplugin_fn (vlib_main_t * vm,
         u64 n_dels = sw_if_index < vec_len(am->fa_session_dels_by_sw_if_index) ? am->fa_session_dels_by_sw_if_index[sw_if_index] : 0;
         out0 = format(out0, "sw_if_index %d: add %lu - del %lu = %lu\n", sw_if_index, n_adds, n_dels, n_adds - n_dels);
       }));
+      out0 = format(out0, "\n\nPer-worker data:\n");
+      for (wk = 0; wk < vec_len (am->per_worker_data); wk++) {
+        acl_fa_per_worker_data_t *pw = &am->per_worker_data[wk];
+	out0 = format(out0, "Worker #%d:\n", wk);
+	out0 = format(out0, "  Next expiry time: %lu\n", pw->next_expiry_time);
+	out0 = format(out0, "  Requeue until time: %lu\n", pw->requeue_until_time);
+	out0 = format(out0, "  Current time wait interval: %lu\n", pw->current_time_wait_interval);
+	out0 = format(out0, "  Count of deleted sessions: %lu\n", pw->cnt_deleted_sessions);
+	out0 = format(out0, "  Delete already deleted: %lu\n", pw->cnt_already_deleted_sessions);
+	out0 = format(out0, "  Session timers restarted: %lu\n", pw->cnt_session_timer_restarted);
+	out0 = format(out0, "  Swipe until this time: %lu\n", pw->swipe_end_time);
+	out0 = format(out0, "  sw_if_index serviced bitmap: %U\n", format_bitmap_hex, pw->serviced_sw_if_index_bitmap);
+	out0 = format(out0, "  pending clear intfc bitmap : %U\n", format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap);
+	out0 = format(out0, "  clear in progress: %u\n", pw->clear_in_process);
+	out0 = format(out0, "  interrupt is pending: %d\n", pw->interrupt_is_pending);
+	out0 = format(out0, "  interrupt is needed: %d\n", pw->interrupt_is_needed);
+	out0 = format(out0, "  interrupt is unwanted: %d\n", pw->interrupt_is_unwanted);
+      }
       out0 = format(out0, "\n\nConn cleaner thread counters:\n");
 #define _(cnt, desc) out0 = format(out0, "             %20lu: %s\n", am->cnt, desc);
       foreach_fa_cleaner_counter;
@@ -1867,14 +1887,24 @@ acl_init (vlib_main_t * vm)
   am->fa_conn_table_hash_num_buckets = ACL_FA_CONN_TABLE_DEFAULT_HASH_NUM_BUCKETS;
   am->fa_conn_table_hash_memory_size = ACL_FA_CONN_TABLE_DEFAULT_HASH_MEMORY_SIZE;
   am->fa_conn_table_max_entries = ACL_FA_CONN_TABLE_DEFAULT_MAX_ENTRIES;
-
+  vlib_thread_main_t *tm = vlib_get_thread_main ();
+  // vec_validate(am->per_worker_data, os_get_nthreads()-1);
+  vec_validate(am->per_worker_data, tm->n_vlib_mains-1);
+  clib_warning("ACL_FA_INIT: per-worker len: %d", vec_len(am->per_worker_data));
   {
+    u16 wk;
     u8 tt;
-    for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) {
-       am->fa_conn_list_head[tt] = ~0;
-       am->fa_conn_list_tail[tt] = ~0;
+    for (wk = 0; wk < vec_len (am->per_worker_data); wk++) {
+      acl_fa_per_worker_data_t *pw = &am->per_worker_data[wk];
+      vec_validate(pw->fa_conn_list_head, ACL_N_TIMEOUTS-1);
+      vec_validate(pw->fa_conn_list_tail, ACL_N_TIMEOUTS-1);
+      for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) {
+        pw->fa_conn_list_head[tt] = ~0;
+        pw->fa_conn_list_tail[tt] = ~0;
+      }
     }
   }
+  clib_warning("ACL_FA_INIT-DONE: per-worker len: %d", vec_len(am->per_worker_data));
 
   am->fa_min_deleted_sessions_per_interval = ACL_FA_DEFAULT_MIN_DELETED_SESSIONS_PER_INTERVAL;
   am->fa_max_deleted_sessions_per_interval = ACL_FA_DEFAULT_MAX_DELETED_SESSIONS_PER_INTERVAL;
@@ -1883,7 +1913,6 @@ acl_init (vlib_main_t * vm)
   am->fa_cleaner_cnt_delete_by_sw_index = 0;
   am->fa_cleaner_cnt_delete_by_sw_index_ok = 0;
   am->fa_cleaner_cnt_unknown_event = 0;
-  am->fa_cleaner_cnt_deleted_sessions = 0;
   am->fa_cleaner_cnt_timer_restarted = 0;
   am->fa_cleaner_cnt_wait_with_timeout = 0;
 
diff --git a/src/plugins/acl/acl.h b/src/plugins/acl/acl.h
index eb074a7b..e35e0ea7 100644
--- a/src/plugins/acl/acl.h
+++ b/src/plugins/acl/acl.h
@@ -138,9 +138,7 @@ typedef struct {
   /* bitmap, when set the hash is initialized */
   uword *fa_sessions_on_sw_if_index;
   clib_bihash_40_8_t *fa_sessions_by_sw_if_index;
-  /* pool for FA session data. See fa_node.h */
-  fa_session_t *fa_sessions_pool;
-  /* The process node which is responsible to deleting the sessions */
+  /* The process node which orcherstrates the cleanup */
   u32 fa_cleaner_node_index;
   /* FA session timeouts, in seconds */
   u32 session_timeout_sec[ACL_N_TIMEOUTS];
@@ -192,8 +190,9 @@ typedef struct {
   f64 fa_cleaner_wait_time_increment;
 
   u64 fa_current_cleaner_timer_wait_interval;
-  u32 fa_conn_list_head[ACL_N_TIMEOUTS];
-  u32 fa_conn_list_tail[ACL_N_TIMEOUTS];
+
+  /* per-worker data related t conn management */
+  acl_fa_per_worker_data_t *per_worker_data;
 
   /* Configured session timeout */
   u64 session_timeout[ACL_N_TIMEOUTS];
@@ -205,12 +204,10 @@ typedef struct {
   _(fa_cleaner_cnt_delete_by_sw_index, "delete_by_sw_index events")        \
   _(fa_cleaner_cnt_delete_by_sw_index_ok, "delete_by_sw_index handled ok") \
   _(fa_cleaner_cnt_unknown_event, "unknown events received")               \
-  _(fa_cleaner_cnt_deleted_sessions, "sessions deleted")                   \
   _(fa_cleaner_cnt_timer_restarted, "session idle timers restarted")       \
   _(fa_cleaner_cnt_wait_with_timeout, "event wait with timeout called")    \
   _(fa_cleaner_cnt_wait_without_timeout, "event wait w/o timeout called")  \
   _(fa_cleaner_cnt_event_cycles, "total event cycles")                     \
-  _(fa_cleaner_cnt_already_deleted, "try to delete already deleted conn")  \
 /* end of counters */
 #define _(id, desc) u32 id;
   foreach_fa_cleaner_counter
diff --git a/src/plugins/acl/fa_node.c b/src/plugins/acl/fa_node.c
index c71429e7..c6059aae 100644
--- a/src/plugins/acl/fa_node.c
+++ b/src/plugins/acl/fa_node.c
@@ -542,6 +542,39 @@ fa_session_get_timeout_type (acl_main_t * am, fa_session_t * sess)
 }
 
 
+static u64
+fa_session_get_shortest_timeout(acl_main_t * am)
+{
+  int timeout_type;
+  u64 timeout = ~0LL;
+  for(timeout_type = 0; timeout_type < ACL_N_TIMEOUTS; timeout_type++) {
+    if (timeout > am->session_timeout_sec[timeout_type]) {
+      timeout = am->session_timeout_sec[timeout_type];
+    }
+  }
+  return timeout;
+}
+
+/*
+ * Get the timeout of the session in a list since its enqueue time.
+ */
+
+static u64
+fa_session_get_list_timeout (acl_main_t * am, fa_session_t * sess)
+{
+  u64 timeout = am->vlib_main->clib_time.clocks_per_second;
+  /*
+   * we have the shortest possible timeout type in all the lists
+   * (see README-multicore for the rationale)
+   */
+  timeout *= fa_session_get_shortest_timeout(am);
+  return timeout;
+}
+
+/*
+ * Get the idle timeout of a session.
+ */
+
 static u64
 fa_session_get_timeout (acl_main_t * am, fa_session_t * sess)
 {
@@ -554,6 +587,7 @@ fa_session_get_timeout (acl_main_t * am, fa_session_t * sess)
 static void
 acl_fa_ifc_init_sessions (acl_main_t * am, int sw_if_index0)
 {
+  /// FIXME-MULTICORE: lock around this function
 #ifdef FA_NODE_VERBOSE_DEBUG
   clib_warning
     ("Initializing bihash for sw_if_index %d num buckets %lu memory size %llu",
@@ -569,68 +603,97 @@ acl_fa_ifc_init_sessions (acl_main_t * am, int sw_if_index0)
     clib_bitmap_set (am->fa_sessions_on_sw_if_index, sw_if_index0, 1);
 }
 
+static inline fa_session_t *get_session_ptr(acl_main_t *am, u16 thread_index, u32 session_index)
+{
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+  fa_session_t *sess = pw->fa_sessions_pool + session_index;
+  return sess;
+}
+
 static void
-acl_fa_conn_list_add_session (acl_main_t * am, u32 sess_id, u64 now)
+acl_fa_conn_list_add_session (acl_main_t * am, fa_full_session_id_t sess_id, u64 now)
 {
-  fa_session_t *sess = am->fa_sessions_pool + sess_id;
+  fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index);
   u8 list_id = fa_session_get_timeout_type(am, sess);
+  uword thread_index = os_get_thread_index ();
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+  /* the retrieved session thread index must be necessarily the same as the one in the key */
+  ASSERT (sess->thread_index == sess_id.thread_index);
+  /* the retrieved session thread index must be the same as current thread */
+  ASSERT (sess->thread_index == thread_index);
   sess->link_enqueue_time = now;
   sess->link_list_id = list_id;
   sess->link_next_idx = ~0;
-  sess->link_prev_idx = am->fa_conn_list_tail[list_id];
-  if (~0 != am->fa_conn_list_tail[list_id]) {
-    fa_session_t *prev_sess = am->fa_sessions_pool + am->fa_conn_list_tail[list_id];
-    prev_sess->link_next_idx = sess_id;
+  sess->link_prev_idx = pw->fa_conn_list_tail[list_id];
+  if (~0 != pw->fa_conn_list_tail[list_id]) {
+    fa_session_t *prev_sess = get_session_ptr(am, thread_index, pw->fa_conn_list_tail[list_id]);
+    prev_sess->link_next_idx = sess_id.session_index;
+    /* We should never try to link with a session on another thread */
+    ASSERT(prev_sess->thread_index == sess->thread_index);
   }
-  am->fa_conn_list_tail[list_id] = sess_id;
+  pw->fa_conn_list_tail[list_id] = sess_id.session_index;
+  pw->serviced_sw_if_index_bitmap = clib_bitmap_set(pw->serviced_sw_if_index_bitmap, sess->sw_if_index, 1);
 
-  if (~0 == am->fa_conn_list_head[list_id]) {
-    am->fa_conn_list_head[list_id] = sess_id;
-    /* If it is a first conn in any list, kick off the cleaner */
+  if (~0 == pw->fa_conn_list_head[list_id]) {
+    pw->fa_conn_list_head[list_id] = sess_id.session_index;
+    /* If it is a first conn in any list, kick the cleaner */
     vlib_process_signal_event (am->vlib_main, am->fa_cleaner_node_index,
                                  ACL_FA_CLEANER_RESCHEDULE, 0);
-
   }
 }
 
-static void
-acl_fa_conn_list_delete_session (acl_main_t *am, u32 sess_id)
+static int
+acl_fa_conn_list_delete_session (acl_main_t *am, fa_full_session_id_t sess_id)
 {
-  fa_session_t *sess = am->fa_sessions_pool + sess_id;
+  uword thread_index = os_get_thread_index ();
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+  if (thread_index != sess_id.thread_index) {
+    /* If another thread attempts to delete the session, fail it. */
+#ifdef FA_NODE_VERBOSE_DEBUG
+    clib_warning("thread id in key %d != curr thread index, not deleting");
+#endif
+    return 0;
+  }
+  fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index);
+  /* we should never try to delete the session with another thread index */
+  ASSERT(sess->thread_index == thread_index);
   if (~0 != sess->link_prev_idx) {
-    fa_session_t *prev_sess = am->fa_sessions_pool + sess->link_prev_idx;
+    fa_session_t *prev_sess = get_session_ptr(am, thread_index, sess->link_prev_idx);
+    /* the previous session must be in the same list as this one */
+    ASSERT(prev_sess->link_list_id == sess->link_list_id);
     prev_sess->link_next_idx = sess->link_next_idx;
-    if (prev_sess->link_list_id != sess->link_list_id)
-      clib_warning("(prev_sess->link_list_id != sess->link_list_id)");
   }
   if (~0 != sess->link_next_idx) {
-    fa_session_t *next_sess = am->fa_sessions_pool + sess->link_next_idx;
+    fa_session_t *next_sess = get_session_ptr(am, thread_index, sess->link_next_idx);
+    /* The next session must be in the same list as the one we are deleting */
+    ASSERT(next_sess->link_list_id == sess->link_list_id);
     next_sess->link_prev_idx = sess->link_prev_idx;
-    if (next_sess->link_list_id != sess->link_list_id)
-      clib_warning("(next_sess->link_list_id != sess->link_list_id)");
   }
-  if (am->fa_conn_list_head[sess->link_list_id] == sess_id) {
-    am->fa_conn_list_head[sess->link_list_id] = sess->link_next_idx;
+  if (pw->fa_conn_list_head[sess->link_list_id] == sess_id.session_index) {
+    pw->fa_conn_list_head[sess->link_list_id] = sess->link_next_idx;
   }
-  if (am->fa_conn_list_tail[sess->link_list_id] == sess_id) {
-    am->fa_conn_list_tail[sess->link_list_id] = sess->link_prev_idx;
+  if (pw->fa_conn_list_tail[sess->link_list_id] == sess_id.session_index) {
+    pw->fa_conn_list_tail[sess->link_list_id] = sess->link_prev_idx;
   }
+  return 1;
 }
 
-
-int
-acl_fa_session_is_dead (acl_main_t * am, u32 sw_if_index, u64 now,
-			u32 sess_id)
-{
-  return 0;
-}
-
-static void
-acl_fa_restart_timer_for_session (acl_main_t * am, u64 now, u32 sess_id)
+static int
+acl_fa_restart_timer_for_session (acl_main_t * am, u64 now, fa_full_session_id_t sess_id)
 {
-  // fa_session_t *sess = am->fa_sessions_pool + sess_id;
-  acl_fa_conn_list_delete_session(am, sess_id);
-  acl_fa_conn_list_add_session(am, sess_id, now);
+  if (acl_fa_conn_list_delete_session(am, sess_id)) {
+    acl_fa_conn_list_add_session(am, sess_id, now);
+    return 1;
+  } else {
+    /*
+     * Our thread does not own this connection, so we can not delete
+     * The session. To avoid the complicated signaling, we simply
+     * pick the list waiting time to be the shortest of the timeouts.
+     * This way we do not have to do anything special, and let
+     * the regular requeue check take care of everything.
+     */
+    return 0;
+  }
 }
 
 
@@ -648,13 +711,16 @@ acl_fa_track_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now,
 
 
 static void
-acl_fa_delete_session (acl_main_t * am, u32 sw_if_index, u32 sess_id)
+acl_fa_delete_session (acl_main_t * am, u32 sw_if_index, fa_full_session_id_t sess_id)
 {
-  fa_session_t *sess = (fa_session_t *) am->fa_sessions_pool + sess_id;
+  fa_session_t *sess = get_session_ptr(am, sess_id.thread_index, sess_id.session_index);
+  ASSERT(sess->thread_index == os_get_thread_index ());
   BV (clib_bihash_add_del) (&am->fa_sessions_by_sw_if_index[sw_if_index],
 			    &sess->info.kv, 0);
-  pool_put_index (am->fa_sessions_pool, sess_id);
-  /* Deleting from timer wheel not needed, as the cleaner deals with the timers. */
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[sess_id.thread_index];
+  pool_put_index (pw->fa_sessions_pool, sess_id.session_index);
+  /* Deleting from timer structures not needed,
+     as the caller must have dealt with the timers. */
   vec_validate (am->fa_session_dels_by_sw_if_index, sw_if_index);
   am->fa_session_dels_by_sw_if_index[sw_if_index]++;
 }
@@ -671,13 +737,114 @@ acl_fa_can_add_session (acl_main_t * am, int is_input, u32 sw_if_index)
   return (curr_sess < am->fa_conn_table_max_entries);
 }
 
+static u64
+acl_fa_get_list_head_expiry_time(acl_main_t *am, acl_fa_per_worker_data_t *pw, u64 now, u16 thread_index, int timeout_type)
+{
+  if (~0 == pw->fa_conn_list_head[timeout_type]) {
+    return ~0LL; // infinity.
+  } else {
+    fa_session_t *sess = get_session_ptr(am, thread_index, pw->fa_conn_list_head[timeout_type]);
+    u64 timeout_time =
+              sess->link_enqueue_time + fa_session_get_list_timeout (am, sess);
+    return timeout_time;
+  }
+}
+
+static int
+acl_fa_conn_time_to_check (acl_main_t *am, acl_fa_per_worker_data_t *pw, u64 now, u16 thread_index, u32 session_index)
+{
+  fa_session_t *sess = get_session_ptr(am, thread_index, session_index);
+  u64 timeout_time =
+              sess->link_enqueue_time + fa_session_get_list_timeout (am, sess);
+  return (timeout_time < now) || (sess->link_enqueue_time <= pw->swipe_end_time);
+}
+
+/*
+ * see if there are sessions ready to be checked,
+ * do the maintenance (requeue or delete), and
+ * return the total number of sessions reclaimed.
+ */
+static int
+acl_fa_check_idle_sessions(acl_main_t *am, u16 thread_index, u64 now)
+{
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+  fa_full_session_id_t fsid;
+  fsid.thread_index = thread_index;
+  int total_expired = 0;
+
+  {
+    u8 tt = 0;
+    for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) {
+      while((vec_len(pw->expired) < am->fa_max_deleted_sessions_per_interval)
+	    && (~0 != pw->fa_conn_list_head[tt])
+	    && (acl_fa_conn_time_to_check(am, pw, now, thread_index,
+					  pw->fa_conn_list_head[tt]))) {
+	fsid.session_index = pw->fa_conn_list_head[tt];
+	vec_add1(pw->expired, fsid.session_index);
+	acl_fa_conn_list_delete_session(am, fsid);
+      }
+    }
+  }
+
+  u32 *psid = NULL;
+  vec_foreach (psid, pw->expired)
+  {
+    fsid.session_index = *psid;
+    if (!pool_is_free_index (pw->fa_sessions_pool, fsid.session_index))
+      {
+	fa_session_t *sess = get_session_ptr(am, thread_index, fsid.session_index);
+	u32 sw_if_index = sess->sw_if_index;
+	u64 sess_timeout_time =
+	  sess->last_active_time + fa_session_get_timeout (am, sess);
+	if ((now < sess_timeout_time) && (0 == clib_bitmap_get(pw->pending_clear_sw_if_index_bitmap, sw_if_index)))
+	  {
+#ifdef FA_NODE_VERBOSE_DEBUG
+	    clib_warning ("ACL_FA_NODE_CLEAN: Restarting timer for session %d",
+	       (int) session_index);
+#endif
+	    /* There was activity on the session, so the idle timeout
+	       has not passed. Enqueue for another time period. */
+
+	    acl_fa_conn_list_add_session(am, fsid, now);
+	    pw->cnt_session_timer_restarted++;
+	  }
+	else
+	  {
+#ifdef FA_NODE_VERBOSE_DEBUG
+	    clib_warning ("ACL_FA_NODE_CLEAN: Deleting session %d",
+	       (int) session_index);
+#endif
+	    acl_fa_delete_session (am, sw_if_index, fsid);
+	    pw->cnt_deleted_sessions++;
+	  }
+      }
+    else
+      {
+	pw->cnt_already_deleted_sessions++;
+      }
+  }
+  total_expired = vec_len(pw->expired);
+  /* zero out the vector which we have acted on */
+  if (pw->expired)
+    _vec_len (pw->expired) = 0;
+  /* if we were advancing and reached the end
+   * (no more sessions to recycle), reset the fast-forward timestamp */
+
+  if (pw->swipe_end_time && 0 == total_expired)
+    pw->swipe_end_time = 0;
+  return (total_expired);
+}
+
 always_inline void
-acl_fa_try_recycle_session (acl_main_t * am, int is_input, u32 sw_if_index)
+acl_fa_try_recycle_session (acl_main_t * am, int is_input, u16 thread_index, u32 sw_if_index)
 {
   /* try to recycle a TCP transient session */
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
   u8 timeout_type = ACL_TIMEOUT_TCP_TRANSIENT;
-  u32 sess_id = am->fa_conn_list_head[timeout_type];
-  if (~0 != sess_id) {
+  fa_full_session_id_t sess_id;
+  sess_id.session_index = pw->fa_conn_list_head[timeout_type];
+  if (~0 != sess_id.session_index) {
+    sess_id.thread_index = thread_index;
     acl_fa_conn_list_delete_session(am, sess_id);
     acl_fa_delete_session(am, sw_if_index, sess_id);
   }
@@ -689,25 +856,28 @@ acl_fa_add_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now,
 {
   clib_bihash_kv_40_8_t *pkv = &p5tuple->kv;
   clib_bihash_kv_40_8_t kv;
-  u32 sess_id;
-  fa_session_t *sess;
+  fa_full_session_id_t f_sess_id;
+  uword thread_index = os_get_thread_index();
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
 
-  pool_get (am->fa_sessions_pool, sess);
-  sess_id = sess - am->fa_sessions_pool;
+  f_sess_id.thread_index = thread_index;
+  fa_session_t *sess;
 
+  pool_get_aligned (pw->fa_sessions_pool, sess, CLIB_CACHE_LINE_BYTES);
+  f_sess_id.session_index = sess - pw->fa_sessions_pool;
 
   kv.key[0] = pkv->key[0];
   kv.key[1] = pkv->key[1];
   kv.key[2] = pkv->key[2];
   kv.key[3] = pkv->key[3];
   kv.key[4] = pkv->key[4];
-  kv.value = sess_id;
+  kv.value = f_sess_id.as_u64;
 
   memcpy (sess, pkv, sizeof (pkv->key));
   sess->last_active_time = now;
   sess->sw_if_index = sw_if_index;
   sess->tcp_flags_seen.as_u16 = 0;
-  sess->reserved1 = 0;
+  sess->thread_index = thread_index;
   sess->link_list_id = ~0;
   sess->link_prev_idx = ~0;
   sess->link_next_idx = ~0;
@@ -721,7 +891,7 @@ acl_fa_add_session (acl_main_t * am, int is_input, u32 sw_if_index, u64 now,
 
   BV (clib_bihash_add_del) (&am->fa_sessions_by_sw_if_index[sw_if_index],
 			    &kv, 1);
-  acl_fa_conn_list_add_session(am, sess_id, now);
+  acl_fa_conn_list_add_session(am, f_sess_id, now);
 
   vec_validate (am->fa_session_adds_by_sw_if_index, sw_if_index);
   am->fa_session_adds_by_sw_if_index[sw_if_index]++;
@@ -757,6 +927,7 @@ acl_fa_node_fn (vlib_main_t * vm,
   clib_bihash_kv_40_8_t value_sess;
   vlib_node_runtime_t *error_node;
   u64 now = clib_cpu_time_now ();
+  uword thread_index = os_get_thread_index ();
 
   from = vlib_frame_vector_args (frame);
   n_left_from = frame->n_vectors;
@@ -827,16 +998,19 @@ acl_fa_node_fn (vlib_main_t * vm,
 		{
 		  trace_bitmap |= 0x80000000;
 		  error0 = ACL_FA_ERROR_ACL_EXIST_SESSION;
-		  // FIXME assert(value_sess.value == (0xffffffff & value_sess.value));
-		  u32 sess_id = value_sess.value;
-		  fa_session_t *sess = am->fa_sessions_pool + sess_id;
+		  fa_full_session_id_t f_sess_id;
+
+                  f_sess_id.as_u64 = value_sess.value;
+                  ASSERT(f_sess_id.thread_index < vec_len(vlib_mains));
+
+		  fa_session_t *sess = get_session_ptr(am, f_sess_id.thread_index, f_sess_id.session_index);
 		  int old_timeout_type =
 		    fa_session_get_timeout_type (am, sess);
 		  action =
 		    acl_fa_track_session (am, is_input, sw_if_index0, now,
 					  sess, &fa_5tuple);
 		  /* expose the session id to the tracer */
-		  match_rule_index = sess_id;
+		  match_rule_index = f_sess_id.session_index;
 		  int new_timeout_type =
 		    fa_session_get_timeout_type (am, sess);
 		  acl_check_needed = 0;
@@ -844,7 +1018,7 @@ acl_fa_node_fn (vlib_main_t * vm,
 		  /* Tracking might have changed the session timeout type, e.g. from transient to established */
 		  if (PREDICT_FALSE (old_timeout_type != new_timeout_type))
 		    {
-		      acl_fa_restart_timer_for_session (am, now, sess_id);
+		      acl_fa_restart_timer_for_session (am, now, f_sess_id);
 		      pkts_restart_session_timer++;
 		      trace_bitmap |=
 			0x00010000 + ((0xff & old_timeout_type) << 8) +
@@ -865,7 +1039,7 @@ acl_fa_node_fn (vlib_main_t * vm,
 	      if (2 == action)
 		{
 		  if (!acl_fa_can_add_session (am, is_input, sw_if_index0))
-                    acl_fa_try_recycle_session (am, is_input, sw_if_index0);
+                    acl_fa_try_recycle_session (am, is_input, thread_index, sw_if_index0);
 
 		  if (acl_fa_can_add_session (am, is_input, sw_if_index0))
 		    {
@@ -1024,16 +1198,9 @@ acl_out_ip4_fa_node_fn (vlib_main_t * vm,
 }
 
 /*
- * This process performs all the connection clean up - both for idle connections,
- * as well as receiving the signals to clean up the connections in case of sw_if_index deletion,
- * or (maybe in the future) the connection deletion due to policy reasons.
- *
- * The previous iteration (l2sess) attempted to clean up the connections in small increments,
- * in-band, but the problem it tried to preemptively address (process starvation) is yet to be seen.
- *
- * The approach with a single thread deleting the connections is simpler, thus we use it until
- * there is a real starvation problem to solve.
- *
+ * This process ensures the connection cleanup happens every so often
+ * even in absence of traffic, as well as provides general orchestration
+ * for requests like connection deletion on a given sw_if_index.
  */
 
 
@@ -1056,57 +1223,131 @@ static char *acl_fa_cleaner_error_strings[] = {
 #undef _
 };
 
-static int
-acl_fa_clean_sessions_by_sw_if_index (acl_main_t *am, u32 sw_if_index, u32 *count)
-{
-
-  int undeleted = 0;
-  fa_session_t *sess;
-  uword *dv = NULL;
-  uword *ii;
-
-  pool_foreach(sess, am->fa_sessions_pool, ({
-    if ( (~0 == sw_if_index) || (sw_if_index == sess->sw_if_index) )
-      vec_add1(dv, sess-am->fa_sessions_pool);
-  }));
-  vec_foreach(ii, dv)
-  {
-    sess =  pool_elt_at_index(am->fa_sessions_pool, *ii);
-    acl_fa_delete_session(am, sess->sw_if_index, *ii);
-    (*count)++;
-  }
-
-  pool_foreach(sess, am->fa_sessions_pool, ({
-    if ( (~0 == sw_if_index) || (sw_if_index == sess->sw_if_index) )
-      undeleted++;
-  }));
-  if (undeleted == 0)
-    {
-      if (~0 == sw_if_index)
-        {
-          /* FIXME: clean-up tables ? */
-        }
-      else
-        {
-          /* FIXME: clean-up tables ? */
-        }
-    }
-  return (undeleted == 0);
-}
 /* *INDENT-ON* */
 
 static vlib_node_registration_t acl_fa_session_cleaner_process_node;
+static vlib_node_registration_t acl_fa_worker_session_cleaner_process_node;
 
-static int
-acl_fa_conn_time_to_check (acl_main_t *am, u64 now, u32 session_index)
+/*
+ * Per-worker thread interrupt-driven cleaner thread
+ * to clean idle connections if there are no packets
+ */
+static uword
+acl_fa_worker_conn_cleaner_process(vlib_main_t * vm,
+              vlib_node_runtime_t * rt, vlib_frame_t * f)
 {
-  fa_session_t *sess = am->fa_sessions_pool + session_index;
-  u64 timeout_time =
-              sess->link_enqueue_time + fa_session_get_timeout (am, sess);
-  return (timeout_time < now);
+   acl_main_t *am = &acl_main;
+   u64 now = clib_cpu_time_now ();
+   u16 thread_index = os_get_thread_index ();
+   acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+   int num_expired;
+#ifdef FA_NODE_VERBOSE_DEBUG
+   clib_warning("\nacl_fa_worker_conn_cleaner: thread index %d now %lu\n\n", thread_index, now);
+#endif
+   /* allow another interrupt to be queued */
+   pw->interrupt_is_pending = 0;
+   if (pw->clear_in_process) {
+     if (0 == pw->swipe_end_time) {
+       /*
+        * Someone has just set the flag to start clearing.
+        * we do this by combing through the connections up to a "time T"
+        * which is now, and requeueing everything except the expired
+        * connections and those matching the interface(s) being cleared.
+        */
+
+       /*
+        * first filter the sw_if_index bitmap that they want from us, by
+        * a bitmap of sw_if_index for which we actually have connections.
+        */
+       if ((pw->pending_clear_sw_if_index_bitmap == 0)
+           || (pw->serviced_sw_if_index_bitmap == 0)) {
+#ifdef FA_NODE_VERBOSE_DEBUG
+         clib_warning("WORKER-CLEAR: someone tried to call clear, but one of the bitmaps are empty");
+#endif
+	 clib_bitmap_zero(pw->pending_clear_sw_if_index_bitmap);
+       } else {
+#ifdef FA_NODE_VERBOSE_DEBUG
+         clib_warning("WORKER-CLEAR: (before and) swiping sw-if-index bitmap: %U, my serviced bitmap %U",
+                      format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap,
+                      format_bitmap_hex, pw->serviced_sw_if_index_bitmap);
+#endif
+         pw->pending_clear_sw_if_index_bitmap = clib_bitmap_and(pw->pending_clear_sw_if_index_bitmap,
+							      pw->serviced_sw_if_index_bitmap);
+       }
+
+       if (clib_bitmap_is_zero(pw->pending_clear_sw_if_index_bitmap)) {
+         /* if the cross-section is a zero vector, no need to do anything. */
+         clib_warning("WORKER: clearing done - nothing to do");
+         pw->clear_in_process = 0;
+       } else {
+#ifdef FA_NODE_VERBOSE_DEBUG
+         clib_warning("WORKER-CLEAR: swiping sw-if-index bitmap: %U, my serviced bitmap %U",
+                      format_bitmap_hex, pw->pending_clear_sw_if_index_bitmap,
+                      format_bitmap_hex, pw->serviced_sw_if_index_bitmap);
+#endif
+         /* swipe through the connection lists until enqueue timestamps become above "now" */
+         pw->swipe_end_time = now;
+       }
+     }
+   }
+   num_expired = acl_fa_check_idle_sessions(am, thread_index, now);
+   // clib_warning("WORKER-CLEAR: checked %d sessions (clear_in_progress: %d)", num_expired, pw->clear_in_process);
+   if (pw->clear_in_process) {
+     if (0 == num_expired) {
+       /* we were clearing but we could not process any more connections. time to stop. */
+       clib_bitmap_zero(pw->pending_clear_sw_if_index_bitmap);
+       pw->clear_in_process = 0;
+#ifdef FA_NODE_VERBOSE_DEBUG
+       clib_warning("WORKER: clearing done, all done");
+#endif
+     } else {
+#ifdef FA_NODE_VERBOSE_DEBUG
+       clib_warning("WORKER-CLEAR: more work to do, raising interrupt");
+#endif
+       /* should continue clearing.. So could they please sent an interrupt again? */
+       pw->interrupt_is_needed = 1;
+     }
+   } else {
+     if (num_expired >= am->fa_max_deleted_sessions_per_interval) {
+       /* there was too much work, we should get an interrupt ASAP */
+       pw->interrupt_is_needed = 1;
+     } else if (num_expired <= am->fa_min_deleted_sessions_per_interval) {
+       /* signal that they should trigger us less */
+       pw->interrupt_is_unwanted = 1;
+     } else {
+       /* the current rate of interrupts is ok */
+       pw->interrupt_is_needed = 0;
+       pw->interrupt_is_unwanted = 0;
+     }
+   }
+   return 0;
 }
 
+static void
+send_one_worker_interrupt (vlib_main_t * vm, acl_main_t *am, int thread_index)
+{
+  acl_fa_per_worker_data_t *pw = &am->per_worker_data[thread_index];
+  if (!pw->interrupt_is_pending) {
+    vlib_node_set_interrupt_pending (vlib_mains[thread_index],
+                  acl_fa_worker_session_cleaner_process_node.index);
+    pw->interrupt_is_pending = 1;
+    /* if the interrupt was requested, mark that done. */
+    pw->interrupt_is_needed = 0;
+  }
+}
 
+static void
+send_interrupts_to_workers (vlib_main_t * vm, acl_main_t *am)
+{
+  int i;
+  /* Can't use vec_len(am->per_worker_data) since the threads might not have come up yet; */
+  int n_threads = vec_len(vlib_mains);
+  for (i = n_threads > 1 ? 1 : 0; i < n_threads; i++) {
+    send_one_worker_interrupt(vm, am, i);
+  }
+}
+
+/* centralized process to drive per-worker cleaners */
 static uword
 acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
 				vlib_frame_t * f)
@@ -1115,28 +1356,48 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
   u64 now = clib_cpu_time_now ();
   f64 cpu_cps = vm->clib_time.clocks_per_second;
   u64 next_expire;
-  /* We should call timer wheel at least twice a second */
+  /* We should check if there are connections to clean up - at least twice a second */
   u64 max_timer_wait_interval = cpu_cps / 2;
-  am->fa_current_cleaner_timer_wait_interval = max_timer_wait_interval;
-
-  u32 *expired = NULL;
   uword event_type, *event_data = 0;
+  acl_fa_per_worker_data_t *pw0;
 
+  am->fa_current_cleaner_timer_wait_interval = max_timer_wait_interval;
   am->fa_cleaner_node_index = acl_fa_session_cleaner_process_node.index;
 
   while (1)
     {
-      u32 count_deleted_sessions = 0;
-      u32 count_already_deleted = 0;
       now = clib_cpu_time_now ();
       next_expire = now + am->fa_current_cleaner_timer_wait_interval;
       int has_pending_conns = 0;
+      u16 ti;
       u8 tt;
-      for(tt = 0; tt < ACL_N_TIMEOUTS; tt++)
-        {
-          if (~0 != am->fa_conn_list_head[tt])
+
+      /*
+       * walk over all per-thread list heads of different timeouts,
+       * and see if there are any connections pending.
+       * If there aren't - we do not need to wake up until the
+       * worker code signals that it has added a connection.
+       *
+       * Also, while we are at it, calculate the earliest we need to wake up.
+       */
+      for(ti = 0; ti < vec_len(vlib_mains); ti++) {
+        if (ti >= vec_len(am->per_worker_data)) {
+          continue;
+        }
+        acl_fa_per_worker_data_t *pw = &am->per_worker_data[ti];
+        for(tt = 0; tt < vec_len(pw->fa_conn_list_head); tt++) {
+          u64 head_expiry = acl_fa_get_list_head_expiry_time(am, pw, now, ti, tt);
+          if ((head_expiry < next_expire) && !pw->interrupt_is_pending) {
+#ifdef FA_NODE_VERBOSE_DEBUG
+            clib_warning("Head expiry: %lu, now: %lu, next_expire: %lu (worker: %d, tt: %d)", head_expiry, now, next_expire, ti, tt);
+#endif
+            next_expire = head_expiry;
+	  }
+          if (~0 != pw->fa_conn_list_head[tt]) {
             has_pending_conns = 1;
+          }
         }
+      }
 
       /* If no pending connections then no point in timing out */
       if (!has_pending_conns)
@@ -1155,9 +1416,6 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
 	    }
 	  else
 	    {
-	      /* Timing wheel code is happier if it is called regularly */
-	      if (timeout > 0.5)
-		timeout = 0.5;
               am->fa_cleaner_cnt_wait_with_timeout++;
 	      (void) vlib_process_wait_for_event_or_clock (vm, timeout);
 	      event_type = vlib_process_get_events (vm, &event_data);
@@ -1175,7 +1433,11 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
 	  break;
 	case ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX:
 	  {
+            uword *clear_sw_if_index_bitmap = 0;
 	    uword *sw_if_index0;
+#ifdef FA_NODE_VERBOSE_DEBUG
+	    clib_warning("ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX received");
+#endif
 	    vec_foreach (sw_if_index0, event_data)
 	    {
               am->fa_cleaner_cnt_delete_by_sw_index++;
@@ -1184,13 +1446,54 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
 		("ACL_FA_NODE_CLEAN: ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX: %d",
 		 *sw_if_index0);
 #endif
-	      u32 count = 0;
-	      int result =
-		acl_fa_clean_sessions_by_sw_if_index (am, *sw_if_index0,
-						      &count);
-	      count_deleted_sessions += count;
-              am->fa_cleaner_cnt_delete_by_sw_index_ok += result;
+              clear_sw_if_index_bitmap = clib_bitmap_set(clear_sw_if_index_bitmap, *sw_if_index0, 1);
 	    }
+#ifdef FA_NODE_VERBOSE_DEBUG
+	    clib_warning("ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX bitmap: %U", format_bitmap_hex, clear_sw_if_index_bitmap);
+#endif
+	    vec_foreach(pw0, am->per_worker_data) {
+              CLIB_MEMORY_BARRIER ();
+	      while (pw0->clear_in_process) {
+                CLIB_MEMORY_BARRIER ();
+#ifdef FA_NODE_VERBOSE_DEBUG
+                clib_warning("ACL_FA_NODE_CLEAN: waiting previous cleaning cycle to finish on %d...", pw0 - am->per_worker_data);
+#endif
+                vlib_process_suspend(vm, 0.0001);
+                if (pw0->interrupt_is_needed) {
+                  send_one_worker_interrupt(vm, am, (pw0 - am->per_worker_data));
+                }
+              }
+              if (pw0->clear_in_process) {
+                clib_warning("ERROR-BUG! Could not initiate cleaning on worker because another cleanup in progress");
+	      } else {
+                pw0->pending_clear_sw_if_index_bitmap = clib_bitmap_dup(clear_sw_if_index_bitmap);
+                pw0->clear_in_process = 1;
+              }
+            }
+            /* send some interrupts so they can start working */
+            send_interrupts_to_workers(vm, am);
+
+            /* now wait till they all complete */
+#ifdef FA_NODE_VERBOSE_DEBUG
+	    clib_warning("CLEANER mains len: %d per-worker len: %d", vec_len(vlib_mains), vec_len(am->per_worker_data));
+#endif
+	    vec_foreach(pw0, am->per_worker_data) {
+              CLIB_MEMORY_BARRIER ();
+	      while (pw0->clear_in_process) {
+                CLIB_MEMORY_BARRIER ();
+#ifdef FA_NODE_VERBOSE_DEBUG
+                clib_warning("ACL_FA_NODE_CLEAN: waiting for my cleaning cycle to finish on %d...", pw0 - am->per_worker_data);
+#endif
+                vlib_process_suspend(vm, 0.0001);
+                if (pw0->interrupt_is_needed) {
+                  send_one_worker_interrupt(vm, am, (pw0 - am->per_worker_data));
+                }
+              }
+            }
+#ifdef FA_NODE_VERBOSE_DEBUG
+            clib_warning("ACL_FA_NODE_CLEAN: cleaning done");
+#endif
+            clib_bitmap_free(clear_sw_if_index_bitmap);
 	  }
 	  break;
 	default:
@@ -1206,74 +1509,34 @@ acl_fa_session_cleaner_process (vlib_main_t * vm, vlib_node_runtime_t * rt,
 	  break;
 	}
 
-      {
-        u8 tt = 0;
-        for(tt = 0; tt < ACL_N_TIMEOUTS; tt++) {
-          while((vec_len(expired) < 2*am->fa_max_deleted_sessions_per_interval)
-                && (~0 != am->fa_conn_list_head[tt])
-                && (acl_fa_conn_time_to_check(am, now,
-                                              am->fa_conn_list_head[tt]))) {
-            u32 sess_id = am->fa_conn_list_head[tt];
-            vec_add1(expired, sess_id);
-            acl_fa_conn_list_delete_session(am, sess_id);
-          }
-        }
-      }
+      send_interrupts_to_workers(vm, am);
 
-      u32 *psid = NULL;
-      vec_foreach (psid, expired)
-      {
-	u32 session_index = *psid;
-	if (!pool_is_free_index (am->fa_sessions_pool, session_index))
-	  {
-	    fa_session_t *sess = am->fa_sessions_pool + session_index;
-	    u32 sw_if_index = sess->sw_if_index;
-	    u64 sess_timeout_time =
-	      sess->last_active_time + fa_session_get_timeout (am, sess);
-	    if (now < sess_timeout_time)
-	      {
-		/* clib_warning ("ACL_FA_NODE_CLEAN: Restarting timer for session %d",
-		   (int) session_index); */
-
-                /* There was activity on the session, so the idle timeout
-                   has not passed. Enqueue for another time period. */
-
-                acl_fa_conn_list_add_session(am, session_index, now);
-
-		/* FIXME: When/if moving to timer wheel,
-                   pretend we did this in the past,
-                   at last_active moment, so the timer is accurate */
-                am->fa_cleaner_cnt_timer_restarted++;
-	      }
-	    else
-	      {
-		/* clib_warning ("ACL_FA_NODE_CLEAN: Deleting session %d",
-		   (int) session_index); */
-		acl_fa_delete_session (am, sw_if_index, session_index);
-                count_deleted_sessions++;
-	      }
-	  }
-	else
-	  {
-	    count_already_deleted++;
-	  }
-      }
-      if (expired)
-	_vec_len (expired) = 0;
       if (event_data)
 	_vec_len (event_data) = 0;
 
-      if (count_deleted_sessions > am->fa_max_deleted_sessions_per_interval) {
-        /* if there was too many sessions to delete, do less waiting around next time */
+
+      int interrupts_needed = 0;
+      int interrupts_unwanted = 0;
+
+      vec_foreach(pw0, am->per_worker_data) {
+        if (pw0->interrupt_is_needed) {
+          interrupts_needed++;
+          /* the per-worker value is reset when sending the interrupt */
+        }
+        if (pw0->interrupt_is_unwanted) {
+          interrupts_unwanted++;
+          pw0->interrupt_is_unwanted = 0;
+        }
+      }
+      if (interrupts_needed) {
+        /* they need more interrupts, do less waiting around next time */
         am->fa_current_cleaner_timer_wait_interval /= 2;
-      } else if (count_deleted_sessions < am->fa_min_deleted_sessions_per_interval) {
-        /* Too few deleted sessions, slowly increase the amount of sleep up to a limit */
+      } else if (interrupts_unwanted) {
+        /* slowly increase the amount of sleep up to a limit */
         if (am->fa_current_cleaner_timer_wait_interval < max_timer_wait_interval)
           am->fa_current_cleaner_timer_wait_interval += cpu_cps * am->fa_cleaner_wait_time_increment;
       }
       am->fa_cleaner_cnt_event_cycles++;
-      am->fa_cleaner_cnt_deleted_sessions += count_deleted_sessions;
-      am->fa_cleaner_cnt_already_deleted += count_already_deleted;
     }
   /* NOT REACHED */
   return 0;
@@ -1307,6 +1570,9 @@ acl_fa_enable_disable (u32 sw_if_index, int is_input, int enable_disable)
   if ((!enable_disable) && (!acl_fa_ifc_has_in_acl (am, sw_if_index))
       && (!acl_fa_ifc_has_out_acl (am, sw_if_index)))
     {
+#ifdef FA_NODE_VERBOSE_DEBUG
+      clib_warning("ENABLE-DISABLE: clean the connections on interface %d", sw_if_index);
+#endif
       vlib_process_signal_event (am->vlib_main, am->fa_cleaner_node_index,
 				 ACL_FA_CLEANER_DELETE_BY_SW_IF_INDEX,
 				 sw_if_index);
@@ -1317,6 +1583,12 @@ acl_fa_enable_disable (u32 sw_if_index, int is_input, int enable_disable)
 
 /* *INDENT-OFF* */
 
+VLIB_REGISTER_NODE (acl_fa_worker_session_cleaner_process_node, static) = {
+  .function = acl_fa_worker_conn_cleaner_process,
+  .name = "acl-plugin-fa-worker-cleaner-process",
+  .type = VLIB_NODE_TYPE_INPUT,
+  .state = VLIB_NODE_STATE_INTERRUPT,
+};
 
 VLIB_REGISTER_NODE (acl_fa_session_cleaner_process_node, static) = {
   .function = acl_fa_session_cleaner_process,
diff --git a/src/plugins/acl/fa_node.h b/src/plugins/acl/fa_node.h
index 86183622..a94e7db9 100644
--- a/src/plugins/acl/fa_node.h
+++ b/src/plugins/acl/fa_node.h
@@ -59,15 +59,29 @@ typedef struct {
     u8 as_u8[2];
     u16 as_u16;
   } tcp_flags_seen; ;     /* +2 bytes = 62 */
-  u8 link_list_id;           /* +1 bytes = 63 */
-  u8 reserved1;           /* +1 bytes = 64 */
-  u32 link_prev_idx;
-  u32 link_next_idx;
-  u64 link_enqueue_time;
-  u64 reserved2[6];
+  u16 thread_index;          /* +2 bytes = 64 */
+  u64 link_enqueue_time;  /* 8 byte = 8 */
+  u32 link_prev_idx;      /* +4 bytes = 12 */
+  u32 link_next_idx;      /* +4 bytes = 16 */
+  u8 link_list_id;        /* +1 bytes = 17 */
+  u8 reserved1[7];        /* +7 bytes = 24 */
+  u64 reserved2[5];       /* +5*8 bytes = 64 */
 } fa_session_t;
 
 
+/* This structure is used to fill in the u64 value
+   in the per-sw-if-index hash table */
+typedef struct {
+  union {
+    u64 as_u64;
+    struct {
+      u32 session_index;
+      u16 thread_index;
+      u16 reserved0;
+    };
+  };
+} fa_full_session_id_t;
+
 /*
  * A few compile-time constraints on the size and the layout of the union, to ensure
  * it makes sense both for bihash and for us.
@@ -79,10 +93,56 @@ CT_ASSERT_EQUAL(fa_l4_key_t_is_8, sizeof(fa_session_l4_key_t), sizeof(u64));
 CT_ASSERT_EQUAL(fa_packet_info_t_is_8, sizeof(fa_packet_info_t), sizeof(u64));
 CT_ASSERT_EQUAL(fa_l3_kv_size_is_48, sizeof(fa_5tuple_t), sizeof(clib_bihash_kv_40_8_t));
 
-/* Let's try to fit within the cacheline */
-CT_ASSERT_EQUAL(fa_session_t_size_is_64, sizeof(fa_session_t), 128);
+/* Let's try to fit within two cachelines */
+CT_ASSERT_EQUAL(fa_session_t_size_is_128, sizeof(fa_session_t), 128);
+
+/* Session ID MUST be the same as u64 */
+CT_ASSERT_EQUAL(fa_full_session_id_size_is_64, sizeof(fa_full_session_id_t), sizeof(u64));
 #undef CT_ASSERT_EQUAL
 
+typedef struct {
+  /* The pool of sessions managed by this worker */
+  fa_session_t *fa_sessions_pool;
+  /* per-worker ACL_N_TIMEOUTS of conn lists */
+  u32 *fa_conn_list_head;
+  u32 *fa_conn_list_tail;
+  /* Vector of expired connections retrieved from lists */
+  u32 *expired;
+  /* the earliest next expiry time */
+  u64 next_expiry_time;
+  /* if not zero, look at all the elements until their enqueue timestamp is after below one */
+  u64 requeue_until_time;
+  /* Current time between the checks */
+  u64 current_time_wait_interval;
+  /* Counter of how many sessions we did delete */
+  u64 cnt_deleted_sessions;
+  /* Counter of already deleted sessions being deleted - should not increment unless a bug */
+  u64 cnt_already_deleted_sessions;
+  /* Number of times we requeued a session to a head of the list */
+  u64 cnt_session_timer_restarted;
+  /* swipe up to this enqueue time, rather than following the timeouts */
+  u64 swipe_end_time;
+  /* bitmap of sw_if_index serviced by this worker */
+  uword *serviced_sw_if_index_bitmap;
+  /* bitmap of sw_if_indices to clear. set by main thread, cleared by worker */
+  uword *pending_clear_sw_if_index_bitmap;
+  /* atomic, indicates that the swipe-deletion of connections is in progress */
+  u32 clear_in_process;
+  /* Interrupt is pending from main thread */
+  int interrupt_is_pending;
+  /*
+   * Interrupt node on the worker thread sets this if it knows there is
+   * more work to do, but it has to finish to avoid hogging the
+   * core for too long.
+   */
+  int interrupt_is_needed;
+  /*
+   * Set to indicate that the interrupt node wants to get less interrupts
+   * because there is not enough work for the current rate.
+   */
+  int interrupt_is_unwanted;
+} acl_fa_per_worker_data_t;
+
 
 typedef enum {
   ACL_FA_ERROR_DROP,
-- 
cgit 1.2.3-korg