summaryrefslogtreecommitdiffstats
path: root/src/plugins/acl
diff options
context:
space:
mode:
authorNathan Skrzypczak <nathan.skrzypczak@gmail.com>2021-10-08 14:05:35 +0200
committerDave Wallace <dwallacelf@gmail.com>2021-10-13 23:22:20 +0000
commitf47122e07e1ecd0151902a3cabe46c60a99bee8e (patch)
tree0c28c0eca2cb17050d6f31fd8f0ca8f78299bf0d /src/plugins/acl
parent1e4281223ab4d655b54496ae13fbdb68f867e351 (diff)
docs: convert plugins doc md->rst
Type: improvement Change-Id: I7e821cce1feae229e1be4baeed249b9cca658135 Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com>
Diffstat (limited to 'src/plugins/acl')
-rw-r--r--src/plugins/acl/acl_hash_lookup_doc.md240
-rw-r--r--src/plugins/acl/acl_hash_lookup_doc.rst243
-rw-r--r--src/plugins/acl/acl_lookup_context.md125
-rw-r--r--src/plugins/acl/acl_lookup_context.rst138
-rw-r--r--src/plugins/acl/acl_multicore_doc.md349
-rw-r--r--src/plugins/acl/acl_multicore_doc.rst354
6 files changed, 735 insertions, 714 deletions
diff --git a/src/plugins/acl/acl_hash_lookup_doc.md b/src/plugins/acl/acl_hash_lookup_doc.md
deleted file mode 100644
index 6b08e1bc953..00000000000
--- a/src/plugins/acl/acl_hash_lookup_doc.md
+++ /dev/null
@@ -1,240 +0,0 @@
-ACL plugin constant-time lookup design {#acl_hash_lookup}
-======================================
-
-The initial implementation of ACL plugin performs a trivial for() cycle,
-going through the assigned ACLs on a per-packet basis. This is not very
-efficient, even if for very short ACLs due to its simplicity it can beat
-more advanced methods.
-
-However, to cover the case of longer ACLs with acceptable performance,
-we need to have a better way of matching. This write-up proposes
-a mechanism to make a lookup from O(M) where M is number of entries
-to O(N) where N is number of different mask combinations.
-
-Preparation of ACL(s)
----------------------
-
-The ACL plugin will maintain a global list of "mask types", i.e. the specific
-configurations of "do not care" bits within the ACEs.
-Upon the creation of a new ACL, a pass will be made through all the
-ACEs, to assign and possibly allocate the "mask type number".
-
-Each ACL has a structure *hash_acl_info_t* representing the "hash-based"
-parts of information related to that ACL, primarily the array of
-*hash_ace_info_t* structures - each of the members of that array
-corresponding to one of the rules (ACEs) in the original ACL,
-for this they have a pair of *(acl_index, ace_index)* to keep track,
-predominantly for debugging.
-
-Why do we need a whole separate structure, and are not adding new fields
-to the existing rule structure? First, encapsulation, to minimize
-the pollution of the main ACL code with the hash-based lookup artifacts.
-Second, one rule may correspond to more than one "hash-based" ACE.
-In fact, most of the rules do correspond to two of those. Why ?
-
-Consider that the current ACL lookup logic is that if a packet
-is not the initial fragment, and there is an L4 entry acting on the packet,
-the comparison will be made only on the L4 protocol field value rather
-than on the protocol and port values. This behavior is governed by
-*l4_match_nonfirst_fragment* flag in the *acl_main*, and is needed to
-maintain the compatibility with the existing software switch implementation.
-
-While for the sequential check in *single_acl_match_5tuple()*
-it is very easy to implement by just breaking out at the right moment,
-in case of hash-based matching this cost us two checks:
-one on full 5-tuple and the flag *pkt.is_nonfirst_fragment* being zero,
-the second on 3-tuple and the flag *pkt.is_nonfirst_fragment* being one,
-with the second check triggered by the *acl_main.l4_match_nonfirst_fragment*
-setting being the default 1. This dictates the necessity of having a "match"
-field in a given *hash_ace_info_t* element, which would reflect the value
-we are supposed to match after applying the mask.
-
-There can be other circumstances when it might be beneficial to expand
-the given rule in the original ACL into multiple - for example, as an
-optimization within the port range handling for small port ranges
-(this is not done as of the time of writing).
-
-Assigning ACLs to an interface
-------------------------------
-
-Once the ACL list is assigned to an interface, or, rather, a new ACL
-is added to the list of the existing ACLs applied to the interface,
-we need to update the bihash accelerating the lookup.
-
-All the entries for the lookups are stored within a single *48_8* bihash,
-which captures the 5-tuple from the packet as well as the miscellaneous
-per-packet information flags, e.g. *l4_valid*, *is_non_first_fragment*,
-and so on. To facilitate the use of the single bihash by all the interfaces,
-the *is_ip6*, *is_input*, *sw_if_index* are part of the key,
-as well as *mask_type_index* - the latter being necessary because
-there can be entries with the same value but different masks, e.g.:
-`permit ::/0, permit::/128`.
-
-At the moment of an ACL being applied to an interface, we need to
-walk the list of *hash_ace_info_t* entries corresponding to that ACL,
-and update the bihash with the keys corresponding to the match
-values in these entries.
-
-The value of the hash match contains the index into a per-*sw_if_index* vector
-of *applied_ace_hash_entry_t* elements, as well as a couple of flags:
-*shadowed* (optimization: if this flag on a matched entry is zero, means
-we can stop the lookup early and declare a match - see below),
-and *need_portrange_check* - meaning that what matched was a superset
-of the actual match, and we need to perform an extra check.
-
-Also, upon insertion, we must keep in mind there can be
-multiple *applied_ace_hash_entry_t* for the same key and must keep
-a list of those. This is necessary to incrementally apply/unapply
-the ACLs as part of the ACL vector: say, two ACLs have
-"permit 2001:db8::1/128 any" - we should be able to retain the entry
-for the second ACL even if we have deleted the first one.
-Also, in case there are two entries with the same key but
-different port ranges, say 0..42 and 142..65535 - we need
-to be able to sequentially match on those if we decide not
-to expand them into individual port-specific entries.
-
-Per-packet lookup
------------------
-
-The simple single-packet lookup is defined in
-*multi_acl_match_get_applied_ace_index*, which returns the index
-of the applied hash ACE if there was a match, or ~0 if there wasn't.
-
-The future optimized per-packet lookup may be batched in three phases:
-
-1. Prepare the keys in the per-worker vector by doing logical AND of
- original 5-tuple record with the elements of the mask vector.
-2. Lookup the keys in the bihash in a batch manner, collecting the
- result with lowest u64 (acl index within vector, ACE index) from
- the hash lookup value, and performing the list walk if necessary
- (for portranges).
-3. Take the action from the ACL record as defined by (ACL#, ACE#) from the
- resulting lookup winner, or, if no match found, then perform default deny.
-
-Shadowed/independent/redundant ACEs
-------------------------------------
-
-During the phase of combining multiple ACLs into one rulebase, when they
-are applied to interface, we also can perform several optimizations.
-
-If a given ACE is a strict subset of another ACE located up in the linear
-search order, we can ignore this ACE completely - because by definition
-it will never match. We will call such an ACE *redundant*. Here is an example:
-
-```
-permit 2001:db8:1::/48 2001:db8:2::/48 (B)
-deny 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A)
-```
-
-A bit more formally, we can define this relationship of an ACE A to ACE B as:
-
-```
-redundant(aceA, aceB) := (contains(protoB, protoA) && contains(srcB, srcA)
- && contains(dstB, dstA) && is_after(A, B))
-```
-
-Here as "contains" we define an operation operating on the sets defined by
-the protocol, (srcIP, srcPortDefinition) and (dstIP, dstPortDefinition)
-respectively, and returning true if all the elements represented by
-the second argument are represented by the first argument. The "is_after"
-is true if A is located below B in the ruleset.
-
-If a given ACE does not intersect at all with any other ACE
-in front of it, we can mark it as such.
-
-Then during the sequence of the lookups the successful hit on this ACE means
-we do not need to look up other mask combinations - thus potentially
-significantly speeding up the match process. Here is an example,
-assuming we have the following ACL:
-
-```
-permit 2001:db8:1::/48 2001:db8:2::/48 (B)
-deny 2001:db8:3::/48 2001:db8:2:1::/64 (A)
-```
-
-In this case if we match the second entry, we do not need to check whether
-we have matched the first one - the source addresses are completely
-different. We call such an ACE *independent* from another.
-
-We can define this as
-
-```
-independent(aceA, aceB) := (!intersect(protoA, protoB) ||
- !intersect(srcA, srcB) ||
- !intersect(dstA, dstB))
-```
-
-where intersect is defined as operation returning true if there are
-elements belonging to the sets of both arguments.
-
-If the entry A is neither redundant nor independent from B, and is below
-B in the ruleset, we call such an entry *shadowed* by B, here is an example:
-
-```
-deny tcp 2001:db8:1::/48 2001:db8:2::/48 (B)
-permit 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A)
-```
-
-This means the earlier rule "carves out" a subset of A, thus leaving
-a "shadow". (Evidently, the action needs to be different for the shadow
-to have an effect, but for for the terminology sake we do not care).
-
-The more formal definition:
-
-```
-shadowed(aceA, aceB) := !redundant(aceA, aceB) &&
- !independent(aceA, aceB) &&
- is_after(aceA, aceB)
-```
-
-Using this terminology, any ruleset can be represented as
-a DAG (Directed Acyclic Graph), with the bottom being the implicit
-"deny any", pointing to the set of rules shadowing it or the ones
-it is redundant for.
-
-These rules may in turn be shadowing each other. There is no cycles in
-this graph because of the natural order of the rules - the rule located
-closer to the end of the ruleset can never shadow or make redundant a rule
-higher up.
-
-The optimization that enables can allow for is to skip matching certain
-masks on a per-lookup basis - if a given rule has matched,
-the only adjustments that can happen is the match with one of
-the shadowing rules.
-
-Also, another avenue for the optimization can be starting the lookup process
-with the mask type that maximizes the chances of the independent ACE match,
-thus resulting in an ACE lookup being a single hash table hit.
-
-
-Plumbing
---------
-
-All the new routines are located in a separate file,
-so we can cleanly experiment with a different approach if this
-does not fit all of the use cases.
-
-The constant-time lookup within the data path has the API with
-the same signature as:
-
-```
-u8
-multi_acl_match_5tuple (u32 sw_if_index, fa_5tuple_t * pkt_5tuple, int is_l2,
- int is_ip6, int is_input, u32 * acl_match_p,
- u32 * rule_match_p, u32 * trace_bitmap)
-```
-
-There should be a new upper-level function with the same signature, which
-will make a decision whether to use a linear lookup, or to use the
-constant-time lookup implemented by this work, or to add some other
-optimizations (e.g. by keeping the cache of the last N lookups).
-
-The calls to the routine doing preparatory work should happen
-in `acl_add_list()` after creating the linear-lookup structures,
-and the routine doing the preparatory work populating the hashtable
-should be called from `acl_interface_add_del_inout_acl()` or its callees.
-
-The initial implementation will be geared towards looking up a single
-match at a time, with the subsequent optimizations possible to make
-the lookup for more than one packet.
-
diff --git a/src/plugins/acl/acl_hash_lookup_doc.rst b/src/plugins/acl/acl_hash_lookup_doc.rst
new file mode 100644
index 00000000000..72842af423d
--- /dev/null
+++ b/src/plugins/acl/acl_hash_lookup_doc.rst
@@ -0,0 +1,243 @@
+ACL plugin constant-time lookup
+===============================
+
+The initial implementation of ACL plugin performs a trivial for() cycle,
+going through the assigned ACLs on a per-packet basis. This is not very
+efficient, even if for very short ACLs due to its simplicity it can beat
+more advanced methods.
+
+However, to cover the case of longer ACLs with acceptable performance,
+we need to have a better way of matching. This write-up proposes a
+mechanism to make a lookup from O(M) where M is number of entries to
+O(N) where N is number of different mask combinations.
+
+Preparation of ACL(s)
+---------------------
+
+The ACL plugin will maintain a global list of “mask types”, i.e. the
+specific configurations of “do not care” bits within the ACEs. Upon the
+creation of a new ACL, a pass will be made through all the ACEs, to
+assign and possibly allocate the “mask type number”.
+
+Each ACL has a structure *hash_acl_info_t* representing the “hash-based”
+parts of information related to that ACL, primarily the array of
+*hash_ace_info_t* structures - each of the members of that array
+corresponding to one of the rules (ACEs) in the original ACL, for this
+they have a pair of *(acl_index, ace_index)* to keep track,
+predominantly for debugging.
+
+Why do we need a whole separate structure, and are not adding new fields
+to the existing rule structure? First, encapsulation, to minimize the
+pollution of the main ACL code with the hash-based lookup artifacts.
+Second, one rule may correspond to more than one “hash-based” ACE. In
+fact, most of the rules do correspond to two of those. Why ?
+
+Consider that the current ACL lookup logic is that if a packet is not
+the initial fragment, and there is an L4 entry acting on the packet, the
+comparison will be made only on the L4 protocol field value rather than
+on the protocol and port values. This behavior is governed by
+*l4_match_nonfirst_fragment* flag in the *acl_main*, and is needed to
+maintain the compatibility with the existing software switch
+implementation.
+
+While for the sequential check in *single_acl_match_5tuple()* it is very
+easy to implement by just breaking out at the right moment, in case of
+hash-based matching this cost us two checks: one on full 5-tuple and the
+flag *pkt.is_nonfirst_fragment* being zero, the second on 3-tuple and
+the flag *pkt.is_nonfirst_fragment* being one, with the second check
+triggered by the *acl_main.l4_match_nonfirst_fragment* setting being the
+default 1. This dictates the necessity of having a “match” field in a
+given *hash_ace_info_t* element, which would reflect the value we are
+supposed to match after applying the mask.
+
+There can be other circumstances when it might be beneficial to expand
+the given rule in the original ACL into multiple - for example, as an
+optimization within the port range handling for small port ranges (this
+is not done as of the time of writing).
+
+Assigning ACLs to an interface
+------------------------------
+
+Once the ACL list is assigned to an interface, or, rather, a new ACL is
+added to the list of the existing ACLs applied to the interface, we need
+to update the bihash accelerating the lookup.
+
+All the entries for the lookups are stored within a single *48_8*
+bihash, which captures the 5-tuple from the packet as well as the
+miscellaneous per-packet information flags, e.g. *l4_valid*,
+*is_non_first_fragment*, and so on. To facilitate the use of the single
+bihash by all the interfaces, the *is_ip6*, *is_input*, *sw_if_index*
+are part of the key, as well as *mask_type_index* - the latter being
+necessary because there can be entries with the same value but different
+masks, e.g.: ``permit ::/0, permit::/128``.
+
+At the moment of an ACL being applied to an interface, we need to walk
+the list of *hash_ace_info_t* entries corresponding to that ACL, and
+update the bihash with the keys corresponding to the match values in
+these entries.
+
+The value of the hash match contains the index into a per-*sw_if_index*
+vector of *applied_ace_hash_entry_t* elements, as well as a couple of
+flags: *shadowed* (optimization: if this flag on a matched entry is
+zero, means we can stop the lookup early and declare a match - see
+below), and *need_portrange_check* - meaning that what matched was a
+superset of the actual match, and we need to perform an extra check.
+
+Also, upon insertion, we must keep in mind there can be multiple
+*applied_ace_hash_entry_t* for the same key and must keep a list of
+those. This is necessary to incrementally apply/unapply the ACLs as part
+of the ACL vector: say, two ACLs have “permit 2001:db8::1/128 any” - we
+should be able to retain the entry for the second ACL even if we have
+deleted the first one. Also, in case there are two entries with the same
+key but different port ranges, say 0..42 and 142..65535 - we need to be
+able to sequentially match on those if we decide not to expand them into
+individual port-specific entries.
+
+Per-packet lookup
+-----------------
+
+The simple single-packet lookup is defined in
+*multi_acl_match_get_applied_ace_index*, which returns the index of the
+applied hash ACE if there was a match, or ~0 if there wasn’t.
+
+The future optimized per-packet lookup may be batched in three phases:
+
+1. Prepare the keys in the per-worker vector by doing logical AND of
+ original 5-tuple record with the elements of the mask vector.
+2. Lookup the keys in the bihash in a batch manner, collecting the
+ result with lowest u64 (acl index within vector, ACE index) from the
+ hash lookup value, and performing the list walk if necessary (for
+ portranges).
+3. Take the action from the ACL record as defined by (ACL#, ACE#) from
+ the resulting lookup winner, or, if no match found, then perform
+ default deny.
+
+Shadowed/independent/redundant ACEs
+-----------------------------------
+
+During the phase of combining multiple ACLs into one rulebase, when they
+are applied to interface, we also can perform several optimizations.
+
+If a given ACE is a strict subset of another ACE located up in the
+linear search order, we can ignore this ACE completely - because by
+definition it will never match. We will call such an ACE *redundant*.
+Here is an example:
+
+::
+
+ permit 2001:db8:1::/48 2001:db8:2::/48 (B)
+ deny 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A)
+
+A bit more formally, we can define this relationship of an ACE A to ACE
+B as:
+
+::
+
+ redundant(aceA, aceB) := (contains(protoB, protoA) && contains(srcB, srcA)
+ && contains(dstB, dstA) && is_after(A, B))
+
+Here as “contains” we define an operation operating on the sets defined
+by the protocol, (srcIP, srcPortDefinition) and (dstIP,
+dstPortDefinition) respectively, and returning true if all the elements
+represented by the second argument are represented by the first
+argument. The “is_after” is true if A is located below B in the ruleset.
+
+If a given ACE does not intersect at all with any other ACE in front of
+it, we can mark it as such.
+
+Then during the sequence of the lookups the successful hit on this ACE
+means we do not need to look up other mask combinations - thus
+potentially significantly speeding up the match process. Here is an
+example, assuming we have the following ACL:
+
+::
+
+ permit 2001:db8:1::/48 2001:db8:2::/48 (B)
+ deny 2001:db8:3::/48 2001:db8:2:1::/64 (A)
+
+In this case if we match the second entry, we do not need to check
+whether we have matched the first one - the source addresses are
+completely different. We call such an ACE *independent* from another.
+
+We can define this as
+
+::
+
+ independent(aceA, aceB) := (!intersect(protoA, protoB) ||
+ !intersect(srcA, srcB) ||
+ !intersect(dstA, dstB))
+
+where intersect is defined as operation returning true if there are
+elements belonging to the sets of both arguments.
+
+If the entry A is neither redundant nor independent from B, and is below
+B in the ruleset, we call such an entry *shadowed* by B, here is an
+example:
+
+::
+
+ deny tcp 2001:db8:1::/48 2001:db8:2::/48 (B)
+ permit 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A)
+
+This means the earlier rule “carves out” a subset of A, thus leaving a
+“shadow”. (Evidently, the action needs to be different for the shadow to
+have an effect, but for for the terminology sake we do not care).
+
+The more formal definition:
+
+::
+
+ shadowed(aceA, aceB) := !redundant(aceA, aceB) &&
+ !independent(aceA, aceB) &&
+ is_after(aceA, aceB)
+
+Using this terminology, any ruleset can be represented as a DAG
+(Directed Acyclic Graph), with the bottom being the implicit “deny any”,
+pointing to the set of rules shadowing it or the ones it is redundant
+for.
+
+These rules may in turn be shadowing each other. There is no cycles in
+this graph because of the natural order of the rules - the rule located
+closer to the end of the ruleset can never shadow or make redundant a
+rule higher up.
+
+The optimization that enables can allow for is to skip matching certain
+masks on a per-lookup basis - if a given rule has matched, the only
+adjustments that can happen is the match with one of the shadowing
+rules.
+
+Also, another avenue for the optimization can be starting the lookup
+process with the mask type that maximizes the chances of the independent
+ACE match, thus resulting in an ACE lookup being a single hash table
+hit.
+
+Plumbing
+--------
+
+All the new routines are located in a separate file, so we can cleanly
+experiment with a different approach if this does not fit all of the use
+cases.
+
+The constant-time lookup within the data path has the API with the same
+signature as:
+
+::
+
+ u8
+ multi_acl_match_5tuple (u32 sw_if_index, fa_5tuple_t * pkt_5tuple, int is_l2,
+ int is_ip6, int is_input, u32 * acl_match_p,
+ u32 * rule_match_p, u32 * trace_bitmap)
+
+There should be a new upper-level function with the same signature,
+which will make a decision whether to use a linear lookup, or to use the
+constant-time lookup implemented by this work, or to add some other
+optimizations (e.g. by keeping the cache of the last N lookups).
+
+The calls to the routine doing preparatory work should happen in
+``acl_add_list()`` after creating the linear-lookup structures, and the
+routine doing the preparatory work populating the hashtable should be
+called from ``acl_interface_add_del_inout_acl()`` or its callees.
+
+The initial implementation will be geared towards looking up a single
+match at a time, with the subsequent optimizations possible to make the
+lookup for more than one packet.
diff --git a/src/plugins/acl/acl_lookup_context.md b/src/plugins/acl/acl_lookup_context.md
deleted file mode 100644
index e95f82043f9..00000000000
--- a/src/plugins/acl/acl_lookup_context.md
+++ /dev/null
@@ -1,125 +0,0 @@
-Lookup contexts aka "ACL as a service" {#acl_lookup_context}
-======================================
-
-The initial implementation of the ACL plugin had tightly tied the policy (L3-L4) ACLs
-to ingress/egress processing on an interface.
-
-However, some uses outside of pure traffic control have appeared, for example,
-ACL-based forwarding, etc. Also, improved algorithms of the ACL lookup
-could benefit of the more abstract representation, not coupled to the interfaces.
-
-This describes a way to accommodate these use cases by generalizing the ACL
-lookups into "ACL lookup contexts", not tied to specific interfaces, usable
-by other portions of the code by utilizing the exports.h header file,
-which provides the necessary interface.
-
-
-Why "lookup contexts" and not "match me an ACL#" ?
-================================================
-
-The first reason is the logical grouping of multiple ACLs.
-
-The interface matching code currently allows for matching multiple ACLs
-in a 'first-match' fashion. Some other use cases also fall into a similar
-pattern: they attempt to match a sequence of ACLs, and the first matched ACL
-determines what the outcome is, e.g. where to forward traffic. Thus,
-a match never happens on an ACL in isolation, but always on a group of
-ACLs.
-
-The second reason is potential optimizations in matching.
-
-A naive match on series of ACLs each represented as a vector of ACEs
-does not care about the API level - it could be "match one ACL", or
-"match the set of ACLs" - there will be just a simple loop iterating over
-the ACLs to match, returning the first match. Be it in the ACL code or
-in the user code.
-
-However, for more involved lookup methods, providing a more high-level
-interface of matching over the entire group of ACLs allows for future
-improvements in the algorithms, delivered at once to all the users
-of the API.
-
-What is a "lookup context" ?
-============================
-
-An ACL lookup context is an entity that groups the set of ACL#s
-together for the purposes of a first-match lookup, and may store
-additional internal information needed to optimize the lookups
-for that particular vector of ACLs.
-
-Using ACL contexts in your code
-===============================
-
-In order to use the ACL lookup contexts, you need to include
-plugins/acl/exports.h into your code. This header includes
-all the necessary dependencies required.
-
-As you probably will invoke this code from another plugin,
-the non-inline function calls are implemented via function pointers,
-which you need to initialize by calling acl_plugin_exports_init(&acl_plugin), which,
-if everything succeeds, returns 0 and fills in the acl_plugin structure
-with pointers to the exported methods - else it will return clib_error_t with
-more information about what went wrong.
-
-When you have initialized the symbols, you also need to register yourself
-as a user of the ACL lookups - this allows to track the ACL lookup context
-ownership, as well as make the debug show outputs more user friendly.
-
-To do that, call acl_plugin.register_user_module(caller_module_string, val1_label, val2_label) -
-and record the returned value. This will bethe first parameter that you pass to create a new
-lookup context. The passed strings must be static, and are used as descriptions for the ACL
-contexts themselves, as well as labels for up to two user-supplied u32 labels, used to
-differentiate the lookup contexts for the debugging purposes.
-
-Creating a new context is done by calling acl_plugin.get_lookup_context_index(user_id, val1, val2).
-The first argument is your "user" ID obtained in a registration call earlier, the other two
-arguments are u32s with semantics that you designate. They are used purely for debugging purposes
-in the "show acl lookup context" command.
-
-To set the vector of ACL numbers to be looked up within the context, use the function
-acl_plugin.set_acl_vec_for_context(lc_index, acl_list). The first parameter specifies the context
-that you have created, the second parameter is a vector of u32s, each u32 being the index of the ACL
-which we should be looking up within this context. The command is idempotent, i.e.
-it unapplies the previously applied list of ACLs, and then sets the new list of ACLs.
-
-Subsequent ACL updates for the already applied ACLs will cause the re-application
-on an as-needed basis. Note, that the ACL application is potentially a relatively costly operation,
-so it is only expected that these changes will be done in the control plane, NOT in the datapath.
-
-The matching within the context is done using two functions - acl_plugin.fill_5tuple() and
-acl_plugin.match_5tuple() and their corresponding inline versions, named acl_plugin_fill_5tuple_inline()
-and acl_plugin_match_5tuple_inline(). The inline and non-inline versions have the equivalent functionality,
-in that the non-inline version calls the inline version. These two variants are provided
-for debugging/maintenance reasons.
-
-When you no longer need a particular context, you can return the allocated resources by calling
-acl_plugin.put_lookup_context_index() to mark it as free. The lookup structured associated with
-the vector of ACLs set for the lookup are cleaned up automatically. However, the ACLs themselves
-are not deleted and are available for subsequent reuse by other lookup contexts if needed.
-
-There is one delicate detail that you might want to be aware of.
-When the non-inline functions reference the inline functions,
-they are compiled as part of ACL plugin; whereas when you refer to the inline
-functions from your code, they are compiled as part of your code.
-This makes referring to a single acl_main structure a little trickier.
-
-It is done by having a static p_acl_main within the .h file,
-which points to acl_main of the ACL plugin, and is initialized by a static constructor
-function.
-
-This way the multiple includes and inlines will "just work" as one would expect.
-
-
-Debug CLIs
-==========
-
-To see the state of the ACL lookup contexts, you can issue "show acl-plugin lookup user" to see
-all of the users which registered for the usage of the ACL plugin lookup contexts,
-and "show acl-plugin lookup context" to show the actual contexts created. You will notice
-that the latter command uses the values supplied during the module registration in order to
-make the output more friendly.
-
-The "show acl-plugin acl" and "show acl-plugin interface" commands have also acquired the
-notion of lookup context, but there it is used from the client perspective, since
-with this change the interface ACL lookup itself is a user of ACL lookup contexts.
-
diff --git a/src/plugins/acl/acl_lookup_context.rst b/src/plugins/acl/acl_lookup_context.rst
new file mode 100644
index 00000000000..278e87381f3
--- /dev/null
+++ b/src/plugins/acl/acl_lookup_context.rst
@@ -0,0 +1,138 @@
+ACL Lookup contexts
+===================
+
+The initial implementation of the ACL plugin had tightly tied the policy
+(L3-L4) ACLs to ingress/egress processing on an interface.
+
+However, some uses outside of pure traffic control have appeared, for
+example, ACL-based forwarding, etc. Also, improved algorithms of the ACL
+lookup could benefit of the more abstract representation, not coupled to
+the interfaces.
+
+This describes a way to accommodate these use cases by generalizing the
+ACL lookups into “ACL lookup contexts”, not tied to specific interfaces,
+usable by other portions of the code by utilizing the exports.h header
+file, which provides the necessary interface.
+
+Why “lookup contexts” and not “match me an ACL” ?
+-------------------------------------------------
+
+The first reason is the logical grouping of multiple ACLs.
+
+The interface matching code currently allows for matching multiple ACLs
+in a ‘first-match’ fashion. Some other use cases also fall into a
+similar pattern: they attempt to match a sequence of ACLs, and the first
+matched ACL determines what the outcome is, e.g. where to forward
+traffic. Thus, a match never happens on an ACL in isolation, but always
+on a group of ACLs.
+
+The second reason is potential optimizations in matching.
+
+A naive match on series of ACLs each represented as a vector of ACEs
+does not care about the API level - it could be “match one ACL”, or
+“match the set of ACLs” - there will be just a simple loop iterating
+over the ACLs to match, returning the first match. Be it in the ACL code
+or in the user code.
+
+However, for more involved lookup methods, providing a more high-level
+interface of matching over the entire group of ACLs allows for future
+improvements in the algorithms, delivered at once to all the users of
+the API.
+
+What is a “lookup context” ?
+----------------------------
+
+An ACL lookup context is an entity that groups the set of ACL#s together
+for the purposes of a first-match lookup, and may store additional
+internal information needed to optimize the lookups for that particular
+vector of ACLs.
+
+Using ACL contexts in your code
+-------------------------------
+
+In order to use the ACL lookup contexts, you need to include
+plugins/acl/exports.h into your code. This header includes all the
+necessary dependencies required.
+
+As you probably will invoke this code from another plugin, the
+non-inline function calls are implemented via function pointers, which
+you need to initialize by calling acl_plugin_exports_init(&acl_plugin),
+which, if everything succeeds, returns 0 and fills in the acl_plugin
+structure with pointers to the exported methods - else it will return
+clib_error_t with more information about what went wrong.
+
+When you have initialized the symbols, you also need to register
+yourself as a user of the ACL lookups - this allows to track the ACL
+lookup context ownership, as well as make the debug show outputs more
+user friendly.
+
+To do that, call acl_plugin.register_user_module(caller_module_string,
+val1_label, val2_label) - and record the returned value. This will be the
+first parameter that you pass to create a new lookup context. The passed
+strings must be static, and are used as descriptions for the ACL
+contexts themselves, as well as labels for up to two user-supplied u32
+labels, used to differentiate the lookup contexts for the debugging
+purposes.
+
+Creating a new context is done by calling
+acl_plugin.get_lookup_context_index(user_id, val1, val2). The first
+argument is your “user” ID obtained in a registration call earlier, the
+other two arguments are u32s with semantics that you designate. They are
+used purely for debugging purposes in the “show acl lookup context”
+command.
+
+To set the vector of ACL numbers to be looked up within the context, use
+the function acl_plugin.set_acl_vec_for_context(lc_index, acl_list). The
+first parameter specifies the context that you have created, the second
+parameter is a vector of u32s, each u32 being the index of the ACL which
+we should be looking up within this context. The command is idempotent,
+i.e. it unapplies the previously applied list of ACLs, and then sets the
+new list of ACLs.
+
+Subsequent ACL updates for the already applied ACLs will cause the
+re-application on an as-needed basis. Note, that the ACL application is
+potentially a relatively costly operation, so it is only expected that
+these changes will be done in the control plane, NOT in the datapath.
+
+The matching within the context is done using two functions -
+acl_plugin.fill_5tuple() and acl_plugin.match_5tuple() and their
+corresponding inline versions, named acl_plugin_fill_5tuple_inline() and
+acl_plugin_match_5tuple_inline(). The inline and non-inline versions
+have the equivalent functionality, in that the non-inline version calls
+the inline version. These two variants are provided for
+debugging/maintenance reasons.
+
+When you no longer need a particular context, you can return the
+allocated resources by calling acl_plugin.put_lookup_context_index() to
+mark it as free. The lookup structured associated with the vector of
+ACLs set for the lookup are cleaned up automatically. However, the ACLs
+themselves are not deleted and are available for subsequent reuse by
+other lookup contexts if needed.
+
+There is one delicate detail that you might want to be aware of. When
+the non-inline functions reference the inline functions, they are
+compiled as part of ACL plugin; whereas when you refer to the inline
+functions from your code, they are compiled as part of your code. This
+makes referring to a single acl_main structure a little trickier.
+
+It is done by having a static p_acl_main within the .h file, which
+points to acl_main of the ACL plugin, and is initialized by a static
+constructor function.
+
+This way the multiple includes and inlines will “just work” as one would
+expect.
+
+Debug CLIs
+----------
+
+To see the state of the ACL lookup contexts, you can issue “show
+acl-plugin lookup user” to see all of the users which registered for the
+usage of the ACL plugin lookup contexts, and “show acl-plugin lookup
+context” to show the actual contexts created. You will notice that the
+latter command uses the values supplied during the module registration
+in order to make the output more friendly.
+
+The “show acl-plugin acl” and “show acl-plugin interface” commands have
+also acquired the notion of lookup context, but there it is used from
+the client perspective, since with this change the interface ACL lookup
+itself is a user of ACL lookup contexts.
diff --git a/src/plugins/acl/acl_multicore_doc.md b/src/plugins/acl/acl_multicore_doc.md
deleted file mode 100644
index deec5e9d566..00000000000
--- a/src/plugins/acl/acl_multicore_doc.md
+++ /dev/null
@@ -1,349 +0,0 @@
-Multicore support for ACL plugin {#acl_multicore}
-================================
-
-This captures some considerations and design decisions that I have made,
-both for my own memory later on ("what the hell was I thinking?!?"),
-and for anyone interested to criticize/improve/hack on this code.
-
-One of the factors taken into account while making these decisions,
-was the relative emphasis on the multi-thread vs. single-thread
-use cases: the latter is the vastly more prevalent. But,
-one can not optimize the single-thread performance without
-having a functioning code for multi-thread.
-
-stateless ACLs
-==============
-
-The stateless trivially parallelizes, and the only potential for the
-race between the different threads is during the reconfiguration,
-at the time of replacing the old ACL being checked, with
-the new ACL.
-
-In case an acl_add_replace is being used to replace the rules
-within the existing entry, a reallocation of `am->acls[X].rules`
-vector will happen and potentially a change in count.
-
-acl_match_5tuple() has the following code:
-
-```{.c}
- a = am->acls + acl_index;
- for (i = 0; i < a->count; i++)
- {
- r = a->rules + i;
- . . .
-```
-
-Ideally we should be immune from a->rules changing,
-but the problem arises if the count changes in flight,
-and the new ruleset is smaller - then we will attempt
-to "match" against the free memory.
-
-This can(?) be solved by replacing the for() with while(),
-so the comparison happens at each iteration.
-
-full_acl_match_5tuple(), which iterates over the list
-of ACLs, is a bit less immune, since it takes the pointer
-to the vector to iterate and keeps a local copy of
-that pointer.
-
-This race can be solved by checking the
-current pointer to the vector with the source pointer,
-and seeing if there is an (unlikely) change, and if
-there is, return the "deny" action, or, better,
-restart the check.
-
-Since the check reloads the ACL list on a per-packet basis,
-there is only a window of opportunity of one packet to
-"match" packet against an incorrect rule set.
-The workers also do not change anything, only read.
-Therefore, it looks like building special structures
-to ensure that it does not happen at all might be not
-worth it.
-
-At least not until we have a unit-test able to
-reliably catch this condition and test that
-the measures applied are effective. Adding the code
-which is not possible to exercise is worse than
-not adding any code at all.
-
-So, I opt for "do-nothing" here for the moment.
-
-reflexive ACLs: single-thread
-=============================
-
-Before we talk multi-thread, is worth revisiting the
-design of the reflexive ACLs in the plugin, and
-the history of their evolution.
-
-The very first version of the ACL plugin, shipped in
-1701, mostly did the job using the existing components
-and gluing them together. Because it needed to work
-in bridged forwarding path only, using L2 classifier
-as an insertion point appeared natural, also L2 classifier,
-being a table with sessions, seemed like a good place
-to hold the sessions.
-
-So, the original design had two conceptual nodes:
-one, pointed by the next_miss from the L2 classifier table,
-was checking the actual ACL, and inserting session into
-the L2 classifier table, and the other one, pointed
-to by the next_match within the specific session rule,
-was checking the existing session. The timing out
-of the existing connections was done in the datapath,
-by periodically calling the aging function.
-
-This decision to use the existing components,
-with its attractiveness, did bring a few limitations as well:
-
-* L2 classifier is a simple mask-and-value match, with
-a fixed mask across the table. So, sanely supporting IPv6
-packets with extension headers in that framework was impossible.
-
-* There is no way to get a backpressure from L2 classifier
-depending on memory usage. When it runs out of memory,
-it simply crashes the box. When it runs out of memory ?
-We don't really know. Depends on how it allocates it.
-
-* Since we need to match the *reflected* traffic,
-we had to create *two* full session entries
-in two different directions, which is quite wasteful memory-wise.
-
-* (showstopper): the L2 classifier runs only in
-the bridged data path, so supporting routed data path
-would require creating something else entirely different,
-which would mean much more headaches support-wise going forward.
-
-Because of that, I have moved to a different model of
-creating a session-5-tuple from the packet data - once,
-and then doing all the matching just on that 5-tuple.
-
-This has allowed to add support for skipping IPv6 extension headers.
-
-Also, this new version started to store the sessions in a dedicated
-bihash-per-interface, with the session key data being
-aligned for the ingress packets, and being mirrored for the
-egress packets. This allows of significant savings in memory,
-because now we need to keep only one copy of the session table per
-interface instead of two, and also to only have ONE node for all the lookups,
-(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.
-
-Unfortunately, bihash still has the "lack of backpressure" problem,
-in a sense that if you try to insert too many entries and run out
-of memory in the heap you supplied, you get a crash.
-
-To somewhat workaround against that, there is a "maximum tested number of sessions"
-value, which tracks the currently inserted sessions in the bihash,
-and if this number is being approached, a more aggressive cleanup
-can happen. If this number is reached, two behaviors are possible:
-
-* attempt to do the stateless ACL matching and permit the packet
- if it succeeds
-
-* deny the packet
-
-Currently I have opted for a second one, since it allows for
-a better defined behavior, and if you have to permit
-the traffic in both directions, why using stateful anyway ?
-
-In order to be able to do the cleanup, we need to discriminate between
-the session types, with each session type having its own idle timeout.
-In order to do that, we keep three lists, defined in enum acl_timeout_e:
-ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
-
-The first one is hopefully obvious - it is just all UDP connections.
-They have an idle timeout of 600 seconds.
-
-The second and third is a bit more subtle. TCP is a complicated protocol,
-and we need to tread the fine line between doing too little and doing
-too much, and triggering the potential compatibility issues because of
-being a "middlebox".
-
-I decided to split the TCP connections into two classes:
-established, and everything else. "Established", means we have seen
-the SYN and ACK from both sides (with PUSH obviously masked out).
-This is the "active" state of any TCP connection and we would like
-to ensure we do not screw it up. So, the connections in this state
-have the default idle timer of 24 hours.
-
-All the rest of the connections have the idle timeout of 2 minutes,
-(inspired by an old value of MSL) and based on the observation
-that the states this class represent are usually very short lived.
-
-Once we have these three baskets of connections, it is trivial to
-imagine a simple cleanup mechanism to deal with this: take a
-TCP transient connection that has been hanging around.
-
-It is debatable whether we want to do discrimination between the
-different TCP transient connections. Assuming we do FIFO (and
-the lists allow us to do just that), it means a given connection
-on the head of the list has been hanging around for longest.
-Thus, if we are short on resources, we might just go ahead and
-reuse it within the datapath.
-
-This is where we are slowly approaching the question
-"Why in the world have not you used timer wheel or such ?"
-
-The answer is simple: within the above constraints, it does
-not buy me much.
-
-Also, timer wheel creates a leaky abstraction with a difficult
-to manage corner case. Which corner case ?
-
-We have a set of objects (sessions) with an event that may
-or may not happen (idle timeout timer firing), and a
-necessity to reset the idle timeout when there is
-activity on the session.
-
-In the worst case, where we had a 10000 of one-packet
-UDP sessions just created 10 minutes ago, we would need
-to deal with a spike of 10000 expired timers.
-
-Of course, if we have the active traffic on all
-of these 10000 connections, then we will not have
-to deal with that ? Right, but we will still have to deal
-with canceling and requeueing the timers.
-
-In the best possible case, requeueing a timer is
-going to be something along the lines of a linked-list
-removal and reinsertion.
-
-However, keep in mind we already need to classify the
-connections for reuse, so therefore we already have
-the linked lists!
-
-And if we just check these linked lists periodically in
-a FIFO fashion, we can get away with a very simple per-packet operation:
-writing back the timestamp of "now" into the connection structure.
-
-Then rather than requeueing the list on a per-packet or per-frame
-basis, we can defer this action until the time this session
-appears on the head of the FIFO list, and the cleaning
-routine makes the decision about whether to discard
-the session (because the interval since last activity is bigger
-than the idle timeout), or to requeue the session back to
-the end of the list (because the last activity was less
-than idle timeout ago).
-
-So, rather than using the timers, we can simply reuse our classification
-FIFOs, with the following heuristic: do not look at the session that was
-enqueued at time X until X+session_timeout. If we enqueue the sessions
-in the order of their initial activity, then we can simply use enqueue
-timestamp of the head session as a decision criterion for when we need
-to get back at looking at it for the timeout purposes.
-
-Since the number of FIFOs is small, we get a slightly worse check
-performance than with timers, but still O(1).
-
-We seemingly do quite a few "useless" operations of requeueing the items
-back to the tail of the list - but, these are the operations we do not
-have to do in the active data path, so overall it is a win.
-
-(Diversion: I believe this problem is congruent to poll vs. epoll or
-events vs. threads, some reading on this subject:
-http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
-
-We can also can run a TCP-like scheme for adaptively changing
-the wait period in the routine that deals with the connection timeouts:
-we can attempt to check the connections a couple of times per second
-(same as we would advance the timer wheel), and then if we have requeued
-close to a max-per-quantum number of connections, we can half the waiting
-interval, and if we did not requeue any, we can slowly increment the waiting
-interval - which at a steady state should stabilize similar to what the TCP rate
-does.
-
-reflexive ACLs: multi-thread
-=============================
-
-The single-threaded implementation in 1704 used a separate "cleaner" process
-to deal with the timing out of the connections.
-It is all good and great when you know that there is only a single core
-to run everything on, but the existence of the lists proves to be
-a massive difficulty when it comes to operating from multiple threads.
-
-Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
-and the worker have a demarcation point in time where either one or the other one touches
-the session in the list) it might be possible to make it work, but the resulting
-trickiness of doing it neatly with all the corner cases is quite large.
-
-So, for the multi-threaded scenario, we need to move the connection
-aging back to the same CPU as its creation.
-
-Luckily we can do this with the help of the interrupts.
-
-So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
-periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
-using vlib_node_set_interrupt_pending(), and
-the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
-which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
-doing is putting the items onto FIFO, and updating the last active time on the existing connection.
-
-The one "delicate" part is that the worker for one leg of the connection might be different from
-the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
-nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
-will get the timestamp updated - same thing about the TCP flags seen.
-
-A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
-and the return packet processed by another worker, and as a result changes the
-the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
-If the class changes from one with the shorter idle time to the one with the longer idle time,
-then unless we are in the starvation mode where the transient connections are recycled,
-we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
-timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
-will affect the resource usage.
-
-One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
-can signal to the owner the connection# that needs to be requeued out of order.
-
-A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
-This way the resource starvation problem is taken care of, at an expense of some additional work.
-
-This all looks sufficiently nice and simple until a skeleton falls out of the closet:
-sometimes we want to clean the connections en masse before they expire.
-
-There few potential scenarios:
-1) removal of an ACL from the interface
-2) removal of an interface
-3) manual action of an operator (in the future).
-
-In order to tackle this, we need to modify the logic which decides whether to requeue the
-connection on the end of the list, or to delete it due to idle timeout:
-
-We define a point in time, and have each worker thread fast-forward through its FIFO,
-in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.
-
-To keep the ease of appearance to the outside world, we still process this as an event
-within the connection cleaner thread, but this event handler does as follows:
-1) it creates the bitmap of the sw_if_index values requested to be cleared
-2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
-it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
-3) wait until all cleanup operations have completed.
-
-Within the worker interrupt node, we check if the "cleanup in progress" is set,
-and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
-requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.
-
-(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
-
-If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
-Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
-meaning there is more work to do.
-When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
-zeroize the bitmap of sw_if_index-es requested to be cleaned.
-
-The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
-flag within the per-worker structure. The main thread, while waiting for the
-cleanup operation to complete, checks if there is a request for interrupt,
-and if there is - it sends one.
-
-This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
-connection cleanup.
-
-One potential inefficiency is the bitmap values set by the session insertion
-in the data path - there is nothing to clear them.
-
-So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
-For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
-would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).
-
-=== the end ===
-
diff --git a/src/plugins/acl/acl_multicore_doc.rst b/src/plugins/acl/acl_multicore_doc.rst
new file mode 100644
index 00000000000..142b6b216d2
--- /dev/null
+++ b/src/plugins/acl/acl_multicore_doc.rst
@@ -0,0 +1,354 @@
+Multicore support for ACL plugin
+================================
+
+This captures some considerations and design decisions that I have made,
+both for my own memory later on (“what the hell was I thinking?!?”), and
+for anyone interested to criticize/improve/hack on this code.
+
+One of the factors taken into account while making these decisions, was
+the relative emphasis on the multi-thread vs. single-thread use cases:
+the latter is the vastly more prevalent. But, one can not optimize the
+single-thread performance without having a functioning code for
+multi-thread.
+
+stateless ACLs
+--------------
+
+The stateless trivially parallelizes, and the only potential for the
+race between the different threads is during the reconfiguration, at the
+time of replacing the old ACL being checked, with the new ACL.
+
+In case an acl_add_replace is being used to replace the rules within the
+existing entry, a reallocation of ``am->acls[X].rules`` vector will
+happen and potentially a change in count.
+
+acl_match_5tuple() has the following code:
+
+.. code:: c
+
+ a = am->acls + acl_index;
+ for (i = 0; i < a->count; i++)
+ {
+ r = a->rules + i;
+ . . .
+
+Ideally we should be immune from a->rules changing, but the problem
+arises if the count changes in flight, and the new ruleset is smaller -
+then we will attempt to “match” against the free memory.
+
+This can(?) be solved by replacing the for() with while(), so the
+comparison happens at each iteration.
+
+full_acl_match_5tuple(), which iterates over the list of ACLs, is a bit
+less immune, since it takes the pointer to the vector to iterate and
+keeps a local copy of that pointer.
+
+This race can be solved by checking the current pointer to the vector
+with the source pointer, and seeing if there is an (unlikely) change,
+and if there is, return the “deny” action, or, better, restart the
+check.
+
+Since the check reloads the ACL list on a per-packet basis, there is
+only a window of opportunity of one packet to “match” packet against an
+incorrect rule set. The workers also do not change anything, only read.
+Therefore, it looks like building special structures to ensure that it
+does not happen at all might be not worth it.
+
+At least not until we have a unit-test able to reliably catch this
+condition and test that the measures applied are effective. Adding the
+code which is not possible to exercise is worse than not adding any code
+at all.
+
+So, I opt for “do-nothing” here for the moment.
+
+reflexive ACLs: single-thread
+-----------------------------
+
+Before we talk multi-thread, is worth revisiting the design of the
+reflexive ACLs in the plugin, and the history of their evolution.
+
+The very first version of the ACL plugin, shipped in 1701, mostly did
+the job using the existing components and gluing them together. Because
+it needed to work in bridged forwarding path only, using L2 classifier
+as an insertion point appeared natural, also L2 classifier, being a
+table with sessions, seemed like a good place to hold the sessions.
+
+So, the original design had two conceptual nodes: one, pointed by the
+next_miss from the L2 classifier table, was checking the actual ACL, and
+inserting session into the L2 classifier table, and the other one,
+pointed to by the next_match within the specific session rule, was
+checking the existing session. The timing out of the existing
+connections was done in the datapath, by periodically calling the aging
+function.
+
+This decision to use the existing components, with its attractiveness,
+did bring a few limitations as well:
+
+- L2 classifier is a simple mask-and-value match, with a fixed mask
+ across the table. So, sanely supporting IPv6 packets with extension
+ headers in that framework was impossible.
+
+- There is no way to get a backpressure from L2 classifier depending on
+ memory usage. When it runs out of memory, it simply crashes the box.
+ When it runs out of memory ? We don’t really know. Depends on how it
+ allocates it.
+
+- Since we need to match the *reflected* traffic, we had to create
+ *two* full session entries in two different directions, which is
+ quite wasteful memory-wise.
+
+- (showstopper): the L2 classifier runs only in the bridged data path,
+ so supporting routed data path would require creating something else
+ entirely different, which would mean much more headaches support-wise
+ going forward.
+
+Because of that, I have moved to a different model of creating a
+session-5-tuple from the packet data - once, and then doing all the
+matching just on that 5-tuple.
+
+This has allowed to add support for skipping IPv6 extension headers.
+
+Also, this new version started to store the sessions in a dedicated
+bihash-per-interface, with the session key data being aligned for the
+ingress packets, and being mirrored for the egress packets. This allows
+of significant savings in memory, because now we need to keep only one
+copy of the session table per interface instead of two, and also to only
+have ONE node for all the lookups, (L2/L3 path, in/out, IPv4/IPv6) -
+significantly reducing the code complexity.
+
+Unfortunately, bihash still has the “lack of backpressure” problem, in a
+sense that if you try to insert too many entries and run out of memory
+in the heap you supplied, you get a crash.
+
+To somewhat workaround against that, there is a “maximum tested number
+of sessions” value, which tracks the currently inserted sessions in the
+bihash, and if this number is being approached, a more aggressive
+cleanup can happen. If this number is reached, two behaviors are
+possible:
+
+- attempt to do the stateless ACL matching and permit the packet if it
+ succeeds
+
+- deny the packet
+
+Currently I have opted for a second one, since it allows for a better
+defined behavior, and if you have to permit the traffic in both
+directions, why using stateful anyway ?
+
+In order to be able to do the cleanup, we need to discriminate between
+the session types, with each session type having its own idle timeout.
+In order to do that, we keep three lists, defined in enum acl_timeout_e:
+ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
+
+The first one is hopefully obvious - it is just all UDP connections.
+They have an idle timeout of 600 seconds.
+
+The second and third is a bit more subtle. TCP is a complicated
+protocol, and we need to tread the fine line between doing too little
+and doing too much, and triggering the potential compatibility issues
+because of being a “middlebox”.
+
+I decided to split the TCP connections into two classes: established,
+and everything else. “Established”, means we have seen the SYN and ACK
+from both sides (with PUSH obviously masked out). This is the “active”
+state of any TCP connection and we would like to ensure we do not screw
+it up. So, the connections in this state have the default idle timer of
+24 hours.
+
+All the rest of the connections have the idle timeout of 2 minutes,
+(inspired by an old value of MSL) and based on the observation that the
+states this class represent are usually very short lived.
+
+Once we have these three baskets of connections, it is trivial to
+imagine a simple cleanup mechanism to deal with this: take a TCP
+transient connection that has been hanging around.
+
+It is debatable whether we want to do discrimination between the
+different TCP transient connections. Assuming we do FIFO (and the lists
+allow us to do just that), it means a given connection on the head of
+the list has been hanging around for longest. Thus, if we are short on
+resources, we might just go ahead and reuse it within the datapath.
+
+This is where we are slowly approaching the question “Why in the world
+have not you used timer wheel or such ?”
+
+The answer is simple: within the above constraints, it does not buy me
+much.
+
+Also, timer wheel creates a leaky abstraction with a difficult to manage
+corner case. Which corner case ?
+
+We have a set of objects (sessions) with an event that may or may not
+happen (idle timeout timer firing), and a necessity to reset the idle
+timeout when there is activity on the session.
+
+In the worst case, where we had a 10000 of one-packet UDP sessions just
+created 10 minutes ago, we would need to deal with a spike of 10000
+expired timers.
+
+Of course, if we have the active traffic on all of these 10000
+connections, then we will not have to deal with that ? Right, but we
+will still have to deal with canceling and requeueing the timers.
+
+In the best possible case, requeueing a timer is going to be something
+along the lines of a linked-list removal and reinsertion.
+
+However, keep in mind we already need to classify the connections for
+reuse, so therefore we already have the linked lists!
+
+And if we just check these linked lists periodically in a FIFO fashion,
+we can get away with a very simple per-packet operation: writing back
+the timestamp of “now” into the connection structure.
+
+Then rather than requeueing the list on a per-packet or per-frame basis,
+we can defer this action until the time this session appears on the head
+of the FIFO list, and the cleaning routine makes the decision about
+whether to discard the session (because the interval since last activity
+is bigger than the idle timeout), or to requeue the session back to the
+end of the list (because the last activity was less than idle timeout
+ago).
+
+So, rather than using the timers, we can simply reuse our classification
+FIFOs, with the following heuristic: do not look at the session that was
+enqueued at time X until X+session_timeout. If we enqueue the sessions
+in the order of their initial activity, then we can simply use enqueue
+timestamp of the head session as a decision criterion for when we need
+to get back at looking at it for the timeout purposes.
+
+Since the number of FIFOs is small, we get a slightly worse check
+performance than with timers, but still O(1).
+
+We seemingly do quite a few “useless” operations of requeueing the items
+back to the tail of the list - but, these are the operations we do not
+have to do in the active data path, so overall it is a win.
+
+(Diversion: I believe this problem is congruent to poll vs. epoll or
+events vs. threads, some reading on this subject:
+http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
+
+We can also can run a TCP-like scheme for adaptively changing the wait
+period in the routine that deals with the connection timeouts: we can
+attempt to check the connections a couple of times per second (same as
+we would advance the timer wheel), and then if we have requeued close to
+a max-per-quantum number of connections, we can half the waiting
+interval, and if we did not requeue any, we can slowly increment the
+waiting interval - which at a steady state should stabilize similar to
+what the TCP rate does.
+
+reflexive ACLs: multi-thread
+----------------------------
+
+The single-threaded implementation in 1704 used a separate “cleaner”
+process to deal with the timing out of the connections. It is all good
+and great when you know that there is only a single core to run
+everything on, but the existence of the lists proves to be a massive
+difficulty when it comes to operating from multiple threads.
+
+Initial study shows that with a few assumptions (e.g. that the cleaner
+running in main thread and the worker have a demarcation point in time
+where either one or the other one touches the session in the list) it
+might be possible to make it work, but the resulting trickiness of doing
+it neatly with all the corner cases is quite large.
+
+So, for the multi-threaded scenario, we need to move the connection
+aging back to the same CPU as its creation.
+
+Luckily we can do this with the help of the interrupts.
+
+So, the design is as follows: the aging thread
+(acl_fa_session_cleaner_process) periodically fires the interrupts to
+the workers interrupt nodes
+(acl_fa_worker_session_cleaner_process_node.index), using
+vlib_node_set_interrupt_pending(), and the interrupt node
+acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
+which does the actual job of advancing the lists. And within the actual
+datapath the only thing we will be doing is putting the items onto FIFO,
+and updating the last active time on the existing connection.
+
+The one “delicate” part is that the worker for one leg of the connection
+might be different from the worker of another leg of the connection -
+but, even if the “owner” tries to free the connection, nothing terrible
+can happen - worst case the element of the pool (which is nominally free
+for a short period) will get the timestamp updated - same thing about
+the TCP flags seen.
+
+A slightly trickier issue arises when the packet initially seen by one
+worker (thus owned by that worker), and the return packet processed by
+another worker, and as a result changes the the class of the connection
+(e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). If the
+class changes from one with the shorter idle time to the one with the
+longer idle time, then unless we are in the starvation mode where the
+transient connections are recycled, we can simply do nothing and let the
+normal requeue mechanism kick in. If the class changes from the longer
+idle timer to the shorter idle timer, then we risk keeping the
+connection around for longer than needed, which will affect the resource
+usage.
+
+One solution to that is to have NxN ring buffers (where N is the number
+of workers), such that the non-owner can signal to the owner the
+connection# that needs to be requeued out of order.
+
+A simpler solution though, is to ensure that each FIFO’s period is equal
+to that of a shortest timer. This way the resource starvation problem is
+taken care of, at an expense of some additional work.
+
+This all looks sufficiently nice and simple until a skeleton falls out
+of the closet: sometimes we want to clean the connections en masse
+before they expire.
+
+There few potential scenarios: 1) removal of an ACL from the interface
+2) removal of an interface 3) manual action of an operator (in the
+future).
+
+In order to tackle this, we need to modify the logic which decides
+whether to requeue the connection on the end of the list, or to delete
+it due to idle timeout:
+
+We define a point in time, and have each worker thread fast-forward
+through its FIFO, in the process looking for sessions that satisfy the
+criteria, and either keeping them or requeueing them.
+
+To keep the ease of appearance to the outside world, we still process
+this as an event within the connection cleaner thread, but this event
+handler does as follows: 1) it creates the bitmap of the sw_if_index
+values requested to be cleared 2) for each worker, it waits to ensure
+there is no cleanup operation in progress (and if there is one, it
+waits), and then makes a copy of the bitmap, sets the per-worker flag of
+a cleanup operation, and sends an interrupt. 3) wait until all cleanup
+operations have completed.
+
+Within the worker interrupt node, we check if the “cleanup in progress”
+is set, and if it is, we check the “fast forward time” value. If unset,
+we initialize it to value now, and compare the requested bitmap of
+sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of
+sw_if_index that this worker deals with.
+
+(we set the bit in the bitmap every time we enqueue the packet onto a
+FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
+
+If the result of this AND operation is zero - then we can clear the flag
+of cleanup in progress and return. Else we kick off the quantum of
+cleanup, and make sure we get another interrupt ASAP if that cleanup
+operation returns non-zero, meaning there is more work to do. When that
+operation returns zero, everything has been processed, we can clear the
+“cleanup-in-progress” flag, and zeroize the bitmap of sw_if_index-es
+requested to be cleaned.
+
+The interrupt node signals its wish to receive an interrupt ASAP by
+setting interrupt_is_needed flag within the per-worker structure. The
+main thread, while waiting for the cleanup operation to complete, checks
+if there is a request for interrupt, and if there is - it sends one.
+
+This approach gives us a way to mass-clean the connections which is
+reusing the code of the regular idle connection cleanup.
+
+One potential inefficiency is the bitmap values set by the session
+insertion in the data path - there is nothing to clear them.
+
+So, if one rearranges the interface placement with the workers, then the
+cleanups will cause some unnecessary work. For now, we consider it an
+acceptable limitation. It can be resolved by having another per-worker
+bitmap, which, when set, would trigger the cleanup of the bits in the
+serviced_sw_if_index_bitmap).
+
+=== the end ===