diff options
Diffstat (limited to 'src/plugins/acl')
-rw-r--r-- | src/plugins/acl/acl_hash_lookup_doc.md | 240 | ||||
-rw-r--r-- | src/plugins/acl/acl_hash_lookup_doc.rst | 243 | ||||
-rw-r--r-- | src/plugins/acl/acl_lookup_context.md | 125 | ||||
-rw-r--r-- | src/plugins/acl/acl_lookup_context.rst | 138 | ||||
-rw-r--r-- | src/plugins/acl/acl_multicore_doc.md | 349 | ||||
-rw-r--r-- | src/plugins/acl/acl_multicore_doc.rst | 354 |
6 files changed, 735 insertions, 714 deletions
diff --git a/src/plugins/acl/acl_hash_lookup_doc.md b/src/plugins/acl/acl_hash_lookup_doc.md deleted file mode 100644 index 6b08e1bc953..00000000000 --- a/src/plugins/acl/acl_hash_lookup_doc.md +++ /dev/null @@ -1,240 +0,0 @@ -ACL plugin constant-time lookup design {#acl_hash_lookup} -====================================== - -The initial implementation of ACL plugin performs a trivial for() cycle, -going through the assigned ACLs on a per-packet basis. This is not very -efficient, even if for very short ACLs due to its simplicity it can beat -more advanced methods. - -However, to cover the case of longer ACLs with acceptable performance, -we need to have a better way of matching. This write-up proposes -a mechanism to make a lookup from O(M) where M is number of entries -to O(N) where N is number of different mask combinations. - -Preparation of ACL(s) ---------------------- - -The ACL plugin will maintain a global list of "mask types", i.e. the specific -configurations of "do not care" bits within the ACEs. -Upon the creation of a new ACL, a pass will be made through all the -ACEs, to assign and possibly allocate the "mask type number". - -Each ACL has a structure *hash_acl_info_t* representing the "hash-based" -parts of information related to that ACL, primarily the array of -*hash_ace_info_t* structures - each of the members of that array -corresponding to one of the rules (ACEs) in the original ACL, -for this they have a pair of *(acl_index, ace_index)* to keep track, -predominantly for debugging. - -Why do we need a whole separate structure, and are not adding new fields -to the existing rule structure? First, encapsulation, to minimize -the pollution of the main ACL code with the hash-based lookup artifacts. -Second, one rule may correspond to more than one "hash-based" ACE. -In fact, most of the rules do correspond to two of those. Why ? - -Consider that the current ACL lookup logic is that if a packet -is not the initial fragment, and there is an L4 entry acting on the packet, -the comparison will be made only on the L4 protocol field value rather -than on the protocol and port values. This behavior is governed by -*l4_match_nonfirst_fragment* flag in the *acl_main*, and is needed to -maintain the compatibility with the existing software switch implementation. - -While for the sequential check in *single_acl_match_5tuple()* -it is very easy to implement by just breaking out at the right moment, -in case of hash-based matching this cost us two checks: -one on full 5-tuple and the flag *pkt.is_nonfirst_fragment* being zero, -the second on 3-tuple and the flag *pkt.is_nonfirst_fragment* being one, -with the second check triggered by the *acl_main.l4_match_nonfirst_fragment* -setting being the default 1. This dictates the necessity of having a "match" -field in a given *hash_ace_info_t* element, which would reflect the value -we are supposed to match after applying the mask. - -There can be other circumstances when it might be beneficial to expand -the given rule in the original ACL into multiple - for example, as an -optimization within the port range handling for small port ranges -(this is not done as of the time of writing). - -Assigning ACLs to an interface ------------------------------- - -Once the ACL list is assigned to an interface, or, rather, a new ACL -is added to the list of the existing ACLs applied to the interface, -we need to update the bihash accelerating the lookup. - -All the entries for the lookups are stored within a single *48_8* bihash, -which captures the 5-tuple from the packet as well as the miscellaneous -per-packet information flags, e.g. *l4_valid*, *is_non_first_fragment*, -and so on. To facilitate the use of the single bihash by all the interfaces, -the *is_ip6*, *is_input*, *sw_if_index* are part of the key, -as well as *mask_type_index* - the latter being necessary because -there can be entries with the same value but different masks, e.g.: -`permit ::/0, permit::/128`. - -At the moment of an ACL being applied to an interface, we need to -walk the list of *hash_ace_info_t* entries corresponding to that ACL, -and update the bihash with the keys corresponding to the match -values in these entries. - -The value of the hash match contains the index into a per-*sw_if_index* vector -of *applied_ace_hash_entry_t* elements, as well as a couple of flags: -*shadowed* (optimization: if this flag on a matched entry is zero, means -we can stop the lookup early and declare a match - see below), -and *need_portrange_check* - meaning that what matched was a superset -of the actual match, and we need to perform an extra check. - -Also, upon insertion, we must keep in mind there can be -multiple *applied_ace_hash_entry_t* for the same key and must keep -a list of those. This is necessary to incrementally apply/unapply -the ACLs as part of the ACL vector: say, two ACLs have -"permit 2001:db8::1/128 any" - we should be able to retain the entry -for the second ACL even if we have deleted the first one. -Also, in case there are two entries with the same key but -different port ranges, say 0..42 and 142..65535 - we need -to be able to sequentially match on those if we decide not -to expand them into individual port-specific entries. - -Per-packet lookup ------------------ - -The simple single-packet lookup is defined in -*multi_acl_match_get_applied_ace_index*, which returns the index -of the applied hash ACE if there was a match, or ~0 if there wasn't. - -The future optimized per-packet lookup may be batched in three phases: - -1. Prepare the keys in the per-worker vector by doing logical AND of - original 5-tuple record with the elements of the mask vector. -2. Lookup the keys in the bihash in a batch manner, collecting the - result with lowest u64 (acl index within vector, ACE index) from - the hash lookup value, and performing the list walk if necessary - (for portranges). -3. Take the action from the ACL record as defined by (ACL#, ACE#) from the - resulting lookup winner, or, if no match found, then perform default deny. - -Shadowed/independent/redundant ACEs ------------------------------------- - -During the phase of combining multiple ACLs into one rulebase, when they -are applied to interface, we also can perform several optimizations. - -If a given ACE is a strict subset of another ACE located up in the linear -search order, we can ignore this ACE completely - because by definition -it will never match. We will call such an ACE *redundant*. Here is an example: - -``` -permit 2001:db8:1::/48 2001:db8:2::/48 (B) -deny 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A) -``` - -A bit more formally, we can define this relationship of an ACE A to ACE B as: - -``` -redundant(aceA, aceB) := (contains(protoB, protoA) && contains(srcB, srcA) - && contains(dstB, dstA) && is_after(A, B)) -``` - -Here as "contains" we define an operation operating on the sets defined by -the protocol, (srcIP, srcPortDefinition) and (dstIP, dstPortDefinition) -respectively, and returning true if all the elements represented by -the second argument are represented by the first argument. The "is_after" -is true if A is located below B in the ruleset. - -If a given ACE does not intersect at all with any other ACE -in front of it, we can mark it as such. - -Then during the sequence of the lookups the successful hit on this ACE means -we do not need to look up other mask combinations - thus potentially -significantly speeding up the match process. Here is an example, -assuming we have the following ACL: - -``` -permit 2001:db8:1::/48 2001:db8:2::/48 (B) -deny 2001:db8:3::/48 2001:db8:2:1::/64 (A) -``` - -In this case if we match the second entry, we do not need to check whether -we have matched the first one - the source addresses are completely -different. We call such an ACE *independent* from another. - -We can define this as - -``` -independent(aceA, aceB) := (!intersect(protoA, protoB) || - !intersect(srcA, srcB) || - !intersect(dstA, dstB)) -``` - -where intersect is defined as operation returning true if there are -elements belonging to the sets of both arguments. - -If the entry A is neither redundant nor independent from B, and is below -B in the ruleset, we call such an entry *shadowed* by B, here is an example: - -``` -deny tcp 2001:db8:1::/48 2001:db8:2::/48 (B) -permit 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A) -``` - -This means the earlier rule "carves out" a subset of A, thus leaving -a "shadow". (Evidently, the action needs to be different for the shadow -to have an effect, but for for the terminology sake we do not care). - -The more formal definition: - -``` -shadowed(aceA, aceB) := !redundant(aceA, aceB) && - !independent(aceA, aceB) && - is_after(aceA, aceB) -``` - -Using this terminology, any ruleset can be represented as -a DAG (Directed Acyclic Graph), with the bottom being the implicit -"deny any", pointing to the set of rules shadowing it or the ones -it is redundant for. - -These rules may in turn be shadowing each other. There is no cycles in -this graph because of the natural order of the rules - the rule located -closer to the end of the ruleset can never shadow or make redundant a rule -higher up. - -The optimization that enables can allow for is to skip matching certain -masks on a per-lookup basis - if a given rule has matched, -the only adjustments that can happen is the match with one of -the shadowing rules. - -Also, another avenue for the optimization can be starting the lookup process -with the mask type that maximizes the chances of the independent ACE match, -thus resulting in an ACE lookup being a single hash table hit. - - -Plumbing --------- - -All the new routines are located in a separate file, -so we can cleanly experiment with a different approach if this -does not fit all of the use cases. - -The constant-time lookup within the data path has the API with -the same signature as: - -``` -u8 -multi_acl_match_5tuple (u32 sw_if_index, fa_5tuple_t * pkt_5tuple, int is_l2, - int is_ip6, int is_input, u32 * acl_match_p, - u32 * rule_match_p, u32 * trace_bitmap) -``` - -There should be a new upper-level function with the same signature, which -will make a decision whether to use a linear lookup, or to use the -constant-time lookup implemented by this work, or to add some other -optimizations (e.g. by keeping the cache of the last N lookups). - -The calls to the routine doing preparatory work should happen -in `acl_add_list()` after creating the linear-lookup structures, -and the routine doing the preparatory work populating the hashtable -should be called from `acl_interface_add_del_inout_acl()` or its callees. - -The initial implementation will be geared towards looking up a single -match at a time, with the subsequent optimizations possible to make -the lookup for more than one packet. - diff --git a/src/plugins/acl/acl_hash_lookup_doc.rst b/src/plugins/acl/acl_hash_lookup_doc.rst new file mode 100644 index 00000000000..72842af423d --- /dev/null +++ b/src/plugins/acl/acl_hash_lookup_doc.rst @@ -0,0 +1,243 @@ +ACL plugin constant-time lookup +=============================== + +The initial implementation of ACL plugin performs a trivial for() cycle, +going through the assigned ACLs on a per-packet basis. This is not very +efficient, even if for very short ACLs due to its simplicity it can beat +more advanced methods. + +However, to cover the case of longer ACLs with acceptable performance, +we need to have a better way of matching. This write-up proposes a +mechanism to make a lookup from O(M) where M is number of entries to +O(N) where N is number of different mask combinations. + +Preparation of ACL(s) +--------------------- + +The ACL plugin will maintain a global list of “mask types”, i.e. the +specific configurations of “do not care” bits within the ACEs. Upon the +creation of a new ACL, a pass will be made through all the ACEs, to +assign and possibly allocate the “mask type number”. + +Each ACL has a structure *hash_acl_info_t* representing the “hash-based” +parts of information related to that ACL, primarily the array of +*hash_ace_info_t* structures - each of the members of that array +corresponding to one of the rules (ACEs) in the original ACL, for this +they have a pair of *(acl_index, ace_index)* to keep track, +predominantly for debugging. + +Why do we need a whole separate structure, and are not adding new fields +to the existing rule structure? First, encapsulation, to minimize the +pollution of the main ACL code with the hash-based lookup artifacts. +Second, one rule may correspond to more than one “hash-based” ACE. In +fact, most of the rules do correspond to two of those. Why ? + +Consider that the current ACL lookup logic is that if a packet is not +the initial fragment, and there is an L4 entry acting on the packet, the +comparison will be made only on the L4 protocol field value rather than +on the protocol and port values. This behavior is governed by +*l4_match_nonfirst_fragment* flag in the *acl_main*, and is needed to +maintain the compatibility with the existing software switch +implementation. + +While for the sequential check in *single_acl_match_5tuple()* it is very +easy to implement by just breaking out at the right moment, in case of +hash-based matching this cost us two checks: one on full 5-tuple and the +flag *pkt.is_nonfirst_fragment* being zero, the second on 3-tuple and +the flag *pkt.is_nonfirst_fragment* being one, with the second check +triggered by the *acl_main.l4_match_nonfirst_fragment* setting being the +default 1. This dictates the necessity of having a “match” field in a +given *hash_ace_info_t* element, which would reflect the value we are +supposed to match after applying the mask. + +There can be other circumstances when it might be beneficial to expand +the given rule in the original ACL into multiple - for example, as an +optimization within the port range handling for small port ranges (this +is not done as of the time of writing). + +Assigning ACLs to an interface +------------------------------ + +Once the ACL list is assigned to an interface, or, rather, a new ACL is +added to the list of the existing ACLs applied to the interface, we need +to update the bihash accelerating the lookup. + +All the entries for the lookups are stored within a single *48_8* +bihash, which captures the 5-tuple from the packet as well as the +miscellaneous per-packet information flags, e.g. *l4_valid*, +*is_non_first_fragment*, and so on. To facilitate the use of the single +bihash by all the interfaces, the *is_ip6*, *is_input*, *sw_if_index* +are part of the key, as well as *mask_type_index* - the latter being +necessary because there can be entries with the same value but different +masks, e.g.: ``permit ::/0, permit::/128``. + +At the moment of an ACL being applied to an interface, we need to walk +the list of *hash_ace_info_t* entries corresponding to that ACL, and +update the bihash with the keys corresponding to the match values in +these entries. + +The value of the hash match contains the index into a per-*sw_if_index* +vector of *applied_ace_hash_entry_t* elements, as well as a couple of +flags: *shadowed* (optimization: if this flag on a matched entry is +zero, means we can stop the lookup early and declare a match - see +below), and *need_portrange_check* - meaning that what matched was a +superset of the actual match, and we need to perform an extra check. + +Also, upon insertion, we must keep in mind there can be multiple +*applied_ace_hash_entry_t* for the same key and must keep a list of +those. This is necessary to incrementally apply/unapply the ACLs as part +of the ACL vector: say, two ACLs have “permit 2001:db8::1/128 any” - we +should be able to retain the entry for the second ACL even if we have +deleted the first one. Also, in case there are two entries with the same +key but different port ranges, say 0..42 and 142..65535 - we need to be +able to sequentially match on those if we decide not to expand them into +individual port-specific entries. + +Per-packet lookup +----------------- + +The simple single-packet lookup is defined in +*multi_acl_match_get_applied_ace_index*, which returns the index of the +applied hash ACE if there was a match, or ~0 if there wasn’t. + +The future optimized per-packet lookup may be batched in three phases: + +1. Prepare the keys in the per-worker vector by doing logical AND of + original 5-tuple record with the elements of the mask vector. +2. Lookup the keys in the bihash in a batch manner, collecting the + result with lowest u64 (acl index within vector, ACE index) from the + hash lookup value, and performing the list walk if necessary (for + portranges). +3. Take the action from the ACL record as defined by (ACL#, ACE#) from + the resulting lookup winner, or, if no match found, then perform + default deny. + +Shadowed/independent/redundant ACEs +----------------------------------- + +During the phase of combining multiple ACLs into one rulebase, when they +are applied to interface, we also can perform several optimizations. + +If a given ACE is a strict subset of another ACE located up in the +linear search order, we can ignore this ACE completely - because by +definition it will never match. We will call such an ACE *redundant*. +Here is an example: + +:: + + permit 2001:db8:1::/48 2001:db8:2::/48 (B) + deny 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A) + +A bit more formally, we can define this relationship of an ACE A to ACE +B as: + +:: + + redundant(aceA, aceB) := (contains(protoB, protoA) && contains(srcB, srcA) + && contains(dstB, dstA) && is_after(A, B)) + +Here as “contains” we define an operation operating on the sets defined +by the protocol, (srcIP, srcPortDefinition) and (dstIP, +dstPortDefinition) respectively, and returning true if all the elements +represented by the second argument are represented by the first +argument. The “is_after” is true if A is located below B in the ruleset. + +If a given ACE does not intersect at all with any other ACE in front of +it, we can mark it as such. + +Then during the sequence of the lookups the successful hit on this ACE +means we do not need to look up other mask combinations - thus +potentially significantly speeding up the match process. Here is an +example, assuming we have the following ACL: + +:: + + permit 2001:db8:1::/48 2001:db8:2::/48 (B) + deny 2001:db8:3::/48 2001:db8:2:1::/64 (A) + +In this case if we match the second entry, we do not need to check +whether we have matched the first one - the source addresses are +completely different. We call such an ACE *independent* from another. + +We can define this as + +:: + + independent(aceA, aceB) := (!intersect(protoA, protoB) || + !intersect(srcA, srcB) || + !intersect(dstA, dstB)) + +where intersect is defined as operation returning true if there are +elements belonging to the sets of both arguments. + +If the entry A is neither redundant nor independent from B, and is below +B in the ruleset, we call such an entry *shadowed* by B, here is an +example: + +:: + + deny tcp 2001:db8:1::/48 2001:db8:2::/48 (B) + permit 2001:d8b:1:1::/64 2001:db8:2:1::/64 (A) + +This means the earlier rule “carves out” a subset of A, thus leaving a +“shadow”. (Evidently, the action needs to be different for the shadow to +have an effect, but for for the terminology sake we do not care). + +The more formal definition: + +:: + + shadowed(aceA, aceB) := !redundant(aceA, aceB) && + !independent(aceA, aceB) && + is_after(aceA, aceB) + +Using this terminology, any ruleset can be represented as a DAG +(Directed Acyclic Graph), with the bottom being the implicit “deny any”, +pointing to the set of rules shadowing it or the ones it is redundant +for. + +These rules may in turn be shadowing each other. There is no cycles in +this graph because of the natural order of the rules - the rule located +closer to the end of the ruleset can never shadow or make redundant a +rule higher up. + +The optimization that enables can allow for is to skip matching certain +masks on a per-lookup basis - if a given rule has matched, the only +adjustments that can happen is the match with one of the shadowing +rules. + +Also, another avenue for the optimization can be starting the lookup +process with the mask type that maximizes the chances of the independent +ACE match, thus resulting in an ACE lookup being a single hash table +hit. + +Plumbing +-------- + +All the new routines are located in a separate file, so we can cleanly +experiment with a different approach if this does not fit all of the use +cases. + +The constant-time lookup within the data path has the API with the same +signature as: + +:: + + u8 + multi_acl_match_5tuple (u32 sw_if_index, fa_5tuple_t * pkt_5tuple, int is_l2, + int is_ip6, int is_input, u32 * acl_match_p, + u32 * rule_match_p, u32 * trace_bitmap) + +There should be a new upper-level function with the same signature, +which will make a decision whether to use a linear lookup, or to use the +constant-time lookup implemented by this work, or to add some other +optimizations (e.g. by keeping the cache of the last N lookups). + +The calls to the routine doing preparatory work should happen in +``acl_add_list()`` after creating the linear-lookup structures, and the +routine doing the preparatory work populating the hashtable should be +called from ``acl_interface_add_del_inout_acl()`` or its callees. + +The initial implementation will be geared towards looking up a single +match at a time, with the subsequent optimizations possible to make the +lookup for more than one packet. diff --git a/src/plugins/acl/acl_lookup_context.md b/src/plugins/acl/acl_lookup_context.md deleted file mode 100644 index e95f82043f9..00000000000 --- a/src/plugins/acl/acl_lookup_context.md +++ /dev/null @@ -1,125 +0,0 @@ -Lookup contexts aka "ACL as a service" {#acl_lookup_context} -====================================== - -The initial implementation of the ACL plugin had tightly tied the policy (L3-L4) ACLs -to ingress/egress processing on an interface. - -However, some uses outside of pure traffic control have appeared, for example, -ACL-based forwarding, etc. Also, improved algorithms of the ACL lookup -could benefit of the more abstract representation, not coupled to the interfaces. - -This describes a way to accommodate these use cases by generalizing the ACL -lookups into "ACL lookup contexts", not tied to specific interfaces, usable -by other portions of the code by utilizing the exports.h header file, -which provides the necessary interface. - - -Why "lookup contexts" and not "match me an ACL#" ? -================================================ - -The first reason is the logical grouping of multiple ACLs. - -The interface matching code currently allows for matching multiple ACLs -in a 'first-match' fashion. Some other use cases also fall into a similar -pattern: they attempt to match a sequence of ACLs, and the first matched ACL -determines what the outcome is, e.g. where to forward traffic. Thus, -a match never happens on an ACL in isolation, but always on a group of -ACLs. - -The second reason is potential optimizations in matching. - -A naive match on series of ACLs each represented as a vector of ACEs -does not care about the API level - it could be "match one ACL", or -"match the set of ACLs" - there will be just a simple loop iterating over -the ACLs to match, returning the first match. Be it in the ACL code or -in the user code. - -However, for more involved lookup methods, providing a more high-level -interface of matching over the entire group of ACLs allows for future -improvements in the algorithms, delivered at once to all the users -of the API. - -What is a "lookup context" ? -============================ - -An ACL lookup context is an entity that groups the set of ACL#s -together for the purposes of a first-match lookup, and may store -additional internal information needed to optimize the lookups -for that particular vector of ACLs. - -Using ACL contexts in your code -=============================== - -In order to use the ACL lookup contexts, you need to include -plugins/acl/exports.h into your code. This header includes -all the necessary dependencies required. - -As you probably will invoke this code from another plugin, -the non-inline function calls are implemented via function pointers, -which you need to initialize by calling acl_plugin_exports_init(&acl_plugin), which, -if everything succeeds, returns 0 and fills in the acl_plugin structure -with pointers to the exported methods - else it will return clib_error_t with -more information about what went wrong. - -When you have initialized the symbols, you also need to register yourself -as a user of the ACL lookups - this allows to track the ACL lookup context -ownership, as well as make the debug show outputs more user friendly. - -To do that, call acl_plugin.register_user_module(caller_module_string, val1_label, val2_label) - -and record the returned value. This will bethe first parameter that you pass to create a new -lookup context. The passed strings must be static, and are used as descriptions for the ACL -contexts themselves, as well as labels for up to two user-supplied u32 labels, used to -differentiate the lookup contexts for the debugging purposes. - -Creating a new context is done by calling acl_plugin.get_lookup_context_index(user_id, val1, val2). -The first argument is your "user" ID obtained in a registration call earlier, the other two -arguments are u32s with semantics that you designate. They are used purely for debugging purposes -in the "show acl lookup context" command. - -To set the vector of ACL numbers to be looked up within the context, use the function -acl_plugin.set_acl_vec_for_context(lc_index, acl_list). The first parameter specifies the context -that you have created, the second parameter is a vector of u32s, each u32 being the index of the ACL -which we should be looking up within this context. The command is idempotent, i.e. -it unapplies the previously applied list of ACLs, and then sets the new list of ACLs. - -Subsequent ACL updates for the already applied ACLs will cause the re-application -on an as-needed basis. Note, that the ACL application is potentially a relatively costly operation, -so it is only expected that these changes will be done in the control plane, NOT in the datapath. - -The matching within the context is done using two functions - acl_plugin.fill_5tuple() and -acl_plugin.match_5tuple() and their corresponding inline versions, named acl_plugin_fill_5tuple_inline() -and acl_plugin_match_5tuple_inline(). The inline and non-inline versions have the equivalent functionality, -in that the non-inline version calls the inline version. These two variants are provided -for debugging/maintenance reasons. - -When you no longer need a particular context, you can return the allocated resources by calling -acl_plugin.put_lookup_context_index() to mark it as free. The lookup structured associated with -the vector of ACLs set for the lookup are cleaned up automatically. However, the ACLs themselves -are not deleted and are available for subsequent reuse by other lookup contexts if needed. - -There is one delicate detail that you might want to be aware of. -When the non-inline functions reference the inline functions, -they are compiled as part of ACL plugin; whereas when you refer to the inline -functions from your code, they are compiled as part of your code. -This makes referring to a single acl_main structure a little trickier. - -It is done by having a static p_acl_main within the .h file, -which points to acl_main of the ACL plugin, and is initialized by a static constructor -function. - -This way the multiple includes and inlines will "just work" as one would expect. - - -Debug CLIs -========== - -To see the state of the ACL lookup contexts, you can issue "show acl-plugin lookup user" to see -all of the users which registered for the usage of the ACL plugin lookup contexts, -and "show acl-plugin lookup context" to show the actual contexts created. You will notice -that the latter command uses the values supplied during the module registration in order to -make the output more friendly. - -The "show acl-plugin acl" and "show acl-plugin interface" commands have also acquired the -notion of lookup context, but there it is used from the client perspective, since -with this change the interface ACL lookup itself is a user of ACL lookup contexts. - diff --git a/src/plugins/acl/acl_lookup_context.rst b/src/plugins/acl/acl_lookup_context.rst new file mode 100644 index 00000000000..278e87381f3 --- /dev/null +++ b/src/plugins/acl/acl_lookup_context.rst @@ -0,0 +1,138 @@ +ACL Lookup contexts +=================== + +The initial implementation of the ACL plugin had tightly tied the policy +(L3-L4) ACLs to ingress/egress processing on an interface. + +However, some uses outside of pure traffic control have appeared, for +example, ACL-based forwarding, etc. Also, improved algorithms of the ACL +lookup could benefit of the more abstract representation, not coupled to +the interfaces. + +This describes a way to accommodate these use cases by generalizing the +ACL lookups into “ACL lookup contexts”, not tied to specific interfaces, +usable by other portions of the code by utilizing the exports.h header +file, which provides the necessary interface. + +Why “lookup contexts” and not “match me an ACL” ? +------------------------------------------------- + +The first reason is the logical grouping of multiple ACLs. + +The interface matching code currently allows for matching multiple ACLs +in a ‘first-match’ fashion. Some other use cases also fall into a +similar pattern: they attempt to match a sequence of ACLs, and the first +matched ACL determines what the outcome is, e.g. where to forward +traffic. Thus, a match never happens on an ACL in isolation, but always +on a group of ACLs. + +The second reason is potential optimizations in matching. + +A naive match on series of ACLs each represented as a vector of ACEs +does not care about the API level - it could be “match one ACL”, or +“match the set of ACLs” - there will be just a simple loop iterating +over the ACLs to match, returning the first match. Be it in the ACL code +or in the user code. + +However, for more involved lookup methods, providing a more high-level +interface of matching over the entire group of ACLs allows for future +improvements in the algorithms, delivered at once to all the users of +the API. + +What is a “lookup context” ? +---------------------------- + +An ACL lookup context is an entity that groups the set of ACL#s together +for the purposes of a first-match lookup, and may store additional +internal information needed to optimize the lookups for that particular +vector of ACLs. + +Using ACL contexts in your code +------------------------------- + +In order to use the ACL lookup contexts, you need to include +plugins/acl/exports.h into your code. This header includes all the +necessary dependencies required. + +As you probably will invoke this code from another plugin, the +non-inline function calls are implemented via function pointers, which +you need to initialize by calling acl_plugin_exports_init(&acl_plugin), +which, if everything succeeds, returns 0 and fills in the acl_plugin +structure with pointers to the exported methods - else it will return +clib_error_t with more information about what went wrong. + +When you have initialized the symbols, you also need to register +yourself as a user of the ACL lookups - this allows to track the ACL +lookup context ownership, as well as make the debug show outputs more +user friendly. + +To do that, call acl_plugin.register_user_module(caller_module_string, +val1_label, val2_label) - and record the returned value. This will be the +first parameter that you pass to create a new lookup context. The passed +strings must be static, and are used as descriptions for the ACL +contexts themselves, as well as labels for up to two user-supplied u32 +labels, used to differentiate the lookup contexts for the debugging +purposes. + +Creating a new context is done by calling +acl_plugin.get_lookup_context_index(user_id, val1, val2). The first +argument is your “user” ID obtained in a registration call earlier, the +other two arguments are u32s with semantics that you designate. They are +used purely for debugging purposes in the “show acl lookup context” +command. + +To set the vector of ACL numbers to be looked up within the context, use +the function acl_plugin.set_acl_vec_for_context(lc_index, acl_list). The +first parameter specifies the context that you have created, the second +parameter is a vector of u32s, each u32 being the index of the ACL which +we should be looking up within this context. The command is idempotent, +i.e. it unapplies the previously applied list of ACLs, and then sets the +new list of ACLs. + +Subsequent ACL updates for the already applied ACLs will cause the +re-application on an as-needed basis. Note, that the ACL application is +potentially a relatively costly operation, so it is only expected that +these changes will be done in the control plane, NOT in the datapath. + +The matching within the context is done using two functions - +acl_plugin.fill_5tuple() and acl_plugin.match_5tuple() and their +corresponding inline versions, named acl_plugin_fill_5tuple_inline() and +acl_plugin_match_5tuple_inline(). The inline and non-inline versions +have the equivalent functionality, in that the non-inline version calls +the inline version. These two variants are provided for +debugging/maintenance reasons. + +When you no longer need a particular context, you can return the +allocated resources by calling acl_plugin.put_lookup_context_index() to +mark it as free. The lookup structured associated with the vector of +ACLs set for the lookup are cleaned up automatically. However, the ACLs +themselves are not deleted and are available for subsequent reuse by +other lookup contexts if needed. + +There is one delicate detail that you might want to be aware of. When +the non-inline functions reference the inline functions, they are +compiled as part of ACL plugin; whereas when you refer to the inline +functions from your code, they are compiled as part of your code. This +makes referring to a single acl_main structure a little trickier. + +It is done by having a static p_acl_main within the .h file, which +points to acl_main of the ACL plugin, and is initialized by a static +constructor function. + +This way the multiple includes and inlines will “just work” as one would +expect. + +Debug CLIs +---------- + +To see the state of the ACL lookup contexts, you can issue “show +acl-plugin lookup user” to see all of the users which registered for the +usage of the ACL plugin lookup contexts, and “show acl-plugin lookup +context” to show the actual contexts created. You will notice that the +latter command uses the values supplied during the module registration +in order to make the output more friendly. + +The “show acl-plugin acl” and “show acl-plugin interface” commands have +also acquired the notion of lookup context, but there it is used from +the client perspective, since with this change the interface ACL lookup +itself is a user of ACL lookup contexts. diff --git a/src/plugins/acl/acl_multicore_doc.md b/src/plugins/acl/acl_multicore_doc.md deleted file mode 100644 index deec5e9d566..00000000000 --- a/src/plugins/acl/acl_multicore_doc.md +++ /dev/null @@ -1,349 +0,0 @@ -Multicore support for ACL plugin {#acl_multicore} -================================ - -This captures some considerations and design decisions that I have made, -both for my own memory later on ("what the hell was I thinking?!?"), -and for anyone interested to criticize/improve/hack on this code. - -One of the factors taken into account while making these decisions, -was the relative emphasis on the multi-thread vs. single-thread -use cases: the latter is the vastly more prevalent. But, -one can not optimize the single-thread performance without -having a functioning code for multi-thread. - -stateless ACLs -============== - -The stateless trivially parallelizes, and the only potential for the -race between the different threads is during the reconfiguration, -at the time of replacing the old ACL being checked, with -the new ACL. - -In case an acl_add_replace is being used to replace the rules -within the existing entry, a reallocation of `am->acls[X].rules` -vector will happen and potentially a change in count. - -acl_match_5tuple() has the following code: - -```{.c} - a = am->acls + acl_index; - for (i = 0; i < a->count; i++) - { - r = a->rules + i; - . . . -``` - -Ideally we should be immune from a->rules changing, -but the problem arises if the count changes in flight, -and the new ruleset is smaller - then we will attempt -to "match" against the free memory. - -This can(?) be solved by replacing the for() with while(), -so the comparison happens at each iteration. - -full_acl_match_5tuple(), which iterates over the list -of ACLs, is a bit less immune, since it takes the pointer -to the vector to iterate and keeps a local copy of -that pointer. - -This race can be solved by checking the -current pointer to the vector with the source pointer, -and seeing if there is an (unlikely) change, and if -there is, return the "deny" action, or, better, -restart the check. - -Since the check reloads the ACL list on a per-packet basis, -there is only a window of opportunity of one packet to -"match" packet against an incorrect rule set. -The workers also do not change anything, only read. -Therefore, it looks like building special structures -to ensure that it does not happen at all might be not -worth it. - -At least not until we have a unit-test able to -reliably catch this condition and test that -the measures applied are effective. Adding the code -which is not possible to exercise is worse than -not adding any code at all. - -So, I opt for "do-nothing" here for the moment. - -reflexive ACLs: single-thread -============================= - -Before we talk multi-thread, is worth revisiting the -design of the reflexive ACLs in the plugin, and -the history of their evolution. - -The very first version of the ACL plugin, shipped in -1701, mostly did the job using the existing components -and gluing them together. Because it needed to work -in bridged forwarding path only, using L2 classifier -as an insertion point appeared natural, also L2 classifier, -being a table with sessions, seemed like a good place -to hold the sessions. - -So, the original design had two conceptual nodes: -one, pointed by the next_miss from the L2 classifier table, -was checking the actual ACL, and inserting session into -the L2 classifier table, and the other one, pointed -to by the next_match within the specific session rule, -was checking the existing session. The timing out -of the existing connections was done in the datapath, -by periodically calling the aging function. - -This decision to use the existing components, -with its attractiveness, did bring a few limitations as well: - -* L2 classifier is a simple mask-and-value match, with -a fixed mask across the table. So, sanely supporting IPv6 -packets with extension headers in that framework was impossible. - -* There is no way to get a backpressure from L2 classifier -depending on memory usage. When it runs out of memory, -it simply crashes the box. When it runs out of memory ? -We don't really know. Depends on how it allocates it. - -* Since we need to match the *reflected* traffic, -we had to create *two* full session entries -in two different directions, which is quite wasteful memory-wise. - -* (showstopper): the L2 classifier runs only in -the bridged data path, so supporting routed data path -would require creating something else entirely different, -which would mean much more headaches support-wise going forward. - -Because of that, I have moved to a different model of -creating a session-5-tuple from the packet data - once, -and then doing all the matching just on that 5-tuple. - -This has allowed to add support for skipping IPv6 extension headers. - -Also, this new version started to store the sessions in a dedicated -bihash-per-interface, with the session key data being -aligned for the ingress packets, and being mirrored for the -egress packets. This allows of significant savings in memory, -because now we need to keep only one copy of the session table per -interface instead of two, and also to only have ONE node for all the lookups, -(L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity. - -Unfortunately, bihash still has the "lack of backpressure" problem, -in a sense that if you try to insert too many entries and run out -of memory in the heap you supplied, you get a crash. - -To somewhat workaround against that, there is a "maximum tested number of sessions" -value, which tracks the currently inserted sessions in the bihash, -and if this number is being approached, a more aggressive cleanup -can happen. If this number is reached, two behaviors are possible: - -* attempt to do the stateless ACL matching and permit the packet - if it succeeds - -* deny the packet - -Currently I have opted for a second one, since it allows for -a better defined behavior, and if you have to permit -the traffic in both directions, why using stateful anyway ? - -In order to be able to do the cleanup, we need to discriminate between -the session types, with each session type having its own idle timeout. -In order to do that, we keep three lists, defined in enum acl_timeout_e: -ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT. - -The first one is hopefully obvious - it is just all UDP connections. -They have an idle timeout of 600 seconds. - -The second and third is a bit more subtle. TCP is a complicated protocol, -and we need to tread the fine line between doing too little and doing -too much, and triggering the potential compatibility issues because of -being a "middlebox". - -I decided to split the TCP connections into two classes: -established, and everything else. "Established", means we have seen -the SYN and ACK from both sides (with PUSH obviously masked out). -This is the "active" state of any TCP connection and we would like -to ensure we do not screw it up. So, the connections in this state -have the default idle timer of 24 hours. - -All the rest of the connections have the idle timeout of 2 minutes, -(inspired by an old value of MSL) and based on the observation -that the states this class represent are usually very short lived. - -Once we have these three baskets of connections, it is trivial to -imagine a simple cleanup mechanism to deal with this: take a -TCP transient connection that has been hanging around. - -It is debatable whether we want to do discrimination between the -different TCP transient connections. Assuming we do FIFO (and -the lists allow us to do just that), it means a given connection -on the head of the list has been hanging around for longest. -Thus, if we are short on resources, we might just go ahead and -reuse it within the datapath. - -This is where we are slowly approaching the question -"Why in the world have not you used timer wheel or such ?" - -The answer is simple: within the above constraints, it does -not buy me much. - -Also, timer wheel creates a leaky abstraction with a difficult -to manage corner case. Which corner case ? - -We have a set of objects (sessions) with an event that may -or may not happen (idle timeout timer firing), and a -necessity to reset the idle timeout when there is -activity on the session. - -In the worst case, where we had a 10000 of one-packet -UDP sessions just created 10 minutes ago, we would need -to deal with a spike of 10000 expired timers. - -Of course, if we have the active traffic on all -of these 10000 connections, then we will not have -to deal with that ? Right, but we will still have to deal -with canceling and requeueing the timers. - -In the best possible case, requeueing a timer is -going to be something along the lines of a linked-list -removal and reinsertion. - -However, keep in mind we already need to classify the -connections for reuse, so therefore we already have -the linked lists! - -And if we just check these linked lists periodically in -a FIFO fashion, we can get away with a very simple per-packet operation: -writing back the timestamp of "now" into the connection structure. - -Then rather than requeueing the list on a per-packet or per-frame -basis, we can defer this action until the time this session -appears on the head of the FIFO list, and the cleaning -routine makes the decision about whether to discard -the session (because the interval since last activity is bigger -than the idle timeout), or to requeue the session back to -the end of the list (because the last activity was less -than idle timeout ago). - -So, rather than using the timers, we can simply reuse our classification -FIFOs, with the following heuristic: do not look at the session that was -enqueued at time X until X+session_timeout. If we enqueue the sessions -in the order of their initial activity, then we can simply use enqueue -timestamp of the head session as a decision criterion for when we need -to get back at looking at it for the timeout purposes. - -Since the number of FIFOs is small, we get a slightly worse check -performance than with timers, but still O(1). - -We seemingly do quite a few "useless" operations of requeueing the items -back to the tail of the list - but, these are the operations we do not -have to do in the active data path, so overall it is a win. - -(Diversion: I believe this problem is congruent to poll vs. epoll or -events vs. threads, some reading on this subject: -http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html) - -We can also can run a TCP-like scheme for adaptively changing -the wait period in the routine that deals with the connection timeouts: -we can attempt to check the connections a couple of times per second -(same as we would advance the timer wheel), and then if we have requeued -close to a max-per-quantum number of connections, we can half the waiting -interval, and if we did not requeue any, we can slowly increment the waiting -interval - which at a steady state should stabilize similar to what the TCP rate -does. - -reflexive ACLs: multi-thread -============================= - -The single-threaded implementation in 1704 used a separate "cleaner" process -to deal with the timing out of the connections. -It is all good and great when you know that there is only a single core -to run everything on, but the existence of the lists proves to be -a massive difficulty when it comes to operating from multiple threads. - -Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread -and the worker have a demarcation point in time where either one or the other one touches -the session in the list) it might be possible to make it work, but the resulting -trickiness of doing it neatly with all the corner cases is quite large. - -So, for the multi-threaded scenario, we need to move the connection -aging back to the same CPU as its creation. - -Luckily we can do this with the help of the interrupts. - -So, the design is as follows: the aging thread (acl_fa_session_cleaner_process) -periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index), -using vlib_node_set_interrupt_pending(), and -the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions() -which does the actual job of advancing the lists. And within the actual datapath the only thing we will be -doing is putting the items onto FIFO, and updating the last active time on the existing connection. - -The one "delicate" part is that the worker for one leg of the connection might be different from -the worker of another leg of the connection - but, even if the "owner" tries to free the connection, -nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period) -will get the timestamp updated - same thing about the TCP flags seen. - -A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker), -and the return packet processed by another worker, and as a result changes the -the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). -If the class changes from one with the shorter idle time to the one with the longer idle time, -then unless we are in the starvation mode where the transient connections are recycled, -we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle -timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which -will affect the resource usage. - -One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner -can signal to the owner the connection# that needs to be requeued out of order. - -A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer. -This way the resource starvation problem is taken care of, at an expense of some additional work. - -This all looks sufficiently nice and simple until a skeleton falls out of the closet: -sometimes we want to clean the connections en masse before they expire. - -There few potential scenarios: -1) removal of an ACL from the interface -2) removal of an interface -3) manual action of an operator (in the future). - -In order to tackle this, we need to modify the logic which decides whether to requeue the -connection on the end of the list, or to delete it due to idle timeout: - -We define a point in time, and have each worker thread fast-forward through its FIFO, -in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them. - -To keep the ease of appearance to the outside world, we still process this as an event -within the connection cleaner thread, but this event handler does as follows: -1) it creates the bitmap of the sw_if_index values requested to be cleared -2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one, -it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt. -3) wait until all cleanup operations have completed. - -Within the worker interrupt node, we check if the "cleanup in progress" is set, -and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the -requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with. - -(we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session). - -If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return. -Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero, -meaning there is more work to do. -When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and -zeroize the bitmap of sw_if_index-es requested to be cleaned. - -The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed -flag within the per-worker structure. The main thread, while waiting for the -cleanup operation to complete, checks if there is a request for interrupt, -and if there is - it sends one. - -This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle -connection cleanup. - -One potential inefficiency is the bitmap values set by the session insertion -in the data path - there is nothing to clear them. - -So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work. -For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set, -would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap). - -=== the end === - diff --git a/src/plugins/acl/acl_multicore_doc.rst b/src/plugins/acl/acl_multicore_doc.rst new file mode 100644 index 00000000000..142b6b216d2 --- /dev/null +++ b/src/plugins/acl/acl_multicore_doc.rst @@ -0,0 +1,354 @@ +Multicore support for ACL plugin +================================ + +This captures some considerations and design decisions that I have made, +both for my own memory later on (“what the hell was I thinking?!?”), and +for anyone interested to criticize/improve/hack on this code. + +One of the factors taken into account while making these decisions, was +the relative emphasis on the multi-thread vs. single-thread use cases: +the latter is the vastly more prevalent. But, one can not optimize the +single-thread performance without having a functioning code for +multi-thread. + +stateless ACLs +-------------- + +The stateless trivially parallelizes, and the only potential for the +race between the different threads is during the reconfiguration, at the +time of replacing the old ACL being checked, with the new ACL. + +In case an acl_add_replace is being used to replace the rules within the +existing entry, a reallocation of ``am->acls[X].rules`` vector will +happen and potentially a change in count. + +acl_match_5tuple() has the following code: + +.. code:: c + + a = am->acls + acl_index; + for (i = 0; i < a->count; i++) + { + r = a->rules + i; + . . . + +Ideally we should be immune from a->rules changing, but the problem +arises if the count changes in flight, and the new ruleset is smaller - +then we will attempt to “match” against the free memory. + +This can(?) be solved by replacing the for() with while(), so the +comparison happens at each iteration. + +full_acl_match_5tuple(), which iterates over the list of ACLs, is a bit +less immune, since it takes the pointer to the vector to iterate and +keeps a local copy of that pointer. + +This race can be solved by checking the current pointer to the vector +with the source pointer, and seeing if there is an (unlikely) change, +and if there is, return the “deny” action, or, better, restart the +check. + +Since the check reloads the ACL list on a per-packet basis, there is +only a window of opportunity of one packet to “match” packet against an +incorrect rule set. The workers also do not change anything, only read. +Therefore, it looks like building special structures to ensure that it +does not happen at all might be not worth it. + +At least not until we have a unit-test able to reliably catch this +condition and test that the measures applied are effective. Adding the +code which is not possible to exercise is worse than not adding any code +at all. + +So, I opt for “do-nothing” here for the moment. + +reflexive ACLs: single-thread +----------------------------- + +Before we talk multi-thread, is worth revisiting the design of the +reflexive ACLs in the plugin, and the history of their evolution. + +The very first version of the ACL plugin, shipped in 1701, mostly did +the job using the existing components and gluing them together. Because +it needed to work in bridged forwarding path only, using L2 classifier +as an insertion point appeared natural, also L2 classifier, being a +table with sessions, seemed like a good place to hold the sessions. + +So, the original design had two conceptual nodes: one, pointed by the +next_miss from the L2 classifier table, was checking the actual ACL, and +inserting session into the L2 classifier table, and the other one, +pointed to by the next_match within the specific session rule, was +checking the existing session. The timing out of the existing +connections was done in the datapath, by periodically calling the aging +function. + +This decision to use the existing components, with its attractiveness, +did bring a few limitations as well: + +- L2 classifier is a simple mask-and-value match, with a fixed mask + across the table. So, sanely supporting IPv6 packets with extension + headers in that framework was impossible. + +- There is no way to get a backpressure from L2 classifier depending on + memory usage. When it runs out of memory, it simply crashes the box. + When it runs out of memory ? We don’t really know. Depends on how it + allocates it. + +- Since we need to match the *reflected* traffic, we had to create + *two* full session entries in two different directions, which is + quite wasteful memory-wise. + +- (showstopper): the L2 classifier runs only in the bridged data path, + so supporting routed data path would require creating something else + entirely different, which would mean much more headaches support-wise + going forward. + +Because of that, I have moved to a different model of creating a +session-5-tuple from the packet data - once, and then doing all the +matching just on that 5-tuple. + +This has allowed to add support for skipping IPv6 extension headers. + +Also, this new version started to store the sessions in a dedicated +bihash-per-interface, with the session key data being aligned for the +ingress packets, and being mirrored for the egress packets. This allows +of significant savings in memory, because now we need to keep only one +copy of the session table per interface instead of two, and also to only +have ONE node for all the lookups, (L2/L3 path, in/out, IPv4/IPv6) - +significantly reducing the code complexity. + +Unfortunately, bihash still has the “lack of backpressure” problem, in a +sense that if you try to insert too many entries and run out of memory +in the heap you supplied, you get a crash. + +To somewhat workaround against that, there is a “maximum tested number +of sessions” value, which tracks the currently inserted sessions in the +bihash, and if this number is being approached, a more aggressive +cleanup can happen. If this number is reached, two behaviors are +possible: + +- attempt to do the stateless ACL matching and permit the packet if it + succeeds + +- deny the packet + +Currently I have opted for a second one, since it allows for a better +defined behavior, and if you have to permit the traffic in both +directions, why using stateful anyway ? + +In order to be able to do the cleanup, we need to discriminate between +the session types, with each session type having its own idle timeout. +In order to do that, we keep three lists, defined in enum acl_timeout_e: +ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT. + +The first one is hopefully obvious - it is just all UDP connections. +They have an idle timeout of 600 seconds. + +The second and third is a bit more subtle. TCP is a complicated +protocol, and we need to tread the fine line between doing too little +and doing too much, and triggering the potential compatibility issues +because of being a “middlebox”. + +I decided to split the TCP connections into two classes: established, +and everything else. “Established”, means we have seen the SYN and ACK +from both sides (with PUSH obviously masked out). This is the “active” +state of any TCP connection and we would like to ensure we do not screw +it up. So, the connections in this state have the default idle timer of +24 hours. + +All the rest of the connections have the idle timeout of 2 minutes, +(inspired by an old value of MSL) and based on the observation that the +states this class represent are usually very short lived. + +Once we have these three baskets of connections, it is trivial to +imagine a simple cleanup mechanism to deal with this: take a TCP +transient connection that has been hanging around. + +It is debatable whether we want to do discrimination between the +different TCP transient connections. Assuming we do FIFO (and the lists +allow us to do just that), it means a given connection on the head of +the list has been hanging around for longest. Thus, if we are short on +resources, we might just go ahead and reuse it within the datapath. + +This is where we are slowly approaching the question “Why in the world +have not you used timer wheel or such ?” + +The answer is simple: within the above constraints, it does not buy me +much. + +Also, timer wheel creates a leaky abstraction with a difficult to manage +corner case. Which corner case ? + +We have a set of objects (sessions) with an event that may or may not +happen (idle timeout timer firing), and a necessity to reset the idle +timeout when there is activity on the session. + +In the worst case, where we had a 10000 of one-packet UDP sessions just +created 10 minutes ago, we would need to deal with a spike of 10000 +expired timers. + +Of course, if we have the active traffic on all of these 10000 +connections, then we will not have to deal with that ? Right, but we +will still have to deal with canceling and requeueing the timers. + +In the best possible case, requeueing a timer is going to be something +along the lines of a linked-list removal and reinsertion. + +However, keep in mind we already need to classify the connections for +reuse, so therefore we already have the linked lists! + +And if we just check these linked lists periodically in a FIFO fashion, +we can get away with a very simple per-packet operation: writing back +the timestamp of “now” into the connection structure. + +Then rather than requeueing the list on a per-packet or per-frame basis, +we can defer this action until the time this session appears on the head +of the FIFO list, and the cleaning routine makes the decision about +whether to discard the session (because the interval since last activity +is bigger than the idle timeout), or to requeue the session back to the +end of the list (because the last activity was less than idle timeout +ago). + +So, rather than using the timers, we can simply reuse our classification +FIFOs, with the following heuristic: do not look at the session that was +enqueued at time X until X+session_timeout. If we enqueue the sessions +in the order of their initial activity, then we can simply use enqueue +timestamp of the head session as a decision criterion for when we need +to get back at looking at it for the timeout purposes. + +Since the number of FIFOs is small, we get a slightly worse check +performance than with timers, but still O(1). + +We seemingly do quite a few “useless” operations of requeueing the items +back to the tail of the list - but, these are the operations we do not +have to do in the active data path, so overall it is a win. + +(Diversion: I believe this problem is congruent to poll vs. epoll or +events vs. threads, some reading on this subject: +http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html) + +We can also can run a TCP-like scheme for adaptively changing the wait +period in the routine that deals with the connection timeouts: we can +attempt to check the connections a couple of times per second (same as +we would advance the timer wheel), and then if we have requeued close to +a max-per-quantum number of connections, we can half the waiting +interval, and if we did not requeue any, we can slowly increment the +waiting interval - which at a steady state should stabilize similar to +what the TCP rate does. + +reflexive ACLs: multi-thread +---------------------------- + +The single-threaded implementation in 1704 used a separate “cleaner” +process to deal with the timing out of the connections. It is all good +and great when you know that there is only a single core to run +everything on, but the existence of the lists proves to be a massive +difficulty when it comes to operating from multiple threads. + +Initial study shows that with a few assumptions (e.g. that the cleaner +running in main thread and the worker have a demarcation point in time +where either one or the other one touches the session in the list) it +might be possible to make it work, but the resulting trickiness of doing +it neatly with all the corner cases is quite large. + +So, for the multi-threaded scenario, we need to move the connection +aging back to the same CPU as its creation. + +Luckily we can do this with the help of the interrupts. + +So, the design is as follows: the aging thread +(acl_fa_session_cleaner_process) periodically fires the interrupts to +the workers interrupt nodes +(acl_fa_worker_session_cleaner_process_node.index), using +vlib_node_set_interrupt_pending(), and the interrupt node +acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions() +which does the actual job of advancing the lists. And within the actual +datapath the only thing we will be doing is putting the items onto FIFO, +and updating the last active time on the existing connection. + +The one “delicate” part is that the worker for one leg of the connection +might be different from the worker of another leg of the connection - +but, even if the “owner” tries to free the connection, nothing terrible +can happen - worst case the element of the pool (which is nominally free +for a short period) will get the timestamp updated - same thing about +the TCP flags seen. + +A slightly trickier issue arises when the packet initially seen by one +worker (thus owned by that worker), and the return packet processed by +another worker, and as a result changes the the class of the connection +(e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). If the +class changes from one with the shorter idle time to the one with the +longer idle time, then unless we are in the starvation mode where the +transient connections are recycled, we can simply do nothing and let the +normal requeue mechanism kick in. If the class changes from the longer +idle timer to the shorter idle timer, then we risk keeping the +connection around for longer than needed, which will affect the resource +usage. + +One solution to that is to have NxN ring buffers (where N is the number +of workers), such that the non-owner can signal to the owner the +connection# that needs to be requeued out of order. + +A simpler solution though, is to ensure that each FIFO’s period is equal +to that of a shortest timer. This way the resource starvation problem is +taken care of, at an expense of some additional work. + +This all looks sufficiently nice and simple until a skeleton falls out +of the closet: sometimes we want to clean the connections en masse +before they expire. + +There few potential scenarios: 1) removal of an ACL from the interface +2) removal of an interface 3) manual action of an operator (in the +future). + +In order to tackle this, we need to modify the logic which decides +whether to requeue the connection on the end of the list, or to delete +it due to idle timeout: + +We define a point in time, and have each worker thread fast-forward +through its FIFO, in the process looking for sessions that satisfy the +criteria, and either keeping them or requeueing them. + +To keep the ease of appearance to the outside world, we still process +this as an event within the connection cleaner thread, but this event +handler does as follows: 1) it creates the bitmap of the sw_if_index +values requested to be cleared 2) for each worker, it waits to ensure +there is no cleanup operation in progress (and if there is one, it +waits), and then makes a copy of the bitmap, sets the per-worker flag of +a cleanup operation, and sends an interrupt. 3) wait until all cleanup +operations have completed. + +Within the worker interrupt node, we check if the “cleanup in progress” +is set, and if it is, we check the “fast forward time” value. If unset, +we initialize it to value now, and compare the requested bitmap of +sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of +sw_if_index that this worker deals with. + +(we set the bit in the bitmap every time we enqueue the packet onto a +FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session). + +If the result of this AND operation is zero - then we can clear the flag +of cleanup in progress and return. Else we kick off the quantum of +cleanup, and make sure we get another interrupt ASAP if that cleanup +operation returns non-zero, meaning there is more work to do. When that +operation returns zero, everything has been processed, we can clear the +“cleanup-in-progress” flag, and zeroize the bitmap of sw_if_index-es +requested to be cleaned. + +The interrupt node signals its wish to receive an interrupt ASAP by +setting interrupt_is_needed flag within the per-worker structure. The +main thread, while waiting for the cleanup operation to complete, checks +if there is a request for interrupt, and if there is - it sends one. + +This approach gives us a way to mass-clean the connections which is +reusing the code of the regular idle connection cleanup. + +One potential inefficiency is the bitmap values set by the session +insertion in the data path - there is nothing to clear them. + +So, if one rearranges the interface placement with the workers, then the +cleanups will cause some unnecessary work. For now, we consider it an +acceptable limitation. It can be resolved by having another per-worker +bitmap, which, when set, would trigger the cleanup of the bits in the +serviced_sw_if_index_bitmap). + +=== the end === |