aboutsummaryrefslogtreecommitdiffstats
path: root/src/vppinfra/atomics.h
AgeCommit message (Collapse)AuthorFilesLines
2021-10-12ipsec: Performance improvement of ipsec4_output_node using flow cacheGovindarajan Mohandoss1-0/+2
Adding flow cache support to improve outbound IPv4/IPSec SPD lookup performance. Details about flow cache: Mechanism: 1. First packet of a flow will undergo linear search in SPD table. Once a policy match is found, a new entry will be added into the flow cache. From 2nd packet onwards, the policy lookup will happen in flow cache. 2. The flow cache is implemented using bihash without collision handling. This will avoid the logic to age out or recycle the old flows in flow cache. Whenever a collision occurs, old entry will be overwritten by the new entry. Worst case is when all the 256 packets in a batch result in collision and fall back to linear search. Average and best case will be O(1). 3. The size of flow cache is fixed and decided based on the number of flows to be supported. The default is set to 1 million flows. This can be made as a configurable option as a next step. 4. Whenever a SPD rule is added/deleted by the control plane, the flow cache entries will be completely deleted (reset) in the control plane. The assumption here is that SPD rule add/del is not a frequent operation from control plane. Flow cache reset is done, by putting the data plane in fall back mode, to bypass flow cache and do linear search till the SPD rule add/delete operation is complete. Once the rule is successfully added/deleted, the data plane will be allowed to make use of the flow cache. The flow cache will be reset only after flushing out the inflight packets from all the worker cores using vlib_worker_wait_one_loop(). Details about bihash usage: 1. A new bihash template (16_8) is added to support IPv4 5 tuple. BIHASH_KVP_PER_PAGE and BIHASH_KVP_AT_BUCKET_LEVEL are set to 1 in the new template. It means only one KVP is supported per bucket. 2. Collision handling is avoided by calling BV (clib_bihash_add_or_overwrite_stale) function. Through the stale callback function pointer, the KVP entry will be overwritten during collision. 3. Flow cache reset is done using BV (clib_bihash_foreach_key_value_pair) function. Through the callback function pointer, the KVP value is reset to ~0ULL. MRR performance numbers with 1 core, 1 ESP Tunnel, null-encrypt, 64B for different SPD policy matching indices: SPD Policy index : 1 10 100 1000 Throughput : MPPS/MPPS MPPS/MPPS MPPS/MPPS KPPS/MPPS (Baseline/Optimized) ARM Neoverse N1 : 5.2/4.84 4.55/4.84 2.11/4.84 329.5/4.84 ARM TX2 : 2.81/2.6 2.51/2.6 1.27/2.6 176.62/2.6 INTEL SKX : 4.93/4.48 4.29/4.46 2.05/4.48 336.79/4.47 Next Steps: Following can be made as a configurable option through startup conf at IPSec level: 1. Enable/Disable Flow cache. 2. Bihash configuration like number of buckets and memory size. 3. Dual/Quad loop unroll can be applied around bihash to further improve the performance. 4. The same flow cache logic can be applied for IPv6 as well as in IPSec inbound direction. A deeper and wider flow cache using bihash_40_8 can replace existing bihash_16_8, to make it common for both IPv4 and IPv6 in both outbound and inbound directions. Following changes are made based on the review comments: 1. ON/OFF flow cache through startup conf. Default: OFF 2. Flow cache stale entry detection using epoch counter. 3. Avoid host order endianness conversion during flow cache lookup. 4. Move IPSec startup conf to a common file. 5. Added SPD flow cache unit test case 6. Replaced bihash with vectors to implement flow cache. 7. ipsec_add_del_policy API is not mpsafe. Cleaned up inflight packets check in control plane. Type: improvement Signed-off-by: mgovind <govindarajan.Mohandoss@arm.com> Signed-off-by: Zachary Leaf <zachary.leaf@arm.com> Tested-by: Jieqiang Wang <jieqiang.wang@arm.com> Change-Id: I62b4d6625fbc6caf292427a5d2046aa5672b2006
2021-02-08virtio: add atomic call for kickingMohsin Kazmi1-0/+3
Type: fix Signed-off-by: Mohsin Kazmi <sykazmi@cisco.com> Change-Id: I41faa2ca249ff75e564a732af896e6b5d76bf665
2020-12-23svm: remove fifo segment heapFlorin Coras1-0/+3
Type: improvement Signed-off-by: Florin Coras <fcoras@cisco.com> Change-Id: I518e096fe13847759806ff62009e73fd8f7451b7
2020-11-19svm: move chunk locks to linked listFlorin Coras1-0/+3
Type: improvement We only need to protect the linked lists. Signed-off-by: Florin Coras <fcoras@cisco.com> Change-Id: Ie1542073f3993acfc66d99096b08bf9ecd10a49b
2019-08-01vppinfra: refactor clib_spinlock_t to use compare and swapjaszha031-0/+3
Tested performance of a CAS implementation (using __atomic_compare_exchange) against a TAS implementation (using __atomic_exchange) using test_spinlock.c and found some performance improvement. Generated assembly for CAS and TAS implementations show that TAS always executes with a load-store dependency, but CAS moves a branch condition between the load and store so that only a load occurs when the lock is free. Benchmarking: The results below are the cycle counts from test_spinlock.c, configured so that for 10000 iterations, 12 threads on separate cores are spawned, each of which increments a global counter 10000 times in each iteration. For A72, 8 threads are spawned in each test. x86 Xeon TAS: 7.333e8, 7.605e8, 7.535e8, 7.485e8, 7.321e8 x86 Xeon CAS: 5.842e8, 5.433e8, 5.389e8, 5.983e8, 5.552e8 Aarch64 ThX2* TAS: 9.852e7, 10.209e7, 9.190e7, 9.600e7, 9.224e7 Aarch64 ThX2* CAS: 7.640e7, 7.486e7, 7.425e7, 7.269e7, 7.534e7 A72 TAS: 7.289e6, 6.963e6, 7.208e6, 6.976e6, 7.200e6 A72 CAS: 1.695e6, 1.608e6, 1.600e6, 1.634e6, 1.746e6 *ThunderX2 used additional gcc options "-march=armv8.1-a+crc+crypto+lse" Type: refactor Change-Id: Ic5cd97991804f6b012707fad1a5d1a6edb96cd3d Signed-off-by: Jason Zhang <jason.zhang2@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Lijian Zhang <Lijian.Zhang@arm.com>
2019-06-05Switch atomic release API from __sync to __atomic builtin.Sirshak Das1-1/+1
__sync_lock_release switched to __atomic_store for code consitency, although both generate same instructions with current compilers. Change-Id: I37d320509e43a4c2b8a49af6346dc4a43ca2f535 Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Lijian Zhang <Lijian.Zhang@arm.com> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
2019-06-05Switch atomic test and set API from __sync to __atomic builtinSirshak Das1-1/+1
__sync_test_and_set uses full memory barriers for AArch64, __atomic_exchange(ACQUIRE) would use load acquire. Change-Id: Ifdf2481db3b9dde6c5842d75671402862adb6d81 Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Lijian Zhang <Lijian.Zhang@arm.com> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
2019-04-16svm_fifo rework to avoid contention on cursizeSirshak Das1-0/+3
Problems Addressed: - Contention of cursize by producer and consumer. - Reduce the no of modulo operations. Changes: - Synchronization between producer and consumer changed from cursize to head and tail indexes Implications: reduces the usable size of fifo by 1. - Using weaker memory ordering C++11 atomics to access head and tail based on producer and consumer role. - Head and tail indexes are unsigned 32 bit integers. Additions and subtraction on them are implicit 32 bit Modulo operation. - Adding weaker memory ordering variants of max_enq, max_deq, is_empty and is_full Using them appropriately in all places. Perfomance improvement (iperf3 via Hoststack): iperf3 Server: Marvell ThunderX2(AArch64) - iperf3 Client: Skylake(x86) ~6%(256 rxd/txd) - ~11%(2048 rxd/txd) Change-Id: I1d484e000e437430fdd5a819657d1c6b62443018 Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
2019-03-22svm/atomics: add clib_atomic_swap_rel_nFlorin Coras1-0/+1
Change-Id: Iea2c173000570043beafef58ca923463ce76d872 Signed-off-by: Florin Coras <fcoras@cisco.com>
2018-11-28Use acquire/release ordering when accessing svm_fifo shared variable cursizeSirshak Das1-0/+4
Improves TCP iperf3 performance by ~3% on AArch64. Change-Id: I1e51bd8403ba45ec6af4c2f96b95e884c1ae0d67 Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-11-05Enable atomic swap and store macro with acquire and release orderingSirshak Das1-2/+3
Add atomic swap and store macro with acquire and release ordering respectively. Variable in question is interupt_pending variable which is used as guard variable by input nodes to process the device queue. Atomic Swap is used with Acquire ordering as writes or reads following this in program order should not be reordered before the swap. Atomic Store is used with Release ordering, as post store the node is added to pending list. Change-Id: I1be49e91a15c58d0bf21ff5ba1bd37d5d7d12f7a Original-patch-by: Damjan Marion <damarion@cisco.com> Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
2018-10-19vppinfra: add atomic macros for __sync builtinsSirshak Das1-0/+45
This is first part of addition of atomic macros with only macros for __sync builtins. - Based on earlier patch by Damjan (https://gerrit.fd.io/r/#/c/10729/) Additionally - clib_atomic_release macro added and used in the absence of any memory barrier. - clib_atomic_bool_cmp_and_swap added Change-Id: Ie4e48c1e184a652018d1d0d87c4be80ddd180a3b Original-patch-by: Damjan Marion <damarion@cisco.com> Signed-off-by: Sirshak Das <sirshak.das@arm.com> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com> Reviewed-by: Steve Capper <steve.capper@arm.com>