From 374954b9d648f503f6783325a1266457953a998d Mon Sep 17 00:00:00 2001 From: Tibor Frank Date: Wed, 3 May 2023 13:53:27 +0000 Subject: C-Docs: New structure Change-Id: I73d107f94b28b138f3350a9e1eedb0555583a9ca Signed-off-by: Tibor Frank --- docs/content/methodology/overview/_index.md | 6 + .../overview/dut_state_considerations.md | 148 +++++++++++++++++++++ .../methodology/overview/multi_core_speedup.md | 51 +++++++ .../methodology/overview/per_thread_resources.md | 101 ++++++++++++++ docs/content/methodology/overview/terminology.md | 97 ++++++++++++++ .../methodology/overview/vpp_forwarding_modes.md | 104 +++++++++++++++ 6 files changed, 507 insertions(+) create mode 100644 docs/content/methodology/overview/_index.md create mode 100644 docs/content/methodology/overview/dut_state_considerations.md create mode 100644 docs/content/methodology/overview/multi_core_speedup.md create mode 100644 docs/content/methodology/overview/per_thread_resources.md create mode 100644 docs/content/methodology/overview/terminology.md create mode 100644 docs/content/methodology/overview/vpp_forwarding_modes.md (limited to 'docs/content/methodology/overview') diff --git a/docs/content/methodology/overview/_index.md b/docs/content/methodology/overview/_index.md new file mode 100644 index 0000000000..10f362013f --- /dev/null +++ b/docs/content/methodology/overview/_index.md @@ -0,0 +1,6 @@ +--- +bookCollapseSection: true +bookFlatSection: false +title: "Overview" +weight: 1 +--- diff --git a/docs/content/methodology/overview/dut_state_considerations.md b/docs/content/methodology/overview/dut_state_considerations.md new file mode 100644 index 0000000000..eca10a22cd --- /dev/null +++ b/docs/content/methodology/overview/dut_state_considerations.md @@ -0,0 +1,148 @@ +--- +title: "DUT State Considerations" +weight: 5 +--- + +# DUT State Considerations + +This page discusses considerations for Device Under Test (DUT) state. +DUTs such as VPP require configuration, to be provided before the aplication +starts (via config files) or just after it starts (via API or CLI access). + +During operation DUTs gather various telemetry data, depending on configuration. +This internal state handling is part of normal operation, +so any performance impact is included in the test results. +Accessing telemetry data is additional load on DUT, +so we are not doing that in main trial measurements that affect results, +but we include separate trials specifically for gathering runtime telemetry. + +But there is one kind of state that needs specific handling. +This kind of DUT state is dynamically created based on incoming traffic, +it affects how DUT handles the traffic, and (unlike telemetry counters) +it has uneven impact on CPU load. +Typical example is NAT, where detecting new sessions takes more CPU than +forwarding packet on existing (open or recently closed) sessions. +We call DUT configurations with this kind of state "stateful", +and configurations without them "stateless". +(Even though stateless configurations contain state described in previous +paragraphs, and some configuration items may have "stateful" in their name, +such as stateful ACLs.) + +# Stateful DUT configurations + +Typically, the level of CPU impact of traffic depends on DUT state. +The first packets causing DUT state to change have higher impact, +subsequent packets matching that state have lower impact. + +From performance point of view, this is similar to traffic phases +for stateful protocols, see +[NGFW draft](https://tools.ietf.org/html/draft-ietf-bmwg-ngfw-performance-05#section-4.3.4). +In CSIT we borrow the terminology (even if it does not fit perfectly, +see discussion below). Ramp-up traffic causes the state change, +sustain traffic does not change the state. + +As the performance is different, each test has to choose which traffic +it wants to test, and manipulate the DUT state to achieve the intended impact. + +## Ramp-up trial + +Tests aiming at sustain performance need to make sure DUT state is created. +We achieve this via a ramp-up trial, specific purpose of which +is to create the state. + +Subsequent trials need no specific handling, as long as the state +remains the same. But some state can time-out, so additional ramp-up +trials are inserted whenever the code detects the state can time-out. +Note that a trial with zero loss refreshes the state, +so only the time since the last non-zero loss trial is tracked. + +For the state to be set completely, it is important both DUT and TG +do not lose any packets. We achieve this by setting the profile multiplier +(TPS from now on) to low enough value. + +It is also important each state-affecting packet is sent. +For size-limited traffic profile it is guaranteed by the size limit. +For continuous traffic, we set a long enough duration (based on TPS). + +At the end of the ramp-up trial, we check DUT state to confirm +it has been created as expected. +Test fails if the state is not (completely) created. + +## State Reset + +Tests aiming at ramp-up performance do not use ramp-up trial, +and they need to reset the DUT state before each trial measurement. +The way of resetting the state depends on test, +usually an API call is used to partially de-configure +the part that holds the state, and then re-configure it back. + +In CSIT we control the DUT state behavior via a test variable "resetter". +If it is not set, DUT state is not reset. +If it is set, each search algorithm (including MRR) will invoke it +before all trial measurements (both main and telemetry ones). +Any configuration keyword enabling a feature with DUT state +will check whether a test variable for ramp-up rate is present. +If it is present, resetter is not set. +If it is not present, the keyword sets the apropriate resetter value. +This logic makes sure either ramp-up or state reset are used. + +Notes: If both ramp-up and state reset were used, the DUT behavior +would be identical to just reset, while test would take longer to execute. +If neither were used, DUT will show different performance in subsequent trials, +violating assumptions of search algorithms. + +## DUT versus protocol ramp-up + +There are at least three different causes for bandwidth possibly increasing +within a single measurement trial. + +The first is DUT switching from state modification phase to constant phase, +it is the primary focus of this document. +Using ramp-up traffic before main trials eliminates this cause +for tests wishing to measure the performance of the next phase. +Using size-limited profiles eliminates the next phase +for tests wishing to measure performance of this phase. + +The second is protocol such as TCP ramping up their throughput to utilize +the bandwidth available. This is the original meaning of "ramp up" +in the NGFW draft (see above). +In existing tests we are not using this meaning of TCP ramp-up. +Instead we use only small transactions, and large enough initial window +so TCP acts as ramped-up already. + +The third is TCP increasing offered load due to retransmissions triggered by +packet loss. In CSIT we again try to avoid this behavior +by using small enough data to transfer, so overlap of multiple transactions +(primary cause of packet loss) is unlikely. +But in MRR tests, packet loss and non-constant offered load are still expected. + +# Stateless DUT configuratons + +These are simple configurations, which do not set any resetter value +(even if ramp-up duration is not configured). +Majority of existing tests are of this type, using continuous traffic profiles. + +In order to identify limits of Trex performance, +we have added suites with stateless DUT configuration (VPP ip4base) +subjected to size-limited ASTF traffic. +The discovered rates serve as a basis of comparison +for evaluating the results for stateful DUT configurations (VPP NAT44ed) +subjected to the same traffic profiles. + +# DUT versus TG state + +Traffic Generator profiles can be stateful (ASTF) or stateless (STL). +DUT configuration can be stateful or stateless (with respect to packet traffic). + +In CSIT we currently use all four possible configurations: + +- Regular stateless VPP tests use stateless traffic profiles. + +- Stateless VPP configuration with stateful profile is used as a base for + comparison. + +- Some stateful DUT configurations (NAT44DET, NAT44ED unidirectional) + are tested using stateless traffic profiles and continuous traffic. + +- The rest of stateful DUT configurations (NAT44ED bidirectional) + are tested using stateful traffic profiles and size limited traffic. diff --git a/docs/content/methodology/overview/multi_core_speedup.md b/docs/content/methodology/overview/multi_core_speedup.md new file mode 100644 index 0000000000..f438e8e996 --- /dev/null +++ b/docs/content/methodology/overview/multi_core_speedup.md @@ -0,0 +1,51 @@ +--- +title: "Multi-Core Speedup" +weight: 3 +--- + +# Multi-Core Speedup + +All performance tests are executed with single physical core and with +multiple cores scenarios. + +## Intel Hyper-Threading (HT) + +Intel Xeon processors used in FD.io CSIT can operate either in HT +Disabled mode (single logical core per each physical core) or in HT +Enabled mode (two logical cores per each physical core). HT setting is +applied in BIOS and requires server SUT reload for it to take effect, +making it impractical for continuous changes of HT mode of operation. + +Performance tests are executed with server SUTs' Intel XEON processors +configured with Intel Hyper-Threading Enabled for all Xeon +Cascadelake and Xeon Icelake testbeds. + +## Multi-core Tests + +Multi-core tests are executed in the following VPP worker thread and physical +core configurations: + +1. Intel Xeon Icelake and Cascadelake testbeds (2n-icx, 3n-icx, 2n-clx) + with Intel HT enabled (2 logical CPU cores per each physical core): + + 1. 2t1c - 2 VPP worker threads on 1 physical core. + 2. 4t2c - 4 VPP worker threads on 2 physical cores. + 3. 8t4c - 8 VPP worker threads on 4 physical cores. + +VPP worker threads are the data plane threads running on isolated +logical cores. With Intel HT enabled VPP workers are placed as sibling +threads on each used physical core. VPP control threads (main, stats) +are running on a separate non-isolated core together with other Linux +processes. + +In all CSIT tests care is taken to ensure that each VPP worker handles +the same amount of received packet load and does the same amount of +packet processing work. This is achieved by evenly distributing per +interface type (e.g. physical, virtual) receive queues over VPP workers +using default VPP round-robin mapping and by loading these queues with +the same amount of packet flows. + +If number of VPP workers is higher than number of physical or virtual +interfaces, multiple receive queues are configured on each interface. +NIC Receive Side Scaling (RSS) for physical interfaces and multi-queue +for virtual interfaces are used for this purpose. diff --git a/docs/content/methodology/overview/per_thread_resources.md b/docs/content/methodology/overview/per_thread_resources.md new file mode 100644 index 0000000000..c23efb50bd --- /dev/null +++ b/docs/content/methodology/overview/per_thread_resources.md @@ -0,0 +1,101 @@ +--- +title: "Per Thread Resources" +weight: 2 +--- + +# Per Thread Resources + +CSIT test framework is managing mapping of the following resources per thread: + +1. Cores, physical cores (pcores) allocated as pairs of sibling logical cores + (lcores) if server in HyperThreading/SMT mode, or as single lcores + if server not in HyperThreading/SMT mode. Note that if server's + processors are running in HyperThreading/SMT mode sibling lcores are + always used. +2. Receive Queues (RxQ), packet receive queues allocated on each + physical and logical interface tested. +3. Transmit Queues(TxQ), packet transmit queues allocated on each + physical and logical interface tested. + +Approach to mapping per thread resources depends on the application/DUT +tested (VPP or DPDK apps) and associated thread types, as follows: + +1. Data-plane workers, used for data-plane packet processing, when no + feature workers present. + + - Cores: data-plane workers are typically tested in 1, 2 and 4 pcore + configurations, running on single lcore per pcore or on sibling + lcores per pcore. Result is a set of {T}t{C}c thread-core + configurations, where{T} stands for a total number of threads + (lcores), and {C} for a total number of pcores. Tested + configurations are encoded in CSIT test case names, + e.g. "1c", "2c", "4c", and test tags "2T1C" (or "1T1C"), "4T2C" + (or "2T2C"), "8T4C" (or "4T4C"). + - Interface Receive Queues (RxQ): as of CSIT-2106 release, number of + RxQs used on each physical or virtual interface is equal to the + number of data-plane workers. In other words each worker has a + dedicated RxQ on each interface tested. This ensures packet + processing load to be equal for each worker, subject to RSS flow + load balancing efficacy. Note: Before CSIT-2106 total number of + RxQs across all interfaces of specific type was equal to the + number of data-plane workers. + - Interface Transmit Queues (TxQ): number of TxQs used on each + physical or virtual interface is equal to the number of data-plane + workers. In other words each worker has a dedicated TxQ on each + interface tested. + - Applies to VPP and DPDK Testpmd and L3Fwd. + +2. Data-plane and feature workers (e.g. IPsec async crypto workers), the + latter dedicated to specific feature processing. + + - Cores: data-plane and feature workers are tested in 2, 3 and 4 + pcore configurations, running on single lcore per pcore or on + sibling lcores per pcore. This results in a two sets of + thread-core combinations separated by "-", {T}t{C}c-{T}t{C}c, with + the leading set denoting total number of threads (lcores) and + pcores used for data-plane workers, and the trailing set denoting + total number of lcores and pcores used for feature workers. + Accordingly, tested configurations are encoded in CSIT test case + names, e.g. "1c-1c", "1c-2c", "1c-3c", and test tags "2T1C_2T1C" + (or "1T1C_1T1C"), "2T1C_4T2C" (or "1T1C_2T2C"), "2T1C_6T3C" + (or "1T1C_3T3C"). + - RxQ and TxQ: no RxQs and no TxQs are used by feature workers. + - Applies to VPP only. + +3. Management/main worker, control plane and management. + + - Cores: single lcore. + - RxQ: not used (VPP default behaviour). + - TxQ: single TxQ per interface, allocated but not used (VPP default + behaviour). + - Applies to VPP only. + +## VPP Thread Configuration + +Mapping of cores and RxQs to VPP data-plane worker threads is done in +the VPP startup.conf during test suite setup: + +1. `corelist-workers `: List of logical cores to run VPP + data-plane workers and feature workers. The actual lcores' + allocations depends on HyperThreading/SMT server configuration and + per test core configuration. + + - For tests without feature workers, by default, all CPU cores + configured in startup.conf are used for data-plane workers. + - For tests with feature workers, CSIT code distributes lcores across + data-plane and feature workers. + +2. `num-rx-queues `: Number of Rx queues used per interface. + +Mapping of TxQs to VPP data-plane worker threads uses the default VPP +setting of one TxQ per interface per data-plane worker. + +## DPDK Thread Configuration + +Mapping of cores and RxQs to DPDK Testpmd/L3Fwd data-plane worker +threads is done in the startup CLI: + +1. `-l ` - List of logical cores to run DPDK + application. +2. `nb-cores=` - Number of forwarding cores. +3. `rxq=` - Number of Rx queues used per interface. diff --git a/docs/content/methodology/overview/terminology.md b/docs/content/methodology/overview/terminology.md new file mode 100644 index 0000000000..c9115e9291 --- /dev/null +++ b/docs/content/methodology/overview/terminology.md @@ -0,0 +1,97 @@ +--- +title: "Terminology" +weight: 1 +--- + +# Terminology + +- **Frame size**: size of an Ethernet Layer-2 frame on the wire, including + any VLAN tags (dot1q, dot1ad) and Ethernet FCS, but excluding Ethernet + preamble and inter-frame gap. Measured in Bytes. + +- **Packet size**: same as frame size, both terms used interchangeably. + +- **Inner L2 size**: for tunneled L2 frames only, size of an encapsulated + Ethernet Layer-2 frame, preceded with tunnel header, and followed by + tunnel trailer. Measured in Bytes. + +- **Inner IP size**: for tunneled IP packets only, size of an encapsulated + IPv4 or IPv6 packet, preceded with tunnel header, and followed by + tunnel trailer. Measured in Bytes. + +- **Device Under Test (DUT)**: In software networking, "device" denotes a + specific piece of software tasked with packet processing. Such device + is surrounded with other software components (such as operating system + kernel). It is not possible to run devices without also running the + other components, and hardware resources are shared between both. For + purposes of testing, the whole set of hardware and software components + is called "System Under Test" (SUT). As SUT is the part of the whole + test setup performance of which can be measured with RFC2544, using + SUT instead of RFC2544 DUT. Device under test + (DUT) can be re-introduced when analyzing test results using whitebox + techniques, but this document sticks to blackbox testing. + +- **System Under Test (SUT)**: System under test (SUT) is a part of the + whole test setup whose performance is to be benchmarked. The complete + methodology contains other parts, whose performance is either already + established, or not affecting the benchmarking result. + +- **Bi-directional throughput tests**: involve packets/frames flowing in + both east-west and west-east directions over every tested interface of + SUT/DUT. Packet flow metrics are measured per direction, and can be + reported as aggregate for both directions (i.e. throughput) and/or + separately for each measured direction (i.e. latency). In most cases + bi-directional tests use the same (symmetric) load in both directions. + +- **Uni-directional throughput tests**: involve packets/frames flowing in + only one direction, i.e. either east-west or west-east direction, over + every tested interface of SUT/DUT. Packet flow metrics are measured + and are reported for measured direction. + +- **Packet Loss Ratio (PLR)**: ratio of packets received relative to packets + transmitted over the test trial duration, calculated using formula: + PLR = ( pkts_transmitted - pkts_received ) / pkts_transmitted. + For bi-directional throughput tests aggregate PLR is calculated based + on the aggregate number of packets transmitted and received. + +- **Packet Throughput Rate**: maximum packet offered load DUT/SUT forwards + within the specified Packet Loss Ratio (PLR). In many cases the rate + depends on the frame size processed by DUT/SUT. Hence packet + throughput rate MUST be quoted with specific frame size as received by + DUT/SUT during the measurement. For bi-directional tests, packet + throughput rate should be reported as aggregate for both directions. + Measured in packets-per-second (pps) or frames-per-second (fps), + equivalent metrics. + +- **Bandwidth Throughput Rate**: a secondary metric calculated from packet + throughput rate using formula: bw_rate = pkt_rate * (frame_size + + L1_overhead) * 8, where L1_overhead for Ethernet includes preamble (8 + Bytes) and inter-frame gap (12 Bytes). For bi-directional tests, + bandwidth throughput rate should be reported as aggregate for both + directions. Expressed in bits-per-second (bps). + +- **Non Drop Rate (NDR)**: maximum packet/bandwith throughput rate sustained + by DUT/SUT at PLR equal zero (zero packet loss) specific to tested + frame size(s). MUST be quoted with specific packet size as received by + DUT/SUT during the measurement. Packet NDR measured in + packets-per-second (or fps), bandwidth NDR expressed in + bits-per-second (bps). + +- **Partial Drop Rate (PDR)**: maximum packet/bandwith throughput rate + sustained by DUT/SUT at PLR greater than zero (non-zero packet loss) + specific to tested frame size(s). MUST be quoted with specific packet + size as received by DUT/SUT during the measurement. Packet PDR + measured in packets-per-second (or fps), bandwidth PDR expressed in + bits-per-second (bps). + +- **Maximum Receive Rate (MRR)**: packet/bandwidth rate regardless of PLR + sustained by DUT/SUT under specified Maximum Transmit Rate (MTR) + packet load offered by traffic generator. MUST be quoted with both + specific packet size and MTR as received by DUT/SUT during the + measurement. Packet MRR measured in packets-per-second (or fps), + bandwidth MRR expressed in bits-per-second (bps). + +- **Trial**: a single measurement step. + +- **Trial duration**: amount of time over which packets are transmitted and + received in a single measurement step. diff --git a/docs/content/methodology/overview/vpp_forwarding_modes.md b/docs/content/methodology/overview/vpp_forwarding_modes.md new file mode 100644 index 0000000000..b3c3bba984 --- /dev/null +++ b/docs/content/methodology/overview/vpp_forwarding_modes.md @@ -0,0 +1,104 @@ +--- +title: "VPP Forwarding Modes" +weight: 4 +--- + +# VPP Forwarding Modes + +VPP is tested in a number of L2, IPv4 and IPv6 packet lookup and forwarding +modes. Within each mode baseline and scale tests are executed, the latter with +varying number of FIB entries. + +## L2 Ethernet Switching + +VPP is tested in three L2 forwarding modes: + +- *l2patch*: L2 patch, the fastest point-to-point L2 path that loops + packets between two interfaces without any Ethernet frame checks or + lookups. +- *l2xc*: L2 cross-connect, point-to-point L2 path with all Ethernet + frame checks, but no MAC learning and no MAC lookup. +- *l2bd*: L2 bridge-domain, multipoint-to-multipoint L2 path with all + Ethernet frame checks, with MAC learning (unless static MACs are used) + and MAC lookup. + +l2bd tests are executed in baseline and scale configurations: + +- *l2bdbase*: Two MAC FIB entries are learned by VPP to enable packet + switching between two interfaces in two directions. VPP L2 switching + is tested with 254 IPv4 unique flows per direction, varying IPv4 + source address per flow in order to invoke RSS based packet + distribution across VPP workers. The same source and destination MAC + address is used for all flows per direction. IPv4 source address is + incremented for every packet. + +- *l2bdscale*: A high number of MAC FIB entries are learned by VPP to + enable packet switching between two interfaces in two directions. + Tested MAC FIB sizes include: i) 10k with 5k unique flows per + direction, ii) 100k with 2 x 50k flows and iii) 1M with 2 x 500k + flows. Unique flows are created by using distinct source and + destination MAC addresses that are changed for every packet using + incremental ordering, making VPP learn (or refresh) distinct src MAC + entries and look up distinct dst MAC entries for every packet. For + details, see + [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}). + +Ethernet wire encapsulations tested include: untagged, dot1q, dot1ad. + +## IPv4 Routing + +IPv4 routing tests are executed in baseline and scale configurations: + +- *ip4base*: Two /32 IPv4 FIB entries are configured in VPP to enable + packet routing between two interfaces in two directions. VPP routing + is tested with 253 IPv4 unique flows per direction, varying IPv4 + source address per flow in order to invoke RSS based packet + distribution across VPP workers. IPv4 source address is incremented + for every packet. + +- *ip4scale*: A high number of /32 IPv4 FIB entries are configured in + VPP. Tested IPv4 FIB sizes include: i) 20k with 10k unique flows per + direction, ii) 200k with 2 * 100k flows and iii) 2M with 2 * 1M + flows. Unique flows are created by using distinct IPv4 destination + addresses that are changed for every packet, using incremental or + random ordering. For details, see + [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}). + +## IPv6 Routing + +Similarly to IPv4, IPv6 routing tests are executed in baseline and scale +configurations: + +- *ip6base*: Two /128 IPv4 FIB entries are configured in VPP to enable + packet routing between two interfaces in two directions. VPP routing + is tested with 253 IPv6 unique flows per direction, varying IPv6 + source address per flow in order to invoke RSS based packet + distribution across VPP workers. IPv6 source address is incremented + for every packet. + +- *ip4scale*: A high number of /128 IPv6 FIB entries are configured in + VPP. Tested IPv6 FIB sizes include: i) 20k with 10k unique flows per + direction, ii) 200k with 2 * 100k flows and iii) 2M with 2 * 1M + flows. Unique flows are created by using distinct IPv6 destination + addresses that are changed for every packet, using incremental or + random ordering. For details, see + [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}). + +## SRv6 Routing + +SRv6 routing tests are executed in a number of baseline configurations, +in each case SR policy and steering policy are configured for one +direction and one (or two) SR behaviours (functions) in the other +directions: + +- *srv6enc1sid*: One SID (no SRH present), one SR function - End. +- *srv6enc2sids*: Two SIDs (SRH present), two SR functions - End and + End.DX6. +- *srv6enc2sids-nodecaps*: Two SIDs (SRH present) without decapsulation, + one SR function - End. +- *srv6proxy-dyn*: Dynamic SRv6 proxy, one SR function - End.AD. +- *srv6proxy-masq*: Masquerading SRv6 proxy, one SR function - End.AM. +- *srv6proxy-stat*: Static SRv6 proxy, one SR function - End.AS. + +In all listed cases low number of IPv6 flows (253 per direction) is +routed by VPP. -- cgit 1.2.3-korg