From 5c739ab22c2b62dd38a4d89d62188e1baa7a0f45 Mon Sep 17 00:00:00 2001 From: Tibor Frank Date: Tue, 5 Feb 2019 10:20:41 +0100 Subject: CSIT-1420: Split methodology section to more files Change-Id: I861e578434abdf72244d684fca8cfd66e1db9c28 Signed-off-by: Tibor Frank (cherry picked from commit 124101d22151239b0411a73ae4d2bf8d70970937) --- docs/report/introduction/methodology.rst | 934 +-------------------- .../introduction/methodology_bmrr_throughput.rst | 54 ++ .../methodology_data_plane_throughput.rst | 33 + .../methodology_http_tcp_with_wrk_tool.rst | 40 + .../methodology_ipsec_on_intel_qat.rst | 14 + .../methodology_k8s_container_memif.rst | 23 + .../methodology_kvm_vms_vhost_user.rst | 23 + .../methodology_lxc_drc_container_memif.rst | 21 + .../introduction/methodology_mlrsearch_tests.rst | 292 +++++++ .../methodology_multi_core_speedup.rst | 66 ++ .../introduction/methodology_packet_latency.rst | 23 + .../methodology_trex_traffic_generator.rst | 53 ++ .../methodology_tunnel_encapsulations.rst | 38 + .../methodology_vpp_device_functional.rst | 13 + .../introduction/methodology_vpp_features.rst | 80 ++ .../methodology_vpp_forwarding_modes.rst | 90 ++ .../methodology_vpp_startup_settings.rst | 43 + 17 files changed, 924 insertions(+), 916 deletions(-) create mode 100644 docs/report/introduction/methodology_bmrr_throughput.rst create mode 100644 docs/report/introduction/methodology_data_plane_throughput.rst create mode 100644 docs/report/introduction/methodology_http_tcp_with_wrk_tool.rst create mode 100644 docs/report/introduction/methodology_ipsec_on_intel_qat.rst create mode 100644 docs/report/introduction/methodology_k8s_container_memif.rst create mode 100644 docs/report/introduction/methodology_kvm_vms_vhost_user.rst create mode 100644 docs/report/introduction/methodology_lxc_drc_container_memif.rst create mode 100644 docs/report/introduction/methodology_mlrsearch_tests.rst create mode 100644 docs/report/introduction/methodology_multi_core_speedup.rst create mode 100644 docs/report/introduction/methodology_packet_latency.rst create mode 100644 docs/report/introduction/methodology_trex_traffic_generator.rst create mode 100644 docs/report/introduction/methodology_tunnel_encapsulations.rst create mode 100644 docs/report/introduction/methodology_vpp_device_functional.rst create mode 100644 docs/report/introduction/methodology_vpp_features.rst create mode 100644 docs/report/introduction/methodology_vpp_forwarding_modes.rst create mode 100644 docs/report/introduction/methodology_vpp_startup_settings.rst (limited to 'docs/report/introduction') diff --git a/docs/report/introduction/methodology.rst b/docs/report/introduction/methodology.rst index 16d3edacdb..1f9fcfe7fb 100644 --- a/docs/report/introduction/methodology.rst +++ b/docs/report/introduction/methodology.rst @@ -4,919 +4,21 @@ Test Methodology ================ -VPP Forwarding Modes --------------------- - -VPP is tested in a number of L2 and IP packet lookup and forwarding -modes. Within each mode baseline and scale tests are executed, the -latter with varying number of lookup entries. - -L2 Ethernet Switching -~~~~~~~~~~~~~~~~~~~~~ - -VPP is tested in three L2 forwarding modes: - -- *l2patch*: L2 patch, the fastest point-to-point L2 path that loops - packets between two interfaces without any Ethernet frame checks or - lookups. -- *l2xc*: L2 cross-connect, point-to-point L2 path with all Ethernet - frame checks, but no MAC learning and no MAC lookup. -- *l2bd*: L2 bridge-domain, multipoint-to-multipoint L2 path with all - Ethernet frame checks, with MAC learning (unless static MACs are used) - and MAC lookup. - -l2bd tests are executed in baseline and scale configurations: - -- *l2bdbase*: low number of L2 flows (253 per direction) is switched by - VPP. They drive the content of MAC FIB size (506 total MAC entries). - Both source and destination MAC addresses are incremented on a packet - by packet basis. - -- *l2bdscale*: high number of L2 flows is switched by VPP. Tested MAC - FIB sizes include: i) 10k (5k unique flows per direction), ii) 100k - (2x 50k flows) and iii) 1M (2x 500k). Both source and destination MAC - addresses are incremented on a packet by packet basis, ensuring new - entries are learn refreshed and looked up at every packet, making it - the worst case scenario. - -Ethernet wire encapsulations tested include: untagged, dot1q, dot1ad. - -IPv4 Routing -~~~~~~~~~~~~ - -IPv4 routing tests are executed in baseline and scale configurations: - -- *ip4base*: low number of IPv4 flows (253 per direction) is routed by - VPP. They drive the content of IPv4 FIB size (506 total /32 prefixes). - Destination IPv4 addresses are incremented on a packet by packet - basis. - -- *ip4scale*: high number of IPv4 flows is routed by VPP. Tested IPv4 - FIB sizes of /32 prefixes include: i) 20k (10k unique flows per - direction), ii) 200k (2x 100k flows) and iii) 2M (2x 1M). Destination - IPv4 addresses are incremented on a packet by packet basis, ensuring - new FIB entries are looked up at every packet, making it the worst - case scenario. - -IPv6 Routing -~~~~~~~~~~~~ - -IPv6 routing tests are executed in baseline and scale configurations: - -- *ip6base*: low number of IPv6 flows (253 per direction) is routed by - VPP. They drive the content of IPv6 FIB size (506 total /128 prefixes). - Destination IPv6 addresses are incremented on a packet by packet - basis. - -- *ip6scale*: high number of IPv6 flows is routed by VPP. Tested IPv6 - FIB sizes of /128 prefixes include: i) 20k (10k unique flows per - direction), ii) 200k (2x 100k flows) and iii) 2M (2x 1M). Destination - IPv6 addresses are incremented on a packet by packet basis, ensuring - new FIB entries are looked up at every packet, making it the worst - case scenario. - -SRv6 Routing -~~~~~~~~~~~~ - -SRv6 routing tests are executed in a number of baseline configurations, -in each case SR policy and steering policy are configured for one -direction and one (or two) SR behaviours (functions) in the other -directions: - -- *srv6enc1sid*: One SID (no SRH present), one SR function - End. -- *srv6enc2sids*: Two SIDs (SRH present), two SR functions - End and - End.DX6. -- *srv6enc2sids-nodecaps*: Two SIDs (SRH present) without decapsulation, - one SR function - End. -- *srv6proxy-dyn*: Dynamic SRv6 proxy, one SR function - End.AD. -- *srv6proxy-masq*: Masquerading SRv6 proxy, one SR function - End.AM. -- *srv6proxy-stat*: Static SRv6 proxy, one SR function - End.AS. - -In all listed cases low number of IPv6 flows (253 per direction) is -routed by VPP. - -Tunnel Encapsulations ---------------------- - -Tunnel encapsulations testing is grouped based on the type of outer -header: IPv4 or IPv6. - -IPv4 Tunnels -~~~~~~~~~~~~ - -VPP is tested in the following IPv4 tunnel baseline configurations: - -- *ip4vxlan-l2bdbase*: VXLAN over IPv4 tunnels with L2 bridge-domain MAC - switching. -- *ip4vxlan-l2xcbase*: VXLAN over IPv4 tunnels with L2 cross-connect. -- *ip4lispip4-ip4base*: LISP over IPv4 tunnels with IPv4 routing. -- *ip4lispip6-ip6base*: LISP over IPv4 tunnels with IPv6 routing. - -In all cases listed above low number of MAC, IPv4, IPv6 flows (253 per -direction) is switched or routed by VPP. - -In addition selected IPv4 tunnels are tested at scale: - -- *dot1q--ip4vxlanscale-l2bd*: VXLAN over IPv4 tunnels with L2 bridge- - domain MAC switching, with scaled up dot1q VLANs (10, 100, 1k), - mapped to scaled up L2 bridge-domains (10, 100, 1k), that are in turn - mapped to (10, 100, 1k) VXLAN tunnels. 64.5k flows are transmitted per - direction. - -IPv6 Tunnels -~~~~~~~~~~~~ - -VPP is tested in the following IPv6 tunnel baseline configurations: - -- *ip6lispip4-ip4base*: LISP over IPv4 tunnels with IPv4 routing. -- *ip6lispip6-ip6base*: LISP over IPv4 tunnels with IPv6 routing. - -In all cases listed above low number of IPv4, IPv6 flows (253 per -direction) is routed by VPP. - -VPP Features ------------- - -VPP is tested in a number of data plane feature configurations across -different forwarding modes. Following sections list features tested. - -ACL Security-Groups -~~~~~~~~~~~~~~~~~~~ - -Both stateless and stateful access control lists (ACL), also known as -security-groups, are supported by VPP. - -Following ACL configurations are tested for MAC switching with L2 -bridge-domains: - -- *l2bdbasemaclrn-iacl{E}sl-{F}flows*: Input stateless ACL, with {E} - entries and {F} flows. -- *l2bdbasemaclrn-oacl{E}sl-{F}flows*: Output stateless ACL, with {E} - entries and {F} flows. -- *l2bdbasemaclrn-iacl{E}sf-{F}flows*: Input stateful ACL, with {E} - entries and {F} flows. -- *l2bdbasemaclrn-oacl{E}sf-{F}flows*: Output stateful ACL, with {E} - entries and {F} flows. - -Following ACL configurations are tested with IPv4 routing: - -- *ip4base-iacl{E}sl-{F}flows*: Input stateless ACL, with {E} entries - and {F} flows. -- *ip4base-oacl{E}sl-{F}flows*: Output stateless ACL, with {E} entries - and {F} flows. -- *ip4base-iacl{E}sf-{F}flows*: Input stateful ACL, with {E} entries and - {F} flows. -- *ip4base-oacl{E}sf-{F}flows*: Output stateful ACL, with {E} entries - and {F} flows. - -ACL tests are executed with the following combinations of ACL entries -and number of flows: - -- ACL entry definitions - - - flow non-matching deny entry: (src-ip4, dst-ip4, src-port, dst-port). - - flow matching permit ACL entry: (src-ip4, dst-ip4). - -- {E} - number of non-matching deny ACL entries, {E} = [1, 10, 50]. -- {F} - number of UDP flows with different tuple (src-ip4, dst-ip4, - src-port, dst-port), {F} = [100, 10k, 100k]. -- All {E}x{F} combinations are tested per ACL type, total of 9. - -ACL MAC-IP -~~~~~~~~~~ - -MAC-IP binding ACLs are tested for MAC switching with L2 bridge-domains: - -- *l2bdbasemaclrn-macip-iacl{E}sl-{F}flows*: Input stateless ACL, with - {E} entries and {F} flows. - -MAC-IP ACL tests are executed with the following combinations of ACL -entries and number of flows: - -- ACL entry definitions - - - flow non-matching deny entry: (dst-ip4, dst-mac, bit-mask) - - flow matching permit ACL entry: (dst-ip4, dst-mac, bit-mask) - -- {E} - number of non-matching deny ACL entries, {E} = [1, 10, 50] -- {F} - number of UDP flows with different tuple (dst-ip4, dst-mac), - {F} = [100, 10k, 100k] -- All {E}x{F} combinations are tested per ACL type, total of 9. - -NAT44 -~~~~~ - -NAT44 is tested in baseline and scale configurations with IPv4 routing: - -- *ip4base-nat44*: baseline test with single NAT entry (addr, port), - single UDP flow. -- *ip4base-udpsrcscale{U}-nat44*: baseline test with {U} NAT entries - (addr, {U}ports), {U}=15. -- *ip4scale{R}-udpsrcscale{U}-nat44*: scale tests with {R}*{U} NAT - entries ({R}addr, {U}ports), {R}=[100, 1k, 2k, 4k], {U}=15. - -Data Plane Throughput ---------------------- - -Network data plane packet and bandwidth throughput are measured in -accordance with :rfc:`2544`, using FD.io CSIT Multiple Loss Ratio search -(MLRsearch), an optimized throughput search algorithm, that measures -SUT/DUT packet throughput rates at different Packet Loss Ratio (PLR) -values. - -Following MLRsearch values are measured across a range of L2 frame sizes -and reported: - -- NON DROP RATE (NDR): packet and bandwidth throughput at PLR=0%. - - - **Aggregate packet rate**: NDR_LOWER - pps. - - **Aggregate bandwidth rate**: NDR_LOWER Gbps. - -- PARTIAL DROP RATE (PDR): packet and bandwidth throughput at PLR=0.5%. - - - **Aggregate packet rate**: PDR_LOWER - pps. - - **Aggregate bandwidth rate**: PDR_LOWER Gbps. - -NDR and PDR are measured for the following L2 frame sizes (untagged -Ethernet): - -- IPv4 payload: 64B, IMIX (28x64B, 16x570B, 4x1518B), 1518B, 9000B. -- IPv6 payload: 78B, IMIX (28x78B, 16x570B, 4x1518B), 1518B, 9000B. - -All rates are reported from external Traffic Generator perspective. - -.. _mlrsearch_algorithm: - -MLRsearch Tests ---------------- - -Multiple Loss Rate search (MLRsearch) tests use new search algorithm -implemented in FD.io CSIT project. MLRsearch discovers multiple packet -throughput rates in a single search, with each rate associated with a -distinct Packet Loss Ratio (PLR) criteria. MLRsearch is being -standardized in IETF with `draft-vpolak-mkonstan-mlrsearch-XX -`_. - -Two throughput measurements used in FD.io CSIT are Non-Drop Rate (NDR, -with zero packet loss, PLR=0) and Partial Drop Rate (PDR, with packet -loss rate not greater than the configured non-zero PLR). MLRsearch -discovers NDR and PDR in a single pass reducing required execution time -compared to separate binary searches for NDR and PDR. MLRsearch reduces -execution time even further by relying on shorter trial durations -of intermediate steps, with only the final measurements -conducted at the specified final trial duration. -This results in the shorter overall search -execution time when compared to a standard NDR/PDR binary search, -while guaranteeing the same or similar results. - -If needed, MLRsearch can be easily adopted to discover more throughput rates -with different pre-defined PLRs. - -.. Note:: All throughput rates are *always* bi-directional - aggregates of two equal (symmetric) uni-directional packet rates - received and reported by an external traffic generator. - -Overview -~~~~~~~~ - -The main properties of MLRsearch: - -- MLRsearch is a duration aware multi-phase multi-rate search algorithm. - - - Initial phase determines promising starting interval for the search. - - Intermediate phases progress towards defined final search criteria. - - Final phase executes measurements according to the final search - criteria. - -- *Initial phase*: - - - Uses link rate as a starting transmit rate and discovers the Maximum - Receive Rate (MRR) used as an input to the first intermediate phase. - -- *Intermediate phases*: - - - Start with initial trial duration (in the first phase) and converge - geometrically towards the final trial duration (in the final phase). - - Track two values for NDR and two for PDR. - - - The values are called (NDR or PDR) lower_bound and upper_bound. - - Each value comes from a specific trial measurement - (most recent for that transmit rate), - and as such the value is associated with that measurement's duration and loss. - - A bound can be invalid, for example if NDR lower_bound - has been measured with nonzero loss. - - Invalid bounds are not real boundaries for the searched value, - but are needed to track interval widths. - - Valid bounds are real boundaries for the searched value. - - Each non-initial phase ends with all bounds valid. - - - Start with a large (lower_bound, upper_bound) interval width and - geometrically converge towards the width goal (measurement resolution) - of the phase. Each phase halves the previous width goal. - - Use internal and external searches: - - - External search - measures at transmit rates outside the (lower_bound, - upper_bound) interval. Activated when a bound is invalid, - to search for a new valid bound by doubling the interval width. - It is a variant of `exponential search`_. - - Internal search - `binary search`_, measures at transmit rates within the - (lower_bound, upper_bound) valid interval, halving the interval width. - -- *Final phase* is executed with the final test trial duration, and the final - width goal that determines resolution of the overall search. - Intermediate phases together with the final phase are called non-initial phases. - -The main benefits of MLRsearch vs. binary search include: - -- In general MLRsearch is likely to execute more search trials overall, but - less trials at a set final duration. -- In well behaving cases it greatly reduces (>50%) the overall duration - compared to a single PDR (or NDR) binary search duration, - while finding multiple drop rates. -- In all cases MLRsearch yields the same or similar results to binary search. -- Note: both binary search and MLRsearch are susceptible to reporting - non-repeatable results across multiple runs for very bad behaving - cases. - -Caveats: - -- Worst case MLRsearch can take longer than a binary search e.g. in case of - drastic changes in behaviour for trials at varying durations. - -Search Implementation -~~~~~~~~~~~~~~~~~~~~~ - -Following is a brief description of the current MLRsearch -implementation in FD.io CSIT. - -Input Parameters -```````````````` - -#. *maximum_transmit_rate* - maximum packet transmit rate to be used by - external traffic generator, limited by either the actual Ethernet - link rate or traffic generator NIC model capabilities. Sample - defaults: 2 * 14.88 Mpps for 64B 10GE link rate, - 2 * 18.75 Mpps for 64B 40GE NIC maximum rate. -#. *minimum_transmit_rate* - minimum packet transmit rate to be used for - measurements. MLRsearch fails if lower transmit rate needs to be - used to meet search criteria. Default: 2 * 10 kpps (could be higher). -#. *final_trial_duration* - required trial duration for final rate - measurements. Default: 30 sec. -#. *initial_trial_duration* - trial duration for initial MLRsearch phase. - Default: 1 sec. -#. *final_relative_width* - required measurement resolution expressed as - (lower_bound, upper_bound) interval width relative to upper_bound. - Default: 0.5%. -#. *packet_loss_ratio* - maximum acceptable PLR search criteria for - PDR measurements. Default: 0.5%. -#. *number_of_intermediate_phases* - number of phases between the initial - phase and the final phase. Impacts the overall MLRsearch duration. - Less phases are required for well behaving cases, more phases - may be needed to reduce the overall search duration for worse behaving cases. - Default (2). (Value chosen based on limited experimentation to date. - More experimentation needed to arrive to clearer guidelines.) - -Initial Phase -````````````` - -1. First trial measures at maximum rate and discovers MRR. - - a. *in*: trial_duration = initial_trial_duration. - b. *in*: offered_transmit_rate = maximum_transmit_rate. - c. *do*: single trial. - d. *out*: measured loss ratio. - e. *out*: mrr = measured receive rate. - -2. Second trial measures at MRR and discovers MRR2. - - a. *in*: trial_duration = initial_trial_duration. - b. *in*: offered_transmit_rate = MRR. - c. *do*: single trial. - d. *out*: measured loss ratio. - e. *out*: mrr2 = measured receive rate. - -3. Third trial measures at MRR2. - - a. *in*: trial_duration = initial_trial_duration. - b. *in*: offered_transmit_rate = MRR2. - c. *do*: single trial. - d. *out*: measured loss ratio. - -Non-initial Phases -`````````````````` - -1. Main loop: - - a. *in*: trial_duration for the current phase. - Set to initial_trial_duration for the first intermediate phase; - to final_trial_duration for the final phase; - or to the element of interpolating geometric sequence - for other intermediate phases. - For example with two intermediate phases, trial_duration - of the second intermediate phase is the geometric average - of initial_strial_duration and final_trial_duration. - b. *in*: relative_width_goal for the current phase. - Set to final_relative_width for the final phase; - doubled for each preceding phase. - For example with two intermediate phases, - the first intermediate phase uses quadruple of final_relative_width - and the second intermediate phase uses double of final_relative_width. - c. *in*: ndr_interval, pdr_interval from the previous main loop iteration - or the previous phase. - If the previous phase is the initial phase, both intervals have - lower_bound = MRR2, uper_bound = MRR. - Note that the initial phase is likely to create intervals with invalid bounds. - d. *do*: According to the procedure described in point 2, - either exit the phase (by jumping to 1.g.), - or prepare new transmit rate to measure with. - e. *do*: Perform the trial measurement at the new transmit rate - and trial_duration, compute its loss ratio. - f. *do*: Update the bounds of both intervals, based on the new measurement. - The actual update rules are numerous, as NDR external search - can affect PDR interval and vice versa, but the result - agrees with rules of both internal and external search. - For example, any new measurement below an invalid lower_bound - becomes the new lower_bound, while the old measurement - (previously acting as the invalid lower_bound) - becomes a new and valid upper_bound. - Go to next iteration (1.c.), taking the updated intervals as new input. - g. *out*: current ndr_interval and pdr_interval. - In the final phase this is also considered - to be the result of the whole search. - For other phases, the next phase loop is started - with the current results as an input. - -2. New transmit rate (or exit) calculation (for 1.d.): - - - If there is an invalid bound then prepare for external search: - - - *If* the most recent measurement at NDR lower_bound transmit rate - had the loss higher than zero, then - the new transmit rate is NDR lower_bound - decreased by two NDR interval widths. - - Else, *if* the most recent measurement at PDR lower_bound - transmit rate had the loss higher than PLR, then - the new transmit rate is PDR lower_bound - decreased by two PDR interval widths. - - Else, *if* the most recent measurement at NDR upper_bound - transmit rate had no loss, then - the new transmit rate is NDR upper_bound - increased by two NDR interval widths. - - Else, *if* the most recent measurement at PDR upper_bound - transmit rate had the loss lower or equal to PLR, then - the new transmit rate is PDR upper_bound - increased by two PDR interval widths. - - If interval width is higher than the current phase goal: - - - Else, *if* NDR interval does not meet the current phase width goal, - prepare for internal search. The new transmit rate is - (NDR lower bound + NDR upper bound) / 2. - - Else, *if* PDR interval does not meet the current phase width goal, - prepare for internal search. The new transmit rate is - (PDR lower bound + PDR upper bound) / 2. - - Else, *if* some bound has still only been measured at a lower duration, - prepare to re-measure at the current duration (and the same transmit rate). - The order of priorities is: - - - NDR lower_bound, - - PDR lower_bound, - - NDR upper_bound, - - PDR upper_bound. - - *Else*, do not prepare any new rate, to exit the phase. - This ensures that at the end of each non-initial phase - all intervals are valid, narrow enough, and measured - at current phase trial duration. - -Implementation Deviations -~~~~~~~~~~~~~~~~~~~~~~~~~ - -This document so far has been describing a simplified version of MLRsearch algorithm. -The full algorithm as implemented contains additional logic, -which makes some of the details (but not general ideas) above incorrect. -Here is a short description of the additional logic as a list of principles, -explaining their main differences from (or additions to) the simplified description, -but without detailing their mutual interaction. - -1. *Logarithmic transmit rate.* - In order to better fit the relative width goal, - the interval doubling and halving is done differently. - For example, the middle of 2 and 8 is 4, not 5. -2. *Optimistic maximum rate.* - The increased rate is never higher than the maximum rate. - Upper bound at that rate is always considered valid. -3. *Pessimistic minimum rate.* - The decreased rate is never lower than the minimum rate. - If a lower bound at that rate is invalid, - a phase stops refining the interval further (until it gets re-measured). -4. *Conservative interval updates.* - Measurements above current upper bound never update a valid upper bound, - even if drop ratio is low. - Measurements below current lower bound always update any lower bound - if drop ratio is high. -5. *Ensure sufficient interval width.* - Narrow intervals make external search take more time to find a valid bound. - If the new transmit increased or decreased rate would result in width - less than the current goal, increase/decrease more. - This can happen if the measurement for the other interval - makes the current interval too narrow. - Similarly, take care the measurements in the initial phase - create wide enough interval. -6. *Timeout for bad cases.* - The worst case for MLRsearch is when each phase converges to intervals - way different than the results of the previous phase. - Rather than suffer total search time several times larger - than pure binary search, the implemented tests fail themselves - when the search takes too long (given by argument *timeout*). - -(B)MRR Throughput ------------------ - -Maximum Receive Rate (MRR) tests are complementary to MLRsearch tests, -as they provide a maximum "raw" throughput benchmark for development and -testing community. MRR tests measure the packet forwarding rate under -the maximum load offered by traffic generator over a set trial duration, -regardless of packet loss. Maximum load for specified Ethernet frame -size is set to the bi-directional link rate. - -In |csit-release| MRR test code has been updated with a configurable -burst MRR parameters: trial duration and number of trials in a single -burst. This enabled a new Burst MRR (BMRR) methodology for more precise -performance trending. - -Current parameters for BMRR tests: - -- Ethernet frame sizes: 64B (78B for IPv6), IMIX, 1518B, 9000B; all - quoted sizes include frame CRC, but exclude per frame transmission - overhead of 20B (preamble, inter frame gap). - -- Maximum load offered: 10GE and 40GE link (sub-)rates depending on NIC - tested, with the actual packet rate depending on frame size, - transmission overhead and traffic generator NIC forwarding capacity. - - - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, - a 10GE bi-directional link rate. - - For 25GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, - a 25GE bi-directional link sub-rate limited by TG 25GE NIC used, - XXV710. - - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, - a 40GE bi-directional link sub-rate limited by TG 40GE NIC used, - XL710. Packet rate for other tested frame sizes is limited by PCIe - Gen3 x8 bandwidth limitation of ~50Gbps. - -- Trial duration: 1 sec. - -- Number of trials per burst: 10. - -Similarly to NDR/PDR throughput tests, MRR test should be reporting bi- -directional link rate (or NIC rate, if lower) if tested VPP -configuration can handle the packet rate higher than bi-directional link -rate, e.g. large packet tests and/or multi-core tests. - -MRR tests are currently used for FD.io CSIT continuous performance -trending and for comparison between releases. Daily trending job tests -subset of frame sizes, focusing on 64B (78B for IPv6) for all tests and -IMIX for selected tests (vhost, memif). - -MRR-like measurements are being used to establish starting conditions -for experimental Probabilistic Loss Ratio Search (PLRsearch) used for -soak testing, aimed at verifying continuous system performance over an -extended period of time, hours, days, weeks, months. PLRsearch code is -currently in experimental phase in FD.io CSIT project. - -Packet Latency --------------- - -TRex Traffic Generator (TG) is used for measuring latency of VPP DUTs. -Reported latency values are measured using following methodology: - -- Latency tests are performed at 100% of discovered NDR and PDR rates - for each throughput test and packet size (except IMIX). -- TG sends dedicated latency streams, one per direction, each at the - rate of 9 kpps at the prescribed packet size; these are sent in - addition to the main load streams. -- TG reports min/avg/max latency values per stream direction, hence two - sets of latency values are reported per test case; future release of - TRex is expected to report latency percentiles. -- Reported latency values are aggregate across two SUTs due to three - node topology used for all performance tests; for per SUT latency, - reported value should be divided by two. -- 1usec is the measurement accuracy advertised by TRex TG for the setup - used in FD.io labs used by CSIT project. -- TRex setup introduces an always-on error of about 2*2usec per latency - flow additonal Tx/Rx interface latency induced by TRex SW writing and - reading packet timestamps on CPU cores without HW acceleration on NICs - closer to the interface line. - -Multi-Core Speedup ------------------- - -All performance tests are executed with single processor core and with -multiple cores scenarios. - -Intel Hyper-Threading (HT) -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Intel Xeon processors used in FD.io CSIT can operate either in HT -Disabled mode (single logical core per each physical core) or in HT -Enabled mode (two logical cores per each physical core). HT setting is -applied in BIOS and requires server SUT reload for it to take effect, -making it impractical for continuous changes of HT mode of operation. - -|csit-release| performance tests are executed with server SUTs' Intel -XEON processors configured with Intel Hyper-Threading Disabled for all -Xeon Haswell testbeds (3n-hsw) and with Intel Hyper-Threading Enabled -for all Xeon Skylake testbeds. - -More information about physical testbeds is provided in -:ref:`tested_physical_topologies`. - -Multi-core Tests -~~~~~~~~~~~~~~~~ - -|csit-release| multi-core tests are executed in the following VPP worker -thread and physical core configurations: - -#. Intel Xeon Haswell testbeds (3n-hsw) with Intel HT disabled - (1 logical CPU core per each physical core): - - #. 1t1c - 1 VPP worker thread on 1 physical core. - #. 2t2c - 2 VPP worker threads on 2 physical cores. - #. 4t4c - 4 VPP worker threads on 4 physical cores. - -#. Intel Xeon Skylake testbeds (2n-skx, 3n-skx) with Intel HT enabled - (2 logical CPU cores per each physical core): - - #. 2t1c - 2 VPP worker threads on 1 physical core. - #. 4t2c - 4 VPP worker threads on 2 physical cores. - #. 8t4c - 8 VPP worker threads on 4 physical cores. - -VPP worker threads are the data plane threads running on isolated -logical cores. With Intel HT enabled VPP workers are placed as sibling -threads on each used physical core. VPP control threads (main, stats) -are running on a separate non-isolated core together with other Linux -processes. - -In all CSIT tests care is taken to ensure that each VPP worker handles -the same amount of received packet load and does the same amount of -packet processing work. This is achieved by evenly distributing per -interface type (e.g. physical, virtual) receive queues over VPP workers -using default VPP round- robin mapping and by loading these queues with -the same amount of packet flows. - -If number of VPP workers is higher than number of physical or virtual -interfaces, multiple receive queues are configured on each interface. -NIC Receive Side Scaling (RSS) for physical interfaces and multi-queue -for virtual interfaces are used for this purpose. - -Section :ref:`throughput_speedup_multi_core` includes a set of graphs -illustrating packet throughout speedup when running VPP worker threads -on multiple cores. Note that in quite a few test cases running VPP -workers on 2 or 4 physical cores hits the I/O bandwidth or packets-per- -second limit of tested NIC. - -VPP Startup Settings --------------------- - -CSIT code manipulates a number of VPP settings in startup.conf for optimized -performance. List of common settings applied to all tests and test -dependent settings follows. - -See `VPP startup.conf `_ -for a complete set and description of listed settings. - -Common Settings -~~~~~~~~~~~~~~~ - -List of vpp startup.conf settings applied to all tests: - -#. heap-size - set separately for ip4, ip6, stats, main - depending on scale tested. -#. no-tx-checksum-offload - disables UDP / TCP TX checksum offload in DPDK. - Typically needed for use faster vector PMDs (together with - no-multi-seg). -#. socket-mem , - memory per numa. (Not required anymore - due to VPP code changes, should be removed in CSIT-18.10.) - -Per Test Settings -~~~~~~~~~~~~~~~~~ - -List of vpp startup.conf settings applied dynamically per test: - -#. corelist-workers - list of logical cores to run VPP - worker data plane threads. Depends on HyperThreading and core per - test configuration. -#. num-rx-queues - depends on a number of VPP threads and NIC - interfaces. -#. num-rx-desc/num-tx-desc - number of rx/tx descriptors for specific - NICs, incl. xl710, x710, xxv710. -#. num-mbufs - increases number of buffers allocated, needed - only in scenarios with large number of interfaces and worker threads. - Value is per CPU socket. Default is 16384. -#. no-multi-seg - disables multi-segment buffers in DPDK, improves - packet throughput, but disables Jumbo MTU support. Disabled for all - tests apart from the ones that require Jumbo 9000B frame support. -#. UIO driver - depends on topology file definition. -#. QAT VFs - depends on NRThreads, each thread = 1QAT VFs. - -KVM VMs vhost-user ------------------- - -FD.io CSIT performance lab is testing VPP vhost with KVM VMs using -following environment settings: - -- Tests with varying Qemu virtio queue (a.k.a. vring) sizes: [vr256] - default 256 descriptors, [vr1024] 1024 descriptors to optimize for - packet throughput. -- Tests with varying Linux :abbr:`CFS (Completely Fair Scheduler)` - settings: [cfs] default settings, [cfsrr1] CFS RoundRobin(1) policy - applied to all data plane threads handling test packet path including - all VPP worker threads and all Qemu testpmd poll-mode threads. -- Resulting test cases are all combinations with [vr256,vr1024] and - [cfs,cfsrr1] settings. -- Adjusted Linux kernel :abbr:`CFS (Completely Fair Scheduler)` - scheduler policy for data plane threads used in CSIT is documented in - `CSIT Performance Environment Tuning wiki - `_. -- The purpose is to verify performance impact (MRR and NDR/PDR - throughput) and same test measurements repeatability, by making VPP - and VM data plane threads less susceptible to other Linux OS system - tasks hijacking CPU cores running those data plane threads. - -LXC/DRC Container Memif ------------------------ - -|csit-release| includes tests taking advantage of VPP memif virtual -interface (shared memory interface) to interconnect VPP running in -Containers. VPP vswitch instance runs in bare-metal user-mode handling -NIC interfaces and connecting over memif (Slave side) to VPPs running in -:abbr:`Linux Container (LXC)` or in Docker Container (DRC) configured -with memif (Master side). LXCs and DRCs run in a priviliged mode with -VPP data plane worker threads pinned to dedicated physical CPU cores per -usual CSIT practice. All VPP instances run the same version of software. -This test topology is equivalent to existing tests with vhost-user and -VMs as described earlier in :ref:`tested_logical_topologies`. - -In addition to above vswitch tests, a single memif interface test is -executed. It runs in a simple topology of two VPP container instances -connected over memif interface in order to verify standalone memif -interface performance. - -More information about CSIT LXC and DRC setup and control is available -in :ref:`container_orchestration_in_csit`. - -K8s Container Memif -------------------- - -|csit-release| includes tests of VPP topologies running in K8s -orchestrated Pods/Containers and connected over memif virtual -interfaces. In order to provide simple topology coding flexibility and -extensibility container orchestration is done with `Kubernetes -`_ using `Docker -`_ images for all container applications -including VPP. `Ligato `_ is used for the -Pod/Container networking orchestration that is integrated with K8s, -including memif support. - -In these tests VPP vswitch runs in a K8s Pod with Docker Container (DRC) -handling NIC interfaces and connecting over memif to more instances of -VPP running in Pods/DRCs. All DRCs run in a priviliged mode with VPP -data plane worker threads pinned to dedicated physical CPU cores per -usual CSIT practice. All VPP instances run the same version of software. -This test topology is equivalent to existing tests with vhost-user and -VMs as described earlier in :ref:`tested_physical_topologies`. - -Further documentation is available in -:ref:`container_orchestration_in_csit`. - -VPP_Device Functional ---------------------- - -|csit-release| added new VPP_Device test environment for functional VPP -device tests integrated into LFN CI/CD infrastructure. VPP_Device tests -run on 1-Node testbeds (1n-skx, 1n-arm) and rely on Linux SRIOV Virtual -Function (VF), dot1q VLAN tagging and external loopback cables to -facilitate packet passing over exernal physical links. Initial focus is -on few baseline tests. Existing CSIT VIRL tests can be moved to -VPP_Device framework by changing L1 and L2 KW(s). RF test definition -code stays unchanged with the exception of requiring adjustments from -3-Node to 2-Node logical topologies. CSIT VIRL to VPP_Device migration -is expected in the next CSIT release. - -IPSec on Intel QAT ------------------- - -VPP IPSec performance tests are using DPDK cryptodev device driver in -combination with HW cryptodev devices - Intel QAT 8950 50G - present in -LF FD.io physical testbeds. DPDK cryptodev can be used for all IPSec -data plane functions supported by VPP. - -Currently |csit-release| implements following IPSec test cases: - -- AES-GCM, CBC-SHA1 ciphers, in combination with IPv4 routed-forwarding - with Intel xl710 NIC. -- CBC-SHA1 ciphers, in combination with LISP-GPE overlay tunneling for - IPv4-over-IPv4 with Intel xl710 NIC. - -TRex Traffic Generator ----------------------- - -Usage -~~~~~ - -`TRex traffic generator `_ is used for all -CSIT performance tests. TRex stateless mode is used to measure NDR and -PDR throughputs using binary search (NDR and PDR discovery tests) and -for quick checks of DUT performance against the reference NDRs (NDR -check tests) for specific configuration. - -TRex is installed and run on the TG compute node. The typical procedure -is: - -- If the TRex is not already installed on TG, it is installed in the - suite setup phase - see `TRex intallation`_. -- TRex configuration is set in its configuration file - :: - - /etc/trex_cfg.yaml - -- TRex is started in the background mode - :: - - $ sh -c 'cd /scripts/ && sudo nohup ./t-rex-64 -i -c 7 --iom 0 > /tmp/trex.log 2>&1 &' > /dev/null - -- There are traffic streams dynamically prepared for each test, based on traffic - profiles. The traffic is sent and the statistics obtained using - :command:`trex_stl_lib.api.STLClient`. - -Measuring Packet Loss -~~~~~~~~~~~~~~~~~~~~~ - -Following sequence is followed to measure packet loss: - -- Create an instance of STLClient. -- Connect to the client. -- Add all streams. -- Clear statistics. -- Send the traffic for defined time. -- Get the statistics. - -If there is a warm-up phase required, the traffic is sent also before -test and the statistics are ignored. - -Measuring Latency -~~~~~~~~~~~~~~~~~ - -If measurement of latency is requested, two more packet streams are -created (one for each direction) with TRex flow_stats parameter set to -STLFlowLatencyStats. In that case, returned statistics will also include -min/avg/max latency values. - -HTTP/TCP with WRK Tool ----------------------- - -`WRK HTTP benchmarking tool `_ is used for -experimental TCP/IP and HTTP tests of VPP TCP/IP stack and built-in -static HTTP server. WRK has been chosen as it is capable of generating -significant TCP/IP and HTTP loads by scaling number of threads across -multi-core processors. - -This in turn enables quite high scale benchmarking of the main TCP/IP -and HTTP service including HTTP TCP/IP Connections-Per-Second (CPS), -HTTP Requests-Per-Second and HTTP Bandwidth Throughput. - -The initial tests are designed as follows: - -- HTTP and TCP/IP Connections-Per-Second (CPS) - - - WRK configured to use 8 threads across 8 cores, 1 thread per core. - - Maximum of 50 concurrent connections across all WRK threads. - - Timeout for server responses set to 5 seconds. - - Test duration is 30 seconds. - - Expected HTTP test sequence: - - - Single HTTP GET Request sent per open connection. - - Connection close after valid HTTP reply. - - Resulting flow sequence - 8 packets: >Syn, Ack, >Req, - Fin, Ack. - -- HTTP Requests-Per-Second - - - WRK configured to use 8 threads across 8 cores, 1 thread per core. - - Maximum of 50 concurrent connections across all WRK threads. - - Timeout for server responses set to 5 seconds. - - Test duration is 30 seconds. - - Expected HTTP test sequence: - - - Multiple HTTP GET Requests sent in sequence per open connection. - - Connection close after set test duration time. - - Resulting flow sequence: >Syn, Ack, >Req[1], Req[n], Fin, Ack. - -.. _binary search: https://en.wikipedia.org/wiki/Binary_search -.. _exponential search: https://en.wikipedia.org/wiki/Exponential_search -.. _estimation of standard deviation: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation -.. _simplified error propagation formula: https://en.wikipedia.org/wiki/Propagation_of_uncertainty#Simplification +.. toctree:: + + methodology_vpp_forwarding_modes + methodology_tunnel_encapsulations + methodology_vpp_features + methodology_data_plane_throughput + methodology_mlrsearch_tests + methodology_bmrr_throughput + methodology_packet_latency + methodology_multi_core_speedup + methodology_vpp_startup_settings + methodology_kvm_vms_vhost_user + methodology_lxc_drc_container_memif + methodology_k8s_container_memif + methodology_vpp_device_functional + methodology_ipsec_on_intel_qat + methodology_trex_traffic_generator + methodology_http_tcp_with_wrk_tool diff --git a/docs/report/introduction/methodology_bmrr_throughput.rst b/docs/report/introduction/methodology_bmrr_throughput.rst new file mode 100644 index 0000000000..ac3c54e907 --- /dev/null +++ b/docs/report/introduction/methodology_bmrr_throughput.rst @@ -0,0 +1,54 @@ +(B)MRR Throughput +----------------- + +Maximum Receive Rate (MRR) tests are complementary to MLRsearch tests, +as they provide a maximum "raw" throughput benchmark for development and +testing community. MRR tests measure the packet forwarding rate under +the maximum load offered by traffic generator over a set trial duration, +regardless of packet loss. Maximum load for specified Ethernet frame +size is set to the bi-directional link rate. + +In |csit-release| MRR test code has been updated with a configurable +burst MRR parameters: trial duration and number of trials in a single +burst. This enabled a new Burst MRR (BMRR) methodology for more precise +performance trending. + +Current parameters for BMRR tests: + +- Ethernet frame sizes: 64B (78B for IPv6), IMIX, 1518B, 9000B; all + quoted sizes include frame CRC, but exclude per frame transmission + overhead of 20B (preamble, inter frame gap). + +- Maximum load offered: 10GE and 40GE link (sub-)rates depending on NIC + tested, with the actual packet rate depending on frame size, + transmission overhead and traffic generator NIC forwarding capacity. + + - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, + a 10GE bi-directional link rate. + - For 25GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, + a 25GE bi-directional link sub-rate limited by TG 25GE NIC used, + XXV710. + - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, + a 40GE bi-directional link sub-rate limited by TG 40GE NIC used, + XL710. Packet rate for other tested frame sizes is limited by PCIe + Gen3 x8 bandwidth limitation of ~50Gbps. + +- Trial duration: 1 sec. + +- Number of trials per burst: 10. + +Similarly to NDR/PDR throughput tests, MRR test should be reporting bi- +directional link rate (or NIC rate, if lower) if tested VPP +configuration can handle the packet rate higher than bi-directional link +rate, e.g. large packet tests and/or multi-core tests. + +MRR tests are currently used for FD.io CSIT continuous performance +trending and for comparison between releases. Daily trending job tests +subset of frame sizes, focusing on 64B (78B for IPv6) for all tests and +IMIX for selected tests (vhost, memif). + +MRR-like measurements are being used to establish starting conditions +for experimental Probabilistic Loss Ratio Search (PLRsearch) used for +soak testing, aimed at verifying continuous system performance over an +extended period of time, hours, days, weeks, months. PLRsearch code is +currently in experimental phase in FD.io CSIT project. diff --git a/docs/report/introduction/methodology_data_plane_throughput.rst b/docs/report/introduction/methodology_data_plane_throughput.rst new file mode 100644 index 0000000000..10d85e991c --- /dev/null +++ b/docs/report/introduction/methodology_data_plane_throughput.rst @@ -0,0 +1,33 @@ +Data Plane Throughput +--------------------- + +Network data plane packet and bandwidth throughput are measured in +accordance with :rfc:`2544`, using FD.io CSIT Multiple Loss Ratio search +(MLRsearch), an optimized throughput search algorithm, that measures +SUT/DUT packet throughput rates at different Packet Loss Ratio (PLR) +values. + +Following MLRsearch values are measured across a range of L2 frame sizes +and reported: + +- NON DROP RATE (NDR): packet and bandwidth throughput at PLR=0%. + + - **Aggregate packet rate**: NDR_LOWER + pps. + - **Aggregate bandwidth rate**: NDR_LOWER Gbps. + +- PARTIAL DROP RATE (PDR): packet and bandwidth throughput at PLR=0.5%. + + - **Aggregate packet rate**: PDR_LOWER + pps. + - **Aggregate bandwidth rate**: PDR_LOWER Gbps. + +NDR and PDR are measured for the following L2 frame sizes (untagged +Ethernet): + +- IPv4 payload: 64B, IMIX (28x64B, 16x570B, 4x1518B), 1518B, 9000B. +- IPv6 payload: 78B, IMIX (28x78B, 16x570B, 4x1518B), 1518B, 9000B. + +All rates are reported from external Traffic Generator perspective. diff --git a/docs/report/introduction/methodology_http_tcp_with_wrk_tool.rst b/docs/report/introduction/methodology_http_tcp_with_wrk_tool.rst new file mode 100644 index 0000000000..28f3fc6bbb --- /dev/null +++ b/docs/report/introduction/methodology_http_tcp_with_wrk_tool.rst @@ -0,0 +1,40 @@ +HTTP/TCP with WRK Tool +---------------------- + +`WRK HTTP benchmarking tool `_ is used for +experimental TCP/IP and HTTP tests of VPP TCP/IP stack and built-in +static HTTP server. WRK has been chosen as it is capable of generating +significant TCP/IP and HTTP loads by scaling number of threads across +multi-core processors. + +This in turn enables quite high scale benchmarking of the main TCP/IP +and HTTP service including HTTP TCP/IP Connections-Per-Second (CPS), +HTTP Requests-Per-Second and HTTP Bandwidth Throughput. + +The initial tests are designed as follows: + +- HTTP and TCP/IP Connections-Per-Second (CPS) + + - WRK configured to use 8 threads across 8 cores, 1 thread per core. + - Maximum of 50 concurrent connections across all WRK threads. + - Timeout for server responses set to 5 seconds. + - Test duration is 30 seconds. + - Expected HTTP test sequence: + + - Single HTTP GET Request sent per open connection. + - Connection close after valid HTTP reply. + - Resulting flow sequence - 8 packets: >Syn, Ack, >Req, + Fin, Ack. + +- HTTP Requests-Per-Second + + - WRK configured to use 8 threads across 8 cores, 1 thread per core. + - Maximum of 50 concurrent connections across all WRK threads. + - Timeout for server responses set to 5 seconds. + - Test duration is 30 seconds. + - Expected HTTP test sequence: + + - Multiple HTTP GET Requests sent in sequence per open connection. + - Connection close after set test duration time. + - Resulting flow sequence: >Syn, Ack, >Req[1], Req[n], Fin, Ack. diff --git a/docs/report/introduction/methodology_ipsec_on_intel_qat.rst b/docs/report/introduction/methodology_ipsec_on_intel_qat.rst new file mode 100644 index 0000000000..54fb1b0ef2 --- /dev/null +++ b/docs/report/introduction/methodology_ipsec_on_intel_qat.rst @@ -0,0 +1,14 @@ +IPSec on Intel QAT +------------------ + +VPP IPSec performance tests are using DPDK cryptodev device driver in +combination with HW cryptodev devices - Intel QAT 8950 50G - present in +LF FD.io physical testbeds. DPDK cryptodev can be used for all IPSec +data plane functions supported by VPP. + +Currently |csit-release| implements following IPSec test cases: + +- AES-GCM, CBC-SHA1 ciphers, in combination with IPv4 routed-forwarding + with Intel xl710 NIC. +- CBC-SHA1 ciphers, in combination with LISP-GPE overlay tunneling for + IPv4-over-IPv4 with Intel xl710 NIC. diff --git a/docs/report/introduction/methodology_k8s_container_memif.rst b/docs/report/introduction/methodology_k8s_container_memif.rst new file mode 100644 index 0000000000..2e5ce0e017 --- /dev/null +++ b/docs/report/introduction/methodology_k8s_container_memif.rst @@ -0,0 +1,23 @@ +K8s Container Memif +------------------- + +|csit-release| includes tests of VPP topologies running in K8s +orchestrated Pods/Containers and connected over memif virtual +interfaces. In order to provide simple topology coding flexibility and +extensibility container orchestration is done with `Kubernetes +`_ using `Docker +`_ images for all container applications +including VPP. `Ligato `_ is used for the +Pod/Container networking orchestration that is integrated with K8s, +including memif support. + +In these tests VPP vswitch runs in a K8s Pod with Docker Container (DRC) +handling NIC interfaces and connecting over memif to more instances of +VPP running in Pods/DRCs. All DRCs run in a priviliged mode with VPP +data plane worker threads pinned to dedicated physical CPU cores per +usual CSIT practice. All VPP instances run the same version of software. +This test topology is equivalent to existing tests with vhost-user and +VMs as described earlier in :ref:`tested_physical_topologies`. + +Further documentation is available in +:ref:`container_orchestration_in_csit`. diff --git a/docs/report/introduction/methodology_kvm_vms_vhost_user.rst b/docs/report/introduction/methodology_kvm_vms_vhost_user.rst new file mode 100644 index 0000000000..34e3bc0447 --- /dev/null +++ b/docs/report/introduction/methodology_kvm_vms_vhost_user.rst @@ -0,0 +1,23 @@ +KVM VMs vhost-user +------------------ + +FD.io CSIT performance lab is testing VPP vhost with KVM VMs using +following environment settings: + +- Tests with varying Qemu virtio queue (a.k.a. vring) sizes: [vr256] + default 256 descriptors, [vr1024] 1024 descriptors to optimize for + packet throughput. +- Tests with varying Linux :abbr:`CFS (Completely Fair Scheduler)` + settings: [cfs] default settings, [cfsrr1] CFS RoundRobin(1) policy + applied to all data plane threads handling test packet path including + all VPP worker threads and all Qemu testpmd poll-mode threads. +- Resulting test cases are all combinations with [vr256,vr1024] and + [cfs,cfsrr1] settings. +- Adjusted Linux kernel :abbr:`CFS (Completely Fair Scheduler)` + scheduler policy for data plane threads used in CSIT is documented in + `CSIT Performance Environment Tuning wiki + `_. +- The purpose is to verify performance impact (MRR and NDR/PDR + throughput) and same test measurements repeatability, by making VPP + and VM data plane threads less susceptible to other Linux OS system + tasks hijacking CPU cores running those data plane threads. diff --git a/docs/report/introduction/methodology_lxc_drc_container_memif.rst b/docs/report/introduction/methodology_lxc_drc_container_memif.rst new file mode 100644 index 0000000000..56f052550c --- /dev/null +++ b/docs/report/introduction/methodology_lxc_drc_container_memif.rst @@ -0,0 +1,21 @@ +LXC/DRC Container Memif +----------------------- + +|csit-release| includes tests taking advantage of VPP memif virtual +interface (shared memory interface) to interconnect VPP running in +Containers. VPP vswitch instance runs in bare-metal user-mode handling +NIC interfaces and connecting over memif (Slave side) to VPPs running in +:abbr:`Linux Container (LXC)` or in Docker Container (DRC) configured +with memif (Master side). LXCs and DRCs run in a priviliged mode with +VPP data plane worker threads pinned to dedicated physical CPU cores per +usual CSIT practice. All VPP instances run the same version of software. +This test topology is equivalent to existing tests with vhost-user and +VMs as described earlier in :ref:`tested_logical_topologies`. + +In addition to above vswitch tests, a single memif interface test is +executed. It runs in a simple topology of two VPP container instances +connected over memif interface in order to verify standalone memif +interface performance. + +More information about CSIT LXC and DRC setup and control is available +in :ref:`container_orchestration_in_csit`. diff --git a/docs/report/introduction/methodology_mlrsearch_tests.rst b/docs/report/introduction/methodology_mlrsearch_tests.rst new file mode 100644 index 0000000000..15b9ee0290 --- /dev/null +++ b/docs/report/introduction/methodology_mlrsearch_tests.rst @@ -0,0 +1,292 @@ + +.. _mlrsearch_algorithm: + +MLRsearch Tests +--------------- + +Multiple Loss Rate search (MLRsearch) tests use new search algorithm +implemented in FD.io CSIT project. MLRsearch discovers multiple packet +throughput rates in a single search, with each rate associated with a +distinct Packet Loss Ratio (PLR) criteria. MLRsearch is being +standardized in IETF with `draft-vpolak-mkonstan-mlrsearch-XX +`_. + +Two throughput measurements used in FD.io CSIT are Non-Drop Rate (NDR, +with zero packet loss, PLR=0) and Partial Drop Rate (PDR, with packet +loss rate not greater than the configured non-zero PLR). MLRsearch +discovers NDR and PDR in a single pass reducing required execution time +compared to separate binary searches for NDR and PDR. MLRsearch reduces +execution time even further by relying on shorter trial durations +of intermediate steps, with only the final measurements +conducted at the specified final trial duration. +This results in the shorter overall search +execution time when compared to a standard NDR/PDR binary search, +while guaranteeing the same or similar results. + +If needed, MLRsearch can be easily adopted to discover more throughput rates +with different pre-defined PLRs. + +.. Note:: All throughput rates are *always* bi-directional + aggregates of two equal (symmetric) uni-directional packet rates + received and reported by an external traffic generator. + +Overview +~~~~~~~~ + +The main properties of MLRsearch: + +- MLRsearch is a duration aware multi-phase multi-rate search algorithm. + + - Initial phase determines promising starting interval for the search. + - Intermediate phases progress towards defined final search criteria. + - Final phase executes measurements according to the final search + criteria. + +- *Initial phase*: + + - Uses link rate as a starting transmit rate and discovers the Maximum + Receive Rate (MRR) used as an input to the first intermediate phase. + +- *Intermediate phases*: + + - Start with initial trial duration (in the first phase) and converge + geometrically towards the final trial duration (in the final phase). + - Track two values for NDR and two for PDR. + + - The values are called (NDR or PDR) lower_bound and upper_bound. + - Each value comes from a specific trial measurement + (most recent for that transmit rate), + and as such the value is associated with that measurement's duration and + loss. + - A bound can be invalid, for example if NDR lower_bound + has been measured with nonzero loss. + - Invalid bounds are not real boundaries for the searched value, + but are needed to track interval widths. + - Valid bounds are real boundaries for the searched value. + - Each non-initial phase ends with all bounds valid. + + - Start with a large (lower_bound, upper_bound) interval width and + geometrically converge towards the width goal (measurement resolution) + of the phase. Each phase halves the previous width goal. + - Use internal and external searches: + + - External search - measures at transmit rates outside the (lower_bound, + upper_bound) interval. Activated when a bound is invalid, + to search for a new valid bound by doubling the interval width. + It is a variant of `exponential search`_. + - Internal search - `binary search`_, measures at transmit rates within the + (lower_bound, upper_bound) valid interval, halving the interval width. + +- *Final phase* is executed with the final test trial duration, and the final + width goal that determines resolution of the overall search. + Intermediate phases together with the final phase are called non-initial + phases. + +The main benefits of MLRsearch vs. binary search include: + +- In general MLRsearch is likely to execute more search trials overall, but + less trials at a set final duration. +- In well behaving cases it greatly reduces (>50%) the overall duration + compared to a single PDR (or NDR) binary search duration, + while finding multiple drop rates. +- In all cases MLRsearch yields the same or similar results to binary search. +- Note: both binary search and MLRsearch are susceptible to reporting + non-repeatable results across multiple runs for very bad behaving + cases. + +Caveats: + +- Worst case MLRsearch can take longer than a binary search e.g. in case of + drastic changes in behaviour for trials at varying durations. + +Search Implementation +~~~~~~~~~~~~~~~~~~~~~ + +Following is a brief description of the current MLRsearch +implementation in FD.io CSIT. + +Input Parameters +```````````````` + +#. *maximum_transmit_rate* - maximum packet transmit rate to be used by + external traffic generator, limited by either the actual Ethernet + link rate or traffic generator NIC model capabilities. Sample + defaults: 2 * 14.88 Mpps for 64B 10GE link rate, + 2 * 18.75 Mpps for 64B 40GE NIC maximum rate. +#. *minimum_transmit_rate* - minimum packet transmit rate to be used for + measurements. MLRsearch fails if lower transmit rate needs to be + used to meet search criteria. Default: 2 * 10 kpps (could be higher). +#. *final_trial_duration* - required trial duration for final rate + measurements. Default: 30 sec. +#. *initial_trial_duration* - trial duration for initial MLRsearch phase. + Default: 1 sec. +#. *final_relative_width* - required measurement resolution expressed as + (lower_bound, upper_bound) interval width relative to upper_bound. + Default: 0.5%. +#. *packet_loss_ratio* - maximum acceptable PLR search criteria for + PDR measurements. Default: 0.5%. +#. *number_of_intermediate_phases* - number of phases between the initial + phase and the final phase. Impacts the overall MLRsearch duration. + Less phases are required for well behaving cases, more phases + may be needed to reduce the overall search duration for worse behaving + cases. + Default (2). (Value chosen based on limited experimentation to date. + More experimentation needed to arrive to clearer guidelines.) + +Initial Phase +````````````` + +1. First trial measures at maximum rate and discovers MRR. + + a. *in*: trial_duration = initial_trial_duration. + b. *in*: offered_transmit_rate = maximum_transmit_rate. + c. *do*: single trial. + d. *out*: measured loss ratio. + e. *out*: mrr = measured receive rate. + +2. Second trial measures at MRR and discovers MRR2. + + a. *in*: trial_duration = initial_trial_duration. + b. *in*: offered_transmit_rate = MRR. + c. *do*: single trial. + d. *out*: measured loss ratio. + e. *out*: mrr2 = measured receive rate. + +3. Third trial measures at MRR2. + + a. *in*: trial_duration = initial_trial_duration. + b. *in*: offered_transmit_rate = MRR2. + c. *do*: single trial. + d. *out*: measured loss ratio. + +Non-initial Phases +`````````````````` + +1. Main loop: + + a. *in*: trial_duration for the current phase. + Set to initial_trial_duration for the first intermediate phase; + to final_trial_duration for the final phase; + or to the element of interpolating geometric sequence + for other intermediate phases. + For example with two intermediate phases, trial_duration + of the second intermediate phase is the geometric average + of initial_strial_duration and final_trial_duration. + b. *in*: relative_width_goal for the current phase. + Set to final_relative_width for the final phase; + doubled for each preceding phase. + For example with two intermediate phases, + the first intermediate phase uses quadruple of final_relative_width + and the second intermediate phase uses double of final_relative_width. + c. *in*: ndr_interval, pdr_interval from the previous main loop iteration + or the previous phase. + If the previous phase is the initial phase, both intervals have + lower_bound = MRR2, uper_bound = MRR. + Note that the initial phase is likely to create intervals with invalid + bounds. + d. *do*: According to the procedure described in point 2, + either exit the phase (by jumping to 1.g.), + or prepare new transmit rate to measure with. + e. *do*: Perform the trial measurement at the new transmit rate + and trial_duration, compute its loss ratio. + f. *do*: Update the bounds of both intervals, based on the new measurement. + The actual update rules are numerous, as NDR external search + can affect PDR interval and vice versa, but the result + agrees with rules of both internal and external search. + For example, any new measurement below an invalid lower_bound + becomes the new lower_bound, while the old measurement + (previously acting as the invalid lower_bound) + becomes a new and valid upper_bound. + Go to next iteration (1.c.), taking the updated intervals as new input. + g. *out*: current ndr_interval and pdr_interval. + In the final phase this is also considered + to be the result of the whole search. + For other phases, the next phase loop is started + with the current results as an input. + +2. New transmit rate (or exit) calculation (for 1.d.): + + - If there is an invalid bound then prepare for external search: + + - *If* the most recent measurement at NDR lower_bound transmit rate + had the loss higher than zero, then + the new transmit rate is NDR lower_bound + decreased by two NDR interval widths. + - Else, *if* the most recent measurement at PDR lower_bound + transmit rate had the loss higher than PLR, then + the new transmit rate is PDR lower_bound + decreased by two PDR interval widths. + - Else, *if* the most recent measurement at NDR upper_bound + transmit rate had no loss, then + the new transmit rate is NDR upper_bound + increased by two NDR interval widths. + - Else, *if* the most recent measurement at PDR upper_bound + transmit rate had the loss lower or equal to PLR, then + the new transmit rate is PDR upper_bound + increased by two PDR interval widths. + - If interval width is higher than the current phase goal: + + - Else, *if* NDR interval does not meet the current phase width goal, + prepare for internal search. The new transmit rate is + (NDR lower bound + NDR upper bound) / 2. + - Else, *if* PDR interval does not meet the current phase width goal, + prepare for internal search. The new transmit rate is + (PDR lower bound + PDR upper bound) / 2. + - Else, *if* some bound has still only been measured at a lower duration, + prepare to re-measure at the current duration (and the same transmit + rate). The order of priorities is: + + - NDR lower_bound, + - PDR lower_bound, + - NDR upper_bound, + - PDR upper_bound. + - *Else*, do not prepare any new rate, to exit the phase. + This ensures that at the end of each non-initial phase + all intervals are valid, narrow enough, and measured + at current phase trial duration. + +Implementation Deviations +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This document so far has been describing a simplified version of MLRsearch +algorithm. The full algorithm as implemented contains additional logic, +which makes some of the details (but not general ideas) above incorrect. +Here is a short description of the additional logic as a list of principles, +explaining their main differences from (or additions to) the simplified +description,but without detailing their mutual interaction. + +1. *Logarithmic transmit rate.* + In order to better fit the relative width goal, + the interval doubling and halving is done differently. + For example, the middle of 2 and 8 is 4, not 5. +2. *Optimistic maximum rate.* + The increased rate is never higher than the maximum rate. + Upper bound at that rate is always considered valid. +3. *Pessimistic minimum rate.* + The decreased rate is never lower than the minimum rate. + If a lower bound at that rate is invalid, + a phase stops refining the interval further (until it gets re-measured). +4. *Conservative interval updates.* + Measurements above current upper bound never update a valid upper bound, + even if drop ratio is low. + Measurements below current lower bound always update any lower bound + if drop ratio is high. +5. *Ensure sufficient interval width.* + Narrow intervals make external search take more time to find a valid bound. + If the new transmit increased or decreased rate would result in width + less than the current goal, increase/decrease more. + This can happen if the measurement for the other interval + makes the current interval too narrow. + Similarly, take care the measurements in the initial phase + create wide enough interval. +6. *Timeout for bad cases.* + The worst case for MLRsearch is when each phase converges to intervals + way different than the results of the previous phase. + Rather than suffer total search time several times larger + than pure binary search, the implemented tests fail themselves + when the search takes too long (given by argument *timeout*). + +.. _binary search: https://en.wikipedia.org/wiki/Binary_search +.. _exponential search: https://en.wikipedia.org/wiki/Exponential_search +.. _estimation of standard deviation: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation +.. _simplified error propagation formula: https://en.wikipedia.org/wiki/Propagation_of_uncertainty#Simplification diff --git a/docs/report/introduction/methodology_multi_core_speedup.rst b/docs/report/introduction/methodology_multi_core_speedup.rst new file mode 100644 index 0000000000..94840406a1 --- /dev/null +++ b/docs/report/introduction/methodology_multi_core_speedup.rst @@ -0,0 +1,66 @@ +Multi-Core Speedup +------------------ + +All performance tests are executed with single processor core and with +multiple cores scenarios. + +Intel Hyper-Threading (HT) +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Intel Xeon processors used in FD.io CSIT can operate either in HT +Disabled mode (single logical core per each physical core) or in HT +Enabled mode (two logical cores per each physical core). HT setting is +applied in BIOS and requires server SUT reload for it to take effect, +making it impractical for continuous changes of HT mode of operation. + +|csit-release| performance tests are executed with server SUTs' Intel +XEON processors configured with Intel Hyper-Threading Disabled for all +Xeon Haswell testbeds (3n-hsw) and with Intel Hyper-Threading Enabled +for all Xeon Skylake testbeds. + +More information about physical testbeds is provided in +:ref:`tested_physical_topologies`. + +Multi-core Tests +~~~~~~~~~~~~~~~~ + +|csit-release| multi-core tests are executed in the following VPP worker +thread and physical core configurations: + +#. Intel Xeon Haswell testbeds (3n-hsw) with Intel HT disabled + (1 logical CPU core per each physical core): + + #. 1t1c - 1 VPP worker thread on 1 physical core. + #. 2t2c - 2 VPP worker threads on 2 physical cores. + #. 4t4c - 4 VPP worker threads on 4 physical cores. + +#. Intel Xeon Skylake testbeds (2n-skx, 3n-skx) with Intel HT enabled + (2 logical CPU cores per each physical core): + + #. 2t1c - 2 VPP worker threads on 1 physical core. + #. 4t2c - 4 VPP worker threads on 2 physical cores. + #. 8t4c - 8 VPP worker threads on 4 physical cores. + +VPP worker threads are the data plane threads running on isolated +logical cores. With Intel HT enabled VPP workers are placed as sibling +threads on each used physical core. VPP control threads (main, stats) +are running on a separate non-isolated core together with other Linux +processes. + +In all CSIT tests care is taken to ensure that each VPP worker handles +the same amount of received packet load and does the same amount of +packet processing work. This is achieved by evenly distributing per +interface type (e.g. physical, virtual) receive queues over VPP workers +using default VPP round- robin mapping and by loading these queues with +the same amount of packet flows. + +If number of VPP workers is higher than number of physical or virtual +interfaces, multiple receive queues are configured on each interface. +NIC Receive Side Scaling (RSS) for physical interfaces and multi-queue +for virtual interfaces are used for this purpose. + +Section :ref:`throughput_speedup_multi_core` includes a set of graphs +illustrating packet throughout speedup when running VPP worker threads +on multiple cores. Note that in quite a few test cases running VPP +workers on 2 or 4 physical cores hits the I/O bandwidth or packets-per- +second limit of tested NIC. diff --git a/docs/report/introduction/methodology_packet_latency.rst b/docs/report/introduction/methodology_packet_latency.rst new file mode 100644 index 0000000000..550d12f688 --- /dev/null +++ b/docs/report/introduction/methodology_packet_latency.rst @@ -0,0 +1,23 @@ +Packet Latency +-------------- + +TRex Traffic Generator (TG) is used for measuring latency of VPP DUTs. +Reported latency values are measured using following methodology: + +- Latency tests are performed at 100% of discovered NDR and PDR rates + for each throughput test and packet size (except IMIX). +- TG sends dedicated latency streams, one per direction, each at the + rate of 9 kpps at the prescribed packet size; these are sent in + addition to the main load streams. +- TG reports min/avg/max latency values per stream direction, hence two + sets of latency values are reported per test case; future release of + TRex is expected to report latency percentiles. +- Reported latency values are aggregate across two SUTs due to three + node topology used for all performance tests; for per SUT latency, + reported value should be divided by two. +- 1usec is the measurement accuracy advertised by TRex TG for the setup + used in FD.io labs used by CSIT project. +- TRex setup introduces an always-on error of about 2*2usec per latency + flow additonal Tx/Rx interface latency induced by TRex SW writing and + reading packet timestamps on CPU cores without HW acceleration on NICs + closer to the interface line. diff --git a/docs/report/introduction/methodology_trex_traffic_generator.rst b/docs/report/introduction/methodology_trex_traffic_generator.rst new file mode 100644 index 0000000000..4d4de96fb0 --- /dev/null +++ b/docs/report/introduction/methodology_trex_traffic_generator.rst @@ -0,0 +1,53 @@ +TRex Traffic Generator +---------------------- + +Usage +~~~~~ + +`TRex traffic generator `_ is used for all +CSIT performance tests. TRex stateless mode is used to measure NDR and +PDR throughputs using binary search (NDR and PDR discovery tests) and +for quick checks of DUT performance against the reference NDRs (NDR +check tests) for specific configuration. + +TRex is installed and run on the TG compute node. The typical procedure +is: + +- If the TRex is not already installed on TG, it is installed in the + suite setup phase - see `TRex intallation`_. +- TRex configuration is set in its configuration file + :: + + /etc/trex_cfg.yaml + +- TRex is started in the background mode + :: + + $ sh -c 'cd /scripts/ && sudo nohup ./t-rex-64 -i -c 7 --iom 0 > /tmp/trex.log 2>&1 &' > /dev/null + +- There are traffic streams dynamically prepared for each test, based on traffic + profiles. The traffic is sent and the statistics obtained using + :command:`trex_stl_lib.api.STLClient`. + +Measuring Packet Loss +~~~~~~~~~~~~~~~~~~~~~ + +Following sequence is followed to measure packet loss: + +- Create an instance of STLClient. +- Connect to the client. +- Add all streams. +- Clear statistics. +- Send the traffic for defined time. +- Get the statistics. + +If there is a warm-up phase required, the traffic is sent also before +test and the statistics are ignored. + +Measuring Latency +~~~~~~~~~~~~~~~~~ + +If measurement of latency is requested, two more packet streams are +created (one for each direction) with TRex flow_stats parameter set to +STLFlowLatencyStats. In that case, returned statistics will also include +min/avg/max latency values. diff --git a/docs/report/introduction/methodology_tunnel_encapsulations.rst b/docs/report/introduction/methodology_tunnel_encapsulations.rst new file mode 100644 index 0000000000..6c47d1bd33 --- /dev/null +++ b/docs/report/introduction/methodology_tunnel_encapsulations.rst @@ -0,0 +1,38 @@ +Tunnel Encapsulations +--------------------- + +Tunnel encapsulations testing is grouped based on the type of outer +header: IPv4 or IPv6. + +IPv4 Tunnels +~~~~~~~~~~~~ + +VPP is tested in the following IPv4 tunnel baseline configurations: + +- *ip4vxlan-l2bdbase*: VXLAN over IPv4 tunnels with L2 bridge-domain MAC + switching. +- *ip4vxlan-l2xcbase*: VXLAN over IPv4 tunnels with L2 cross-connect. +- *ip4lispip4-ip4base*: LISP over IPv4 tunnels with IPv4 routing. +- *ip4lispip6-ip6base*: LISP over IPv4 tunnels with IPv6 routing. + +In all cases listed above low number of MAC, IPv4, IPv6 flows (253 per +direction) is switched or routed by VPP. + +In addition selected IPv4 tunnels are tested at scale: + +- *dot1q--ip4vxlanscale-l2bd*: VXLAN over IPv4 tunnels with L2 bridge- + domain MAC switching, with scaled up dot1q VLANs (10, 100, 1k), + mapped to scaled up L2 bridge-domains (10, 100, 1k), that are in turn + mapped to (10, 100, 1k) VXLAN tunnels. 64.5k flows are transmitted per + direction. + +IPv6 Tunnels +~~~~~~~~~~~~ + +VPP is tested in the following IPv6 tunnel baseline configurations: + +- *ip6lispip4-ip4base*: LISP over IPv4 tunnels with IPv4 routing. +- *ip6lispip6-ip6base*: LISP over IPv4 tunnels with IPv6 routing. + +In all cases listed above low number of IPv4, IPv6 flows (253 per +direction) is routed by VPP. diff --git a/docs/report/introduction/methodology_vpp_device_functional.rst b/docs/report/introduction/methodology_vpp_device_functional.rst new file mode 100644 index 0000000000..41a8040ef6 --- /dev/null +++ b/docs/report/introduction/methodology_vpp_device_functional.rst @@ -0,0 +1,13 @@ +VPP_Device Functional +--------------------- + +|csit-release| added new VPP_Device test environment for functional VPP +device tests integrated into LFN CI/CD infrastructure. VPP_Device tests +run on 1-Node testbeds (1n-skx, 1n-arm) and rely on Linux SRIOV Virtual +Function (VF), dot1q VLAN tagging and external loopback cables to +facilitate packet passing over exernal physical links. Initial focus is +on few baseline tests. Existing CSIT VIRL tests can be moved to +VPP_Device framework by changing L1 and L2 KW(s). RF test definition +code stays unchanged with the exception of requiring adjustments from +3-Node to 2-Node logical topologies. CSIT VIRL to VPP_Device migration +is expected in the next CSIT release. diff --git a/docs/report/introduction/methodology_vpp_features.rst b/docs/report/introduction/methodology_vpp_features.rst new file mode 100644 index 0000000000..a19caa4428 --- /dev/null +++ b/docs/report/introduction/methodology_vpp_features.rst @@ -0,0 +1,80 @@ +VPP Features +------------ + +VPP is tested in a number of data plane feature configurations across +different forwarding modes. Following sections list features tested. + +ACL Security-Groups +~~~~~~~~~~~~~~~~~~~ + +Both stateless and stateful access control lists (ACL), also known as +security-groups, are supported by VPP. + +Following ACL configurations are tested for MAC switching with L2 +bridge-domains: + +- *l2bdbasemaclrn-iacl{E}sl-{F}flows*: Input stateless ACL, with {E} + entries and {F} flows. +- *l2bdbasemaclrn-oacl{E}sl-{F}flows*: Output stateless ACL, with {E} + entries and {F} flows. +- *l2bdbasemaclrn-iacl{E}sf-{F}flows*: Input stateful ACL, with {E} + entries and {F} flows. +- *l2bdbasemaclrn-oacl{E}sf-{F}flows*: Output stateful ACL, with {E} + entries and {F} flows. + +Following ACL configurations are tested with IPv4 routing: + +- *ip4base-iacl{E}sl-{F}flows*: Input stateless ACL, with {E} entries + and {F} flows. +- *ip4base-oacl{E}sl-{F}flows*: Output stateless ACL, with {E} entries + and {F} flows. +- *ip4base-iacl{E}sf-{F}flows*: Input stateful ACL, with {E} entries and + {F} flows. +- *ip4base-oacl{E}sf-{F}flows*: Output stateful ACL, with {E} entries + and {F} flows. + +ACL tests are executed with the following combinations of ACL entries +and number of flows: + +- ACL entry definitions + + - flow non-matching deny entry: (src-ip4, dst-ip4, src-port, dst-port). + - flow matching permit ACL entry: (src-ip4, dst-ip4). + +- {E} - number of non-matching deny ACL entries, {E} = [1, 10, 50]. +- {F} - number of UDP flows with different tuple (src-ip4, dst-ip4, + src-port, dst-port), {F} = [100, 10k, 100k]. +- All {E}x{F} combinations are tested per ACL type, total of 9. + +ACL MAC-IP +~~~~~~~~~~ + +MAC-IP binding ACLs are tested for MAC switching with L2 bridge-domains: + +- *l2bdbasemaclrn-macip-iacl{E}sl-{F}flows*: Input stateless ACL, with + {E} entries and {F} flows. + +MAC-IP ACL tests are executed with the following combinations of ACL +entries and number of flows: + +- ACL entry definitions + + - flow non-matching deny entry: (dst-ip4, dst-mac, bit-mask) + - flow matching permit ACL entry: (dst-ip4, dst-mac, bit-mask) + +- {E} - number of non-matching deny ACL entries, {E} = [1, 10, 50] +- {F} - number of UDP flows with different tuple (dst-ip4, dst-mac), + {F} = [100, 10k, 100k] +- All {E}x{F} combinations are tested per ACL type, total of 9. + +NAT44 +~~~~~ + +NAT44 is tested in baseline and scale configurations with IPv4 routing: + +- *ip4base-nat44*: baseline test with single NAT entry (addr, port), + single UDP flow. +- *ip4base-udpsrcscale{U}-nat44*: baseline test with {U} NAT entries + (addr, {U}ports), {U}=15. +- *ip4scale{R}-udpsrcscale{U}-nat44*: scale tests with {R}*{U} NAT + entries ({R}addr, {U}ports), {R}=[100, 1k, 2k, 4k], {U}=15. diff --git a/docs/report/introduction/methodology_vpp_forwarding_modes.rst b/docs/report/introduction/methodology_vpp_forwarding_modes.rst new file mode 100644 index 0000000000..6cf206f2f6 --- /dev/null +++ b/docs/report/introduction/methodology_vpp_forwarding_modes.rst @@ -0,0 +1,90 @@ +VPP Forwarding Modes +-------------------- + +VPP is tested in a number of L2 and IP packet lookup and forwarding +modes. Within each mode baseline and scale tests are executed, the +latter with varying number of lookup entries. + +L2 Ethernet Switching +~~~~~~~~~~~~~~~~~~~~~ + +VPP is tested in three L2 forwarding modes: + +- *l2patch*: L2 patch, the fastest point-to-point L2 path that loops + packets between two interfaces without any Ethernet frame checks or + lookups. +- *l2xc*: L2 cross-connect, point-to-point L2 path with all Ethernet + frame checks, but no MAC learning and no MAC lookup. +- *l2bd*: L2 bridge-domain, multipoint-to-multipoint L2 path with all + Ethernet frame checks, with MAC learning (unless static MACs are used) + and MAC lookup. + +l2bd tests are executed in baseline and scale configurations: + +- *l2bdbase*: low number of L2 flows (253 per direction) is switched by + VPP. They drive the content of MAC FIB size (506 total MAC entries). + Both source and destination MAC addresses are incremented on a packet + by packet basis. + +- *l2bdscale*: high number of L2 flows is switched by VPP. Tested MAC + FIB sizes include: i) 10k (5k unique flows per direction), ii) 100k + (2x 50k flows) and iii) 1M (2x 500k). Both source and destination MAC + addresses are incremented on a packet by packet basis, ensuring new + entries are learn refreshed and looked up at every packet, making it + the worst case scenario. + +Ethernet wire encapsulations tested include: untagged, dot1q, dot1ad. + +IPv4 Routing +~~~~~~~~~~~~ + +IPv4 routing tests are executed in baseline and scale configurations: + +- *ip4base*: low number of IPv4 flows (253 per direction) is routed by + VPP. They drive the content of IPv4 FIB size (506 total /32 prefixes). + Destination IPv4 addresses are incremented on a packet by packet + basis. + +- *ip4scale*: high number of IPv4 flows is routed by VPP. Tested IPv4 + FIB sizes of /32 prefixes include: i) 20k (10k unique flows per + direction), ii) 200k (2x 100k flows) and iii) 2M (2x 1M). Destination + IPv4 addresses are incremented on a packet by packet basis, ensuring + new FIB entries are looked up at every packet, making it the worst + case scenario. + +IPv6 Routing +~~~~~~~~~~~~ + +IPv6 routing tests are executed in baseline and scale configurations: + +- *ip6base*: low number of IPv6 flows (253 per direction) is routed by + VPP. They drive the content of IPv6 FIB size (506 total /128 prefixes). + Destination IPv6 addresses are incremented on a packet by packet + basis. + +- *ip6scale*: high number of IPv6 flows is routed by VPP. Tested IPv6 + FIB sizes of /128 prefixes include: i) 20k (10k unique flows per + direction), ii) 200k (2x 100k flows) and iii) 2M (2x 1M). Destination + IPv6 addresses are incremented on a packet by packet basis, ensuring + new FIB entries are looked up at every packet, making it the worst + case scenario. + +SRv6 Routing +~~~~~~~~~~~~ + +SRv6 routing tests are executed in a number of baseline configurations, +in each case SR policy and steering policy are configured for one +direction and one (or two) SR behaviours (functions) in the other +directions: + +- *srv6enc1sid*: One SID (no SRH present), one SR function - End. +- *srv6enc2sids*: Two SIDs (SRH present), two SR functions - End and + End.DX6. +- *srv6enc2sids-nodecaps*: Two SIDs (SRH present) without decapsulation, + one SR function - End. +- *srv6proxy-dyn*: Dynamic SRv6 proxy, one SR function - End.AD. +- *srv6proxy-masq*: Masquerading SRv6 proxy, one SR function - End.AM. +- *srv6proxy-stat*: Static SRv6 proxy, one SR function - End.AS. + +In all listed cases low number of IPv6 flows (253 per direction) is +routed by VPP. diff --git a/docs/report/introduction/methodology_vpp_startup_settings.rst b/docs/report/introduction/methodology_vpp_startup_settings.rst new file mode 100644 index 0000000000..16185b4c05 --- /dev/null +++ b/docs/report/introduction/methodology_vpp_startup_settings.rst @@ -0,0 +1,43 @@ +VPP Startup Settings +-------------------- + +CSIT code manipulates a number of VPP settings in startup.conf for optimized +performance. List of common settings applied to all tests and test +dependent settings follows. + +See `VPP startup.conf`_ +for a complete set and description of listed settings. + +Common Settings +~~~~~~~~~~~~~~~ + +List of vpp startup.conf settings applied to all tests: + +#. heap-size - set separately for ip4, ip6, stats, main + depending on scale tested. +#. no-tx-checksum-offload - disables UDP / TCP TX checksum offload in DPDK. + Typically needed for use faster vector PMDs (together with + no-multi-seg). +#. socket-mem , - memory per numa. (Not required anymore + due to VPP code changes, should be removed in CSIT-18.10.) + +Per Test Settings +~~~~~~~~~~~~~~~~~ + +List of vpp startup.conf settings applied dynamically per test: + +#. corelist-workers - list of logical cores to run VPP + worker data plane threads. Depends on HyperThreading and core per + test configuration. +#. num-rx-queues - depends on a number of VPP threads and NIC + interfaces. +#. num-rx-desc/num-tx-desc - number of rx/tx descriptors for specific + NICs, incl. xl710, x710, xxv710. +#. num-mbufs - increases number of buffers allocated, needed + only in scenarios with large number of interfaces and worker threads. + Value is per CPU socket. Default is 16384. +#. no-multi-seg - disables multi-segment buffers in DPDK, improves + packet throughput, but disables Jumbo MTU support. Disabled for all + tests apart from the ones that require Jumbo 9000B frame support. +#. UIO driver - depends on topology file definition. +#. QAT VFs - depends on NRThreads, each thread = 1QAT VFs. -- cgit 1.2.3-korg