7 files changed, 711 insertions, 0 deletions
diff --git a/docs/content/methodology/overview/_index.md b/docs/content/methodology/overview/_index.md
new file mode 100644
index 0000000000..d7efd15d02
--- /dev/null
+++ b/docs/content/methodology/overview/_index.md
@@ -0,0 +1,15 @@
+---
+bookCollapseSection: true
+bookFlatSection: false
+title: "Overview"
+weight: 1
+---
+
+# Methodology
+
+- [Terminology]({{< relref "/methodology/overview/terminology" >}})
+- [Per Thread Resources]({{< relref "/methodology/overview/per_thread_resources" >}})
+- [Multi-Core Speedup]({{< relref "/methodology/overview/multi_core_speedup" >}})
+- [VPP Forwarding Modes]({{< relref "/methodology/overview/vpp_forwarding_modes" >}})
+- [DUT State Considerations]({{< relref "/methodology/overview/dut_state_considerations" >}})
+- [TRex Traffic Generator]({{< relref "/methodology/overview/trex_traffic_generator" >}})
diff --git a/docs/content/methodology/overview/dut_state_considerations.md b/docs/content/methodology/overview/dut_state_considerations.md
new file mode 100644
index 0000000000..eca10a22cd
--- /dev/null
+++ b/docs/content/methodology/overview/dut_state_considerations.md
@@ -0,0 +1,148 @@
+---
+title: "DUT State Considerations"
+weight: 5
+---
+
+# DUT State Considerations
+
+This page discusses considerations for Device Under Test (DUT) state.
+DUTs such as VPP require configuration, to be provided before the aplication
+starts (via config files) or just after it starts (via API or CLI access).
+
+During operation DUTs gather various telemetry data, depending on configuration.
+This internal state handling is part of normal operation,
+so any performance impact is included in the test results.
+Accessing telemetry data is additional load on DUT,
+so we are not doing that in main trial measurements that affect results,
+but we include separate trials specifically for gathering runtime telemetry.
+
+But there is one kind of state that needs specific handling.
+This kind of DUT state is dynamically created based on incoming traffic,
+it affects how DUT handles the traffic, and (unlike telemetry counters)
+it has uneven impact on CPU load.
+Typical example is NAT, where detecting new sessions takes more CPU than
+forwarding packet on existing (open or recently closed) sessions.
+We call DUT configurations with this kind of state "stateful",
+and configurations without them "stateless".
+(Even though stateless configurations contain state described in previous
+paragraphs, and some configuration items may have "stateful" in their name,
+such as stateful ACLs.)
+
+# Stateful DUT configurations
+
+Typically, the level of CPU impact of traffic depends on DUT state.
+The first packets causing DUT state to change have higher impact,
+subsequent packets matching that state have lower impact.
+
+From performance point of view, this is similar to traffic phases
+for stateful protocols, see
+[NGFW draft](https://tools.ietf.org/html/draft-ietf-bmwg-ngfw-performance-05#section-4.3.4).
+In CSIT we borrow the terminology (even if it does not fit perfectly,
+see discussion below). Ramp-up traffic causes the state change,
+sustain traffic does not change the state.
+
+As the performance is different, each test has to choose which traffic
+it wants to test, and manipulate the DUT state to achieve the intended impact.
+
+## Ramp-up trial
+
+Tests aiming at sustain performance need to make sure DUT state is created.
+We achieve this via a ramp-up trial, specific purpose of which
+is to create the state.
+
+Subsequent trials need no specific handling, as long as the state
+remains the same. But some state can time-out, so additional ramp-up
+trials are inserted whenever the code detects the state can time-out.
+Note that a trial with zero loss refreshes the state,
+so only the time since the last non-zero loss trial is tracked.
+
+For the state to be set completely, it is important both DUT and TG
+do not lose any packets. We achieve this by setting the profile multiplier
+(TPS from now on) to low enough value.
+
+It is also important each state-affecting packet is sent.
+For size-limited traffic profile it is guaranteed by the size limit.
+For continuous traffic, we set a long enough duration (based on TPS).
+
+At the end of the ramp-up trial, we check DUT state to confirm
+it has been created as expected.
+Test fails if the state is not (completely) created.
+
+## State Reset
+
+Tests aiming at ramp-up performance do not use ramp-up trial,
+and they need to reset the DUT state before each trial measurement.
+The way of resetting the state depends on test,
+usually an API call is used to partially de-configure
+the part that holds the state, and then re-configure it back.
+
+In CSIT we control the DUT state behavior via a test variable "resetter".
+If it is not set, DUT state is not reset.
+If it is set, each search algorithm (including MRR) will invoke it
+before all trial measurements (both main and telemetry ones).
+Any configuration keyword enabling a feature with DUT state
+will check whether a test variable for ramp-up rate is present.
+If it is present, resetter is not set.
+If it is not present, the keyword sets the apropriate resetter value.
+This logic makes sure either ramp-up or state reset are used.
+
+Notes: If both ramp-up and state reset were used, the DUT behavior
+would be identical to just reset, while test would take longer to execute.
+If neither were used, DUT will show different performance in subsequent trials,
+violating assumptions of search algorithms.
+
+## DUT versus protocol ramp-up
+
+There are at least three different causes for bandwidth possibly increasing
+within a single measurement trial.
+
+The first is DUT switching from state modification phase to constant phase,
+it is the primary focus of this document.
+Using ramp-up traffic before main trials eliminates this cause
+for tests wishing to measure the performance of the next phase.
+Using size-limited profiles eliminates the next phase
+for tests wishing to measure performance of this phase.
+
+The second is protocol such as TCP ramping up their throughput to utilize
+the bandwidth available. This is the original meaning of "ramp up"
+in the NGFW draft (see above).
+In existing tests we are not using this meaning of TCP ramp-up.
+Instead we use only small transactions, and large enough initial window
+so TCP acts as ramped-up already.
+
+The third is TCP increasing offered load due to retransmissions triggered by
+packet loss. In CSIT we again try to avoid this behavior
+by using small enough data to transfer, so overlap of multiple transactions
+(primary cause of packet loss) is unlikely.
+But in MRR tests, packet loss and non-constant offered load are still expected.
+
+# Stateless DUT configuratons
+
+These are simple configurations, which do not set any resetter value
+(even if ramp-up duration is not configured).
+Majority of existing tests are of this type, using continuous traffic profiles.
+
+In order to identify limits of Trex performance,
+we have added suites with stateless DUT configuration (VPP ip4base)
+subjected to size-limited ASTF traffic.
+The discovered rates serve as a basis of comparison
+for evaluating the results for stateful DUT configurations (VPP NAT44ed)
+subjected to the same traffic profiles.
+
+# DUT versus TG state
+
+Traffic Generator profiles can be stateful (ASTF) or stateless (STL).
+DUT configuration can be stateful or stateless (with respect to packet traffic).
+
+In CSIT we currently use all four possible configurations:
+
+- Regular stateless VPP tests use stateless traffic profiles.
+
+- Stateless VPP configuration with stateful profile is used as a base for
+  comparison.
+
+- Some stateful DUT configurations (NAT44DET, NAT44ED unidirectional)
+  are tested using stateless traffic profiles and continuous traffic.
+
+- The rest of stateful DUT configurations (NAT44ED bidirectional)
+  are tested using stateful traffic profiles and size limited traffic.
diff --git a/docs/content/methodology/overview/multi_core_speedup.md b/docs/content/methodology/overview/multi_core_speedup.md
new file mode 100644
index 0000000000..f438e8e996
--- /dev/null
+++ b/docs/content/methodology/overview/multi_core_speedup.md
@@ -0,0 +1,51 @@
+---
+title: "Multi-Core Speedup"
+weight: 3
+---
+
+# Multi-Core Speedup
+
+All performance tests are executed with single physical core and with
+multiple cores scenarios.
+
+## Intel Hyper-Threading (HT)
+
+Intel Xeon processors used in FD.io CSIT can operate either in HT
+Disabled mode (single logical core per each physical core) or in HT
+Enabled mode (two logical cores per each physical core). HT setting is
+applied in BIOS and requires server SUT reload for it to take effect,
+making it impractical for continuous changes of HT mode of operation.
+
+Performance tests are executed with server SUTs' Intel XEON processors
+configured with Intel Hyper-Threading Enabled for all Xeon
+Cascadelake and Xeon Icelake testbeds.
+
+## Multi-core Tests
+
+Multi-core tests are executed in the following VPP worker thread and physical
+core configurations:
+
+1. Intel Xeon Icelake and Cascadelake testbeds (2n-icx, 3n-icx, 2n-clx)
+   with Intel HT enabled (2 logical CPU cores per each physical core):
+
+   1. 2t1c - 2 VPP worker threads on 1 physical core.
+   2. 4t2c - 4 VPP worker threads on 2 physical cores.
+  3. 8t4c - 8 VPP worker threads on 4 physical cores.
+
+VPP worker threads are the data plane threads running on isolated
+logical cores. With Intel HT enabled VPP workers are placed as sibling
+threads on each used physical core. VPP control threads (main, stats)
+are running on a separate non-isolated core together with other Linux
+processes.
+
+In all CSIT tests care is taken to ensure that each VPP worker handles
+the same amount of received packet load and does the same amount of
+packet processing work. This is achieved by evenly distributing per
+interface type (e.g. physical, virtual) receive queues over VPP workers
+using default VPP round-robin mapping and by loading these queues with
+the same amount of packet flows.
+
+If number of VPP workers is higher than number of physical or virtual
+interfaces, multiple receive queues are configured on each interface.
+NIC Receive Side Scaling (RSS) for physical interfaces and multi-queue
+for virtual interfaces are used for this purpose.
diff --git a/docs/content/methodology/overview/per_thread_resources.md b/docs/content/methodology/overview/per_thread_resources.md
new file mode 100644
index 0000000000..c23efb50bd
--- /dev/null
+++ b/docs/content/methodology/overview/per_thread_resources.md
@@ -0,0 +1,101 @@
+---
+title: "Per Thread Resources"
+weight: 2
+---
+
+# Per Thread Resources
+
+CSIT test framework is managing mapping of the following resources per thread:
+
+1. Cores, physical cores (pcores) allocated as pairs of sibling logical cores
+   (lcores) if server in HyperThreading/SMT mode, or as single lcores
+   if server not in HyperThreading/SMT mode. Note that if server's
+   processors are running in HyperThreading/SMT mode sibling lcores are
+   always used.
+2. Receive Queues (RxQ), packet receive queues allocated on each
+   physical and logical interface tested.
+3. Transmit Queues(TxQ), packet transmit queues allocated on each
+   physical and logical interface tested.
+
+Approach to mapping per thread resources depends on the application/DUT
+tested (VPP or DPDK apps) and associated thread types, as follows:
+
+1. Data-plane workers, used for data-plane packet processing, when no
+   feature workers present.
+
+   - Cores: data-plane workers are typically tested in 1, 2 and 4 pcore
+     configurations, running on single lcore per pcore or on sibling
+     lcores per pcore. Result is a set of {T}t{C}c thread-core
+     configurations, where{T} stands for a total number of threads
+     (lcores), and {C} for a total number of pcores. Tested
+     configurations are encoded in CSIT test case names,
+     e.g. "1c", "2c", "4c", and test tags "2T1C" (or "1T1C"), "4T2C"
+     (or "2T2C"), "8T4C" (or "4T4C").
+   - Interface Receive Queues (RxQ): as of CSIT-2106 release, number of
+     RxQs used on each physical or virtual interface is equal to the
+     number of data-plane workers. In other words each worker has a
+     dedicated RxQ on each interface tested. This ensures packet
+     processing load to be equal for each worker, subject to RSS flow
+     load balancing efficacy. Note: Before CSIT-2106 total number of
+     RxQs across all interfaces of specific type was equal to the
+     number of data-plane workers.
+   - Interface Transmit Queues (TxQ): number of TxQs used on each
+     physical or virtual interface is equal to the number of data-plane
+     workers. In other words each worker has a dedicated TxQ on each
+     interface tested.
+   - Applies to VPP and DPDK Testpmd and L3Fwd.
+
+2. Data-plane and feature workers (e.g. IPsec async crypto workers), the
+   latter dedicated to specific feature processing.
+
+   - Cores: data-plane and feature workers are tested in 2, 3 and 4
+     pcore configurations, running on single lcore per pcore or on
+     sibling lcores per pcore. This results in a two sets of
+     thread-core combinations separated by "-", {T}t{C}c-{T}t{C}c, with
+     the leading set denoting total number of threads (lcores) and
+     pcores used for data-plane workers, and the trailing set denoting
+     total number of lcores and pcores used for feature workers.
+     Accordingly, tested configurations are encoded in CSIT test case
+     names, e.g. "1c-1c", "1c-2c", "1c-3c", and test tags "2T1C_2T1C"
+     (or "1T1C_1T1C"), "2T1C_4T2C" (or "1T1C_2T2C"), "2T1C_6T3C"
+     (or "1T1C_3T3C").
+   - RxQ and TxQ: no RxQs and no TxQs are used by feature workers.
+   - Applies to VPP only.
+
+3. Management/main worker, control plane and management.
+
+   - Cores: single lcore.
+   - RxQ: not used (VPP default behaviour).
+   - TxQ: single TxQ per interface, allocated but not used (VPP default
+     behaviour).
+   - Applies to VPP only.
+
+## VPP Thread Configuration
+
+Mapping of cores and RxQs to VPP data-plane worker threads is done in
+the VPP startup.conf during test suite setup:
+
+1. `corelist-workers <list_of_cores>`: List of logical cores to run VPP
+   data-plane workers and feature workers. The actual lcores'
+   allocations depends on HyperThreading/SMT server configuration and
+   per test core configuration.
+
+   - For tests without feature workers, by default, all CPU cores
+     configured in startup.conf are used for data-plane workers.
+   - For tests with feature workers, CSIT code distributes lcores across
+     data-plane and feature workers.
+
+2. `num-rx-queues <value>`: Number of Rx queues used per interface.
+
+Mapping of TxQs to VPP data-plane worker threads uses the default VPP
+setting of one TxQ per interface per data-plane worker.
+
+## DPDK Thread Configuration
+
+Mapping of cores and RxQs to DPDK Testpmd/L3Fwd data-plane worker
+threads is done in the startup CLI:
+
+1. `-l <list_of_cores>` - List of logical cores to run DPDK
+   application.
+2. `nb-cores=<N>` - Number of forwarding cores.
+3. `rxq=<N>` - Number of Rx queues used per interface.
diff --git a/docs/content/methodology/overview/terminology.md b/docs/content/methodology/overview/terminology.md
new file mode 100644
index 0000000000..c9115e9291
--- /dev/null
+++ b/docs/content/methodology/overview/terminology.md
@@ -0,0 +1,97 @@
+---
+title: "Terminology"
+weight: 1
+---
+
+# Terminology
+
+- **Frame size**: size of an Ethernet Layer-2 frame on the wire, including
+  any VLAN tags (dot1q, dot1ad) and Ethernet FCS, but excluding Ethernet
+  preamble and inter-frame gap. Measured in Bytes.
+
+- **Packet size**: same as frame size, both terms used interchangeably.
+
+- **Inner L2 size**: for tunneled L2 frames only, size of an encapsulated
+  Ethernet Layer-2 frame, preceded with tunnel header, and followed by
+  tunnel trailer. Measured in Bytes.
+
+- **Inner IP size**: for tunneled IP packets only, size of an encapsulated
+  IPv4 or IPv6 packet, preceded with tunnel header, and followed by
+  tunnel trailer. Measured in Bytes.
+
+- **Device Under Test (DUT)**: In software networking, "device" denotes a
+  specific piece of software tasked with packet processing. Such device
+  is surrounded with other software components (such as operating system
+  kernel). It is not possible to run devices without also running the
+  other components, and hardware resources are shared between both. For
+  purposes of testing, the whole set of hardware and software components
+  is called "System Under Test" (SUT). As SUT is the part of the whole
+  test setup performance of which can be measured with RFC2544, using
+  SUT instead of RFC2544 DUT. Device under test
+  (DUT) can be re-introduced when analyzing test results using whitebox
+  techniques, but this document sticks to blackbox testing.
+
+- **System Under Test (SUT)**: System under test (SUT) is a part of the
+  whole test setup whose performance is to be benchmarked. The complete
+  methodology contains other parts, whose performance is either already
+  established, or not affecting the benchmarking result.
+
+- **Bi-directional throughput tests**: involve packets/frames flowing in
+  both east-west and west-east directions over every tested interface of
+  SUT/DUT. Packet flow metrics are measured per direction, and can be
+  reported as aggregate for both directions (i.e. throughput) and/or
+  separately for each measured direction (i.e. latency). In most cases
+  bi-directional tests use the same (symmetric) load in both directions.
+
+- **Uni-directional throughput tests**: involve packets/frames flowing in
+  only one direction, i.e. either east-west or west-east direction, over
+  every tested interface of SUT/DUT. Packet flow metrics are measured
+  and are reported for measured direction.
+
+- **Packet Loss Ratio (PLR)**: ratio of packets received relative to packets
+  transmitted over the test trial duration, calculated using formula:
+  PLR = ( pkts_transmitted - pkts_received ) / pkts_transmitted.
+  For bi-directional throughput tests aggregate PLR is calculated based
+  on the aggregate number of packets transmitted and received.
+
+- **Packet Throughput Rate**: maximum packet offered load DUT/SUT forwards
+  within the specified Packet Loss Ratio (PLR). In many cases the rate
+  depends on the frame size processed by DUT/SUT. Hence packet
+  throughput rate MUST be quoted with specific frame size as received by
+  DUT/SUT during the measurement. For bi-directional tests, packet
+  throughput rate should be reported as aggregate for both directions.
+  Measured in packets-per-second (pps) or frames-per-second (fps),
+  equivalent metrics.
+
+- **Bandwidth Throughput Rate**: a secondary metric calculated from packet
+  throughput rate using formula: bw_rate = pkt_rate * (frame_size +
+  L1_overhead) * 8, where L1_overhead for Ethernet includes preamble (8
+  Bytes) and inter-frame gap (12 Bytes). For bi-directional tests,
+  bandwidth throughput rate should be reported as aggregate for both
+  directions. Expressed in bits-per-second (bps).
+
+- **Non Drop Rate (NDR)**: maximum packet/bandwith throughput rate sustained
+  by DUT/SUT at PLR equal zero (zero packet loss) specific to tested
+  frame size(s). MUST be quoted with specific packet size as received by
+  DUT/SUT during the measurement. Packet NDR measured in
+  packets-per-second (or fps), bandwidth NDR expressed in
+  bits-per-second (bps).
+
+- **Partial Drop Rate (PDR)**: maximum packet/bandwith throughput rate
+  sustained by DUT/SUT at PLR greater than zero (non-zero packet loss)
+  specific to tested frame size(s). MUST be quoted with specific packet
+  size as received by DUT/SUT during the measurement. Packet PDR
+  measured in packets-per-second (or fps), bandwidth PDR expressed in
+  bits-per-second (bps).
+
+- **Maximum Receive Rate (MRR)**: packet/bandwidth rate regardless of PLR
+  sustained by DUT/SUT under specified Maximum Transmit Rate (MTR)
+  packet load offered by traffic generator. MUST be quoted with both
+  specific packet size and MTR as received by DUT/SUT during the
+  measurement. Packet MRR measured in packets-per-second (or fps),
+  bandwidth MRR expressed in bits-per-second (bps).
+
+- **Trial**: a single measurement step.
+
+- **Trial duration**: amount of time over which packets are transmitted and
+  received in a single measurement step.
diff --git a/docs/content/methodology/overview/trex_traffic_generator.md b/docs/content/methodology/overview/trex_traffic_generator.md
new file mode 100644
index 0000000000..8771bf9780
--- /dev/null
+++ b/docs/content/methodology/overview/trex_traffic_generator.md
@@ -0,0 +1,195 @@
+---
+title: "TRex Traffic Generator"
+weight: 6
+---
+
+# TRex Traffic Generator
+
+## Usage
+
+[TRex traffic generator](https://trex-tgn.cisco.com) is used for majority of
+CSIT performance tests. TRex is used in multiple types of performance tests,
+see [Data Plane Throughtput]({{< ref "../measurements/data_plane_throughput/data_plane_throughput/#Data Plane Throughtput" >}})
+for more details.
+
+## Traffic modes
+
+TRex is primarily used in two (mutually incompatible) modes.
+
+### Stateless mode
+
+Sometimes abbreviated as STL.
+A mode with high performance, which is unable to react to incoming traffic.
+We use this mode whenever it is possible.
+Typical test where this mode is not applicable is NAT44ED,
+as DUT does not assign deterministic outside address+port combinations,
+so we are unable to create traffic that does not lose packets
+in out2in direction.
+
+Measurement results are based on simple L2 counters
+(opackets, ipackets) for each traffic direction.
+
+### Stateful mode
+
+A mode capable of reacting to incoming traffic.
+Contrary to the stateless mode, only UDP and TCP is supported
+(carried over IPv4 or IPv6 packets).
+Performance is limited, as TRex needs to do more CPU processing.
+TRex suports two subtypes of stateful traffic,
+CSIT uses ASTF (Advanced STateFul mode).
+
+This mode is suitable for NAT44ED tests, as clients send packets from inside,
+and servers react to it, so they see the outside address and port to respond to.
+Also, they do not send traffic before NAT44ED has created the corresponding
+translation entry.
+
+When possible, L2 counters (opackets, ipackets) are used.
+Some tests need L7 counters, which track protocol state (e.g. TCP),
+but those values are less than reliable on high loads.
+
+## Traffic Continuity
+
+Generated traffic is either continuous, or limited (by number of transactions).
+Both modes support both continuities in principle.
+
+### Continuous traffic
+
+Traffic is started without any data size goal.
+Traffic is ended based on time duration, as hinted by search algorithm.
+This is useful when DUT behavior does not depend on the traffic duration.
+The default for stateless mode.
+
+### Limited traffic
+
+Traffic has defined data size goal (given as number of transactions),
+duration is computed based on this goal.
+Traffic is ended when the size goal is reached,
+or when the computed duration is reached.
+This is useful when DUT behavior depends on traffic size,
+e.g. target number of NAT translation entries, each to be hit exactly once
+per direction.
+This is used mainly for stateful mode.
+
+## Traffic synchronicity
+
+Traffic can be generated synchronously (test waits for duration)
+or asynchronously (test operates during traffic and stops traffic explicitly).
+
+### Synchronous traffic
+
+Trial measurement is driven by given (or precomputed) duration,
+no activity from test driver during the traffic.
+Used for most trials.
+
+### Asynchronous traffic
+
+Traffic is started, but then the test driver is free to perform
+other actions, before stopping the traffic explicitly.
+This is used mainly by reconf tests, but also by some trials
+used for runtime telemetry.
+
+## Trafic profiles
+
+TRex supports several ways to define the traffic.
+CSIT uses small Python modules based on Scapy as definitions.
+Details of traffic profiles depend on modes (STL or ASTF),
+but some are common for both modes.
+
+Search algorithms are intentionally unaware of the traffic mode used,
+so CSIT defines some terms to use instead of mode-specific TRex terms.
+
+### Transactions
+
+TRex traffic profile defines a small number of behaviors,
+in CSIT called transaction templates. Traffic profiles also instruct
+TRex how to create a large number of transactions based on the templates.
+
+Continuous traffic loops over the generated transactions.
+Limited traffic usually executes each transaction once
+(typically as constant number of loops over source addresses,
+each loop with different source ports).
+
+Currently, ASTF profiles define one transaction template each.
+Number of packets expected per one transaction varies based on profile details,
+as does the criterion for when a transaction is considered successful.
+
+Stateless transactions are just one packet (sent from one TG port,
+successful if received on the other TG port).
+Thus unidirectional stateless profiles define one transaction template,
+bidirectional stateless profiles define two transaction templates.
+
+### TPS multiplier
+
+TRex aims to open transaction specified by the profile at a steady rate.
+While TRex allows the transaction template to define its intended "cps" value,
+CSIT does not specify it, so the default value of 1 is applied,
+meaning TRex will open one transaction per second (and transaction template)
+by default. But CSIT invocation uses "multiplier" (mult) argument
+when starting the traffic, that multiplies the cps value,
+meaning it acts as TPS (transactions per second) input.
+
+With a slight abuse of nomenclature, bidirectional stateless tests
+set "packets per transaction" value to 2, just to keep the TPS semantics
+as a unidirectional input value.
+
+### Duration stretching
+
+TRex can be IO-bound, CPU-bound, or have any other reason
+why it is not able to generate the traffic at the requested TPS.
+Some conditions are detected, leading to TRex failure,
+for example when the bandwidth does not fit into the line capacity.
+But many reasons are not detected.
+
+Unfortunately, TRex frequently reacts by not honoring the duration
+in synchronous mode, taking longer to send the traffic,
+leading to lower then requested load offered to DUT.
+This usualy breaks assumptions used in search algorithms,
+so it has to be avoided.
+
+For stateless traffic, the behavior is quite deterministic,
+so the workaround is to apply a fictional TPS limit (max_rate)
+to search algorithms, usually depending only on the NIC used.
+
+For stateful traffic the behavior is not deterministic enough,
+for example the limit for TCP traffic depends on DUT packet loss.
+In CSIT we decided to use logic similar to asynchronous traffic.
+The traffic driver sleeps for a time, then stops the traffic explicitly.
+The library that parses counters into measurement results
+than usually treats unsent packets/transactions as lost/failed.
+
+We have added a IP4base tests for every NAT44ED test,
+so that users can compare results.
+If the results are very similar, it is probable TRex was the bottleneck.
+
+### Startup delay
+
+By investigating TRex behavior, it was found that TRex does not start
+the traffic in ASTF mode immediately. There is a delay of zero traffic,
+after which the traffic rate ramps up to the defined TPS value.
+
+It is possible to poll for counters during the traffic
+(fist nonzero means traffic has started),
+but that was found to influence the NDR results.
+
+Thus "sleep and stop" stategy is used, which needs a correction
+to the computed duration so traffic is stopped after the intended
+duration of real traffic. Luckily, it turns out this correction
+is not dependend on traffic profile nor CPU used by TRex,
+so a fixed constant (0.112 seconds) works well.
+Unfortunately, the constant may depend on TRex version,
+or execution environment (e.g. TRex in AWS).
+
+The result computations need a precise enough duration of the real traffic,
+luckily server side of TRex has precise enough counter for that.
+
+It is unknown whether stateless traffic profiles also exhibit a startup delay.
+Unfortunately, stateless mode does not have similarly precise duration counter,
+so some results (mostly MRR) are affected by less precise duration measurement
+in Python part of CSIT code.
+
+## Measuring Latency
+
+If measurement of latency is requested, two more packet streams are
+created (one for each direction) with TRex flow_stats parameter set to
+STLFlowLatencyStats. In that case, returned statistics will also include
+min/avg/max latency values and encoded HDRHistogram data.
diff --git a/docs/content/methodology/overview/vpp_forwarding_modes.md b/docs/content/methodology/overview/vpp_forwarding_modes.md
new file mode 100644
index 0000000000..b3c3bba984
--- /dev/null
+++ b/docs/content/methodology/overview/vpp_forwarding_modes.md
@@ -0,0 +1,104 @@
+---
+title: "VPP Forwarding Modes"
+weight: 4
+---
+
+# VPP Forwarding Modes
+
+VPP is tested in a number of L2, IPv4 and IPv6 packet lookup and forwarding
+modes. Within each mode baseline and scale tests are executed, the latter with
+varying number of FIB entries.
+
+## L2 Ethernet Switching
+
+VPP is tested in three L2 forwarding modes:
+
+- *l2patch*: L2 patch, the fastest point-to-point L2 path that loops
+  packets between two interfaces without any Ethernet frame checks or
+  lookups.
+- *l2xc*: L2 cross-connect, point-to-point L2 path with all Ethernet
+  frame checks, but no MAC learning and no MAC lookup.
+- *l2bd*: L2 bridge-domain, multipoint-to-multipoint L2 path with all
+  Ethernet frame checks, with MAC learning (unless static MACs are used)
+  and MAC lookup.
+
+l2bd tests are executed in baseline and scale configurations:
+
+- *l2bdbase*: Two MAC FIB entries are learned by VPP to enable packet
+  switching between two interfaces in two directions. VPP L2 switching
+  is tested with 254 IPv4 unique flows per direction, varying IPv4
+  source address per flow in order to invoke RSS based packet
+  distribution across VPP workers. The same source and destination MAC
+  address is used for all flows per direction. IPv4 source address is
+  incremented for every packet.
+
+- *l2bdscale*: A high number of MAC FIB entries are learned by VPP to
+  enable packet switching between two interfaces in two directions.
+  Tested MAC FIB sizes include: i) 10k with 5k unique flows per
+  direction, ii) 100k with 2 x 50k flows and iii) 1M with 2 x 500k
+  flows. Unique flows are created by using distinct source and
+  destination MAC addresses that are changed for every packet using
+  incremental ordering, making VPP learn (or refresh) distinct src MAC
+  entries and look up distinct dst MAC entries for every packet. For
+  details, see
+  [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}).
+
+Ethernet wire encapsulations tested include: untagged, dot1q, dot1ad.
+
+## IPv4 Routing
+
+IPv4 routing tests are executed in baseline and scale configurations:
+
+- *ip4base*: Two /32 IPv4 FIB entries are configured in VPP to enable
+  packet routing between two interfaces in two directions. VPP routing
+  is tested with 253 IPv4 unique flows per direction, varying IPv4
+  source address per flow in order to invoke RSS based packet
+  distribution across VPP workers. IPv4 source address is incremented
+  for every packet.
+
+- *ip4scale*: A high number of /32 IPv4 FIB entries are configured in
+  VPP. Tested IPv4 FIB sizes include: i) 20k with 10k unique flows per
+  direction, ii) 200k with 2 * 100k flows and iii) 2M with 2 * 1M
+  flows. Unique flows are created by using distinct IPv4 destination
+  addresses that are changed for every packet, using incremental or
+  random ordering. For details, see
+  [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}).
+
+## IPv6 Routing
+
+Similarly to IPv4, IPv6 routing tests are executed in baseline and scale
+configurations:
+
+- *ip6base*: Two /128 IPv4 FIB entries are configured in VPP to enable
+  packet routing between two interfaces in two directions. VPP routing
+  is tested with 253 IPv6 unique flows per direction, varying IPv6
+  source address per flow in order to invoke RSS based packet
+  distribution across VPP workers. IPv6 source address is incremented
+  for every packet.
+
+- *ip4scale*: A high number of /128 IPv6 FIB entries are configured in
+  VPP. Tested IPv6 FIB sizes include: i) 20k with 10k unique flows per
+  direction, ii) 200k with 2 * 100k flows and iii) 2M with 2 * 1M
+  flows. Unique flows are created by using distinct IPv6 destination
+  addresses that are changed for every packet, using incremental or
+  random ordering. For details, see
+  [Packet Flow Ordering]({{< ref "packet_flow_ordering#Packet Flow Ordering" >}}).
+
+## SRv6 Routing
+
+SRv6 routing tests are executed in a number of baseline configurations,
+in each case SR policy and steering policy are configured for one
+direction and one (or two) SR behaviours (functions) in the other
+directions:
+
+- *srv6enc1sid*: One SID (no SRH present), one SR function - End.
+- *srv6enc2sids*: Two SIDs (SRH present), two SR functions - End and
+  End.DX6.
+- *srv6enc2sids-nodecaps*: Two SIDs (SRH present) without decapsulation,
+  one SR function - End.
+- *srv6proxy-dyn*: Dynamic SRv6 proxy, one SR function - End.AD.
+- *srv6proxy-masq*: Masquerading SRv6 proxy, one SR function - End.AM.
+- *srv6proxy-stat*: Static SRv6 proxy, one SR function - End.AS.
+
+In all listed cases low number of IPv6 flows (253 per direction) is
+routed by VPP.