aboutsummaryrefslogtreecommitdiffstats
path: root/docs/report/vpp_performance_tests/mdr_search.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/report/vpp_performance_tests/mdr_search.rst')
-rw-r--r--docs/report/vpp_performance_tests/mdr_search.rst408
1 files changed, 408 insertions, 0 deletions
diff --git a/docs/report/vpp_performance_tests/mdr_search.rst b/docs/report/vpp_performance_tests/mdr_search.rst
new file mode 100644
index 0000000000..374d9e9c97
--- /dev/null
+++ b/docs/report/vpp_performance_tests/mdr_search.rst
@@ -0,0 +1,408 @@
+Experimental: MDR Search
+========================
+
+Multiple Drop Rate (MDR) Search is a new search algorithm implemented in
+FD.io CSIT project. MDR discovers multiple packet throughput rates in a
+single search, with each rate associated with a distinct Packet Loss
+Ratio (PLR) criteria.
+
+Two throughput measurements used in FD.io CSIT are Non-Drop Rate (NDR,
+with zero packet loss, PLR=0) and Partial Drop Rate (PDR, with packet
+loss rate not greater than the configured non-zero PLR). MDR search
+discovers NDR and PDR in a single pass reducing required execution time
+compared to separate binary searches for NDR and PDR. MDR reduces
+execution time even further by relying on shorter trial durations
+of intermediate steps, with only the final measurements
+conducted at the specified final trial duration.
+This results in the shorter overall search
+execution time when compared to a standard NDR/PDR binary search,
+while guaranteeing the same or similar results.
+
+If needed MDR can be easily adopted to discover more throughput rates
+with different pre-defined PLRs.
+
+.. Note:: All throughput rates are *always* bi-directional
+ aggregates of two equal (symmetric) uni-directional packet rates
+ received and reported by an external traffic generator.
+
+Overview
+---------
+
+The main properties of MDR search:
+
+- MDR is a duration aware multi-phase multi-rate search algorithm.
+
+ - Initial phase determines promising starting interval for the search.
+ - Intermediate phases progress towards defined final search criteria.
+ - Final phase executes measurements according to the final search
+ criteria.
+
+- Initial phase:
+
+ - Uses link rate as a starting transmit rate and discovers the Maximum
+ Receive Rate (MRR) used as an input to the first intermediate phase.
+
+- Intermediate phases:
+
+ - Start with initial trial duration (in the first phase) and converge
+ geometrically towards the final trial duration (in the final phase).
+ - Track two values for NDR and two for PDR.
+
+ - The values are called (NDR or PDR) lower_bound and upper_bound.
+ - Each value comes from a specific trial measurement
+ (most recent for that transmit rate),
+ and as such the value is associated with that measurement's duration and loss.
+ - A bound can be invalid, for example if NDR lower_bound
+ has been measured with nonzero loss.
+ - Invalid bounds are not real boundaries for the searched value,
+ but are needed to track interval widths.
+ - Valid bounds are real boundaries for the searched value.
+ - Each non-initial phase ends with all bounds valid.
+
+ - Start with a large (lower_bound, upper_bound) interval width and
+ geometrically converge towards the width goal (measurement resolution)
+ of the phase. Each phase halves the previous width goal.
+ - Use internal and external searches:
+
+ - External search - measures at transmit rates outside the (lower_bound,
+ upper_bound) interval. Activated when a bound is invalid,
+ to search for a new valid bound by doubling the interval width.
+ It is a variant of `exponential search`_.
+ - Internal search - `binary search`_, measures at transmit rates within the
+ (lower_bound, upper_bound) valid interval, halving the interval width.
+
+- Final phase is executed with the final test trial duration, and the final
+ width goal that determines resolution of the overall search.
+ Intermediate phases together with the final phase are called non-initial phases.
+
+The main benefits of MDR search vs. binary search include:
+
+- In general MDR is likely to execute more search trials overall, but
+ less trials at a set final duration.
+- In well behaving cases it greatly reduces (>50%) the overall duration
+ compared to a single PDR (or NDR) binary search duration,
+ while finding multiple drop rates.
+- In all cases MDR yields the same or similar results to binary search.
+- Note: both binary search and MDR are susceptible to reporting
+ non-repeatable results across multiple runs for very bad behaving
+ cases.
+
+Caveats:
+
+- Worst case MDR can take longer than a binary search e.g. in case of
+ drastic changes in behaviour for trials at varying durations.
+
+MDR Search Implementation
+-------------------------
+
+Following is a brief description of the current MDR search
+implementation in FD.io CSIT.
+
+MDR Input Parameters
+````````````````````
+
+#. *maximum_transmit_rate* - maximum packet transmit rate to be used by
+ external traffic generator, limited by either the actual Ethernet
+ link rate or traffic generator NIC model capabilities. Sample
+ defaults: 2 * 14.88 Mpps for 64B 10GE link rate,
+ 2 * 18.75 Mpps for 64B 40GE NIC maximum rate.
+#. *minimum_transmit_rate* - minimum packet transmit rate to be used for
+ measurements. MDR search fails if lower transmit rate needs to be
+ used to meet search criteria. Default: 2 * 10 kpps (could be higher).
+#. *final_trial_duration* - required trial duration for final rate
+ measurements. Default: 30 sec.
+#. *initial_trial_duration* - trial duration for initial MDR phase.
+ Default: 1 sec.
+#. *final_relative_width* - required measurement resolution expressed as
+ (lower_bound, upper_bound) interval width relative to upper_bound.
+ Default: 0.5%.
+#. *packet_loss_ratio* - maximum acceptable PLR search criteria for
+ PDR measurements. Default: 0.5%.
+#. *number_of_intermediate_phases* - number of phases between the initial
+ phase and the final phase. Impacts the overall MDR search duration.
+ Less phases are required for well behaving cases, more phases
+ may be needed to reduce the overall search duration for worse behaving cases.
+ Default (2). (Value chosen based on limited experimentation to date.
+ More experimentation needed to arrive to clearer guidelines.)
+
+Initial phase
+`````````````
+
+1. First trial measures at maximum rate and discovers MRR.
+
+ a) in: trial_duration = initial_trial_duration.
+ b) in: offered_transmit_rate = maximum_transmit_rate.
+ c) do: single trial.
+ d) out: measured loss ratio.
+ e) out: mrr = measured receive rate.
+
+2. Second trial measures at MRR and discovers MRR2.
+
+ a) in: trial_duration = initial_trial_duration.
+ b) in: offered_transmit_rate = MRR.
+ c) do: single trial.
+ d) out: measured loss ratio.
+ e) out: mrr2 = measured receive rate.
+
+3. Third trial measures at MRR2.
+
+ a) in: trial_duration = initial_trial_duration.
+ b) in: offered_transmit_rate = MRR2.
+ c) do: single trial.
+ d) out: measured loss ratio.
+
+Non-initial phases
+``````````````````
+
+1. Main loop:
+
+ a) in: trial_duration for the current phase.
+ Set to initial_trial_duration for the first intermediate phase;
+ to final_trial_duration for the final phase;
+ or to the element of interpolating geometric sequence
+ for other intermediate phases.
+ For example with two intermediate phases, trial_duration
+ of the second intermediate phase is the geometric average
+ of initial_strial_duration and final_trial_duration.
+ b) in: relative_width_goal for the current phase.
+ Set to final_relative_width for the final phase;
+ doubled for each preceding phase.
+ For example with two intermediate phases,
+ the first intermediate phase uses quadruple of final_relative_width
+ and the second intermediate phase uses double of final_relative_width.
+ c) in: ndr_interval, pdr_interval from the previous main loop iteration
+ or the previous phase.
+ If the previous phase is the initial phase, both intervals have
+ lower_bound = MRR2, uper_bound = MRR.
+ Note that the initial phase is likely to create intervals with invalid bounds.
+ d) do: According to the procedure described in point 2,
+ either exit the phase by jumping to g),
+ or prepare new transmit rate to measure with.
+ e) do: Perform the trial measurement at the new transmit rate
+ and trial_duration, compute its loss ratio.
+ f) do: Update the bounds of both intervals, based on the new measurement.
+ The actual update rules are numerous, as NDR external search
+ can affect PDR interval and vice versa, but the result
+ agrees with rules of both internal and external search.
+ For example, any new measurement below an invalid lower_bound
+ becomes the new lower_bound, while the old measurement
+ (previously acting as the invalid lower_bound)
+ becomes a new and valid upper_bound.
+ Go to next iteration c), taking the updated intervals as new input.
+ g) out: current ndr_interval and pdr_interval.
+ In the final phase this is also considered
+ to be the result of the whole search.
+ For other phases, the next phase loop is started
+ with the current results as an input.
+
+2. New transmit rate (or exit) calculation (for 1.d):
+
+ a) If there is an invalid bound then prepare for external search:
+
+ 1) If the most recent measurement at NDR lower_bound transmit rate
+ had the loss higher than zero, then
+ the new transmit rate is NDR lower_bound
+ decreased by two NDR interval widths.
+ 2) Else, if the most recent measurement at PDR lower_bound
+ transmit rate had the loss higher than PLR, then
+ the new transmit rate is PDR lower_bound
+ decreased by two PDR interval widths.
+ 3) Else, if the most recent measurement at NDR upper_bound
+ transmit rate had no loss, then
+ the new transmit rate is NDR upper_bound
+ increased by two NDR interval widths.
+ 4) Else, if the most recent measurement at PDR upper_bound
+ transmit rate had the loss lower or equal to PLR, then
+ the new transmit rate is PDR upper_bound
+ increased by two PDR interval widths.
+
+ b) Else, if NDR (or PDR) interval does not meet the current phase width goal,
+ prepare for internal search. The new transmit rate is
+ (lower bound + upper bound) / 2.
+ It does not matter much which interval is investigated first.
+ The current implementation starts with NDR, unless PDR interval is wider
+ (but always preferring NDR is slightly better).
+
+ c) Else, if some bound has still only been measured at a lower duration,
+ prepare to re-measure at the current duration (and the same transmit rate).
+ The order of priorities is:
+
+ 1) NDR lower_bound,
+ 2) PDR lower_bound,
+ 3) NDR upper_bound,
+ 4) PDR upper_bound.
+
+ d) Else do not prepare any new rate, to exit the phase.
+ This ensures that at the end of each non-initial phase
+ all intervals are valid, narrow enough, and measured
+ at current phase trial duration.
+
+Implementation details
+----------------------
+
+The algorithm as implemented contains additional details
+omitted from the description above.
+Here is a short description of them, without detailing their mutual interaction.
+
+1) Logarithmic transmit rate.
+ In order to better fit the relative width goal,
+ the interval doubling and halving is done differently.
+ For example, middle of 2 and 8 is 4, not 5.
+2) Optimistic maximum rate.
+ The increased rate is never higher than the maximum rate.
+ Upper bound at that rate is always considered valid.
+3) Pessimistic minimum rate.
+ The decreased rate is never lower than the minimum rate.
+ If a lower bound at that rate is invalid,
+ a phase stops refining the interval further (until it gets re-measured).
+4) Conservative interval updates.
+ Measurements above current upper bound never update a valid upper bound,
+ even if drop ratio is low.
+ Measurements below current lower bound always update any lower bound
+ if drop ratio is high.
+5) Ensure sufficient interval width.
+ Narrow intervals make external search take more time to find a valid bound.
+ If the new transmit increased or decreased rate would result in width
+ less than the current goal, increase/decrease more.
+ This can happen if measurement for the other interval
+ makes the current interval too narrow.
+ Similarly, take care the measurements in the initial phase
+ create wide enough interval.
+6) Timeout for bad cases.
+ The worst case for MDR search is when each phase converges to intervals
+ way different than the results of the previous phase.
+ Rather than suffer total search time several times larger
+ than pure binary search, the implemented tests fail themselves
+ when the search takes too long (given by argument *timeout*).
+
+Test effectiveness comparison
+-----------------------------
+
+Introduction
+````````````
+
+CSIT release 1804 contains two test suites that use the new MDR search
+to enable comparison against existing CSIT NDR and PDR binary searches.
+The suites got chosen based on the level of consistency of their
+historical NDR/PDR results:
+
+#. 10Ge2P1X520-Ethip4-Ip4Base-Ndrpdr - yielding very consistent binary
+ search results.
+#. 10Ge2P1X520-Eth-L2Bdbasemaclrn-Eth-2Vhostvr1024-1Vm-Ndrpdr - yielding
+ somewhat inconsistent results.
+
+Here "inconsistent" means the values found differ between runs,
+even though the setup and the test are exactly the same.
+
+The search part of CSIT binary search tests requires a single 5-second warmup
+and each trial measurement is set to 10 seconds.
+
+New tests with MDR search do not have any warmup, as initial measurements
+are not critical to the final result.
+
+Fairness of the following comparison has been achieved
+by setting MDR final relative width to values causing the width to match
+the binary NDR/PDR result.
+Each search algorithm has been run with three different
+(final) trial durations: 10s, 30s and 60s.
+
+The table below compares overall test duration between the search tests.
+For simplicity only data for single thread 64B packet tests is listed,
+as it takes the longest in all cases.
+
+The table is based on result of 6 runs.
+
+Tables
+``````
+
+.. table:: Search part of test duration
+
+ ==================== ========== =========== =========== ========== =========== ===========
+ Duration+-avgdev [s] IP4 10s IP4 30s IP4 60s Vhost 10s Vhost 30s Vhost 60s
+ ==================== ========== =========== =========== ========== =========== ===========
+ MDR (both intervals) 50.8+-1.2 109.0+-10.0 202.8+-11.7 80.5+-9.0 201.9+-20.6 474.9+-58.2
+ NDR binary 98.9+-0.1 278.6+-0.1 548.8+-0.1 119.8+-0.1 339.3+-0.1 669.6+-0.2
+ PDR binary 98.9+-0.1 278.6+-0.1 548.8+-0.1 119.7+-0.1 339.3+-0.1 669.5+-0.1
+ NDR+PDR sum 197.8+-0.1 557.2+-0.2 1097.6+-0.1 239.5+-0.1 678.7+-0.1 1339.2+-0.1
+ ==================== ========== =========== =========== ========== =========== ===========
+
+.. note:: Here "avgdev" is the estimated difference between
+ the average duration computed from the limited sample
+ and a true average duration as its hypothetical limit for infinite samples.
+ To get the usual "standard deviation" of duration, "avgdev" has to be multiplied
+ by the square root of the number of samples.
+ For the subtle details see `estimation of standard deviation`_,
+ we used zero ACF and c4==1.
+
+.. table:: MDR duration as percentage of NDR duration
+
+ ==================================== ========= ========= ========= ========= ========= =========
+ Fraction+-stdev [%] IP4 10s IP4 30s IP4 60s Vhost 10s Vhost 30s Vhost 60s
+ ==================================== ========= ========= ========= ========= ========= =========
+ MDR duration divided by NDR duration 51.4+-1.2 39.1+-3.6 37.0+-2.1 67.2+-7.5 59.5+-6.1 70.9+-8.7
+ ==================================== ========= ========= ========= ========= ========= =========
+
+.. note:: Here "stdev" is standard deviation as computed by the
+ `simplified error propagation formula`_.
+
+Conclusions
+```````````
+
+In consistent tests, MDR is on average more than 50% faster
+than a single NDR binary search (even though MDR also detects PDR).
+
+One exception is 10 second final trial duration,
+where MDR is (only) almost 50% faster than NDR binary search.
+Most probably presence of 2 intermediate phases (instead of just 1) hurts there.
+
+In inconsistent tests MDR is still somewhat faster than NDR binary search,
+but it is not by 50%, and it is hard to quantify as MDR samples have wildly
+varying durations.
+
+Graphical examples
+------------------
+
+The following graphs were created from the data gathered from comparison runs,
+for the vhost tests.
+The vertical axis has always the same values,
+zoomed into the interesting part of the search space.
+The faint blue vertical lines separate the phases of MDR search.
+The bound lines are sloped just to help locate the previous value,
+in reality the bounds are updated instantly at the end of the measurement.
+
+The graphs do not directly show when a particular bound is invalid.
+However this can be gleaned indirectly by identifying
+that the measurement does not satisfy that bound's validity conditions
+(see point 2a).
+Also, the external search follows, and the measurement previously acting
+as and invalid upper or lower bound starts acting instead
+as a valid lower or upper bound, respectively.
+
+The following three graphs are for MDR with 10 second final trial duration,
+showing different behavior in this inconsistent test,
+and different amount of "work" done by each phase.
+Also the horizontal axis has the same scaling here.
+
+.. image:: MDR_10_1.svg
+.. image:: MDR_10_2.svg
+.. image:: MDR_10_3.svg
+
+The next graph is for MDR with 60 second final trial duration,
+to showcase the final phase takes the most of the overall search time.
+The scaling of the horizontal axis is different.
+
+.. image:: MDR_60.svg
+
+Finally, here are two graphs showing NDR and PDR binary searches.
+The trial duration is again 60 seconds,
+but scaling of horizontal axis is once again different.
+This shows the binary search spends most time measuring outside
+the interesting rate region.
+
+.. image:: NDR_60.svg
+.. image:: PDR_60.svg
+
+.. _binary search: https://en.wikipedia.org/wiki/Binary_search
+.. _exponential search: https://en.wikipedia.org/wiki/Exponential_search
+.. _estimation of standard deviation: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation
+.. _simplified error propagation formula: https://en.wikipedia.org/wiki/Propagation_of_uncertainty#Simplification