From ee28e8ae476c6b0c098cd3895c586316feb4bdb9 Mon Sep 17 00:00:00 2001 From: Vratko Polak Date: Tue, 24 Oct 2023 11:48:29 +0200 Subject: feat(MLRsearch): version 05 of the IETF draft Editing not finished, but we ran out of time. Change-Id: I169c672eec731c5bce083af590be01222f99fb4c Signed-off-by: Vratko Polak --- docs/ietf/draft-ietf-bmwg-mlrsearch-04.md | 1479 ----------------------------- docs/ietf/draft-ietf-bmwg-mlrsearch-05.md | 1460 ++++++++++++++++++++++++++++ docs/ietf/process.txt | 2 +- 3 files changed, 1461 insertions(+), 1480 deletions(-) delete mode 100644 docs/ietf/draft-ietf-bmwg-mlrsearch-04.md create mode 100644 docs/ietf/draft-ietf-bmwg-mlrsearch-05.md diff --git a/docs/ietf/draft-ietf-bmwg-mlrsearch-04.md b/docs/ietf/draft-ietf-bmwg-mlrsearch-04.md deleted file mode 100644 index 4db8506131..0000000000 --- a/docs/ietf/draft-ietf-bmwg-mlrsearch-04.md +++ /dev/null @@ -1,1479 +0,0 @@ ---- -title: Multiple Loss Ratio Search -abbrev: MLRsearch -docname: draft-ietf-bmwg-mlrsearch-04 -date: 2023-07-10 - -ipr: trust200902 -area: ops -wg: Benchmarking Working Group -kw: Internet-Draft -cat: info - -coding: us-ascii -pi: # can use array (if all yes) or hash here - toc: yes - sortrefs: # defaults to yes - symrefs: yes - -author: - - - ins: M. Konstantynowicz - name: Maciek Konstantynowicz - org: Cisco Systems - email: mkonstan@cisco.com - - - ins: V. Polak - name: Vratko Polak - org: Cisco Systems - email: vrpolak@cisco.com - -normative: - RFC1242: - RFC2285: - RFC2544: - RFC9004: - -informative: - TST009: - target: https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/009/03.04.01_60/gs_NFV-TST009v030401p.pdf - title: "TST 009" - FDio-CSIT-MLRsearch: - target: https://csit.fd.io/cdocs/methodology/measurements/data_plane_throughput/mlr_search/ - title: "FD.io CSIT Test Methodology - MLRsearch" - date: 2022-11 - PyPI-MLRsearch: - target: https://pypi.org/project/MLRsearch/0.4.0/ - title: "MLRsearch 0.4.0, Python Package Index" - date: 2021-04 - ---- abstract - -This document proposes improvements to [RFC2544] throughput search by -defining a new methodology called Multiple Loss Ratio search -(MLRsearch). The main objectives for MLRsearch are to minimize the -total test duration, search for multiple loss ratios and improve -results repeatibility and comparability. - -The main motivation behind MLRsearch is the new set of challenges and -requirements posed by testing Network Function Virtualization -(NFV) systems and other software based network data planes. - -MLRsearch offers several ways to address these challenges, giving user -configuration options to select their preferred way. - ---- middle - -{::comment} - As we use kramdown to convert from markdown, - we use this way of marking comments not to be visible in rendered draft. - https://stackoverflow.com/a/42323390 - If other engine is used, convert to this way: - https://stackoverflow.com/a/20885980 -{:/comment} - -# Purpose and Scope - -The purpose of this document is to describe Multiple Loss Ratio search -(MLRsearch), a throughput search methodology optimized for software -DUTs. - -Applying vanilla [RFC2544] throughput bisection to software DUTs -results in a number of problems: - -- Binary search takes too long as most of trials are done far from the - eventually found throughput. -- The required final trial duration (and pauses between trials) also - prolong the overall search duration. -- Software DUTs show noisy trial results (noisy neighbor problem), - leading to big spread of possible discovered throughput values. -- Throughput requires loss of exactly zero packets, but the industry - frequently allows for small but non-zero losses. -- The definition of throughput is not clear when trial results are - inconsistent. - -MLRsearch aims to address these problems by applying the following set -of enhancements: - -- Allow searching for multiple search goals, with differing goal loss ratios. - - Each trial result can affect any search goal in principle - (trial reuse). -- Multiple preceding targets for each search goal, earlier ones need - to spend less time on trials. - - Earlier targets also aim at lesser precision. - - Use Forwarding Rate (FR) at maximum offered load - [RFC2285] (section 3.6.2) to initialize the initial targets. -- Take care when dealing with inconsistent trial results. - - Loss ratios goals are handled in an order that minimizes the chance - of interference from later trials to earlier goals. -- Apply several load selection heuristics to save even more time - by trying hard to avoid unnecessarily narrow bounds. - -MLRsearch configuration options are flexible enough to -support both conservative settings (unconditionally compliant with [RFC2544], -but longer search duration and worse repeatability) and aggressive -settings (shorter search duration and better repeatability but not -compliant with [RFC2544]). - -No part of [RFC2544] is intended to be obsoleted by this document. - -# Terminology - -When a subsection is defining a term, the first paragraph -acts as a definition. Other paragraphs are treated as a description, -they provide additional details without being needed to define the term. - -Definitions should form a directed acyclic graph of dependencies. -If a section contains subsections, the section definition -may depend on the subsection definitions. -Otherwise, any definition may depend on preceding definitions. -In other words, if the section definition were to come after subsections, -there would be no forward dependencies for people reading just definitions -from start to finish. - -Descriptions provide motivations and explanations, -they frequently reference terms defined only later. -Motivations in section descriptions are the reason -why section text comes before subsection text. - -## General notions - -General notions are the terms defined in this section. - -It is useful to define the following notions -before delving into MLRsearch architecture, -as the notions appear in multiple places -with no place being special enough to host definition. - -### General and specific quantities - -General quantity is a quantity that may appear multiple times -in MLRsearch specification, perhaps each time in a different role. -The quantity when appearing in a single role is called -a specific quantity. - -It is useful to define the general quantity, -so definitions of specific quantities may refer to it. -We say a specific quantity is based on a general quantity, -if the specific quantity definition refers to and -relies on the general quantity definition. - -It is natural to name specific quantities by adding an adjective -(or a noun) to the name of the general quantity. -But existing RFCs typically explicitly define a term acting -in a specific role, so the RFC name directly refers to a specific -quantity, while the corresponding general quantity -is defined only implicitly. -Therefore this documents defines general quantities explicitly, -even if the same term already appears in an RFC. - -In practice, it is required to know which unit of measurement -is used to accompany a numeric value of each quantity. -The choice of a particular unit of measurement is not important -for MLRsearch specification though, so specific units -mentioned in this document are just examples or recommendations, -not requirements. - -When reporting, it is REQUIRED to state the units used. - -### Composite - -A composite is a set of named attributes. -Each attribute is either a specific quantity or a composite. - -MLRsearch specification frequently groups multiple specific quantities -into a composite. Description of such a composite brings an insight -to motivations why this or other terms are defined as they are. -Such insight will be harder to communicate -with the specific quantities alone. - -Also, it simplifies naming of specific quantities, as they usually can -share a noun or adjective referring to their common composite. -Most of relations between composites and their specific quantities -can be described using plain English. - -Perhaps the only exception involves referring to specific quantities -as attributes. For example if there is a composite called 'target', -and one of its specific quantities is 'target width' defined using -a general quantity 'width', we can say 'width is one of target attributes'. - -### SUT - -As defined in RFC 2285: -The collective set of network devices to which stimulus is offered -as a single entity and response measured. - -While RFC 2544 mostly refers to DUT as a single -(network interconnecting) device, section 19 makes it clear -multiple DUTs can be treated as a single system, -so most of RFC 2544 also applies to testing SUT. - -MLRsearch specification only refers to SUT (not DUT), -even if it consists of just a single device. - -### Trial - -A trial is the part of test described in RFC 2544 section 23. - -When traffic has been sent and SUT response has been observed, -we say the trial has been performed, or the trial has been measured. -Before that happens, multiple possibilities for upcoming trial -may be under consideration. - -### Load - -Intended, constant load for a trial, usually in frames per second. - -Load is the general quantity implied by Constant Load of RFC 1242, -Data Rate of RFC 2544 and Intended Load of RFC 2285. -All three specify this value applies to one (input or output) interface, -so we can talk about unidirectional load also -when bidirectional or multi-port traffic is applied. - -MLRsearch does not rely on this distinction, it works also if -the load values correspond to an aggregate rate -(sum over all SUT tested input or output interface unidirectional loads), -as long as all loads share the same semantics. - -Several RFCs define useful quantities based on Offered Load -(instead of Intended Load), but MLRsearch specification -works only with (intended) load. Those useful quantities -still serve as motivations for few specific quantities used in MLRsearch -specification. - -MLRsearch assumes most load values are positive. -For some (but not all) specific quantities based on load, -zero may also be a valid value. - -### Duration - -Intended duration of the traffic for a trial, usually in seconds. - -This general quantity does not include any preparation nor waiting -described in section 23 of RFC 2544. -Section 24 of RFC 2544 places additional restrictions on duration, -but those restriction apply only to some of the specific quantities based -on duration. - -Duration is always positive in MLRsearch. - -### Duration sum - -For a specific set of trials, this is the sum of their durations. - -Some of specific quantities based on duration sum are derived quantities, -without a specific set of trials to sum their durations. - -Duration sum is never negative in MLRsearch. - -### Width - -General quantity defined for an ordered pair (lower and higher) -of load values, which describes a distance between the two values. - -The motivation for the name comes from binary search. -The binary search tries to approximate an unknown value -by repeatedly bisecting an interval of possible values, -until the interval becomes narrow enough. -Width of the interval is a specific quantity -and the termination condition compares that -to another specific quantity acting as the threshold. -The threshold value does not have a specific interval associated, -but corresponds to a 'size' of the compared interval. -As size is a word already used in definition of frame size, -a more natural word describing interval is width. - -The MLRsearch specification does use (analogues of) upper bound -and lower bound, but does not actually need to talk about intervals. -Still, the intervals are implicitly there, so width is the natural name. - -Actually, there are two popular options for defining width. -Absolute width is based on load, the value is the higher load -minus the lower load. -Relative width is dimensionless, the value is the absolute width -divided by the higher load. As intended loads for trials are positive, -relative width is between 0.0 (including) and 1.0 (excluding). - -Relative width as a threshold value may be useful for users -who do not presume what is the typical performance of SUT, -but absolute width may be a more familiar concept. - -MLRsearch specification does not prescribe which width has to be used, -but widths MUST be either all absolute or all relative, -and it MUST be clear from report which option was used -(it is implied from the unit of measurement of any width value). - -### Loss ratio - -The loss ratio is a general quantity, dimensionless floating point value -assumed to be between 0.0 and 1.0, both including. -It is computed as the number of frames forwarded by SUT, divided by -the number of frames that should have been forwarded during the trial. - -If the number of frames that should have been forwarded is zero, -the loss ratio is considered to be zero -(but it is better to use high enough loads to prevent this). - -Loss ratio is basically the same quantity as Frame Loss Rate of RFC 1242, -just not expressed in percents. - -RFC1242 Frame Loss Rate: -Percentage of frames that should have been forwarded -by a network device under steady state (constant) -load that were not forwarded due to lack of -resources. - -(RFC2544 restricts Frame Loss Rate to a type of benchmark, -for loads 100% of 'maximum rate', 90% and so on.) - -### Exceed ratio - -This general quantity is a dimensionless floating point value, -defined using two duration sum quantities. -One duration sum is referred to as the good duration sum, -the other is referred to as the bad duration sum. -The exceed ratio value is computed as the bad duration sum value -divided by the sum of the two sums. If both sums are zero, -the exceed ratio is undefined. - -As there are no negative duration sums in MLRsearch, -exceed ratio values are between 0.0 and 1.0 (both including). - -## Architecture - -MLRsearch architecture consists of three main components: -the manager, the controller and the measurer. - -The search algorithm is implemented in the controller, -and it is the main focus of this document. - -Most implementation details of the manager and the measurer are -out of scope of this document, except when describing -how do those components interface with the controller. - -### Manager - -The manager is the component that initializes SUT, traffic generator -(called tester in RFC 2544), the measurer and the controller -with intended configurations. It then handles the execution -to the controller and receives its result. - -Managers can range from simple CLI utilities to complex -Continuous Integration systems. From the controller point of view -it is important that no additional configuration (nor warmup) -is needed for SUT and the measurer to perform trials. - -The interface between the manager and the controller -is defined in the controller section. - -One execution of the controller is called a search. -Some benchmarks may execute multiple searches on the same SUT -(for example when confirming the performance is stable over time), -but in this document only one invocation is concerned -(others may be understood as the part of SUT preparation). - -Creation of reports of appropriate format can also be understood -as the responsibility of the manager. This document places requirements -on which information has to be reported. - -### Measurer - -The measurer is the component which performs one trial -as described in RFC 2544 section 23, when requested by the controller. - -From the controller point of view, it is a function that accepts -trial input and returns trial output. - -This is the only way the controller can interact with SUT. -In practice, the measurer has to do subtle decisions -when converting the observed SUT behavior into a single -trial loss ratio value. For example how to deal with -out of order frames or duplicate frames. - -On software implementation level, the measurer is a callable, -injected by the manager into the controller instance. - -The act of performing one trial (act of turning trial input -to trial output) is called a measurement, or trial measurement. -This way we can talk about trials that were measured already -and trials that are merely planned (not measured yet). - -#### Trial input - -The load and duration to use in an upcoming trial. - -This is a composite. - -Other quantities needed by the measurer are assumed to be constant -and set up by the manager before search starts (see traffic profile), -so they do not count as trial input attributes. - -##### Trial load - -Trial load is the intended load for the trial. - -This is a specific quantity based on load, -directly corresponding to RFC 2285 intended load. - -##### Trial duration - -Trial duration is the intended duration for the trial. - -This is a specific quantity based on duration, so it specifies -only the traffic part of the trial, not the waiting parts. - -#### Traffic profile - -Any other configuration values needed by the measurer to perform a trial. - -The measurer needs both trial input and traffic profile to perform the trial. -As trial input contains the only values that vary during one the search, -traffic profile remains constant during the search. - -Traffic profile when understood as a composite is REQUIRED by RFC 2544 -to contain some specific quantities (for example frame size). -Several more specific quantities may be RECOMMENDED. - -Depending on SUT configuration (e.g. when testing specific protocols), -additional values need to be included in the traffic profile -and in the test report. (See other IETF documents.) - -#### Trial ouput - -A composite consisting of trial loss ratio -and trial forwarding rate. - -Those are the only two specific quantities (among other quantities -possibly measured in the trial, for example offered load) -that are important for MLRsearch. - -##### Trial loss ratio - -Trial loss ratio is a specific quantity based on loss ratio. -The value is related to a particular measured trial, -as measured by the measurer. - -##### Trial forwarding rate - -Trial forwarding rate is a derived quantity. -It is computed as one minus trial loss ratio, -that multiplied by trial load. - -Despite the name, the general quantity this specific quantity -corresponds to is load (not rate). -The name is inspired by RFC 2285, which defines Forwarding Rate -specific to one output interface. - -As the definition of loss ratio is not neccessarily per-interface -(one of details left for the measurer), using the definition above -(instead of RFC 2285) makes sure trial forwarding rate -is always between zero and the trial load (both including). - -#### Trial result - -Trial result is a composite consisting of trial input attributes -and trial output attributes. - -Those are all specific quantites related to a measured trial MLRsearch needs. - -While distinction between trial input and output is important -when defining the interface between the controller and the measurer, -it is easier to talk about trial result -when describing how measured trials influence the controller behavior. - -### Controller - -The component of MLRsearch architecture that calls the measurer -and returns conditional throughputs to the manager. - -This component implements the search algorithm, -the main content of this document. - -Contrary to Throughput as defined in RFC 1242, -the definition of conditional throughput is quite sensitive -to the controller input (as provided by the manager), -and its full definition needs several terms -which would otherwise be hidden as internals of the controller -implementation. - -The ability of conditional throughput to be less sensitive -to performance variance, and the ability of the controller -to find conditional throughputs for multiple search goals -within one search (and in short overall search time) -are strong enough motivations for the need of increased complexity. - -### Controller input - -A composite of max load, min load, and a set of search goals. - -The search goals (as elements of the set of search goals) -are usually not named and unordered. - -It is fine if all search goals of the set have the same value -of a particular attribute. In that case, the common value -may be treated as a global attribute (similarly to max and min load). - -The set of search goals MUST NOT be empty. -Two search goals within the set MUST differ in at least one attribute. -The manager MAY avoid both issues by presenting empty report -or de-duplicating the search goals, but it is RECOMMENDED -for the manager to raise an error to its caller, -as the two conditions suggest the test is improperly configured. - -#### Max load - -Max load is a specific quantity based on load. -No trial load is ever higher than this value. - -RFC 2544 section 20 defines maximum frame rate -based on theoretical maximum rate for the frame size on the media. -RFC 2285 section 3.5.3 specifies Maximum offered load (MOL) -which may be lower than maximum frame rate. -There may be other limitations preventing high loads, -for examples resources available to traffic generator. - -The manager is expected to provide a value that is not greater -than any known limitation. Alternatively, the measurer -is expected to work at max load, possibly reporting as lost -any frames that were not able to leave Traffic Generator. - -From the controller point of view, this is merely a global upper limit -for any trial load candidates. - -#### Min load - -Min load is a specific quantity based on load. -No trial load is ever lower than this value. - -The motivation of this quantity is to prevent trials -with too few frames sent to SUT. - -Also, practically if a SUT is able to reach only very small -forwarding rates (min load indirectly serves as a threshold for how small), -it may be considered faulty (or perhaps the test is misconfigured). - -#### Search goal - -A composite of 7 attributes (see subsections). - -If not otherwise specified, 'goal' always refers to a search goal -in this document. - -The controller input may contain multiple search goals. -The name Multiple Loss Ratio search was created back when -goal loss ratio was the only attribute allowed to vary between goals. - -Each goal will get its conditional throughput discovered -and reported at the end of the search. - -The definitions of the 7 attributes are not very informative by themselves. -Their motivation (and naming) becomes more clear -from the impact they have on conditional throughput. - -##### Goal loss ratio - -A specific quantity based on loss ratio. -A threshold value for trial loss ratios. -MUST be lower than one. - -Trial loss ratio values will be compared to this value, -a trial will be considered bad if its loss ratio is higher than this. - -For example, RFC 2544 throughput has goal loss ratio of zero, -a trial is bad once a sigle frame is lost. - -Loss ratio of one would classify each trial as good (regardless of loss), -which is not useful. - -##### Goal initial trial duration - -A specific quantity based on duration. -A threshold value for trial durations. -MUST be positive. - -MLRsearch is allowed to use trials as short as this when focusing -on this goal. -The conditional throughput may be influenced by shorter trials, -(measured when focusing on other search goals). - -{::comment} - FIXME: Should shorter trials be explicitly ignored? -{:/comment} - -##### Goal final trial duration - -A specific quantity based on duration. -A threshold value for trial durations. -MUST be no smaller than goal initial trial duration. - -MLRsearch is allowed to use trials as long as this when focusing -on this goal. If more data is needed, repeated trials -at the same load and duration are requested by the controller. - -##### Goal min duration sum - -A specific quantity based on duration sum. -A threshold value for a particular duration sum. - -MLRsearch requires at least this amount of (effective) trials -for a particular load to become part of MLRsearch outputs. - -It is possible (though maybe not prectical) for goal min duration sum -to be smaller than goal final trial duration. - -In practice, the sum of durations actually spent on trial measurement -can be smaller (when trial results are quite one-sided) or even larger -(in presence of shorter-than-final trial duration results at the same load). - -If the sum of all (good and bad) long trials is at least this, -and there are no short trials, then the load is guaranteed -to be classified as either an upper or a lower bound. - -In some cases, the classification is known sooner, -when the 'missing' trials cannot change the outcome. - -When short trials are present, the logic is more complicated. - -##### Goal exceed ratio - -A specific quantity based on exceed ratio. -A threshold value for particulat sets of trials. - -An attribute used for classifying loads into upper and lower bounds. - -If the duration sum of all (current duration) trials is at least -min duration sum, and more than this percentage of the duration sum -comes from bad trials, this load is an upper bound. - -If there are shorter duration trials, the logic is more complicated. - -##### Goal width - -A specific quantity based on width. -A threshold value for a particular width. -MUST be positive. - -This defines the exit condition for this search goal. - -Relevant bounds (of the final target) need to be this close -before conditional throughput can be reported. - -##### Preceding targets - -A non-negative integer affecting the behavior of the controller. - -How many additional non-final targets to add. -Each next preceding target has double width -and min duration sum geometrically closer to initial trial duration. - -The usage of preceding targets is an important source -of MLRsearch time savings (compared to simpler search algorithms). - -Having this value configurable lets the manager -tweak the overall search duration based on presumed knowledge -of SUT performance stability. - -### Controller internals - -Terms not directly corresponding to the controller's input nor output, -but needed indirectly as dependencies of the conditional throughput -definition. - -Following these definitions specifies virtually all of the controller -(MLRsearch algorithm) logic. - -#### Pre-initial trials - -Up to three special trials executed at the start of the search. -The first trial load is max load, -subsequent trial load are computed from preceding trial -forwarding rate. - -The main loop of the controller logic needs at least one trial result, -and time is saved if the trial results are close to future conditional -throughput values. - -The exact way to compute load for second and third trial -(and whether even measure second or third trial) -are not specified here, as the implementation details -have negligible effect on the reported conditional throughput. - -{::comment} - TODO: Still, recommend something like this: - Loads need to fit several different initial targets at once. - Duration is the largest among initial trial durations, - loads are computed from forwarding rate an smallest loss ratio goal. - Also, the initial target current width is set based on these. -{:/comment} - -#### Search target - -A composite of 5 specific quantites (see subsections). -Frequently called just target. - -Similar to (but distinct from) the search goal. - -Each search goal prescribes a final target, -probably with a chain of preceding targets. - -More details in the Derived targets section. - -##### Target loss ratio - -Same as loss ratio of the corresponding goal. - -##### Target exceed ratio - -Same as exceed ratio of the corresponding goal. - -##### Target width - -Similar to goal width attribute. -Doubled from goal width for each level of preceding target. - -##### Target min duration sum - -Similar to goal min duration sum attribute. -Geometrically interpolated between -initial target duration and goal min duration sum. - -##### Target trial duration - -When MLRsearch focuses on this target, it measures trials -with this duration. -The value is equal to the minimum of goal final trial duration -and target min duration sum. - -Also, this value is used to classify trial results -as short (if trial duration is shorter than this) or long. - -#### Derived targets - -After receiving the set of search goals, -MLRsearch internally derives a set of search targets. - -The derived targets can be seen as forming a chain, -from initial target to final target. -The chain is linked by a reference from a target to its preceding -(towarsds initial) target. - -The reference may be implemented as 6th attribute od target. - -##### Final target - -The final target is the target where the most of attribute values -are directly copied from the coresponding search goal. -Final target width is the same as goal width, -final target trial duration is the same as goal final trial duration, -and final target min duration sum is the same -as the goal min duration sum. - -The conditional throughput is found when focusing on the final target. -All non-final targets do not directly affect the conditional throughput, -they are there just as an optimization. - -##### Preceding target - -Each target may have a preceding target. -Goal attribute Preceding targets governs how many targets are created -in addition to the final target corresponding to the search goal. - -Any preceding target has double width, meaning one balanced bisection -is needed to reduce preceding target width to the next target width. - -Preceding target min duration sum is exponentially smaller, -aiming for prescribed initial target min duration sum. - -Preceding target trial duration is either its min duration sum, -or the corresponding goal's final trial duration, whichever is smaller. - -As the preceding min duration sum is shorter than the next duration sum, -MLRsearch is able to achieve the preceding target width -sooner (than with the next target min duration sum). - -This way an approximation of the conditional throughput is found, -with the next target needing not as much time to improve the approximation -(compared to not starting with the approximation). - -##### Initial target - -Initial target is a target without any other target preceding it. -Initial target min duration sum is equal to the corresponding goal's -initial trial duration. - -As a consequence, initial target trial duration is equal to its min duration sum. - -#### Trial classification - -Any trial result can be classified according to any target along two axes. - -The two classifications are independent. - -This classification is important for defining the conditional throughput. - -##### Short trial - -If the (measured) trial duration is shorter than -the target trial duration, the trial is called long. - -##### Long trial - -If the (measured) trial duration is not shorter than -the target trial duration, the trial is called long. - -##### Bad trial - -If the (measured) trial loss ratio is larger than the target loss ratio, -the trial is called bad. - -For example, if the target loss ratio is zero, -a trial is bad as soon as one frame was lost. - -##### Good trial - -If the (measured) trial loss ratio is not larger than the target loss ratio, -the trial is called good. - -For example, if the target loss ratio is zero, -a trial is good only when there were no frames lost. - -#### Load stat - -A composite of 8 quantities (see subsections) -The quantites depend on a target and a load, -and are computed from all trials measured at that load so far. - -The MLRsearch output is the conditional througput, -which is a specific quantity based on load. -As MLRsearch may measure multiple trials at the same load, -and those trials may not have the same duration, -we need a way to classify a set of trial results at the same load. - -As the logic is not as straightforward as in other parts -of MLRsearch algorithm, it is best defined using the following -derived quantities. - -Load stat is the composite for one load and one target. -Set of load stats for one load an all targets is commonly called load stats. - -##### Long good duration sum - -Sum of durations of all long good trials -(at this load, according to this target). - -##### Long bad duration sum - -Sum of durations of all long bad trials -(at this load, according to this target). - -##### Short good duration sum - -Sum of durations of all short good trials -(at this load, according to this target). - -##### Short bad duration sum - -Sum of durations of all short bad trials -(at this load, according to this target). - -##### Effective bad duration sum - -One divided by tagret exceed ratio, that plus one. -Short good duration sum divided by that. -Short bad duration sum minus that, or zero if that would be negative. -Long bad duration sum plus that is the effective bad duration sum. - -Effective bad duration sum is the long bad duration sum -plus some fraction of short bad duration sum. -The fraction is between zero and one (both possibly including). - -If there are no short good trials, effective bad duration sum -becomes the duration sum of all bad trials (long or short). - -If an exceed ratio computed from short good duration sum -and short bad duration sum is equal or smaller than the target exceed ratio, -effective bad duration sum is equal to just long bad duration sum. - -Basically, short good trials can only lessen the impact -of short bad trials, while short bad trials directly contribute -(unless lessened). - -A typical example of why a goal needs higher final trial duration -than initial trial duration is when SUT is expected to have large buffers, -so a trial may be too short to see frame losses due to -a buffer becoming full. So a short good trial does not give strong information. -On the other hand, short bad trial is a strong hint SUT would lose many frames -at that load and long duration. -But if there is a mix of short bad and short good trials, -MLRsearch should not cherry-pick only the short bad ones. - -The presented way of computing the effective bad duration sum -aims to be a fair treatment of short good trials. - -If the target exceed ratio is zero, the given definition contains -positive infinty as an intermediate value, but still simplifies -to a finite result (long bad duration sum plus short bad duration sum). - -##### Missing duration sum - -The target min duration sum minus effective bad duration sum -and minus long good duration sum, or zero if that would be negative. - -MLRsearch may need up to this duration sum of additional long trials -before classifing the load. - -##### Optimistic exceed ratio - -The specific quantity based on exceed ratio, where bad duration sum is -the effective bad duration sum, and good duration sum is -the long good duration sum plus the missing duration sum. - -This is the value MLRsearch would compare to target exceed ratio -assuming all of the missing duration sum ends up consisting of long good trials. - -If there was a bad long trial, optimistic exceed ratio -becomes larger than zero. -Additionally, if the target exceed ratio is zero, optimistic exceed ratio -becomes larger than zero even on one short bad trial. - -##### Pessimistic exceed ratio - -The specific quantity based on exceed ratio, where bad duration sum is -the effective bad duration sum plus the missing duration sum, -and good duration sum is the long good duration sum. - -This is the value MLRsearch would compare to target exceed ratio -assuming all of the missing duration sum ends up consisting of bad good trials. - -Note that if the missing duration sum is zero, -optimistic exceed ratio becomes equal to pessimistic exceed ratio. - -This is the role target min duration sum has, -it guarantees the two load exceed ratios eventually become the same. -Otherwise, pessimistic exceed ratio -is always bigger than the optimistic exceed ratio. - -Depending on trial results, the missing duration sum may not be large enough -to change optimistic (or pessimistic) exceed ratio -to move to the other side compared to target exceed ratio. -In that case, MLRsearch does not need to measure more trials -at this load when focusing on this target. - -#### Target bounds - -With respect to a target, some loads may be classified as upper or lower bound, -and some of the bounds are treated as relevant. - -The subsequent parts of MLRsearch rely only on relevant bounds, -without the need to classify other loads. - -##### Upper bound - -A load is classified as an upper bound for a target, -if and only if both optimistic exceed ratio -and pessimstic load exceed ratio are larger than the target exceed ratio. - -During the search, it is possible there is no upper bound, -for example because every measured load still has too high -missing duration sum. - -If the target exceed ratio is zero, and the load has at least one bad trial -(short or long), the load becomes an upper bound. - -##### Lower bound - -A load is classified as a lower bound for a target, -if and only if both optimistic exceed ratio -and pessimstic load exceed ratio are no larger than the target exceed ratio. - -During the search, it is possible there is no lower bound, -for example because every measured load still has too high -missing duration sum. - -If the target exceed ratio is zero, all trials at the load of -a lower bound must be good trials (short or long). - -Note that so far it is possible for a lower bound to be higher -than an upper bound. - -##### Relevant upper bound - -For a target, a load is the relevant upper bound, -if and only if it is an upper bound, and all other upper bounds -are larger (as loads). - -In some cases, the max load when classified as a lower bound -is also effectively treated as the relevant upper bound. -(In that case both relevant bounds are equal.) - -If that happens for a final target at the end of the search, -the controller output may contain max load as the relevant upper bound -(even if the goal exceed ratio was not exceeded), -signalling SUT performs well even at max load. - -If the target exceed ratio is zero, the relevant upper bound -is the smallest load where a bad trial (short or long) has been measured. - -##### Relevant lower bound - -For a target, a load is the relevant lower bound if two conditions hold. -Both optimistic exceed ratio and pessimstic load exceed ratio -are no larger than the target exceed ratio, -and there is no smaller load classified as an upper bound. - -This is a second place where MLRsearch is not symmetric -(the first place was effective bad duration sum). - -While it is not likely for a MLRsearch to find a smaller upper bound -and a larger load satisfying first condition for the lower bound, -it still may happen and MLRsearch has to deal with it. -The second condition makes sure the relevant lower bound -is smaller than the relevant upper bound. - -In some cases, the min load when classified as an upper bound -is also effectively treated as the relevant lower bound. -(In that case both relevant bounds are equal.) - -If that happens for a final target at the end of the search, -the controller output may contain min load as the relevant lower bound -even if the exceed ratio was 'overstepped', -signalizing the SUT does not even reach the minimal required performance. - -The manager has to make sure this is distingushed in report -from cases where min rate is a legitimate conditional throughput -(e.g. the exceed ratio was not overstepped at the min load). - -##### Relevant bounds - -The pair of the relevant lower bound and the relevant upper bound. - -Useful for determining the width of the relevant bounds. -Any of the bounds may be the effective one (max load or min load). - -A goal is achieved (at the end of the search) when the final target's -relevant bounds have width no larger than the goal width. - -#### Candidate selector - -A stateful object (a finite state machine) -focusing on a single target, used to determine next trial input. - -Initialized for a pair of targets: -the current target and its preceding target (if any). - -Private state (not shared with other selectors) consists of mode and flags. -Public state (shared with all selectors) is the actual relevant bounds -for both targets (current and precedinig). - -After accepting a trial result, each selector can nominate -one candidate (or no candidate) for the next trial measurement. - -##### Current target - -This is the target this selector tries to achieve. - -##### Preceding target - -The target (if any) preceding to the current target. - -While this selector does not focus on the preceding target, -the relevant bounds for the preceding target are used as hints -when the current bound does not have enough of its relevant bounds. - -##### Candidate - -The trial input (if any) this selecor nominates. - -The trial duration attribute is always the current target trial duration. -The trial load attribute depends on the selector state. - -Candidates have defined ordering, to simplify finding the winner. -If load differs, the candidate with lower load is preferred. -If load is the same but duration differs, the candidate -with larger duration is preferred. - -##### Selector mode - -During its lifetime, selector proceeds through the following modes. -In order, but some modes may be skipped or revisited. - -Each mode has its own strategy of determining the candidate load (if any). - -###### Waiting - -Not enough relevant bounds (even for the preceding target). -In this mode, the selector abstains from nominating a candidate. - -This selector leaves this mode when preceding target's selector is done. - -###### Halving - -Candidate is in the middle of the relevant bounds of the preceding target. - -If the relevant bounds are narrow enough already, this mode is skipped. -As the preceding target had double width, just one halving load -needs to be measured. - -Selector uses a flag to avoid re-entering this mode -once it finished measuring the halved load. - -###### Upgrading - -This mode activates when one relevant bound for the current target is present -and there is a matching relevant bound of the preceding target -within the current target width. -Candidate is the load of the matching bound from the preceding target. - -At most one bound load is measured, depending on halving outcome. -Private flags are used to avoid upgrading at later times -once selector finished measuring the upgraded load. - -###### Extending - -Refined already but the other relevant bound for the current target -is still missing. -Nominate new candidate according to external search. -Initial target selectors skip all previous modes. - -A private value is used to track the width to be used in next load extension -(increasing geometrically). -For initial target selectors, the starting width may be chosen -based on pre-initial trial results. - -If both relevant bounds are present at the current load, -but the lower bound is far away (compared to tracked width), -the candidate from this mode is preferred (as long as the load -is larger than the candidate load of bisecting mode). - -###### Bisecting - -Both relevant bounds for the current target are available, but they are too far -from each other. Candidate is in the middle. - -Contrary to halving, the candidate load does not need to be at the exact middle. -For example if the width of the current relevant bounds -is three times as large as the target width, -it is advantageous to split the interval in 1:2 ratio -(choosing the lower candidate load), as it can save one bisect. - -###### Done - -Both relevant bounds for the current target are available, -the width is no larger than the target width. -No candidate. - -If a selector reaches the done state, -it is still possible later trials invalidate its relevant lower bound -(by proving a lower load is in fact a new uper bound), -making the selector transition into extending or bisecting mode. - -##### Active selector - -Derived from a common goal, the earliest selector which nominates a candidate -is considered to be the active selector for this goal. -Candidates from other selectors of the same goal are ignored. - -It is quite possible selectors focusing on other goals -have already found a lower bound relevant to multiple targets in a chain. -In that case, we want the most-initial of the target selectors -(not already in done mode) to have the nomination. - -Otherwise (when in extending mode and missun relevant upper bound) -the closer-to-final selectors would nominate candidates -at lower load but at too high duration sum, -preventing some of the time savings. - -##### Winner - -If the candidate previously nominated by a selector was the one -that got measured, the candidate is called a winner. - -A selector observing its previous candidate was a winer -can use simplified logic when determining the mode, -as it knows no other selectors may have changed the relevant loads unexpectedly. - -### Controller output - -The output object the controller returns to the manager -is a mapping assigning each search goal its conditional output (if it exists). - -The controller MAY include more information (if manager accepts it), -for example load stat at relevant bounds. - -There MAY be several ways how to communicate the fact a conditional output -does not exist (e.g. min load is classified as an upper bound). -The manager MUST NOT present min load as a conditional output in that case. - -If max load is a lower bound, it leads to a valid conditional output value. - -#### Conditional throughput - -The conditional throughput is the average of trial forwarding rates -across long good trials measured at the (offered load classified as) -relevant lower bound (for the goal, at the end of the search). -The average is the weighted arithmetic mean, weighted by trial duration. - -If the goal exceed ratio is zero, the definition of the relevant bounds -simplifies significantly. -If additionally the goal loss ratio is zero, -and the goal min duration sum is equal to goal final trial duration, -conditional throughput becomes conditionally compliant with RFC 2544 throughput. -If the goal final trial duration is at least 60 seconds, -the conditional througput becomes unconditionally compliant -with RFC 2544 throughput. - -# Problems - -## Long Test Duration - -Emergence of software DUTs, with frequent software updates and a -number of different packet processing modes and configurations, drives -the requirement of continuous test execution and bringing down the test -execution time. - -In the context of characterising particular DUT's network performance, this -calls for improving the time efficiency of throughput search. -A vanilla bisection (at 60sec trial duration for unconditional [RFC2544] -compliance) is slow, because most trials spend time quite far from the -eventual throughput. - -[RFC2544] does not specify any stopping condition for throughput search, -so users can trade-off between search duration and achieved precision. -But, due to exponential behavior of bisection, small improvement -in search duration needs relatively big sacrifice in the throughput precision. - -## DUT within SUT - -[RFC2285] defines: -- *DUT* as - - The network forwarding device to which stimulus is offered and - response measured [RFC2285] (section 3.1.1). -- *SUT* as - - The collective set of network devices to which stimulus is offered - as a single entity and response measured [RFC2285] (section 3.1.2). - -[RFC2544] specifies a test setup with an external tester stimulating the -networking system, treating it either as a single DUT, or as a system -of devices, an SUT. - -In case of software networking, the SUT consists of a software program -processing packets (device of interest, the DUT), -running on a server hardware and using operating system functions as appropriate, -with server hardware resources shared across all programs -and the operating system. - -DUT is effectively "nested" within SUT. - -Due to a shared multi-tenant nature of SUT, DUT is subject to -interference (noise) coming from the operating system and any other -software running on the same server. Some sources of noise can be -eliminated (e.g. by pinning DUT program threads to specific CPU cores -and isolating those cores to avoid context switching). But some -noise remains after all such reasonable precautions are applied. This -noise does negatively affect DUT's network performance. We refer to it -as an *SUT noise*. - -DUT can also exhibit fluctuating performance itself, e.g. while performing -some "stop the world" internal stateful processing. In many cases this -may be an expected per-design behavior, as it would be observable even -in a hypothetical scenario where all sources of SUT noise are -eliminated. Such behavior affects trial results in a way similar to SUT -noise. We use *noise* as a shorthand covering both *DUT fluctuations* and -genuine SUT noise. - -A simple model of SUT performance consists of a baseline *noiseless performance*, -and an additional noise. The baseline is assumed to be constant (enough). -The noise varies in time, sometimes wildly. The noise can sometimes be negligible, -but frequently it lowers the observed SUT performance in a trial. - -In this model, SUT does not have a single performance value, it has a spectrum. -One end of the spectrum is the noiseless baseline, -the other end is a *noiseful performance*. In practice, trial results -close to the noiseful end of the spectrum happen only rarely. -The worse performance, the more rarely it is seen in a trial. - -Focusing on DUT, the benchmarking effort should aim -at eliminating only the SUT noise from SUT measurement. -But that is not really possible, as there are no realistic enough models -able to distinguish SUT noise from DUT fluctuations. - -However, assuming that a well-constructed SUT has the DUT as its -performance bottleneck, the "DUT noiseless performance" can be defined -as the noiseless end of SUT performance spectrum. (At least for -throughput. For other quantities such as latency there will be an -additive difference.) By this definition, DUT noiseless performance -also minimizes the impact of DUT fluctuations. - -In this document, we reduce the "DUT within SUT" problem to estimating -the noiseless end of SUT performance spectrum from a limited number of -trial results. - -Any improvements to throughput search algorithm, aimed for better -dealing with software networking SUT and DUT setup, should employ -strategies recognizing the presence of SUT noise, and allow discovery of -(proxies for) DUT noiseless performance -at different levels of sensitivity to SUT noise. - -## Repeatability and Comparability - -[RFC2544] does not suggest to repeat throughput search. And from just one -throughput value, it cannot be determined how repeatable that value is. -In practice, poor repeatability is also the main cause of poor -comparability, e.g. different benchmarking teams can test the same SUT -but get different throughput values. - -[RFC2544] throughput requirements (60s trial, no tolerance to single frame loss) -force the search to fluctuate close the noiseful end of SUT performance -spectrum. As that end is affected by rare trials of significantly low -performance, the resulting throughput repeatability is poor. - -The repeatability problem is the problem of defining a search procedure -which reports more stable results -(even if they can no longer be called "throughput" in [RFC2544] sense). -According to baseline (noiseless) and noiseful model, better repeatability -will be at the noiseless end of the spectrum. -Therefore, solutions to the "DUT within SUT" problem -will help also with the repeatability problem. - -Conversely, any alteration to [RFC2544] throughput search -that improves repeatability should be considered -as less dependent on the SUT noise. - -An alternative option is to simply run a search multiple times, and report some -statistics (e.g. average and standard deviation). This can be used -for "important" tests, but it makes the search duration problem even -more pronounced. - -## Throughput with Non-Zero Loss - -[RFC1242] (section 3.17) defines throughput as: - The maximum rate at which none of the offered frames - are dropped by the device. - -and then it says: - Since even the loss of one frame in a - data stream can cause significant delays while - waiting for the higher level protocols to time out, - it is useful to know the actual maximum data - rate that the device can support. - -Contrary to that, many benchmarking teams settle with non-zero -(small) loss ratio as the goal for a "throughput rate". - -Motivations are many: modern protocols tolerate frame loss better; -trials nowadays send way more frames within the same duration; -impact of rare noise bursts is smaller as the baseline performance -can compensate somewhat by keeping the loss ratio below the goal; -if SUT noise with "ideal DUT" is known, it can be set as the loss ratio goal. - -Regardless of validity of any and all similar motivations, -support for non-zero loss goals makes any search algorithm more user-friendly. -[RFC2544] throughput is not friendly in this regard. - -Searching for multiple goal loss ratios also helps to describe the SUT -performance better than a single goal result. Repeated wide gap between -zero and non-zero loss conditional throughputs indicates -the noise has a large impact on the overall SUT performance. - -It is easy to modify the vanilla bisection to find a lower bound -for intended load that satisfies a non-zero-loss goal, -but it is not that obvious how to search for multiple goals at once, -hence the support for multiple loss goals remains a problem. - -## Inconsistent Trial Results - -While performing throughput search by executing a sequence of -measurement trials, there is a risk of encountering inconsistencies -between trial results. - -The plain bisection never encounters inconsistent trials. -But [RFC2544] hints about possibility if inconsistent trial results in two places. -The first place is section 24 where full trial durations are required, presumably -because they can be inconsistent with results from shorter trial durations. -The second place is section 26.3 where two successive zero-loss trials -are recommended, presumably because after one zero-loss trial -there can be subsequent inconsistent non-zero-loss trial. - -Examples include: - -- a trial at the same load (same or different trial duration) results - in a different packet loss ratio. -- a trial at higher load (same or different trial duration) results - in a smaller packet loss ratio. - -Any robust throughput search algorithm needs to decide how to continue -the search in presence of such inconsistencies. -Definitions of throughput in [RFC1242] and [RFC2544] are not specific enough -to imply a unique way of handling such inconsistencies. - -Ideally, there will be a definition of a quantity which both generalizes -throughput for non-zero-loss (and other possible repeatibility enhancements), -while being precise enough to force a specific way to resolve trial -inconsistencies. -But until such definition is agreed upon, the correct way to handle -inconsistent trial results remains an open problem. - -# How the problems are addressed - -Configurable loss ratio in MLRsearch search goals are there -in direct support for non-zero-loss conditional throughput. -In practice the conditional throughput results' stability -increases with higher loss ratio goals. - -Multiple trials with noise tolerance enhancement, -as implemented in MLRsearch using non-zero goal exceed ratio value, -also indirectly increases the result stability. -That allows MLRsearch to achieve all the benefits -of Binary Search with Loss Verification, -as recommended in [RFC9004] (section 6.2) -and specified in [TST009] (section 12.3.3). - -The main factor improving the overall search time is the introduction -of preceding targets. Less impactful time savings -are achieved by pre-initial trials, halving mode -and smart splitting in bisecting mode. - -In several places, MLRsearch is "conservative" when handling -(potentially) inconsistent results. This includes the requirement -for the relevant lower bound to be smaller than any upper bound, -the unequal handling of good and bad short trials, -and preference to lower load when choosing the winner among candidates. - -While this does no guarantee good search stability -(goals focusing on higher loads may still invalidate existing bounds -simply by requiring larger min duration sums), -it lowers the change of SUT having an area of poorer performance -below the reported conditional througput loads. -In any case, the definition of conditional throughput -is precise enough to dictate "conservative" handling -of trial inconsistencies. - -# IANA Considerations - -No requests of IANA. - -# Security Considerations - -Benchmarking activities as described in this memo are limited to -technology characterization of a DUT/SUT using controlled stimuli in a -laboratory environment, with dedicated address space and the constraints -specified in the sections above. - -The benchmarking network topology will be an independent test setup and -MUST NOT be connected to devices that may forward the test traffic into -a production network or misroute traffic to the test management network. - -Further, benchmarking is performed on a "black-box" basis, relying -solely on measurements observable external to the DUT/SUT. - -Special capabilities SHOULD NOT exist in the DUT/SUT specifically for -benchmarking purposes. Any implications for network security arising -from the DUT/SUT SHOULD be identical in the lab and in production -networks. - -# Acknowledgements - -Many thanks to Alec Hothan of OPNFV NFVbench project for thorough -review and numerous useful comments and suggestions. - ---- back diff --git a/docs/ietf/draft-ietf-bmwg-mlrsearch-05.md b/docs/ietf/draft-ietf-bmwg-mlrsearch-05.md new file mode 100644 index 0000000000..937e632413 --- /dev/null +++ b/docs/ietf/draft-ietf-bmwg-mlrsearch-05.md @@ -0,0 +1,1460 @@ +--- +title: Multiple Loss Ratio Search +abbrev: MLRsearch +docname: draft-ietf-bmwg-mlrsearch-05 +date: 2023-10-23 + +ipr: trust200902 +area: ops +wg: Benchmarking Working Group +kw: Internet-Draft +cat: info + +coding: us-ascii +pi: # can use array (if all yes) or hash here + toc: yes + sortrefs: # defaults to yes + symrefs: yes + +author: + - + ins: M. Konstantynowicz + name: Maciek Konstantynowicz + org: Cisco Systems + email: mkonstan@cisco.com + - + ins: V. Polak + name: Vratko Polak + org: Cisco Systems + email: vrpolak@cisco.com + +normative: + RFC1242: + RFC2285: + RFC2544: + RFC9004: + +informative: + TST009: + target: https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/009/03.04.01_60/gs_NFV-TST009v030401p.pdf + title: "TST 009" + FDio-CSIT-MLRsearch: + target: https://csit.fd.io/cdocs/methodology/measurements/data_plane_throughput/mlr_search/ + title: "FD.io CSIT Test Methodology - MLRsearch" + date: 2023-10 + PyPI-MLRsearch: + target: https://pypi.org/project/MLRsearch/0.4.1/ + title: "MLRsearch 0.4.1, Python Package Index" + date: 2021-07 + +--- abstract + +This document proposes extensions to [RFC2544] throughput search by +defining a new methodology called Multiple Loss Ratio search +(MLRsearch). The main objectives of MLRsearch are to minimize the +total search duration, to support searching for multiple loss ratios +and to improve results repeatability and comparability. + +The main motivation behind extending [RFC2544] is the new set of challenges +and requirements posed by evaluating and testing software based networking +systems, specifically their data planes. + +MLRsearch offers several ways to address these challenges, giving user +configuration options to select their preferred way. + +--- middle + +{::comment} + As we use kramdown to convert from markdown, + we use this way of marking comments not to be visible in rendered draft. + https://stackoverflow.com/a/42323390 + If other engine is used, convert to this way: + https://stackoverflow.com/a/20885980 +{:/comment} + +# Purpose and Scope + +The purpose of this document is to describe Multiple Loss Ratio search +(MLRsearch), a data plane throughput search methodology optimized for software +networking DUTs. + +Applying vanilla [RFC2544] throughput bisection to software DUTs +results in a number of problems: + +- Binary search takes too long as most of trials are done far from the + eventually found throughput. +- The required final trial duration and pauses between trials + prolong the overall search duration. +- Software DUTs show noisy trial results, + leading to a big spread of possible discovered throughput values. +- Throughput requires loss of exactly zero frames, but the industry + frequently allows for small but non-zero losses. +- The definition of throughput is not clear when trial results are inconsistent. + +MLRsearch library aims to address these problems by applying the following set +of enhancements: + +- Allow multiple shorter trials instead of one big trial per load. + - Optionally, tolerate a percentage of trial result with higher loss. +- Allow searching for multiple search goals, with differing loss ratios. + - Any trial result can affect each search goal in principle. +- Insert multiple coarse targets for each search goal, earlier ones need + to spend less time on trials. + - Earlier targets also aim for lesser precision. + - Use Forwarding Rate (FR) at maximum offered load + [RFC2285] (section 3.6.2) to initialize the initial targets. +- Take care when dealing with inconsistent trial results. + - Reported throughput is smaller than smallest load with high loss. + - Smaller load candidates are measured first. +- Apply several load selection heuristics to save even more time + by trying hard to avoid unnecessarily narrow bounds. + +Some of these enhancements are formalized as MLRsearch specification, +the remaining enhancements are treated as implementation details, +thus achieving high comparability without limiting future improvements. + +MLRsearch configuration options are flexible enough to +support both conservative settings and aggressive settings. +Where the conservative settings lead to results +unconditionally compliant with [RFC2544], +but longer search duration and worse repeatability. +Conversely, aggressive settings lead to shorter search duration +and better repeatability, but the results are not compliant with [RFC2544]. + +No part of [RFC2544] is intended to be obsoleted by this document. + +# Identified Problems + +This chapter describes the problems affecting usability +of various preformance testing methodologies, +mainly a binary search for [RFC2544] unconditionally compliant throughput. + +The last chapter will summarize how the problems are addressed, +the middle chapters provide explanations and definitions needed for that. + +## Long Search Duration + +Emergence of software DUTs, with frequent software updates and a +number of different frame processing modes and configurations, +has increased both the number of performance tests requred to verify DUT update +and the frequency of running those tests. +This makes the overall test execution time even more important than before. + +In the context of characterising particular DUT's network performance, +this calls for improving the time efficiency of throughput search. +A vanilla bisection (at 60sec trial duration for unconditional [RFC2544] +compliance) is slow, because most trials spend time quite far from the +eventual throughput. + +[RFC2544] does not specify any stopping condition for throughput search, +so users can trade-off between search duration and achieved precision. +But, due to logarithmic nature of bisection, even small improvement +in search duration needs relatively big sacrifice in the precision of the +discovered throughput. + +## DUT in SUT + +[RFC2285] defines: +- DUT as + - The network forwarding device to which stimulus is offered and + response measured [RFC2285] (section 3.1.1). +- SUT as + - The collective set of network devices to which stimulus is offered + as a single entity and response measured [RFC2285] (section 3.1.2). + +[RFC2544] specifies a test setup with an external tester stimulating the +networking system, treating it either as a single DUT, or as a system +of devices, an SUT. + +In case of software networking, the SUT consists nt only of the DUT +as a software program processing frames, but also of +a server hardware and operating system functions, +with server hardware resources shared across all programs +and the operating system running on the same server. + +The DUT is effectively nested within the rest of the SUT. + +Due to a shared multi-tenant nature of SUT, DUT is subject to +possible interference coming from the operating system and any other +software running on the same server. Some sources of such interference +can be to some degree eliminated, e.g. by pinning DUT program threads +to specific CPU cores and isolating those cores to avoid context switching. +But some level of adverse effects may remain even after +all such reasonable precautions are applied. +These effects affect DUT's network performance negatively. +As the effects are hard to predict in general, they have impact similar to +what other engineering disciplines define as a noise. +Thus, all such effects are called an SUT noise. + +DUT can also exhibit fluctuating performance itself, for reasons +not related to the rest of SUT, for example due to pauses in execution +as needed for internal stateful processing. In many cases this +may be an expected per-design behavior, as it would be observable even +in a hypothetical scenario where all sources of SUT noise are eliminated. +Such behavior affects trial results in a way similar to SUT noise. +As the two phenomenons are hard to destinguish, +this document uses the word noise as a shorthand covering both +this internal DUT performance fluctuations and genuine SUT noise. + +A simple model of SUT performance consists of an idealized noiseless performance, +and additional noise effects. The noiseless performance is assumed to be constant, +all observed performance variations are due to noise. +The impact of the noise can vary in time, sometimes wildly, +even within a single trial. +The noise can sometimes be negligible, but frequently +it lowers the observed SUT performance as observed in trial results. + +In this model, SUT does not have a single performance value, it has a spectrum. +One end of the spectrum is the idealized noiseless performance value, +the other end can be called a noiseful performance. In practice, trial results +close to the noiseful end of the spectrum happen only rarely. +The worse the performance value is, the more rarely it is seen in a trial. +Therefore, the extreme noiseful end of SUT spectrum is not really observable +among trial results. Also, the extreme noiseless end of SUT spectrum +is unlikely to be observable, this time because some small noise effects +are likely to occur multiple times during a trial. + +Unless specified otherwise, this document talks about potentially observable +ends of the SUT performance spectrum, not about the extreme ones. + +Focusing on DUT, the benchmarking effort should aim +at eliminating only the SUT noise from SUT measurements. +In practice that is not really possible, as based on authors experience +and available literature, there are no realistic enough models +able to distinguish SUT noise from DUT fluctuations. + +However, assuming that a well-constructed SUT has the DUT as its +performance bottleneck, the DUT ideal noiseless performance can be defined +as the noiseless end of SUT performance spectrum. At least for +throughput. For other performance quantities such as latency there may be an +additive difference. + +Note that by this definition, DUT noiseless performance +also minimizes the impact of DUT fluctuations, as much as realistically possible +for a given trial duration. + +In this document, we reduce the DUT in SUT problem to estimating +the noiseless end of SUT performance spectrum from a limited number of +trial results. + +Any improvements to throughput search algorithm, aimed for better +dealing with software networking SUT and DUT setup, should employ +strategies recognizing the presence of SUT noise, and allow discovery of +(proxies for) DUT noiseless performance +at different levels of sensitivity to SUT noise. + +## Repeatability and Comparability + +[RFC2544] does not suggest to repeat throughput search. And from just one +discovered throughput value, it cannot be determined how repeatable that value is. +In practice, poor repeatability is also the main cause of poor +comparability, that is different benchmarking teams can test the same SUT +but get throughput values differing more than expected from search precision. + +[RFC2544] throughput requirements (60 seconds trial and +no tolerance of a single frame loss) affect the throughput results +in the following way. +The SUT behavior close to the noiseful end of its performance spectrum +consists of rare occasions of significantly low performance, +but the long trial duration makes those occasions not so rare on the trial level. +Therefore, the binary search results tend to wander away from the noiseless end +of SUT performance spectrum, more frequently and more widely than shorter +trials would, thus resulting in poor throughput repeatability. + +The repeatability problem can be addressed by defining a search procedure +which reports more stable results, +even if they can no longer be called throughput in [RFC2544] sense. +According to the SUT performance spectrum model, better repeatability +will be at the noiseless end of the spectrum. +Therefore, solutions to the DUT in SUT problem +will help also with the repeatability problem. + +Conversely, any alteration to [RFC2544] throughput search +that improves repeatability should be considered +as less dependent on the SUT noise. + +An alternative option is to simply run a search multiple times, and report some +statistics (e.g. average and standard deviation). This can be used +for a subset of tests deemed more important, +but it makes the search duration problem even more pronounced. + +## Throughput with Non-Zero Loss + +[RFC1242] (section 3.17) defines throughput as: + The maximum rate at which none of the offered frames + are dropped by the device. + +Then, it says: + Since even the loss of one frame in a + data stream can cause significant delays while + waiting for the higher level protocols to time out, + it is useful to know the actual maximum data + rate that the device can support. + +Contrary to that, many benchmarking teams settle with small, non-zero +loss ratio as the goal for a their load search. + +Motivations are many: + +- Modern protocols tolerate frame loss better, + compared to the time when [RFC1242] and [RFC2544] were specified. + +- Trials nowadays send way more frames within the same duration, + increasing the chance small SUT performance fluctuatios + is enough to cause frame loss. + +- Small bursts of frame loss caused by noise have otherwise smaller impact + on the average frame loss ratio ovserved in the trial, + as during other parts of the same trial the SUT may work mroe closely + to its noiseless performance, thus perhaps lowering the trial loss ratio + below the goal loss ratio value. + +- If an approximation of the SUT noise impact on the trial loss ratio is known, + it can be set as the goal loss ratio. + +Regardless of validity of any and all similar motivations, +support for non-zero loss goals makes any search algorithm more user friendly. +[RFC2544] throughput is not user friendly in this regard. + +Assuming users are allowed to specify the goal loss ratio value, +the usefulness is enhanced even more if users can specify multiple +loss ratio values, especially when a single search can find all relevant bounds. + +Searching for multiple search goals also helps to describe the SUT performance +spectrum better than a single search goal result. +For example, repeated wide gap between zero and non-zero loss loads +indicates the noise has a large impact on the observed performance, +which is not evident from a single goal load search procedure result. + +It is easy to modify the vanilla bisection to find a lower bound +for intended load that satisfies a non-zero goal loss ratio. +But it is not that obvious how to search for multiple goals at once, +hence the support for multiple search goals remains a problem. + +## Inconsistent Trial Results + +While performing throughput search by executing a sequence of +measurement trials, there is a risk of encountering inconsistencies +between trial results. + +The plain bisection never encounters inconsistent trials. +But [RFC2544] hints about possibility of inconsistent trial results, +in two places in its text. +The first place is section 24, where full trial durations are required, +presumably because they can be inconsistent with results +from shorter trial durations. +The second place is section 26.3, where two successive zero-loss trials +are recommended, presumably because after one zero-loss trial +there can be subsequent inconsistent non-zero-loss trial. + +Examples include: + +- A trial at the same load (same or different trial duration) results + in a different trial loss ratio. +- A trial at higher load (same or different trial duration) results + in a smaller trial loss ratio. + +Any robust throughput search algorithm needs to decide how to continue +the search in presence of such inconsistencies. +Definitions of throughput in [RFC1242] and [RFC2544] are not specific enough +to imply a unique way of handling such inconsistencies. + +Ideally, there will be a definition of a new quantity which both generalizes +throughput for non-zero-loss (and other possible repeatibility enhancements), +while being precise enough to force a specific way to resolve trial result +inconsistencies. +But until such definition is agreed upon, the correct way to handle +inconsistent trial results remains an open problem. + +# MLRsearch Specification + +This chapter focuses on technical definitions needed for evaluating +whether a particular test procedure adheres to MLRsearch specification. + +For motivations, explanations, and other comments see other chapters. + +## MLRsearch Architecture + +MLRsearch architecture consists of three main components: +the manager, the controller and the measurer. +Presence of other components (mainly the SUT) is also implied. + +While the manager and the measurer can be seen a abstractions +present in any testing procedure, the behavior of the controller +is what distinguishes MLRsearch algorithms from other search procedures. + +### Measurer + +The measurer is the component which performs one trial +as described in [RFC2544] section 23, when requested by the controller. + +Specifically, one call to the measurer accepts a trial load value +and trial duration value, performs the trial, and returns +the measured trial loss ratio, and optionally a different duration value. + +It is responsibility of the measurer to uphold any requirements +and assumptions present in MLRsearch specification +(e.g. trial forwarding ratio not being larger than one). +Implementers have some freedom, for example in the way they deal with +duplicated frames, or what to return if tester sent zero frames towards SUT. +Implementations are RECOMMENDED to document their behavior +related to such freedoms in as detailed way as possible + +Implemenations MUST document any deviations from RFC documents, +for example if the wait time around traffic +is shorter than what [RFC2544] section 23 specifies. + +### Controller + +The controller is the component of MLRsearch architecture +that is called by the manager (just once), calls the measurer +(usually multiple times in succession), +and returns the result of the search to the manager. + +The only required argument in the call to the controller +is a list of search goals. For the structure of the search result, +see subsection Search Result. + +### Manager + +The manager is the component that initializes SUT, traffic generator +(tester in [RFC2544]), the measurer and the controller +with intended configurations. It then hands over the execution +to the controller and receives its result. + +Creation of reports of appropriate format can also be understood +as the responsibility of the manager. + +## Units + +The specification deals with physical quantities, so it is assumed +each numeric value is accompanied by an appropriate physical unit. + +The specification does not state which unit is appropriate, +but implementations MUST make it explicit which unit is used +for each value provided or received by the user. + +For example, load quantities (including the conditional throughput) +returned by the controller are defined to be based on single-interface +(unidirectional) loads. For bidirectional traffic, users are likely +to expect bidirectional throughput quantities, so the manager is responsible +for making its report clear. + +## SUT + +As defined in [RFC2285]: +The collective set of network devices to which stimulus is offered +as a single entity and response measured. + +## Trial + +A trial is the part of test described in [RFC2544] section 23. + +### Trial Load + +Trial load is the intended constant load for a trial. + +Load is the quantity implied by Constant Load of [RFC1242], +Data Rate of [RFC2544] and Intended Load of [RFC2285]. +All three specify this value applies to one (input or output) interface. + +### Trial Duration + +Trial duration is the intended duration of the traffic for a trial. + +In general, this general quantity does not include any preparation nor waiting +described in section 23 of [RFC2544]. + +However, the measurer MAY return a duration value different +from the intended duration. This may be useful for users +who want to control the overal search duration, not just the traffic part of it. +The manager MUST report how does the measurer computes the returned duration +values in that case. + +### Trial Forwarding Ratio + +Trial forwarding ratio is dimensionless floating point value, +assumed to be between 0.0 and 1.0, both including. +It is computed as the number of frames forwarded by SUT, divided by +the number of frames that should have been forwarded during the trial. + +Note that, contrary to load, frame counts used to compute +trial forwarding ratio are aggregates over all SUT ports. + +Questions around what is the correct number of frames +that should have been forwarded it outside of the scope of this document. +E.g. what should the measurer return when it detects +that the offered load differs significantly from the intended load. + +It is RECOMMENDED implementations return an irregular goal result +if they detect questionable (in comparability sense) trial results +affecting their goal result. + +### Trial Loss Ratio + +Trial loss ratio is equal to one minus the trial forwarding ratio. + +### Trial Forwarding Rate + +The trial forwarding rate is the trial load multiplied by +the trial forwarding ratio. + +Note that this is very similar, but not identical to Forwarding Rate +as defined in [RFC2285] section 3.6.1, as that definition +is specific to one output interface, while trial forwarding ratio +is based on frame counts aggregated over all SUT interfaces. + +## Traffic profile + +Any other specifics (besides trial load and trial duration) +the measurer needs to perform the trial are understood as a composite +called the traffic profile. +All its attributes are assumed to be constant during the search, +and the composite is configured on the measurer by the manager +before the search starts. + +Traffic profile is REQUIRED by [RFC2544] to contain some specific quantities, +for example frame size. +Several more specific quantities may be RECOMMENDED. + +Depending on SUT configuration, e.g. when testing specific protocols, +additional values need to be included in the traffic profile +and in the test report. See other IETF documents. + +## Search Goal + +A search goal is one item of the list required as an argument +when the manager calls the controller. + +Each search goal is composite consisting of several attributes, +some of them are required. +Implementations are free to add their own attributes. + +Subsections list all required attributes and one recommended attribute. + +The meaning of the attributes is formally given only by their effect +on the computation of the attributes of the goal result. +The subsections do contain a short informal description, +but see other chapters for more in-depth explanations. + +### Goal Final Trial Duration + +A threshold value for trial durations. +REQUIRED attribute, MUST be positive. + +Informally, the conditional throughput for this goal will be computed +only from trial results from trials as long as this. + +### Goal Duration Sum + +A threshold value for a particular sum of trial durations. +REQUIRED attribute, MUST be positive. + +This uses the duration values returned by the measurer, +in case it returns something else than the intended durations +for the traffic part of the search. + +Informally, even at looking only at trials done at this goal's +final trial duration, MLRsearch may spend up to this time measuring +the same load value. + +### Goal Loss Ratio + +A threshold value for trial loss ratios. +REQUIRED attribute, MUST be non-negative and smaller than one. + +Informally, if a load causes too many trials with trial results larger than this, +the conditional throughput for this goal will be smaller than that load. + +### Goal Exceed Ratio + +A threshold value for particular ratio of duration sums. +REQUIRED attribute, MUST be non-negative and smaller than one. + +This uses the duration values returned by the measurer, +in case it returns something else than the intended durations +for the traffic part of the search. + +Informally, this acts as the q-value for a quantile when selecting +the forwarding rate of which trial result becomes the conditional throughput. +For example, when the goal exceed ratio is 0.5 and MLRsearch +happened to use the whole goal duration sum when determining conditional +throughput, it means the conditional throughput is the median +of trial forwarding rates. +In practice, MLRsearch may stop measuring a load before the goal duration sum +is reached, and the conditional throughput in that case frequently +is the worst trial still not exceeding the goal loss ratio. +If the goal duration sum is no larger than the goal fina trial duration, +MLRsearch performs only one trial per load (unless other goals need more) +and the goal exceed ratio has no effect on the search result. + +### Goal Width + +A value used as a threshold for telling when two trial load values +are close enough. + +RECOMMENDED attribute, positive. Implementations without this attribute +MUST give the manager other ways to control the search exit condition. + +Absolute load difference and relative load difference are two popular choices, +but implementations may choose a different way to specify width. + +Informally, this acts as a stopping condition, controling the precision +of the search. The search stops if every goal has reached its precision. + +## Search Result + +Search result is a single composite object returned from the controller +to the manager. +It is a mapping from the search goals (see section Search Goal) into goal results +(see section Goal Result). +The mapping MUST map from all the search goals present in the controller input. + +Each search goal instance is mapped to a goal result instance. +Multiple search goal instances may map to the same goal result instance. + +## Goal Result + +Goal result is a composite object consisting of several attributes. +All attributes are related to the same search goal, the one search goal instance +the Search Result is mapping into this instance of the Goal Result. + +Some of the attributes are required, some are recommended, +implementations are free to add their own. + +The subsections define attributes for regular goal result. +Implementations are free to define their own irregular goal results, +but the manager MUST report them clearly as not regular according to this section. + +A typical irregular result is when all trials at maximum offered load +have zero loss, as the relevant upper bound does not exist in that case. + +### Relevant Upper Bound + +Relevant upper bound is the intended load value that is classified +at the end of the search as the relevant upper bound (see Appendix A) +for this goal. +This is a REQUIRED attribute. + +Informally, this is the smallest intended load that failed to uphold +all the requirements of this search goal, mainly the goal loss ratio +in combination with the goal exceed ratio. + +### Relevant Lower Bound + +Relevant lower bound is the intended load value that got classified +(after all trials) as the relevant lower bound (see Appendix A) for this goal. +This is a REQUIRED attribute. + +The distance between the relevant lower bound and the relevant upper bound +MUST NOT be larger than the goal width, for a regular goal result, +if the implementation offers width as a goal attribute. + +Informally, this is the smallest intended load that managed to uphold +all the requirements of this search goal, mainly the goal loss ratio +in combination with the goal exceed ratio, while not being larger +than the relevant upper bound. + +### Conditional Throughput + +The conditional throughput (see Appendix B) as evaluated +at the relevant lower bound. +This is a RECOMMENDED attribute. + +Informally, a typical forwarding rate expected to be seen +at the relevant lower bound. But frequently just a conservative estimate thereof, +as MLRsearch implementations tend to stop gathering more data +as soon as they confirm this estimate cannot get worse within +the goal duration sum. + +# MLRsearch Explanations + +This chapter focuses on intuitions and motivations +and skips over some important details. + +Familiarity with the MLRsearch specification is not required here, +so this chapter can act as an introduction. +For example, this chapter start talking about tightest lower bounds +before it is ready to talk about the relevant lower bound from the specification. + +## MLRsearch Versions + +The MLRsearch algorithm has been developed in a code-first approach, +a Python library has been created, debugged and used in production +before first descriptions (even informal) were published. +In fact, multiple versions of the library were used in production +over past few years, and later code was usually not compatible +with earlier descriptions. + +The code in (any version of) MLRsearch library fully determines +the search process (for given configuration parameters), +leaving no space for deviations. +MLRsearch as a name for a broad class of possible algorithms +leaves plenty of space for future improvements, at the cost +of poor comparability of results of different MLRsearch implementations. + +This document aspires to prescribe a MLRsearch specification +in a way that restricts the important parts related to comparability, +while leaving other parts vague enough so implementations can improve freely. + +## Exit Condition + +[RFC2544] prescribes that after performing one trial at a specific offered load, +the next offered load should be larger or smaller, based on frame loss. + +The usual implementation uses binary search. Here a lossy trial becomes +a new upper bound, a lossless trial becomes a new lower bound. +The span of values between the tightest lower bound and the tightest upper bound +forms an interval of possible results, +and after each trial the width of that interval halves. + +Usually the binary search implementation tracks only the two tightest bounds, +simply calling them bounds, but the old values still remain valid bounds, +just not as tight as the new ones. + +After some number of trials, the tightest lower bound becomes the throughput. +[RFC2544] does not specify when (if ever) should the search stop. + +MLRsearch library introduces a concept of goal width. The search stops +when the distance between the tightest upper bound and the tightest lower bound +is smaller than a user-configured value called goal width from now on. +In other words, interval width has to be smaller than goal width +at the end of the search. + +This goal width value therefore determines the precision of the result. +As MLRsearch specification requires a particular structure of the result, +the result itself does contain enough information to determine its precision, +thus it is not required to report the goal width value. + +This allows MLRsearch implementations to use exit conditions +different from goal width. +The MLRsearch specification only REQUIRES the search procedure +to always finish in a finite time, regardless of possible trial results. + +## Load Classification + +MLRsearch keeps the basic logic of binary search (tracking tightest bounds, +measuring at the middle), perhaps with minor technical clarifications. +The algorithm chooses an intended load (as opposed to offered load), +the interval between bounds does not need to be split exactly in two equal halves, +and the final reported structure specifies both bounds +(optionally also the conditional throughput at the lower bound, defined later). + +The biggest difference is that in order to classify a load +as an upper or lower bound, MLRsearch may need more than one trial +(depending on configuration options) to be performed at the same intended load. + +As a consequence, even if a load already does have few trial results, +it still may be classified as undecided, neither a lower bound nor an upper bound. + +Explanation of the classification logic is given in the next chatper, +as it relies heavily on other sections of this chapter. + +For repeatability and comparability reasons, it is important that +given a set of trial results, all implementations of MLRsearch +classify the load in an equivalent way. + +## Loss Ratios + +Next difference is in goals of the search. [RFC2544] has a single goal, +based on classifying full-length trials as either loss-less or lossy. + +As the name suggests, MLRsearch can seach for multiple goals, differing in their +loss ratios. Precise definition of goal loss ratio will be given later. +The [RFC2544] throughput goal then simply becomes a zero goal loss ratio. +Different goals also may have different goal width. + +A set of trial results for one specific intended load value +can classify the load as an upper bound for some goals, but a lower bound +for some other goals, and undecided for the rest of the goals. + +Therefore, the load classification depends not only on ttrial results, +but also on the goal. The overall search procedure becomes more complicated +(compared to binary search with a single goal), +but most of the complications do not affect the final result, +except for one phenomenon, loss inversion. + +## Loss Inversion + +In [RFC2544] throuhput search using bisection, any load with lossy trial +becomes a hard upper bound, meaning every subsequent trial has smaller +intended load. + +But in MLRsearch, a load that is classified as an upper bound for one goal +may still be a lower bound for another goal, and due to that other goal +MLRsearch will probably perform trials at even higher loads. +What to do when all such higher load trials happen to have zero loss? +Does it mean the earlier upper bound was not real? +Does it mean the later lossless trials are not considered a lower bound? +Surely we do not want to have an upper bound at a load smaller than a lower bound. + +MLRsearch is conservative in these situations. +The upper bound is considered real, and the lossless trials at higher loads +are considered to be a coincidence, at least when computing the final result. + +This is formalized using new notions, the relevant upper bound and +the relevant lower bound. +Load classification is still based just on the set of trial results +at a given intended load (trials at other loads are ignored), +making it possible to have a lower load classified as an upper bound +and a higher load classified as a lower bound (for the same goal). +The relevant upper bound (for a goal) is the smallest load classified +as an upper bound. But the relevant lower bound is not simply +the largest among lower bounds. It is the largest load among loads +that are lower bounds while also being smaller than the relevant upper bound. + +With these definitions, the relevant lower bound is always smaller +than the relevant upper bound (if both exist), and the two relevant bounds +are used analogously as the two tightest bounds in the binary search. +When they are less than goal width apart, the relevant bounds are used in output. + +One consequence is that every trial result can have an impact on the search result. +That means if your SUT (or your traffic generator) needs a warmup, +be sure to warm it up before starting the search. + +## Exceed Ratio + +The idea of performing multiple trials at the same load comes from +a model where some trial results (those with high loss) are affected +by infrequent effects, causing poor repeatability of [RFC2544] throughput results. +See the discussion about noiseful and noiseless ends of SUT performance spectrum. +Stable results are closer to the noiseless end of SUT preformance spectrum, +so MLRsearch may need to allow some frequency of high-loss trials +to ignore the reare but big effects near the noisefull end. + +MLRsearch is able to do such trial result filtering, but it needs +a configuration option to tell it how much frequent can the infrequent big loss be. +This option is called exceed ratio. It tells MLRsearch what ratio of trials +(more exactly what ratio of trial seconds) can have trial loss ratio +larger than goal loss ratio and still be classified as a lower bound. +Zero exceed ratio means all trials have to have trial loss ratio +equal to or smaller than the goal loss ratio. + +For explainability reasons, the RECOMMENDED value for exceed ratio is 0.5, +as it simplifies some later concepts by relating them to the concept of median. + +## Duration Sum + +When more than one trial is needed to classify a load, +MLRsearch also needs something that controlls the number of trials needed. +Therefore, each goal also has an attribute called duration sum. + +The meaning of a goal duration sum is that when a load has trials +(at full trial duration, details later) +whose trial durations when summed up give a value at least this, +the load is guaranteed to be classified as an upper bound or a lower bound +for the goal. + +As the duration sum has a big impact on the overall search duration, +and [RFC2544] prescibes wait intervals around trial traffic, +the MLRsearch algorithm may sum durations that are different +from the actual trial traffic durations. + +## Short Trials + +Section 24 of [RFC2544] already anticipates possible time savings +when short trials (shorter than full length trials) are used. + +MLRsearch requires each goal to specify its final trial duration. +Full-length trial is the short name for a trial whose intended trial duration +is equal to the goal final trial duration. + +Any MLRsearch implementation may include its own configuration options +which control when and how MLRsearch chooses to use shorter trial durations. + +For explainability reasons, when exceed ratio of 0.5 is used, +it is recommended for the goal duration sum to be an odd multiple +of the full trial durations, so conditional throughput becomes identical to +a median of a particular set of forwarding rates. + +Presence of shorter trial results complicates the load classification logic. +Full details are given later. In short, results from short trials +may cause a load to be classified as an upper bound. +This may cause loss inversion, and thus lower the relevant lower bound +(below what would classification say when considering full-length trials only). + +For explainability reasons, it is RECOMMENDED users use such configurations +that guarantee all trials have the same length. +Alas, such configurations are usually not compliant with [RFC2544] requirements, +or not time-saving enough. + +## Conditional Throughput + +As testing equipment takes intended load as input parameter +for a trial measurement, any load search algorithm needs to deal +with intended load values internally. + +But in presence of goals with non-zero loss ratio, the intended load +usually does not match the user intuition of what a throughput is. +The forwarding rate (as defined in [RFC2285] section 3.6.1) is better, +but it is not obvious how to generalize it +for loads with multiple trial results, +especially with non-zero goal exceed ratio. + +MLRsearch defines one such generalization, called the conditional throughput. +It is the forwarding rate from one of the trials performed at the load +in question. Specification of which trial exactly is quite technical. +More detailed explanations are given in the next chapter. + +Conditional throughput is partially related to load classification. +If a load is classified as a lower bound for a goal, +the conditional throughpt can be calculated, +and guaranteed to show effective loss ratio no larger than goal loss ratio. + +While the conditional throughput gives more intuitive-looking values +than the relevant lower bound, especially for non-zero goal loss ratio values, +the actual definition is more complicated than the definition of the relevant +lower bound. In future, other intuitive values may become popular, +but they are unlikely to supersede the definition of the relevant lower bound +as the most fitting value for comparability purposes, +therefore the relevant lower bound remains a required attribute +of the goal result structure. + +Note that comparing best and worst case, the same relevant lower bound value +may result in the conditional throughput differing up to the goal loss ratio. +Therefore it is rarely needed to set the goal width (if expressed +as relative difference of loads) below the goal loss ratio. +In other words, setting the goal width below the goal loss ratio +may cause the conditional throughput for a larger loss ratio to became smaller +than a conditional throughput for a goal with a smaller goal loss ratio, +which is counter-intuitive, considering they come from the same search. +Therefore it is RECOMMENDED to set the goal width to a value no smaller +than the goal loss ratio. + +## Search Time + +The main motivation for MLRsearch was to have an algorithm that spends less time +finding a throughput, either the [RFC2544] compliant one, +or some generalization thereof. The art of achieving short search times +is mainly in smart selection of intended loads (and intended durations) +for the next trial to perform. + +While there is an indirect impact of the load selection on the reported values, +in practice such impact tends to be small, +even for SUTs with quite broad performance spectrum. + +A typical example of two approaches to load selection leading to different +relevant lower bounds is when the interval is split in a very uneven way. +An implementation chosing loads very close to the current relevant lower bound +are quite likely to eventually stumble upon a trial result +with poor performance (due to SUT noise). +For an implementation chosing load very close to the current relevant upper bound +this is unlikely, as it examines more loads that can see a performance +close to the noiseless end of the SUT performance spectrum. +The reason why it is unlikely to have two MLRsearch implementation showing +this kind of preference in load selection is precisely +in the desire to have short searches. +Even splits are the best way to achive the desired precision, +so the more optimized a search algorithm is for the overall search duration, +the better the repeatability and comparability +of its results will be, assuming the user configuration remains the same. + +Therefore, this document remains quite vague on load selection +and other optimisation details, and configuration attributes related to them. +Assuming users prefer libreries that achieve short overall search time, +the definition of the relevant lower bound +should be strict enough to ensure result repeatability +and comparability between different implementations, +while not restricting future implementations much. + +Sadly, different implementations may exhibit their sweet spot of +best repeatability at given search duration at different goals attribute values, +especially with respect to optional goal attributes +such as initial trial duration. +Thus, this document does not comment much on which configurations +are good for comparability between different implementations. +For comparability between different SUTs using the same implementation, +refer to configurations recommended by that particular implementation. + +## [RFC2544] compliance + +The following search goal ensures unconditional compliance with +[RFC2544] throughput search procedure: + +- Goal loss ratio: zero. + +- Goal final trial duration: 60 seconds. + +- Goal duration sum: 60 seconds. + +- Goal exceed ratio: zero. + +Presence of other search goals does not affect compliance of this goal result. +The relevant lower bound and the conditional throughput are in this case +equal to each other, and the value is the [RFC2544] throughput. + +If the 60 second quantity is replaced by a smaller quantity in both attributes, +the conditional throughput is still conditionally compliant with +[RFC2544] throughput. + +# Selected Functional Details + +This chapter continues with explanations, +but this time more precise definitions are needed +for readers to follow the explanations. +The definitions here are wordy, implementers can look into the next chapter +for more concise definitions. + +The two areas of focus in this chapter are the load classification +and the conditional throughput, starting with the latter. + +## Performance Spectrum + +There are several equivalent ways to define the conditional throughput computation. +One of the ways relies on an object called the performance spectrum. +First, two heavy definitions are needed. + +Take an intended load value, and a finite set of trial results, all trials +measured at that load value. The performance spectrum is the function that maps +any non-negative real number into a sum of trial durations among all trials +in the set that have that number as their forwarding rate, +e.g. map to zero if no trial has that particular forwarding rate. + +A related function, defined if there is at least one trial in the set, +is the performance spectrum divided by sum of durations of all trials in the set. +That function is called the performance probability function, as it satisfies +all the requirements for probability mass function function +of a discrete probability distribution, +the one-dimensional random variable being the trial forwarding rate. + +These functions are related to the SUT performance spectrum, +as sampled by the trials in the set. + +As for any other probability function, we can talk about percentiles, +of the performance probability function, and bout other quantiles +such as the median. The conditional throughput will be +one such quantile value for a specifically chosen set of trials. + +Take a set of all full-length trials performed at the load in question. +The sum of durations of those trials may be less than goal duration sum, or not. +If it is less, add an imaginary trial result with zero forwarding rate +such that the new sum of durations is equal to the goal duration sum. +This is the set of trials to use. The q-value for the quantile +is the goal exceed ratio. If the quantile touches two trials, +the larger forwarding rate is used. + +First example. For zero exceed ratio when goal duration sum +has been reached. The conditional throughput is the smallest forwarding +rate among the trials. + +Second example. For zero exceed ratio when goal duration sum +has not been reached yet. Due to the missing duration sum, +the worst case may still happen, so the conditional througput +is zero. This is not reported to the user, as this load +cannot become the relevant lower bound yet. + +Third example. Exceed ratio 50%, goal duration sum two seconds, +one trial present with duration one second and zero loss. +An imaginary trial is added with duration one second and zero forwarding rate. +Median would touch both trials, so the conditional throughput +is the forwarding rate of the one non-imaginary trial. +As that had zero loss, the value is equal to the offered load. + +The classification does not need the whole performance spectrum, +only few duration sums. + +A trial is called bad (according to a goal) if its trial loss ratio +is larger than the goal loss ratio. Trial that is not bad is called good. + +## Single Trial Duration + +When goal attributes are chosen in such a way that every trial has the same +intended duration, the load classification is sipler. + +The following description looks technical, but it follows the motivation +of goal loss ratio, goal exceed ratio and goal duration sum. +If sum of durations of all trials (at given load) is less than the goal +duration sum, imagine best case scenario (all subsequent trials having zero loss) +and worst case scenario (all subsequent trials having 100% loss). +Here we assume there is as many subsequent trials as needed +to make the sum of all trials to become equal to the goal duration sum. +As the exceed ratio is defined just using sums of durations +(number of trials does not matter), it does not matter whether +the "subsequent trials" can consist of integer number of full-length trials. + +If even in the best case scenario the load exceed ratio would be larger +than the goal exceed ratio, the load is an upper bound. +If even in the worst case scenario the load exceed ratio would not be larger +than the goal exceed ratio, the load is a lower bound. + +Even more specifically. +Take all trials measured at a given load. +Sum of durations of all bad full-length trials is called the bad sum. +Sum of durations of all good full-length trials is called the good sum. +The result of adding bad sum plus the good sum is called the measured sum. +The larger of the measured sum and the goal duration sum is called the whole sum. +The whole sum minus the measured sum is called the missing sum. +Optimistic exceed ratio is the bad sum divided by the whole sum. +Pessimistic exceed ratio is the bad sum plus the missing sum, that divided by +the whole sum. +If optimistic exceed ratio is larger than the goal exceed ratio, +the load is classified as an upper bound. +If pessimistic exceed ratio is not larger than the goal exceed ratio, +the load is classified as a lower bound. +Else, the load is classified as undecided. + +The definition of pessimistic exceed ratio matches the logic in +the conditional throughput computation, so a load is a lower bound +if and only if the conditional throughput effective loss ratio +is not larger than the goal loss ratio. +If it is larger, the load is either an upper bound or undecided. + +## Short Trial Scenarios + +Trials with intended duration smaller than the goal final trial duration +are called short trials. +The motivation for load classification logic in presence of short trials +is based around a counter-factual case: What would the trial result be +if a short trial has been measured as a full-length trial instead? + +There are three main scenarios where human intuition guides +the intended behavior of load classification. + +Scenario one. The user had their reason for not configuring shorter goal +final trial duration. Perhaps SUT has buffers that may get full at longer +trial durations. Perhaps SUT shows periodic decreases of performance +the user does not want to treat as noise. In any case, many good short trials +may became bad full-length trial in the counter-factual case. +In extreme case, there are no bad short trials. +In this scenario, we want the load classification NOT to classify the load +as a lower bound, despite the abundance of good short trials. +Effectively, we want the good short trials to be ignored, so they +do not contribute to comparisons with the goal duration sum. + +Scenario two. When there is a frame loss in a short trial, +the counter-factual full-length trial is expected to lose at least as many +frames. And in practice, bad short trials are rarely turning into +good full-length trials. In extreme case, there are no good short trials. +In this scenario, we want the load classification +to classify the load as an upper bound just based on abundance +of short bad trials. Effectively we want the bad short trials +to contribute to comparisons with the goal duration sum, +so the load can be classified sooner. + +Scenario three. Some SUTs are quite indifferent to trial duration. +Performance probability function constructed from short trial results +is likely to be similar to performance probability function constructed +from full-length trial results (perhaps with smaller dispersion, +but overall without big impact on the median quantiles). +For moderate goal exceed ratio values, this may mean there are both +good short trials and bad short trials. +This scenario is there just to invalidate a simple heuristic +of always ignoring good short trials and never ignoring bad short trials. +That simple heuristic would be too biased. Yes, the short bad trials +are likely to turn into full-length bad trials in the counter-factual case, +but there is no information on what would the good short trials turn into. +The only way to decide is to do more trials at full length, +the same as in scenario one. + +## Short Trial Logic + +MLRsearch picks a particular logic for load classification +in presence of short trials, but it is still RECOMMENDED to use configurations +that imply no short trials, so the possible inefficiencies in that logic +do not affect the result, and the result has better explainability. + +With thas said, the logic differs from the single trial duration case +only in different definition of bad sum. +Good sum is still the sum across all good full-length trials. + +Few more notions are needed for definig the new bad sum. +Sum of durations of all bad full-length trials is called the bad long sum. +Sum of durations of all bad short trials is called the bad short sum. +Sum of durations of all good short trials is called the good short sum. +One minus the goal exceed ratio is called the inceed ratio. +The goal exceed ratio divided by the inceed ratio is called the exceed coefficient. +The good short sum multiplied by the exceed coefficient is called the balancing sum. +The bad short sum minus the balancing sum is called the excess sum. +If the excess sum is negative, the bad sum is equal to the bad long sum. +Else, the bad sum is equal to the bad long sum plus the excess sum. + +Here is how the new definition of the bad sum fares in the three scenarios, +where the load is close to what would relevant bounds be +if only full-length trials were used for the search. + +Scenario one. If duration is too short, we expect to see higher frequency +of good short trials. This could lead to negative excess sum, +which has no impact, hence the load classification is given just by +full-length trials. +Thus, MLRsearch using too short trials has no detrimental effect +on result comparability in this scenario. +But also using short trials does not help with overall search duration, +proably making it worse. + +Scenario two. Settings with small exceed ratio have small exceed coefficient, +so the impact of good short sum is small and the bad short sum +is almost wholly converted into excess sum, thus bad short trials +have almost as big impact as full-length bad trials. +The same conclusion applies for moderate exceed ratio values +when the good short sum is small. +Thus, short trials can cause a load to get classified as an upper bound earlier +bringing time savings (while not affecting comparability). + +Scenario three. Here excess sum is small in absolute value, as balancing sum +is expected to be be similar to the bad short sum. +Once again, full-length trials are needed for final load classification, +but usage of short trials probably means MLRsearch needed shorter search time +before selecting this load for measurement, bringing time savings +(while not affecting comparability). + +## Longer Trial Durations + +If there are trial results with intended duration larger +than the goal trial duration, the classification logic is intentionally undefined. + +The implementations MAY treat such longer trials as if they were full-length. +In any case, presence of such longer trials in either the relevant lower bound +or the relevant upper bound SHOULD be mentioned, as for sume SUTs +it is likely to affect comparability. + + +TODO: Here will be a chapter summarizing how MLRsearch library +adresses the problems from the Identified Problems chapter. + +{::comment} + + # Problems after MLRsearch + + Now when MLRsearch is clearly specified and explained, + it is possible to summarize how does MLRsearch specification help with problems. + + Here, "multiple trials" is a shorthand for having the goal final trial duration + significantly smaller than the goal duration sum. + This results in MLRsearch performing multiple trials at the same load, + which may not be the case with other configurations. + + ## Long Test Duration + + As shortening the overall search duration is the main motivation + of MLRsearch library development, the library implements + multiple improvements on this front, both big and small. + Most of implementation details are not part of the MLRsearch specification, + so that future implementations may keep shortening the search duration even more. + + TODO: The rest is about attributes. + + From the required goal attributes, the goal duration sum + remains the best way to get even shorter searches. + + Usage of multiple trials can also save time, + depending on wait times around trial traffic. + + The farther the goal exceed ratio is from 0.5 towards either side, + the less predictable the overal search duration becomes in practice. + + Width parameter does not change search duration much in practice + (compared to other, mainly optional goal attributes). + + ## DUT in SUT + + Practice shows big improvements when multiple trials + and moderate exceed ratios are used. Mainly when it comes to result + repeatability, as sometimes it is not easy to distinguish + SUT noise from DUT instability. + + Conditional throughput has intuitive meaning when described + using the performance spectrum, so this is an improvement, + especially when compared to search procedures which use non-zero + goal loss ratio but return only the intended load at a lower bound. + + Multiple trials can save time also when the noisy end of + the preformance spectrum needs to be examined, e.g. for [RFC9004]. + + Under some circumstances, testing the same DUT and SUT setup with different + DUT configurations can give some hints on what part of noise us SUT noise + and what part is DUT performance fluctuations. + In practice, both types of noise tend to be too complicated for that analysis. + MLRsearch does not offer additional tools in this regard, + apart of giving users ability to search for more goals, + hoping to get more insight at the cost of longer overall search time. + + ## Repeatability and Comparability + + Multiple trials improve repeatability, depending on exceed ratio. + + In practice, 1s goal final trial duration with exceed ratio 0.5 + is good enough for modern SUTs (but that usually requires + smaller wait times around the traffic part of the trial, + otherwise too much search time is wasted waiting). + + It is not clear whether exceed ratios higher than 0.5 are better + for repeatability. + The 0.5 value is still preferred due to explainability using median. + TODO: Stress single value is for comparability, which one is due explainability. + + It is possible that the conditional throughput values (with non-zero + goal loss ratio) are better for repeatability than the relevant + lower bound values, especially for implementations + which pick load from a small set of discrete values. + + Implementations focusing on shortening the overall search time + are automatically forced to avoid comparability issues + due to load selection, as they must prefer even splits wherever possible. + But this conclusion only holds when the same goals are used. + Larger adoption is needed before any further claims on comparability + between MLRsearch implementations can be made. + + ## Throughput with Non-Zero Loss + + Suported by the goal loss ratio attribute. + Improves repeatability as expected. + + ## Inconsistent Trial Results + + MLRsearch is conservative wherever possible, + this is built into the definition of conditional throughput, + and into the treatment of short trial results for load classification. + + This is consistent with [RFC2544] zero loss tolerance motivation. + + If the very best (noiseless) part of the SUT performance spectrum + is of interest, it should be enough to set small value for + the goal final trial duration, and perhaps also a large value + for the goal exceed ratio. + + Implementations may offer other (optional) configuration attributes + (and optional goal result attributes) + to become less conservative, but currently it is not clear + what impact would that have on repeatability. + +{:/comment} + +# IANA Considerations + +No requests of IANA. + +# Security Considerations + +Benchmarking activities as described in this memo are limited to +technology characterization of a DUT/SUT using controlled stimuli in a +laboratory environment, with dedicated address space and the constraints +specified in the sections above. + +The benchmarking network topology will be an independent test setup and +MUST NOT be connected to devices that may forward the test traffic into +a production network or misroute traffic to the test management network. + +Further, benchmarking is performed on a "black-box" basis, relying +solely on measurements observable external to the DUT/SUT. + +Special capabilities SHOULD NOT exist in the DUT/SUT specifically for +benchmarking purposes. Any implications for network security arising +from the DUT/SUT SHOULD be identical in the lab and in production +networks. + +# Acknowledgements + +Many thanks to Alec Hothan of OPNFV NFVbench project for thorough +review and numerous useful comments and suggestions. + +Special wholehearted gratitude and thanks to late Al Morton for his +thorough reviews filled with very specific feedback and constructive +guidelines. Thank you Al for the close collaboration over the years, +for your continuous unwavering encouragements full of empathy and +positive attitude. +Al, you are dearly missed. + +# Appendix A + +This is a specification of load classification. + +The block at the end of this appendix holds pseudocode +which computes two values, stored in variables named optimistic and pessimistic. +The pseudocode happens to be a valid Python code. + +If both values are computed to be true, the load in question +is classified as a lower bound according to the goal in question. +If both values are false, the load is classified as an upper bound. +Otherwise, the load is classifies as undecided. + +The pseudocole expects the following variables hold values as follows: + +- goal_duration_sum: The goal duration sum value. + +- goal_exceed_ratio: The goal exceed ratio value. + +- good_long_sum: Sum of durations across trials with trial duration + at least equal to the goal final trial duration and with trial loss ratio + not higher than the goal loss ratio. + +- bad_long_sum: Sum of durations across trials with trial duration + at least equal to the goal final trial duration and with trial loss ratio + higher than the goal loss ratio. + +- good_short_sum: Sum of durations across trials with trial duration + shorter than the goal final trial duration and with trial loss ratio + not higher than the goal loss ratio. + +- bad_short_sum: Sum of durations across trials with trial duration + shorter than the goal final trial duration and with trial loss ratio + higher than the goal loss ratio. + +Here the implicit set of available trial results consists of all trials +measured at given intended load at the end of search. + +The code works correctly also when there are no trial results at given load. + +``` +balancing_sum = good_short_sum * goal_exceed_ratio / (1.0 - goal_exceed_ratio) +effective_bad_sum = bad_long_sum + max(0.0, bad_short_sum - balancing_sum) +effective_whole_sum = max(good_long_sum + effective_bad_sum, goal_duration_sum) +quantile_duration_sum = effective_whole_sum * goal_exceed_ratio +optimistic = effective_bad_sum <= quantile_duration_sum +pessimistic = (effective_whole_sum - good_long_sum) <= quantile_duration_sum +``` + +# Appendix B + +This is a specification of conditional throughput. + +The block at the end of this appendix holds pseudocode +which computes a value stored as variable conditional_throughput. +The pseudocode happens to be a valid Python code. + +The pseudocole expects the following variables hold values as follows: + +- goal_duration_sum: The goal duration sum value. + +- goal_exceed_ratio: The goal exceed ratio value. + +- good_long_sum: Sum of durations across trials with trial duration + at least equal to the goal final trial duration and with trial loss ratio + not higher than the goal loss ratio. + +- bad_long_sum: Sum of durations across trials with trial duration + at least equal to the goal final trial duration and with trial loss ratio + higher than the goal loss ratio. + +- long_trials: An iterable of all trial results from trials with trial duration + at least equal to the goal final trial duration, + sorted by increasing trial loss ratio. + A trial result is a composite with the following two attributes available: + + - trial.loss_ratio: The trial loss ratio as measured for this trial. + + - trial.duration: The trial duration of this trial. + +Here the implicit set of available trial results consists of all trials +measured at given intended load at the end of search. + +The code works correctly only when there if there is at leas one +trial result measured at given load. + +``` +all_long_sum = max(goal_duration_sum, good_long_sum + bad_long_sum) +remaining = all_long_sum * (1.0 - goal_exceed_ratio) +quantile_loss_ratio = None +for trial in long_trials: + if quantile_loss_ratio is None or remaining > 0.0: + quantile_loss_ratio = trial.loss_ratio + remaining -= trial.duration + else: + break +else: + if remaining > 0.0: + quantile_loss_ratio = 1.0 + +conditional_throughput = intended_load * (1.0 - quantile_loss_ratio) +``` + +--- back diff --git a/docs/ietf/process.txt b/docs/ietf/process.txt index 78f7d7f34f..f83e6bbc99 100644 --- a/docs/ietf/process.txt +++ b/docs/ietf/process.txt @@ -20,7 +20,7 @@ $ sudo gem install kramdown-rfc2629 $ kdrfc --version Main: -$ kdrfc draft-ietf-bmwg-mlrsearch-04.md +$ kdrfc draft-ietf-bmwg-mlrsearch-05.md If that complains, do it manually at https://author-tools.ietf.org/ -- cgit 1.2.3-korg