diff options
Diffstat (limited to 'docs/cpta/methodology')
-rw-r--r-- | docs/cpta/methodology/index.rst | 6 | ||||
-rw-r--r-- | docs/cpta/methodology/jenkins_jobs.rst | 62 | ||||
-rw-r--r-- | docs/cpta/methodology/overview.rst | 10 | ||||
-rw-r--r-- | docs/cpta/methodology/performance_tests.rst | 36 | ||||
-rw-r--r-- | docs/cpta/methodology/perpatch_performance_tests.rst | 242 | ||||
-rw-r--r-- | docs/cpta/methodology/testbed_hw_configuration.rst | 5 | ||||
-rw-r--r-- | docs/cpta/methodology/trend_analysis.rst | 78 | ||||
-rw-r--r-- | docs/cpta/methodology/trend_presentation.rst | 41 |
8 files changed, 30 insertions, 450 deletions
diff --git a/docs/cpta/methodology/index.rst b/docs/cpta/methodology/index.rst index cbcfcb50cb..9105ec46b4 100644 --- a/docs/cpta/methodology/index.rst +++ b/docs/cpta/methodology/index.rst @@ -1,14 +1,8 @@ -.. _trending_methodology: - Trending Methodology ==================== .. toctree:: overview - performance_tests trend_analysis trend_presentation - jenkins_jobs - testbed_hw_configuration - perpatch_performance_tests diff --git a/docs/cpta/methodology/jenkins_jobs.rst b/docs/cpta/methodology/jenkins_jobs.rst deleted file mode 100644 index a58d616ff9..0000000000 --- a/docs/cpta/methodology/jenkins_jobs.rst +++ /dev/null @@ -1,62 +0,0 @@ -Jenkins Jobs ------------- - -Performance Trending (PT) -````````````````````````` - -CSIT PT runs regular performance test jobs measuring and collecting MRR -data per test case. PT is designed as follows: - -1. PT job triggers: - - a) Periodic e.g. twice a day. - b) On-demand gerrit triggered. - -2. Measurements and data calculations per test case: - - a) Max Received Rate (MRR) - for each trial measurement, - send packets at link rate for trial duration, - count total received packets, divide by trial duration. - -3. Archive MRR values per test case. -4. Archive all counters collected at MRR. - -Performance Analysis (PA) -````````````````````````` - -CSIT PA runs performance analysis -including anomaly detection as described above. -PA is defined as follows: - -1. PA job triggers: - - a) By PT jobs at their completion. - b) On-demand gerrit triggered. - -2. Download and parse archived historical data and the new data: - - a) Download RF output.xml files from latest PT job and compressed - archived data from nexus. - b) Parse out the data filtering test cases listed in PA specification - (part of CSIT PAL specification file). - -3. Re-calculate new groups and their averages. - -4. Evaluate new test data: - - a) If the existing group is prolonged => Result = Pass, - Reason = Normal. - b) If a new group is detected with lower average => - Result = Fail, Reason = Regression. - c) If a new group is detected with higher average => - Result = Pass, Reason = Progression. - -5. Generate and publish results - - a) Relay evaluation result to job result. - b) Generate a new set of trend summary dashboard, list of failed - tests and graphs. - c) Publish trend dashboard and graphs in html format on - `S3 Docs <https://s3-docs.fd.io/>`_. - d) Generate an alerting email. This email is sent by Jenkins to - `CSIT Report distribution list <csit-report@lists.fd.io>`_. diff --git a/docs/cpta/methodology/overview.rst b/docs/cpta/methodology/overview.rst index ecea051116..d2ffc04407 100644 --- a/docs/cpta/methodology/overview.rst +++ b/docs/cpta/methodology/overview.rst @@ -1,14 +1,6 @@ Overview --------- +^^^^^^^^ This document describes a high-level design of a system for continuous performance measuring, trending and change detection for FD.io VPP SW data plane (and other performance tests run within CSIT sub-project). - -There is a Performance Trending (PT) CSIT module, and a separate -Performance Analysis (PA) module ingesting results from PT and -analysing, detecting and reporting any performance anomalies using -historical data and statistical metrics. PA does also produce -trending dashboard, list of failed tests and graphs with summary and -drill-down views across all specified tests that can be reviewed and -inspected regularly by FD.io developers and users community. diff --git a/docs/cpta/methodology/performance_tests.rst b/docs/cpta/methodology/performance_tests.rst deleted file mode 100644 index 82e64f870a..0000000000 --- a/docs/cpta/methodology/performance_tests.rst +++ /dev/null @@ -1,36 +0,0 @@ -Performance Tests ------------------ - -Performance trending relies on Maximum Receive Rate (MRR) tests. -MRR tests measure the packet forwarding rate, in multiple trials of set -duration, under the maximum load offered by traffic generator -regardless of packet loss. Maximum load for specified Ethernet frame -size is set to the bi-directional link rate. - -Current parameters for performance trending MRR tests: - -- **Ethernet frame sizes**: 64B (78B for IPv6 tests) for all tests, IMIX for - selected tests (vhost, memif); all quoted sizes include frame CRC, but - exclude per frame transmission overhead of 20B (preamble, inter frame - gap). -- **Maximum load offered**: 10GE and 40GE link (sub-)rates depending on NIC - tested, with the actual packet rate depending on frame size, - transmission overhead and traffic generator NIC forwarding capacity. - - - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, - a 10GE bi-directional link rate. - - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, - a 40GE bi-directional link sub-rate limited by the packet forwarding - capacity of 2-port 40GE NIC model (XL710) used on T-Rex Traffic - Generator. - -- **Trial duration**: 1 sec. -- **Number of trials per test**: 10. -- **Test execution frequency**: twice a day, every 12 hrs (02:00, - 14:00 UTC). - -Note: MRR tests should be reporting bi-directional link rate (or NIC -rate, if lower) if tested VPP configuration can handle the packet rate -higher than bi-directional link rate, e.g. large packet tests and/or -multi-core tests. In other words MRR = min(VPP rate, bi-dir link rate, -NIC rate). diff --git a/docs/cpta/methodology/perpatch_performance_tests.rst b/docs/cpta/methodology/perpatch_performance_tests.rst deleted file mode 100644 index a72926f2c6..0000000000 --- a/docs/cpta/methodology/perpatch_performance_tests.rst +++ /dev/null @@ -1,242 +0,0 @@ -Per-patch performance tests ---------------------------- - -Updated for CSIT git commit id: 72b45cfe662107c8e1bb549df71ba51352a898ee. - -A methodology similar to trending analysis is used for comparing performance -before a DUT code change is merged. This can act as a verify job to disallow -changes which would decrease performance without a good reason. - -Existing jobs -````````````` - -VPP is the only project currently using such jobs. -They are not started automatically, must be triggered on demand. -They allow full tag expressions, but some tags are enforced (such as MRR). - -There are jobs available for multiple types of testbeds, -based on various processors. -Their Gerrit triggers words are of the form "perftest-{node_arch}" -where the node_arch combinations currently supported are: -2n-clx, 2n-dnv, 2n-skx, 2n-tx2, 2n-zn2, 3n-dnv, 3n-skx, 3n-tsh. - -Test selection --------------- - -.. - TODO: Majority of this section is also useful for CSIT verify jobs. Move it somewhere. - -Gerrit trigger line without any additional arguments selects -a small set of test cases to run. -If additional arguments are added to the Gerrit trigger, they are treated -as Robot tag expressions to select tests to run. -While very flexible, this method of test selection also allows the user -to accidentally select too high number of tests, blocking the testbed for days. - -What follows is a list of explanations and recommendations -to help users to select the minimal set of tests cases. - -Verify cycles -_____________ - -When Gerrit schedules multiple jobs to run for the same patch set, -it waits until all runs are complete. -While it is waiting, it is possible to trigger more jobs -(adding runs to the set Gerrit is waiting for), but it is not possible -to trigger more runs for the same job, until Gerrit is done waiting. -After Gerrit is done waiting, it becames possible to trigger -the same job again. - -Example. User triggers one set of tests on 2n-skx and immediately -also triggers other set of tests on 3n-skx. Then the user notices -2n-skx run end early because of a typo in tag expression. -When the user tries to re-trigger 2n-skx (with fixed tag expression), -that comment gets ignored by Jenkins. -Only when 3n-skx job finishes, the user can trigger 2n-skx. - -One comment many jobs -_____________________ - -In the past, the CSIT code which parses for perftest trigger comments -was buggy, which lead to bad behavior (as in selection all performance test, -because "perftest" is also a robot tag) when user included multiple -perftest trigger words in the same comment. - -The worst bugs were fixed since then, but it is still recommended -to use just one trigger word per Gerrit comment, just to be safe. - -Multiple test cases in run -__________________________ - -While Robot supports OR operator, it does not support parentheses, -so the OR operator is not very useful. It is recommended -to use space instead of OR operator. - -Example template: -perftest-2n-skx {tag_expression_1} {tag_expression_2} - -See below for more concrete examples. - -Suite tags -__________ - -Traditionally, CSIT maintains broad Robot tags that can be used to select tests, -for details on existing tags, see -`CSIT Tags <https://github.com/FDio/csit/blob/master/docs/tag_documentation.rst>`_. - -But it is not recommended to use them for test selection, -as it is not that easy to determine how many test cases are selected. - -The recommended way is to look into CSIT repository first, -and locate a specific suite the user is interested in, -and use its suite tag. For example, "ethip4-ip4base" is a suite tag -selecting just one suite in CSIT git repository, -avoiding all scale, container, and other simialr variants. - -Note that CSIT uses "autogen" code generator, -so the robot running in Jenkins has access to more suites -than visible just by looking into CSIT git repository, -so suite tag is not enough to select even the intended suite, -and user still probably wants to narrow down -to a single test case within a suite. - -Fully specified tag expressions -_______________________________ - -Here is one template to select a single test case: -{test_type}AND{nic_model}AND{nic_driver}AND{cores}AND{frame_size}AND{suite_tag} -where the variables are all lower case (so AND operator stands out). - -Currently only one test type is supported by the performance comparison jobs: -"mrr". -The nic_driver options depend on nic_model. For Intel cards "drv_avf" (AVF plugin) -and "drv_vfio_pci" (DPDK plugin) are popular, for Mellanox "drv_rdma_core". -Currently, the performance using "drv_af_xdp" is not reliable enough, so do not use it -unless you are specifically testing for AF_XDP. - -The most popular nic_model is "nic_intel-xxv710", but that is not available -on all testbed types. -It is safe to use "1c" for cores (unless you are suspection multi-core performance -is affected differently) and "64b" for frame size ("78b" for ip6 -and more for dot1q and other encapsulated traffic; -"1518b" is popular for ipsec and other payload-bound tests). - -As there are more test cases than CSIT can periodically test, -it is possible to encounter an old test case that currently fails. -To avoid that, you can look at "job spec" files we use for periodic testing, -for example `this one <https://github.com/FDio/csit/blob/master/docs/job_specs/report_iterative/2n-skx/vpp-mrr-00.md>`_. - -.. - TODO: Explain why "periodic" job spec link lands at report_iterative. - -Shortening triggers -___________________ - -Advanced users may use the following tricks to avoid writing long trigger comments. - -Robot supports glob matching, which can be used to select multiple suite tags at once. - -Not specifying one of 6 parts of the recommended expression pattern -will select all available options. For example not specifying nic_driver -for nic_intel-xxv710 will select all 3 applicable drivers. -You can use NOT operator to reject some options (e.g. NOTdrv_af_xdp), -but beware, with NOT the order matters: -tag1ANDtag2NOTtag3 is not the same as tag1NOTtag3ANDtag2, -the latter is evaluated as tag1AND(NOT(tag3ANDtag2)). - -Beware when not specifying nic_model. As a precaution, -CSIT code will insert the defailt NIC model for the tetsbed used. -Example: Specifying drv_rdma_core without specifying nic_model -will fail, as the default nic_model is nic_intel-xxv710 -which does not support RDMA core driver. - -Complete example -________________ - -A user wants to test a VPP change which may affect load balance whith bonding. -Searching tag documentation for "bonding" finds LBOND tag and its variants. -Searching CSIT git repository (directory tests/) finds 8 suite files, -all suited only for 3-node testbeds. -All suites are using vhost, but differ by the forwarding app inside VM -(DPDK or VPP), by the forwarding mode of VPP acting as host level vswitch -(MAC learning or cross connect), and by the number of DUT1-DUT2 links -available (1 or 2). - -As not all NICs and testbeds offer enogh ports for 2 parallel DUT-DUT links, -the user looks at `testbed specifications <https://github.com/FDio/csit/tree/master/topologies/available>`_ -and finds that only x710 NIC on 3n-skx testbed matches the requirements. -Quick look into the suites confirm the smallest frame size is 64 bytes -(despite DOT1Q robot tag, as the encapsulation does not happen on TG-DUT links). -It is ok to use just 1 physical core, as 3n-skx has hyperthreading enabled, -so VPP vswitch will use 2 worker threads. - -The user decides the vswitch forwarding mode is not important -(so choses cross connect as that has less CPU overhead), -but wants to test both NIC drivers (not AF_XDP), both apps in VM, -and both 1 and 2 parallel links. - -After shortening, this is the trigger comment fianlly used: -perftest-3n-skx mrrANDnic_intel-x710AND1cAND64bAND?lbvpplacp-dot1q-l2xcbase-eth-2vhostvr1024-1vm*NOTdrv_af_xdp - -Basic operation -``````````````` - -The job builds VPP .deb packages for both the patch under test -(called "current") and its parent patch (called "parent"). - -For each test (from a set defined by tag expression), -both builds are subjected to several trial measurements (BMRR). -Measured samples are grouped to "parent" sequence, -followed by "current" sequence. The same Minimal Description Length -algorithm as in trending is used to decide whether it is one big group, -or two smaller gropus. If it is one group, a "normal" result -is declared for the test. If it is two groups, and current average -is less then parent average, the test is declared a regression. -If it is two groups and current average is larger or equal, -the test is declared a progression. - -The whole job fails (giving -1) if some trial measurement failed, -or if any test was declared a regression. - -Temporary specifics -``````````````````` - -The Minimal Description Length analysis is performed by -CSIT code equivalent to jumpavg-0.1.3 library available on PyPI. - -In hopes of strengthening of signal (code performance) compared to noise -(all other factors influencing the measured values), several workarounds -are applied. - -In contrast to trending, trial duration is set to 10 seconds, -and only 5 samples are measured for each build. -Both parameters are set in ci-management. - -This decreases sensitivity to regressions, but also decreases -probability of false positives. - -Console output -`````````````` - -The following information as visible towards the end of Jenkins console output, -repeated for each analyzed test. - -The original 5 values are visible in order they were measured. -The 5 values after processing are also visible in output, -this time sorted by value (so people can see minimum and maximum). - -The next output is difference of averages. It is the current average -minus the parent average, expressed as percentage of the parent average. - -The next three outputs contain the jumpavg representation -of the two groups and a combined group. -Here, "bits" is the description length; for "current" sequence -it includes effect from "parent" average value -(jumpavg-0.1.3 penalizes sequences with too close averages). - -Next, a sentence describing which grouping description is shorter, -and by how much bits. -Finally, the test result classification is visible. - -The algorithm does not track test case names, -so test cases are indexed (from 0). diff --git a/docs/cpta/methodology/testbed_hw_configuration.rst b/docs/cpta/methodology/testbed_hw_configuration.rst deleted file mode 100644 index 7f6556c968..0000000000 --- a/docs/cpta/methodology/testbed_hw_configuration.rst +++ /dev/null @@ -1,5 +0,0 @@ -Testbed HW configuration ------------------------- - -The testbed HW configuration is described on -`CSIT/Testbeds: Xeon Hsw, VIRL FD.IO wiki page <https://wiki.fd.io/view/CSIT/Testbeds:_Xeon_Hsw,_VIRL.>`_. diff --git a/docs/cpta/methodology/trend_analysis.rst b/docs/cpta/methodology/trend_analysis.rst index 5a48136c9b..2bb54997b0 100644 --- a/docs/cpta/methodology/trend_analysis.rst +++ b/docs/cpta/methodology/trend_analysis.rst @@ -11,65 +11,12 @@ is called a trend for the group. All the analysis is based on finding the right partition into groups and comparing their trends. -Trend Compliance -~~~~~~~~~~~~~~~~ - -.. _Trend_Compliance: - -Trend compliance metrics are targeted to provide an indication of trend -changes, and hint at their reliability (see Common Patterns below). - -There is a difference between compliance metric names used in this document, -and column names used in :ref:`Dashboard` tables and Alerting emails. -In cases of low user confusion risk, column names are shortened, -e.g. Trend instead of Last Trend. -In cases of high user confusion risk, column names are prolonged, -e.g. Long-Term Change instead of Trend Change. -(This document refers to a generic "trend", -so the compliance metric name is prolonged to Last Trend to avoid confusion.) - -The definition of Reference for Trend Change is perhaps surprising. -It was chosen to allow both positive difference on progression -(if within last week), but also negative difference on progression -(if performance was even better somewhere between 3 months and 1 week ago). - -In the table below, "trend at time <t>", shorthand "trend[t]" -means "the group average of the group the sample at time <t> belongs to". -Here, time is usually given as "last" or last with an offset, -e.g. "last - 1week". -Also, "runs[t]" is a shorthand for "number of samples in the group -the sample at time <t> belongs to". - -The definitions of compliance metrics: - -+-------------------+-------------------+---------------------------------+-------------+-----------------------------------------------+ -| Compliance Metric | Legend Short Name | Formula | Value | Reference | -+===================+===================+=================================+=============+===============================================+ -| Last Trend | Trend | trend[last] | | | -+-------------------+-------------------+---------------------------------+-------------+-----------------------------------------------+ -| Number of runs | Runs | runs[last] | | | -+-------------------+-------------------+---------------------------------+-------------+-----------------------------------------------+ -| Trend Change | Long-Term Change | (Value - Reference) / Reference | trend[last] | max(trend[last - 3mths]..trend[last - 1week]) | -+-------------------+-------------------+---------------------------------+-------------+-----------------------------------------------+ - -Caveats -------- - -Obviously, if the result history is too short, the true Trend[t] value -may not by available. We use the earliest Trend available instead. - -The current implementation does not track time of the samples, -it counts runs instead. -For "- 1week" we use "10 runs ago, 5 runs for topo-arch with 1 TB", -for "- 3mths" we use "180 days or 180 runs ago, whatever comes first". - Anomalies in graphs ~~~~~~~~~~~~~~~~~~~ -In graphs, the start of the following group is marked -as a regression (red circle) or progression (green circle), -if the new trend is lower (or higher respectively) -then the previous group's. +In graphs, the start of the following group is marked as a regression (red +circle) or progression (green circle), if the new trend is lower (or higher +respectively) then the previous group's. Implementation details ~~~~~~~~~~~~~~~~~~~~~~ @@ -77,18 +24,17 @@ Implementation details Partitioning into groups ------------------------ -While sometimes the samples within a group are far from being -distributed normally, currently we do not have a better tractable model. +While sometimes the samples within a group are far from being distributed +normally, currently we do not have a better tractable model. -Here, "sample" should be the result of single trial measurement, -with group boundaries set only at test run granularity. -But in order to avoid detecting causes unrelated to VPP performance, -the current presentation takes average of all trials -within the run as the sample. -Effectively, this acts as a single trial with aggregate duration. +Here, "sample" should be the result of single trial measurement, with group +boundaries set only at test run granularity. But in order to avoid detecting +causes unrelated to VPP performance, the current presentation takes average of +all trials within the run as the sample. Effectively, this acts as a single +trial with aggregate duration. -Performance graphs show the run average as a dot -(not all individual trial results). +Performance graphs show the run average as a dot (not all individual trial +results). The group boundaries are selected based on `Minimum Description Length`_. diff --git a/docs/cpta/methodology/trend_presentation.rst b/docs/cpta/methodology/trend_presentation.rst index e9918020c5..67d0d3c45a 100644 --- a/docs/cpta/methodology/trend_presentation.rst +++ b/docs/cpta/methodology/trend_presentation.rst @@ -1,35 +1,28 @@ Trend Presentation ------------------- - -Performance Dashboard -````````````````````` - -Dashboard tables list a summary of per test-case VPP MRR performance -trend and trend compliance metrics and detected number of anomalies. - -Separate tables are generated for each testbed and each tested number of -physical cores for VPP workers (1c, 2c, 4c). Test case names are linked to -respective trending graphs for ease of navigation through the test data. +^^^^^^^^^^^^^^^^^^ Failed tests -```````````` +~~~~~~~~~~~~ + +The Failed tests tables list the tests which failed during the last test run. +Separate tables are generated for each testbed. -The Failed tests tables list the tests which failed over the specified seven- -day period together with the number of fails over the period and last failure -details - Time, VPP-Build-Id and CSIT-Job-Build-Id. +Regressions and progressions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Separate tables are generated for each testbed. Test case names are linked to -respective trending graphs for ease of navigation through the test data. +These tables list tests which encountered a regression or progression during the +specified time period, which is currently set to the last 21 days. Trendline Graphs -```````````````` +~~~~~~~~~~~~~~~~ -Trendline graphs show measured per run averages of MRR values, -group average values, and detected anomalies. +Trendline graphs show measured per run averages of MRR values, NDR or PDR +values, group average values, and detected anomalies. The graphs are constructed as follows: - X-axis represents the date in the format MMDD. -- Y-axis represents run-average MRR value in Mpps. +- Y-axis represents run-average MRR value, NDR or PDR values in Mpps. For PDR + tests also a graph with average latency at 50% PDR [us] is generated. - Markers to indicate anomaly classification: - Regression - red circle. @@ -37,6 +30,6 @@ The graphs are constructed as follows: - The line shows average MRR value of each group. -In addition the graphs show dynamic labels while hovering over graph -data points, presenting the CSIT build date, measured MRR value, VPP -reference, trend job build ID and the LF testbed ID. +In addition the graphs show dynamic labels while hovering over graph data +points, presenting the CSIT build date, measured value, VPP reference, trend job +build ID and the LF testbed ID. |