From efdcf6470f6e15dcc918c70e5a61d10e10653f1e Mon Sep 17 00:00:00 2001 From: Tibor Frank Date: Thu, 1 Mar 2018 14:52:47 +0100 Subject: CSIT-913: Continuous Trending, Analysis and Change Detection - CSIT-915: LLD - CSIT-917: Functions to evaluate the results according to the PASS / FAIL criteria - CSIT-918: Sphinx configuration - CSIT-948: Statistical functions - CSIT-949: Data models for trending plots - CSIT-950: Code trending plots - CSIT-951: Static content - CSIT-984: PAL Specification file - CSIT-996: Download data from nexus Change-Id: Icb9305945bb0f142135bb177cb8781ba0096280e Signed-off-by: Tibor Frank --- docs/cpta/index.rst | 8 ++ docs/cpta/introduction/index.rst | 182 +++++++++++++++++++++++++++++++++ docs/cpta/trending/container_memif.rst | 80 +++++++++++++++ docs/cpta/trending/index.rst | 10 ++ docs/cpta/trending/ip4.rst | 20 ++++ docs/cpta/trending/ip6.rst | 20 ++++ docs/cpta/trending/l2.rst | 20 ++++ docs/cpta/trending/vm_vhost.rst | 116 +++++++++++++++++++++ 8 files changed, 456 insertions(+) create mode 100644 docs/cpta/index.rst create mode 100644 docs/cpta/introduction/index.rst create mode 100644 docs/cpta/trending/container_memif.rst create mode 100644 docs/cpta/trending/index.rst create mode 100644 docs/cpta/trending/ip4.rst create mode 100644 docs/cpta/trending/ip6.rst create mode 100644 docs/cpta/trending/l2.rst create mode 100644 docs/cpta/trending/vm_vhost.rst (limited to 'docs/cpta') diff --git a/docs/cpta/index.rst b/docs/cpta/index.rst new file mode 100644 index 0000000000..dcefef7f08 --- /dev/null +++ b/docs/cpta/index.rst @@ -0,0 +1,8 @@ +Continuous Performance Trending and Analysis +============================================ + +.. toctree:: + :numbered: + + introduction/index + trending/index diff --git a/docs/cpta/introduction/index.rst b/docs/cpta/introduction/index.rst new file mode 100644 index 0000000000..aad683b390 --- /dev/null +++ b/docs/cpta/introduction/index.rst @@ -0,0 +1,182 @@ +Introduction +============ + +Purpose +------- + +With increasing number of features and code changes in the FD.io VPP data plane +codebase, it is increasingly difficult to measure and detect VPP data plane +performance changes. Similarly, once degradation is detected, it is getting +harder to bisect the source code in search of the Bad code change or addition. +The problem is further escalated by a large combination of compute platforms +that VPP is running and used on, including Intel Xeon, Intel Atom, ARM Aarch64. + +Existing FD.io CSIT continuous performance trending test jobs help, but they +rely on human factors for anomaly detection, and as such are error prone and +unreliable, as the volume of data generated by these jobs is growing +exponentially. + +Proposed solution is to eliminate human factor and fully automate performance +trending, regression and progression detection, as well as bisecting. + +This document describes a high-level design of a system for continuous +measuring, trending and performance change detection for FD.io VPP SW data +plane. It builds upon the existing CSIT framework with extensions to its +throughput testing methodology, CSIT data analytics engine +(PAL – Presentation-and-Analytics-Layer) and associated Jenkins jobs +definitions. + +Continuous Performance Trending and Analysis +-------------------------------------------- + +Proposed design replaces existing CSIT performance trending jobs and tests with +new Performance Trending (PT) CSIT module and separate Performance Analysis (PA) +module ingesting results from PT and analysing, detecting and reporting any +performance anomalies using historical trending data and statistical metrics. +PA does also produce trending graphs with summary and drill-down views across +all specified tests that can be reviewed and inspected regularly by FD.io +developers and users community. + +Trend Analysis +`````````````` + +All measured performance trend data is treated as time-series data that can be +modelled using normal distribution. After trimming the outliers, the average and +deviations from average are used for detecting performance change anomalies +following the three-sigma rule of thumb (a.k.a. 68-95-99.7 rule). + +Analysis Metrics +```````````````` + +Following statistical metrics are proposed as performance trend indicators over +the rolling window of last sets of historical measurement data: + + #. Quartiles Q1, Q2, Q3 – three points dividing a ranked set of data set + into four equal parts, Q2 is the median of the data. + #. Inter Quartile Range IQR=Q3-Q1 – measure of variability, used here to + eliminate outliers. + #. Outliers – extreme values that are at least 1.5*IQR below Q1, or at + least 1.5*IQR above Q3. + #. Trimmed Moving Average (TMA) – average across the data set of the rolling + window of values without the outliers. Used here to calculate TMSD. + #. Trimmed Moving Standard Deviation (TMSD) – standard deviation over the + data set of the rolling window of values without the outliers, + requires calculating TMA. Used here for anomaly detection. + #. Moving Median (MM) - median across the data set of the rolling window of + values with all data points, including the outliers. Used here for + anomaly detection. + +Anomaly Detection +````````````````` + +Based on the assumption that all performance measurements can be modelled using +normal distribution, a three-sigma rule of thumb is proposed as the main +criteria for anomaly detection. + +Three-sigma rule of thumb, aka 68–95–99.7 rule, is a shorthand used to capture +the percentage of values that lie within a band around the average (mean) in a +normal distribution within a width of two, four and six standard deviations. +More accurately 68.27%, 95.45% and 99.73% of the result values should lie within +one, two or three standard deviations of the mean, see figure below. + +To verify compliance of test result with value X against defined trend analysis +metric and detect anomalies, three simple evaluation criteria are proposed: + +:: + + Test Result Evaluation Reported Result Reported Reason Trending Graph Markers + ========================================================================================== + Normal Pass Normal Part of plot line + Regression Fail Regression Red circle + Progression Pass Progression Green circle + +Jenkins job cumulative results: + + #. Pass - if all detection results are Pass or Warning. + #. Fail - if any detection result is Fail. + +Performance Trending (PT) +````````````````````````` + +CSIT PT runs regular performance test jobs finding MRR, PDR and NDR per test +cases. PT is designed as follows: + + #. PT job triggers: + + #. Periodic e.g. daily. + #. On-demand gerrit triggered. + #. Other periodic TBD. + + #. Measurements and calculations per test case: + + #. MRR Max Received Rate + + #. Measured: Unlimited tolerance of packet loss. + #. Send packets at link rate, count total received packets, divide + by test trial period. + + #. Optimized binary search bounds for PDR and NDR tests: + + #. Calculated: High and low bounds for binary search based on MRR + and pre-defined Packet Loss Ratio (PLR). + #. HighBound=MRR, LowBound=to-be-determined. + #. PLR – acceptable loss ratio for PDR tests, currently set to 0.5% + for all performance tests. + + #. PDR and NDR: + + #. Run binary search within the calculated bounds, find PDR and NDR. + #. Measured: PDR Partial Drop Rate – limited non-zero tolerance of + packet loss. + #. Measured: NDR Non Drop Rate - zero packet loss. + + #. Archive MRR, PDR and NDR per test case. + #. Archive counters collected at MRR, PDR and NDR. + +Performance Analysis (PA) +````````````````````````` + +CSIT PA runs performance analysis, change detection and trending using specified +trend analysis metrics over the rolling window of last sets of historical +measurement data. PA is defined as follows: + + #. PA job triggers: + + #. By PT job at its completion. + #. On-demand gerrit triggered. + #. Other periodic TBD. + + #. Download and parse archived historical data and the new data: + + #. New data from latest PT job is evaluated against the rolling window + of sets of historical data. + #. Download RF output.xml files and compressed archived data. + #. Parse out the data filtering test cases listed in PA specification + (part of CSIT PAL specification file). + + #. Calculate trend metrics for the rolling window of sets of historical data: + + #. Calculate quartiles Q1, Q2, Q3. + #. Trim outliers using IQR. + #. Calculate TMA and TMSD. + #. Calculate normal trending range per test case based on TMA and TMSD. + + #. Evaluate new test data against trend metrics: + + #. If within the range of (TMA +/- 3*TMSD) => Result = Pass, + Reason = Normal. + #. If below the range => Result = Fail, Reason = Regression. + #. If above the range => Result = Pass, Reason = Progression. + + #. Generate and publish results + + #. Relay evaluation result to job result. + #. Generate a new set of trend analysis summary graphs and drill-down + graphs. + + #. Summary graphs to include measured values with Normal, + Progression and Regression markers. MM shown in the background if + possible. + #. Drill-down graphs to include MM, TMA and TMSD. + + #. Publish trend analysis graphs in html format. diff --git a/docs/cpta/trending/container_memif.rst b/docs/cpta/trending/container_memif.rst new file mode 100644 index 0000000000..5d145aa0f0 --- /dev/null +++ b/docs/cpta/trending/container_memif.rst @@ -0,0 +1,80 @@ +Container memif Connections +=========================== + +NIC 10ge2p1x520 +--------------- + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* + +.. raw:: html + + + +*Figure 4. Daily trend.* + +.. raw:: html + + + +*Figure 5. Weekly trend.* + +.. raw:: html + + + +*Figure 6. Monthly trend.* + +NIC 40ge2p1xl710 +---------------- + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* + +.. raw:: html + + + +*Figure 4. Daily trend.* + +.. raw:: html + + + +*Figure 5. Weekly trend.* + +.. raw:: html + + + +*Figure 6. Monthly trend.* \ No newline at end of file diff --git a/docs/cpta/trending/index.rst b/docs/cpta/trending/index.rst new file mode 100644 index 0000000000..0dd9cf66a5 --- /dev/null +++ b/docs/cpta/trending/index.rst @@ -0,0 +1,10 @@ +VPP Performance Trend +===================== + +.. toctree:: + + l2 + ip4 + ip6 + container_memif + vm_vhost diff --git a/docs/cpta/trending/ip4.rst b/docs/cpta/trending/ip4.rst new file mode 100644 index 0000000000..a84f362b5d --- /dev/null +++ b/docs/cpta/trending/ip4.rst @@ -0,0 +1,20 @@ +IPv4 Routed-Forwarding +====================== + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* diff --git a/docs/cpta/trending/ip6.rst b/docs/cpta/trending/ip6.rst new file mode 100644 index 0000000000..a2b5afdfd7 --- /dev/null +++ b/docs/cpta/trending/ip6.rst @@ -0,0 +1,20 @@ +IPv6 Routed-Forwarding +====================== + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* diff --git a/docs/cpta/trending/l2.rst b/docs/cpta/trending/l2.rst new file mode 100644 index 0000000000..8a51270ba5 --- /dev/null +++ b/docs/cpta/trending/l2.rst @@ -0,0 +1,20 @@ +L2 Ethernet Switching +===================== + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* diff --git a/docs/cpta/trending/vm_vhost.rst b/docs/cpta/trending/vm_vhost.rst new file mode 100644 index 0000000000..6b464cc3cb --- /dev/null +++ b/docs/cpta/trending/vm_vhost.rst @@ -0,0 +1,116 @@ +VM vhost Connections +==================== + +NIC 10ge2p1x520 +--------------- + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* + +.. raw:: html + + + +*Figure 4. Daily trend.* + +.. raw:: html + + + +*Figure 5. Weekly trend.* + +.. raw:: html + + + +*Figure 6. Monthly trend.* + +.. raw:: html + + + +*Figure 7. Daily trend.* + +.. raw:: html + + + +*Figure 8. Weekly trend.* + +.. raw:: html + + + +*Figure 9. Monthly trend.* + +.. raw:: html + + + +*Figure 10. Daily trend.* + +.. raw:: html + + + +*Figure 11. Weekly trend.* + +.. raw:: html + + + +*Figure 12. Monthly trend.* + +NIC 40ge2p1xl710 +---------------- + +.. raw:: html + + + +*Figure 1. Daily trend.* + +.. raw:: html + + + +*Figure 2. Weekly trend.* + +.. raw:: html + + + +*Figure 3. Monthly trend.* + +.. raw:: html + + + +*Figure 4. Daily trend.* + +.. raw:: html + + + +*Figure 5. Weekly trend.* + +.. raw:: html + + + +*Figure 6. Monthly trend.* -- cgit 1.2.3-korg