diff options
Diffstat (limited to 'docs/content/methodology/trending')
-rw-r--r-- | docs/content/methodology/trending/_index.md | 12 | ||||
-rw-r--r-- | docs/content/methodology/trending/analysis.md | 224 | ||||
-rw-r--r-- | docs/content/methodology/trending/presentation.md | 34 |
3 files changed, 270 insertions, 0 deletions
diff --git a/docs/content/methodology/trending/_index.md b/docs/content/methodology/trending/_index.md new file mode 100644 index 0000000000..4289e7ff96 --- /dev/null +++ b/docs/content/methodology/trending/_index.md @@ -0,0 +1,12 @@ +--- +bookCollapseSection: true +bookFlatSection: false +title: "Trending" +weight: 4 +--- + +# Trending + +This document describes a high-level design of a system for continuous +performance measuring, trending and change detection for FD.io VPP SW +data plane (and other performance tests run within CSIT sub-project). diff --git a/docs/content/methodology/trending/analysis.md b/docs/content/methodology/trending/analysis.md new file mode 100644 index 0000000000..fe952259ab --- /dev/null +++ b/docs/content/methodology/trending/analysis.md @@ -0,0 +1,224 @@ +--- +title: "Analysis" +weight: 1 +--- + +# Trend Analysis + +All measured performance trend data is treated as time-series data +that is modeled as a concatenation of groups, +within each group the samples come (independently) from +the same normal distribution (with some center and standard deviation). + +Center of the normal distribution for the group (equal to population average) +is called a trend for the group. +All the analysis is based on finding the right partition into groups +and comparing their trends. + +## Anomalies in graphs + +In graphs, the start of the following group is marked as a regression (red +circle) or progression (green circle), if the new trend is lower (or higher +respectively) then the previous group's. + +## Implementation details + +### Partitioning into groups + +While sometimes the samples within a group are far from being distributed +normally, currently we do not have a better tractable model. + +Here, "sample" should be the result of single trial measurement, with group +boundaries set only at test run granularity. But in order to avoid detecting +causes unrelated to VPP performance, the current presentation takes average of +all trials within the run as the sample. Effectively, this acts as a single +trial with aggregate duration. + +Performance graphs show the run average as a dot (not all individual trial +results). + +The group boundaries are selected based on `Minimum Description Length`[^1]. + +### Minimum Description Length + +`Minimum Description Length`[^1] (MDL) is a particular formalization +of `Occam's razor`[^2] principle. + +The general formulation mandates to evaluate a large set of models, +but for anomaly detection purposes, it is useful to consider +a smaller set of models, so that scoring and comparing them is easier. + +For each candidate model, the data should be compressed losslessly, +which includes model definitions, encoded model parameters, +and the raw data encoded based on probabilities computed by the model. +The model resulting in shortest compressed message is the "the" correct model. + +For our model set (groups of normally distributed samples), +we need to encode group length (which penalizes too many groups), +group average (more on that later), group stdev and then all the samples. + +Luckily, the "all the samples" part turns out to be quite easy to compute. +If sample values are considered as coordinates in (multi-dimensional) +Euclidean space, fixing stdev means the point with allowed coordinates +lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane, +and Gaussian probability density on the resulting sphere is constant. +So the only contribution is the "area" of the sphere, which only depends +on the number of samples and stdev. + +A somehow ambiguous part is in choosing which encoding +is used for group size, average and stdev. +Different encodings cause different biases to large or small values. +In our implementation we have chosen probability density +corresponding to uniform distribution (from zero to maximal sample value) +for stdev and average of the first group, +but for averages of subsequent groups we have chosen a distribution +which discourages delimiting groups with averages close together. + +Our implementation assumes that measurement precision is 1.0 pps. +Thus it is slightly wrong for trial durations other than 1.0 seconds. +Also, all the calculations assume 1.0 pps is totally negligible, +compared to stdev value. + +The group selection algorithm currently has no parameters, +all the aforementioned encodings and handling of precision is hard-coded. +In principle, every group selection is examined, and the one encodable +with least amount of bits is selected. +As the bit amount for a selection is just sum of bits for every group, +finding the best selection takes number of comparisons +quadratically increasing with the size of data, +the overall time complexity being probably cubic. + +The resulting group distribution looks good +if samples are distributed normally enough within a group. +But for obviously different distributions (for example +`bimodal distribution`[^3]) the groups tend to focus on less relevant factors +(such as "outlier" density). + +## Common Patterns + +When an anomaly is detected, it frequently falls into few known patterns, +each having its typical behavior over time. + +We are going to describe the behaviors, +as they motivate our choice of trend compliance metrics. + +### Sample time and analysis time + +But first we need to distinguish two roles time plays in analysis, +so it is more clear which role we are referring to. + +Sample time is the more obvious one. +It is the time the sample is generated. +It is the start time or the end time of the Jenkins job run, +does not really matter which (parallel runs are disabled, +and length of gap between samples does not affect metrics). + +Analysis time is the time the current analysis is computed. +Again, the exact time does not usually matter, +what matters is how many later (and how fewer earlier) samples +were considered in the computation. + +For some patterns, it is usual for a previously reported +anomaly to "vanish", or previously unseen anomaly to "appear late", +as later samples change which partition into groups is more probable. + +Dashboard and graphs are always showing the latest analysis time, +the compliance metrics are using earlier sample time +with the same latest analysis time. + +Alerting e-mails use the latest analysis time at the time of sending, +so the values reported there are likely to be different +from the later analysis time results shown in dashboard and graphs. + +### Ordinary regression + +The real performance changes from previously stable value +into a new stable value. + +For medium to high magnitude of the change, one run +is enough for anomaly detection to mark this regression. + +Ordinary progressions are detected in the same way. + +### Small regression + +The real performance changes from previously stable value +into a new stable value, but the difference is small. + +For the anomaly detection algorithm, this change is harder to detect, +depending on the standard deviation of the previous group. + +If the new performance value stays stable, eventually +the detection algorithm is able to detect this anomaly +when there are enough samples around the new value. + +If the difference is too small, it may remain undetected +(as new performance change happens, or full history of samples +is still not enough for the detection). + +Small progressions have the same behavior. + +### Reverted regression + +This pattern can have two different causes. +We would like to distinguish them, but that is usually +not possible to do just by looking at the measured values (and not telemetry). + +In one cause, the real DUT performance has changed, +but got restored immediately. +In the other cause, no real performance change happened, +just some temporary infrastructure issue +has caused a wrong low value to be measured. + +For small measured changes, this pattern may remain undetected. +For medium and big measured changes, this is detected when the regression +happens on just the last sample. + +For big changes, the revert is also immediately detected +as a subsequent progression. The trend is usually different +from the previously stable trend (as the two population averages +are not likely to be exactly equal), but the difference +between the two trends is relatively small. + +For medium changes, the detection algorithm may need several new samples +to detect a progression (as it dislikes single sample groups), +in the meantime reporting regressions (difference decreasing +with analysis time), until it stabilizes the same way as for big changes +(regression followed by progression, small difference +between the old stable trend and last trend). + +As it is very hard for a fault code or an infrastructure issue +to increase performance, the opposite (temporary progression) +almost never happens. + +### Summary + +There is a trade-off between detecting small regressions +and not reporting the same old regressions for a long time. + +For people reading e-mails, a sudden regression with a big number of samples +in the last group means this regression was hard for the algorithm to detect. + +If there is a big regression with just one run in the last group, +we are not sure if it is real, or just a temporary issue. +It is useful to wait some time before starting an investigation. + +With decreasing (absolute value of) difference, the number of expected runs +increases. If there is not enough runs, we still cannot distinguish +real regression from temporary regression just from the current metrics +(although humans frequently can tell by looking at the graph). + +When there is a regression or progression with just a small difference, +it is probably an artifact of a temporary regression. +Not worth examining, unless temporary regressions happen somewhat frequently. + +It is not easy for the metrics to locate the previous stable value, +especially if multiple anomalies happened in the last few weeks. +It is good to compare last trend with long term trend maximum, +as it highlights the difference between "now" and "what could be". +It is good to exclude last week from the trend maximum, +as including the last week would hide all real progressions. + +[^1]: [Minimum Description Length](https://en.wikipedia.org/wiki/Minimum_description_length) +[^2]: [Occam's Razor](https://en.wikipedia.org/wiki/Occam%27s_razor) +[^3]: [Bimodal Distribution](https://en.wikipedia.org/wiki/Bimodal_distribution) diff --git a/docs/content/methodology/trending/presentation.md b/docs/content/methodology/trending/presentation.md new file mode 100644 index 0000000000..84925b46c8 --- /dev/null +++ b/docs/content/methodology/trending/presentation.md @@ -0,0 +1,34 @@ +--- +title: "Presentation" +weight: 2 +--- + +# Trend Presentation + +## Failed tests + +The Failed tests tables list the tests which failed during the last test run. +Separate tables are generated for each testbed. + +## Regressions and progressions + +These tables list tests which encountered a regression or progression during the +specified time period, which is currently set to the last 21 days. + +## Trendline Graphs + +Trendline graphs show measured per run averages of MRR values, NDR or PDR +values, group average values, and detected anomalies. +The graphs are constructed as follows: + +- X-axis represents the date in the format MMDD. +- Y-axis represents run-average MRR value, NDR or PDR values in Mpps. For PDR + tests also a graph with average latency at 50% PDR [us] is generated. +- Markers to indicate anomaly classification: + - Regression - red circle. + - Progression - green circle. +- The line shows average MRR value of each group. + +In addition the graphs show dynamic labels while hovering over graph data +points, presenting the CSIT build date, measured value, VPP reference, trend job +build ID and the LF testbed ID. |