5 files changed, 279 insertions, 0 deletions
diff --git a/docs/report/introduction/methodology.rst b/docs/report/introduction/methodology.rst
index a2b1b7b2cc..28187710fe 100644
--- a/docs/report/introduction/methodology.rst
+++ b/docs/report/introduction/methodology.rst
@@ -31,3 +31,4 @@ Test Methodology
     methodology_autogen
     methodology_aws/index
     methodology_rca/index
+    methodology_trending/index
+\ No newline at end of file
diff --git a/docs/report/introduction/methodology_trending/index.rst b/docs/report/introduction/methodology_trending/index.rst
new file mode 100644
index 0000000000..9105ec46b4
--- /dev/null
+++ b/docs/report/introduction/methodology_trending/index.rst
@@ -0,0 +1,8 @@
+Trending Methodology
+====================
+
+.. toctree::
+
+    overview
+    trend_analysis
+    trend_presentation
diff --git a/docs/report/introduction/methodology_trending/overview.rst b/docs/report/introduction/methodology_trending/overview.rst
new file mode 100644
index 0000000000..d2ffc04407
--- /dev/null
+++ b/docs/report/introduction/methodology_trending/overview.rst
@@ -0,0 +1,6 @@
+Overview
+^^^^^^^^
+
+This document describes a high-level design of a system for continuous
+performance measuring, trending and change detection for FD.io VPP SW
+data plane (and other performance tests run within CSIT sub-project).
diff --git a/docs/report/introduction/methodology_trending/trend_analysis.rst b/docs/report/introduction/methodology_trending/trend_analysis.rst
new file mode 100644
index 0000000000..2bb54997b0
--- /dev/null
+++ b/docs/report/introduction/methodology_trending/trend_analysis.rst
@@ -0,0 +1,229 @@
+Trend Analysis
+^^^^^^^^^^^^^^
+
+All measured performance trend data is treated as time-series data
+that is modeled as a concatenation of groups,
+within each group the samples come (independently) from
+the same normal distribution (with some center and standard deviation).
+
+Center of the normal distribution for the group (equal to population average)
+is called a trend for the group.
+All the analysis is based on finding the right partition into groups
+and comparing their trends.
+
+Anomalies in graphs
+~~~~~~~~~~~~~~~~~~~
+
+In graphs, the start of the following group is marked as a regression (red
+circle) or progression (green circle), if the new trend is lower (or higher
+respectively) then the previous group's.
+
+Implementation details
+~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning into groups
+------------------------
+
+While sometimes the samples within a group are far from being distributed
+normally, currently we do not have a better tractable model.
+
+Here, "sample" should be the result of single trial measurement, with group
+boundaries set only at test run granularity. But in order to avoid detecting
+causes unrelated to VPP performance, the current presentation takes average of
+all trials within the run as the sample. Effectively, this acts as a single
+trial with aggregate duration.
+
+Performance graphs show the run average as a dot (not all individual trial
+results).
+
+The group boundaries are selected based on `Minimum Description Length`_.
+
+Minimum Description Length
+--------------------------
+
+`Minimum Description Length`_ (MDL) is a particular formalization
+of `Occam's razor`_ principle.
+
+The general formulation mandates to evaluate a large set of models,
+but for anomaly detection purposes, it is useful to consider
+a smaller set of models, so that scoring and comparing them is easier.
+
+For each candidate model, the data should be compressed losslessly,
+which includes model definitions, encoded model parameters,
+and the raw data encoded based on probabilities computed by the model.
+The model resulting in shortest compressed message is the "the" correct model.
+
+For our model set (groups of normally distributed samples),
+we need to encode group length (which penalizes too many groups),
+group average (more on that later), group stdev and then all the samples.
+
+Luckily, the "all the samples" part turns out to be quite easy to compute.
+If sample values are considered as coordinates in (multi-dimensional)
+Euclidean space, fixing stdev means the point with allowed coordinates
+lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane,
+and Gaussian probability density on the resulting sphere is constant.
+So the only contribution is the "area" of the sphere, which only depends
+on the number of samples and stdev.
+
+A somehow ambiguous part is in choosing which encoding
+is used for group size, average and stdev.
+Different encodings cause different biases to large or small values.
+In our implementation we have chosen probability density
+corresponding to uniform distribution (from zero to maximal sample value)
+for stdev and average of the first group,
+but for averages of subsequent groups we have chosen a distribution
+which discourages delimiting groups with averages close together.
+
+Our implementation assumes that measurement precision is 1.0 pps.
+Thus it is slightly wrong for trial durations other than 1.0 seconds.
+Also, all the calculations assume 1.0 pps is totally negligible,
+compared to stdev value.
+
+The group selection algorithm currently has no parameters,
+all the aforementioned encodings and handling of precision is hard-coded.
+In principle, every group selection is examined, and the one encodable
+with least amount of bits is selected.
+As the bit amount for a selection is just sum of bits for every group,
+finding the best selection takes number of comparisons
+quadratically increasing with the size of data,
+the overall time complexity being probably cubic.
+
+The resulting group distribution looks good
+if samples are distributed normally enough within a group.
+But for obviously different distributions (for example `bimodal distribution`_)
+the groups tend to focus on less relevant factors (such as "outlier" density).
+
+Common Patterns
+~~~~~~~~~~~~~~~
+
+When an anomaly is detected, it frequently falls into few known patterns,
+each having its typical behavior over time.
+
+We are going to describe the behaviors,
+as they motivate our choice of trend compliance metrics.
+
+Sample time and analysis time
+-----------------------------
+
+But first we need to distinguish two roles time plays in analysis,
+so it is more clear which role we are referring to.
+
+Sample time is the more obvious one.
+It is the time the sample is generated.
+It is the start time or the end time of the Jenkins job run,
+does not really matter which (parallel runs are disabled,
+and length of gap between samples does not affect metrics).
+
+Analysis time is the time the current analysis is computed.
+Again, the exact time does not usually matter,
+what matters is how many later (and how fewer earlier) samples
+were considered in the computation.
+
+For some patterns, it is usual for a previously reported
+anomaly to "vanish", or previously unseen anomaly to "appear late",
+as later samples change which partition into groups is more probable.
+
+Dashboard and graphs are always showing the latest analysis time,
+the compliance metrics are using earlier sample time
+with the same latest analysis time.
+
+Alerting e-mails use the latest analysis time at the time of sending,
+so the values reported there are likely to be different
+from the later analysis time results shown in dashboard and graphs.
+
+Ordinary regression
+-------------------
+
+The real performance changes from previously stable value
+into a new stable value.
+
+For medium to high magnitude of the change, one run
+is enough for anomaly detection to mark this regression.
+
+Ordinary progressions are detected in the same way.
+
+Small regression
+----------------
+
+The real performance changes from previously stable value
+into a new stable value, but the difference is small.
+
+For the anomaly detection algorithm, this change is harder to detect,
+depending on the standard deviation of the previous group.
+
+If the new performance value stays stable, eventually
+the detection algorithm is able to detect this anomaly
+when there are enough samples around the new value.
+
+If the difference is too small, it may remain undetected
+(as new performance change happens, or full history of samples
+is still not enough for the detection).
+
+Small progressions have the same behavior.
+
+Reverted regression
+-------------------
+
+This pattern can have two different causes.
+We would like to distinguish them, but that is usually
+not possible to do just by looking at the measured values (and not telemetry).
+
+In one cause, the real DUT performance has changed,
+but got restored immediately.
+In the other cause, no real performance change happened,
+just some temporary infrastructure issue
+has caused a wrong low value to be measured.
+
+For small measured changes, this pattern may remain undetected.
+For medium and big measured changes, this is detected when the regression
+happens on just the last sample.
+
+For big changes, the revert is also immediately detected
+as a subsequent progression. The trend is usually different
+from the previously stable trend (as the two population averages
+are not likely to be exactly equal), but the difference
+between the two trends is relatively small.
+
+For medium changes, the detection algorithm may need several new samples
+to detect a progression (as it dislikes single sample groups),
+in the meantime reporting regressions (difference decreasing
+with analysis time), until it stabilizes the same way as for big changes
+(regression followed by progression, small difference
+between the old stable trend and last trend).
+
+As it is very hard for a fault code or an infrastructure issue
+to increase performance, the opposite (temporary progression)
+almost never happens.
+
+Summary
+-------
+
+There is a trade-off between detecting small regressions
+and not reporting the same old regressions for a long time.
+
+For people reading e-mails, a sudden regression with a big number of samples
+in the last group means this regression was hard for the algorithm to detect.
+
+If there is a big regression with just one run in the last group,
+we are not sure if it is real, or just a temporary issue.
+It is useful to wait some time before starting an investigation.
+
+With decreasing (absolute value of) difference, the number of expected runs
+increases. If there is not enough runs, we still cannot distinguish
+real regression from temporary regression just from the current metrics
+(although humans frequently can tell by looking at the graph).
+
+When there is a regression or progression with just a small difference,
+it is probably an artifact of a temporary regression.
+Not worth examining, unless temporary regressions happen somewhat frequently.
+
+It is not easy for the metrics to locate the previous stable value,
+especially if multiple anomalies happened in the last few weeks.
+It is good to compare last trend with long term trend maximum,
+as it highlights the difference between "now" and "what could be".
+It is good to exclude last week from the trend maximum,
+as including the last week would hide all real progressions.
+
+.. _Minimum Description Length: https://en.wikipedia.org/wiki/Minimum_description_length
+.. _Occam's razor: https://en.wikipedia.org/wiki/Occam%27s_razor
+.. _bimodal distribution: https://en.wikipedia.org/wiki/Bimodal_distribution
diff --git a/docs/report/introduction/methodology_trending/trend_presentation.rst b/docs/report/introduction/methodology_trending/trend_presentation.rst
new file mode 100644
index 0000000000..67d0d3c45a
--- /dev/null
+++ b/docs/report/introduction/methodology_trending/trend_presentation.rst
@@ -0,0 +1,35 @@
+Trend Presentation
+^^^^^^^^^^^^^^^^^^
+
+Failed tests
+~~~~~~~~~~~~
+
+The Failed tests tables list the tests which failed during the last test run.
+Separate tables are generated for each testbed.
+
+Regressions and progressions
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tables list tests which encountered a regression or progression during the
+specified time period, which is currently set to the last 21 days.
+
+Trendline Graphs
+~~~~~~~~~~~~~~~~
+
+Trendline graphs show measured per run averages of MRR values, NDR or PDR
+values, group average values, and detected anomalies.
+The graphs are constructed as follows:
+
+- X-axis represents the date in the format MMDD.
+- Y-axis represents run-average MRR value, NDR or PDR values in Mpps. For PDR
+  tests also a graph with average latency at 50% PDR [us] is generated.
+- Markers to indicate anomaly classification:
+
+  - Regression - red circle.
+  - Progression - green circle.
+
+- The line shows average MRR value of each group.
+
+In addition the graphs show dynamic labels while hovering over graph data
+points, presenting the CSIT build date, measured value, VPP reference, trend job
+build ID and the LF testbed ID.