aboutsummaryrefslogtreecommitdiffstats
path: root/docs/content/methodology/per_patch_testing.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/content/methodology/per_patch_testing.md')
-rw-r--r--docs/content/methodology/per_patch_testing.md229
1 files changed, 229 insertions, 0 deletions
diff --git a/docs/content/methodology/per_patch_testing.md b/docs/content/methodology/per_patch_testing.md
new file mode 100644
index 0000000000..6ae40a13dc
--- /dev/null
+++ b/docs/content/methodology/per_patch_testing.md
@@ -0,0 +1,229 @@
+---
+title: "Per-patch Testing"
+weight: 5
+---
+
+# Per-patch Testing
+
+Updated for CSIT git commit id: d8ec3f8673346c0dc93e567159771f24c1bf74fc.
+
+A methodology similar to trending analysis is used for comparing performance
+before a DUT code change is merged. This can act as a verify job to disallow
+changes which would decrease performance without a good reason.
+
+## Existing jobs
+
+They are not started automatically, must be triggered on demand.
+They allow full tag expressions, all types of perf tests are supported.
+
+There are jobs available for multiple types of testbeds,
+based on various processors.
+Their Gerrit triggers words are of the form "perftest-{node_arch}"
+where the node_arch combinations currently supported are:
+2n-icx, 2n-clx, 2n-spr, 2n-zn2, 3n-icx, 3n-tsh, 3n-alt, 2n-tx2, 3n-snr,
+3na-spr, 3nb-spr.
+
+## Test selection
+
+Gerrit trigger line without any additional arguments selects
+a small set of test cases to run.
+If additional arguments are added to the Gerrit trigger, they are treated
+as Robot tag expressions to select tests to run.
+While very flexible, this method of test selection also allows the user
+to accidentally select too high number of tests, blocking the testbed for days.
+
+What follows is a list of explanations and recommendations
+to help users to select the minimal set of tests cases.
+
+### Verify cycles
+
+When Gerrit schedules multiple jobs to run for the same patch set,
+it waits until all runs are complete.
+While it is waiting, it is possible to trigger more jobs
+(adding runs to the set Gerrit is waiting for), but it is not possible
+to trigger more runs for the same job, until Gerrit is done waiting.
+After Gerrit is done waiting, it becames possible to trigger
+the same job again.
+
+Example. User triggers one set of tests on 2n-icx and immediately
+also triggers other set of tests on 3n-icx. Then the user notices
+2n-icx run ended early because of a typo in tag expression.
+When the user tries to re-trigger 2n-icx (with a fixed tag expression),
+that comment is ignored by Jenkins.
+Only when 3n-icx job finishes, the user can trigger 2n-icx again.
+
+### One comment many jobs
+
+In the past, the CSIT code which parses for perftest trigger comments
+was buggy, which lead to bad behavior (as in selection all performance test,
+because "perftest" is also a robot tag) when user included multiple
+perftest trigger words in the same comment.
+
+The worst bugs were fixed since then, but it is still recommended
+to use just one trigger word per Gerrit comment, just to be safe.
+
+### Multiple test cases in run
+
+While Robot supports OR operator, it does not support parentheses,
+so the OR operator is not very useful.
+It is recommended to use space instead of OR operator.
+
+Example template:
+perftest-2n-icx {tag_expression_1} {tag_expression_2}
+
+See below for more concrete examples.
+
+### Suite tags
+
+Traditionally, CSIT maintains broad Robot tags that can be used to select tests.
+
+But it is not recommended to use them for test selection,
+as it is not that easy to determine how many test cases are selected.
+
+The recommended way is to look into CSIT repository first,
+and locate a specific suite the user is interested in,
+and use its suite tag. For example, "ethip4-ip4base" is a suite tag
+selecting just one suite in CSIT git repository,
+avoiding all scale, container, and other simialr variants.
+
+Note that CSIT uses "autogen" code generator,
+so the robot running in Jenkins has access to more suites
+than visible just by looking into CSIT git repository.
+Thus, suite tag is not enough to select precisely the intended suite,
+and user is encouraged to narrow down to a single test case within a suite.
+
+### Fully specified tag expressions
+
+Here is one template to select a single test case:
+{test_type}AND{nic_model}AND{nic_driver}AND{cores}AND{frame_size}AND{suite_tag}
+where the variables are all lower case (so AND operator stands out).
+
+The fastest and the most widely used type of performance test is "mrr".
+As an alternative, "ndrpdr" focuses on small losses (ax opposed to max load),
+but takes longer to finish.
+The nic_driver options depend on nic_model. For Intel cards "drv_avf"
+(AVF plugin) and "drv_vfio_pci" (DPDK plugin) are popular, for Mellanox
+"drv_mlx5_core". Currently, the performance using "drv_af_xdp" is not reliable
+enough, so do not use it unless you are specifically testing for AF_XDP.
+
+The most popular nic_model is "nic_intel-e810cq", but that is not available
+on all testbed types.
+It is safe to use "1c" for cores (unless you are suspecting multi-core
+performance is affected differently) and "64b" for frame size ("78b" for ip6
+and more for dot1q and other encapsulated traffic;
+"1518b" is popular for ipsec and other CPU-bound tests).
+
+As there are more test cases than CSIT can periodically test,
+it is possible to encounter an old test case that currently fails.
+To avoid that, you can look at "job spec" files we use for periodic testing,
+for example
+[this one](https://raw.githubusercontent.com/FDio/csit/master/resources/job_specs/report_iterative/2n-spr/vpp-mrr-00.md).
+
+### Shortening triggers
+
+Advanced users may use the following tricks to avoid writing long trigger
+comments.
+
+Robot supports glob matching, which can be used to select multiple suite tags at
+once.
+
+Not specifying one of 6 parts of the recommended expression pattern
+will select all available options. For example not specifying nic_driver
+for nic_intel-e810cq will select all 3 applicable drivers.
+You can use NOT operator to reject some options (e.g. NOTdrv_af_xdp).
+Beware, with NOT the order matters:
+tag1ANDtag2NOTtag3 is not the same as tag1NOTtag3ANDtag2,
+the latter is evaluated as tag1AND(NOT(tag3ANDtag2)).
+
+Beware when not specifying nic_model. As a precaution,
+CSIT code will insert the defailt NIC model for the tetsbed used.
+Example: Specifying drv_rdma_core without specifying nic_model
+will fail, as the default nic_model is nic_intel-e810cq
+which does not support RDMA core driver.
+
+### Complete example
+
+A user wants to test a VPP change which may affect load balance whith bonding.
+Searching tag documentation for "bonding" finds LBOND tag and its variants.
+Searching CSIT git repository (directory tests/) finds 8 suite files,
+all suited only for 3-node testbeds.
+All suites are using vhost, but differ by the forwarding app inside VM
+(DPDK or VPP), by the forwarding mode of VPP acting as host level vswitch
+(MAC learning or cross connect), and by the number of DUT1-DUT2 links
+available (1 or 2).
+
+As not all NICs and testbeds offer enogh ports for 2 parallel DUT-DUT links,
+the user looks at
+[testbed specifications](https://github.com/FDio/csit/tree/master/topologies/available)
+and finds that only e810xxv NIC on 3n-icx testbed matches the requirements.
+Quick look into the suites confirm the smallest frame size is 64 bytes
+(despite DOT1Q robot tag, as the encapsulation does not happen on TG-DUT links).
+It is ok to use just 1 physical core, as 3n-icx has hyperthreading enabled,
+so VPP vswitch will use 2 worker threads.
+
+The user decides the vswitch forwarding mode is not important
+(so choses cross connect as that has less CPU overhead),
+but wants to test both NIC drivers (not AF_XDP), both apps in VM,
+and both 1 and 2 parallel links.
+
+After shortening, this is the trigger comment fianlly used:
+perftest-3n-icx mrrANDnic_intel-e810cqAND1cAND64bAND?lbvpplacp-dot1q-l2xcbase-eth-2vhostvr1024-1vm\*NOTdrv_af_xdp
+
+## Basic operation
+
+The job builds VPP .deb packages for both the patch under test
+(called "current") and its parent patch (called "parent").
+
+For each test (from the set defined by tag expressions),
+both builds are subjected to several trial measurements (in case of MRR).
+Measured samples are grouped to "parent" sequence,
+followed by "current" sequence. The same Minimal Description Length
+algorithm as in trending is used to decide whether it is one big group,
+or two smaller gropus. If it is one group, a "normal" result
+is declared for the test. If it is two groups, and current average
+is less then parent average, the test is declared a regression.
+If it is two groups and current average is larger or equal,
+the test is declared a progression.
+
+The whole job fails (giving -1) if any test was declared a regression.
+If a test fails, a fake result values are used,
+so it is possible to use the job fo verify current fixes a test failing in parent
+(if a test is not fixed, it is treated as a regression).
+
+## Temporary specifics
+
+The Minimal Description Length analysis is performed by
+CSIT code equivalent to jumpavg-0.4.1 library available on PyPI.
+
+In hopes of strengthening of signal (code performance) compared to noise
+(all other factors influencing the measured values), several workarounds
+are applied.
+
+In contrast to trending, MRR trial duration is set to 10 seconds,
+and only 5 samples are measured for each build.
+Both parameters are set in ci-management.
+
+This decreases sensitivity to regressions, but also decreases
+probability of false positives.
+
+## Console output
+
+The following information as visible towards the end of Jenkins console output,
+repeated for each analyzed test.
+
+The original 5 values (or 1 for non-mrr) are visible in order they were measured.
+The values after processing are also visible in output,
+this time sorted by value (so people can see minimum and maximum).
+
+The next output is difference of averages. It is the current average
+minus the parent average, expressed as percentage of the parent average.
+
+The next three outputs contain the jumpavg representation
+of the two groups and a combined group.
+Here, "bits" is the description length; for "current" sequence
+it includes effect from "parent" average value
+(jumpavg-0.4.1 penalizes sequences with too close averages).
+
+Next, a sentence describing which grouping description is shorter,
+and by how much bits.
+Finally, the test result classification is visible.