aboutsummaryrefslogtreecommitdiffstats
path: root/docs/usecases/contiv/BUG_REPORTS.rst
diff options
context:
space:
mode:
authorNathan Skrzypczak <nathan.skrzypczak@gmail.com>2021-08-19 11:38:06 +0200
committerDave Wallace <dwallacelf@gmail.com>2021-10-13 23:22:32 +0000
commit9ad39c026c8a3c945a7003c4aa4f5cb1d4c80160 (patch)
tree3cca19635417e28ae381d67ae31c75df2925032d /docs/usecases/contiv/BUG_REPORTS.rst
parentf47122e07e1ecd0151902a3cabe46c60a99bee8e (diff)
docs: better docs, mv doxygen to sphinx
This patch refactors the VPP sphinx docs in order to make it easier to consume for external readers as well as VPP developers. It also makes sphinx the single source of documentation, which simplifies maintenance and operation. Most important updates are: - reformat the existing documentation as rst - split RELEASE.md and move it into separate rst files - remove section 'events' - remove section 'archive' - remove section 'related projects' - remove section 'feature by release' - remove section 'Various links' - make (Configuration reference, CLI docs, developer docs) top level items in the list - move 'Use Cases' as part of 'About VPP' - move 'Troubleshooting' as part of 'Getting Started' - move test framework docs into 'Developer Documentation' - add a 'Contributing' section for gerrit, docs and other contributer related infos - deprecate doxygen and test-docs targets - redirect the "make doxygen" target to "make docs" Type: refactor Change-Id: I552a5645d5b7964d547f99b1336e2ac24e7c209f Signed-off-by: Nathan Skrzypczak <nathan.skrzypczak@gmail.com> Signed-off-by: Andrew Yourtchenko <ayourtch@gmail.com>
Diffstat (limited to 'docs/usecases/contiv/BUG_REPORTS.rst')
-rw-r--r--docs/usecases/contiv/BUG_REPORTS.rst401
1 files changed, 401 insertions, 0 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.rst b/docs/usecases/contiv/BUG_REPORTS.rst
new file mode 100644
index 00000000000..8e55d5b3c8d
--- /dev/null
+++ b/docs/usecases/contiv/BUG_REPORTS.rst
@@ -0,0 +1,401 @@
+Debugging and Reporting Bugs in Contiv-VPP
+==========================================
+
+Bug Report Structure
+--------------------
+
+- `Deployment description <#describe-deployment>`__: Briefly describes
+ the deployment, where an issue was spotted, number of k8s nodes, is
+ DHCP/STN/TAP used.
+
+- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
+ from the vswitch pods.
+
+- `VPP config <#inspect-vpp-config>`__: Attach output of the show
+ commands.
+
+- `Basic Collection Example <#basic-example>`__
+
+Describe Deployment
+~~~~~~~~~~~~~~~~~~~
+
+Since contiv-vpp can be used with different configurations, it is
+helpful to attach the config that was applied. Either attach
+``values.yaml`` passed to the helm chart, or attach the `corresponding
+part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
+from the deployment yaml file.
+
+.. code:: yaml
+
+ contiv.yaml: |-
+ TCPstackDisabled: true
+ UseTAPInterfaces: true
+ TAPInterfaceVersion: 2
+ NatExternalTraffic: true
+ MTUSize: 1500
+ IPAMConfig:
+ PodSubnetCIDR: 10.1.0.0/16
+ PodNetworkPrefixLen: 24
+ PodIfIPCIDR: 10.2.1.0/24
+ VPPHostSubnetCIDR: 172.30.0.0/16
+ VPPHostNetworkPrefixLen: 24
+ NodeInterconnectCIDR: 192.168.16.0/24
+ VxlanCIDR: 192.168.30.0/24
+ NodeInterconnectDHCP: False
+
+Information that might be helpful: - Whether node IPs are statically
+assigned, or if DHCP is used - STN is enabled - Version of TAP
+interfaces used - Output of
+``kubectl get pods -o wide --all-namespaces``
+
+Collecting the Logs
+~~~~~~~~~~~~~~~~~~~
+
+The most essential thing that needs to be done when debugging and
+**reporting an issue** in Contiv-VPP is **collecting the logs from the
+contiv-vpp vswitch containers**.
+
+a) Collecting Vswitch Logs Using kubectl
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to collect the logs from individual vswitches in the cluster,
+connect to the master node and then find the POD names of the individual
+vswitch containers:
+
+::
+
+ $ kubectl get pods --all-namespaces | grep vswitch
+ kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
+ kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
+
+Then run the following command, with *pod name* replaced by the actual
+POD name:
+
+::
+
+ $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
+
+Redirect the output to a file to save the logs, for example:
+
+::
+
+ kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
+
+b) Collecting Vswitch Logs Using Docker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If option a) does not work, then you can still collect the same logs
+using the plain docker command. For that, you need to connect to each
+individual node in the k8s cluster, and find the container ID of the
+vswitch container:
+
+::
+
+ $ docker ps | grep contivvpp/vswitch
+ b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
+
+Now use the ID from the first column to dump the logs into the
+``logs-master.txt`` file:
+
+::
+
+ $ docker logs b682b5837e52 > logs-master.txt
+
+Reviewing the Vswitch Logs
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to debug an issue, it is good to start by grepping the logs for
+the ``level=error`` string, for example:
+
+::
+
+ $ cat logs-master.txt | grep level=error
+
+Also, VPP or contiv-agent may crash with some bugs. To check if some
+process crashed, grep for the string ``exit``, for example:
+
+::
+
+ $ cat logs-master.txt | grep exit
+ 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
+ 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
+
+Collecting the STN Daemon Logs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In STN (Steal The NIC) deployment scenarios, often need to collect and
+review the logs from the STN daemon. This needs to be done on each node:
+
+::
+
+ $ docker logs contiv-stn > logs-stn-master.txt
+
+Collecting Logs in Case of Crash Loop
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the vswitch is crashing in a loop (which can be determined by
+increasing the number in the ``RESTARTS`` column of the
+``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
+``docker logs`` would give us the logs of the latest incarnation of the
+vswitch. That might not be the original root cause of the very first
+crash, so in order to debug that, we need to disable k8s health check
+probes to not restart the vswitch after the very first crash. This can
+be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
+in the contiv-vpp deployment YAML:
+
+.. code:: diff
+
+ diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
+ index 3676047..ffa4473 100644
+ --- a/k8s/contiv-vpp.yaml
+ +++ b/k8s/contiv-vpp.yaml
+ @@ -224,18 +224,18 @@ spec:
+ ports:
+ # readiness + liveness probe
+ - containerPort: 9999
+ - readinessProbe:
+ - httpGet:
+ - path: /readiness
+ - port: 9999
+ - periodSeconds: 1
+ - initialDelaySeconds: 15
+ - livenessProbe:
+ - httpGet:
+ - path: /liveness
+ - port: 9999
+ - periodSeconds: 1
+ - initialDelaySeconds: 60
+ + # readinessProbe:
+ + # httpGet:
+ + # path: /readiness
+ + # port: 9999
+ + # periodSeconds: 1
+ + # initialDelaySeconds: 15
+ + # livenessProbe:
+ + # httpGet:
+ + # path: /liveness
+ + # port: 9999
+ + # periodSeconds: 1
+ + # initialDelaySeconds: 60
+ env:
+ - name: MICROSERVICE_LABEL
+ valueFrom:
+
+If VPP is the crashing process, please follow the
+[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
+
+Inspect VPP Config
+~~~~~~~~~~~~~~~~~~
+
+Inspect the following areas: - Configured interfaces (issues related
+basic node/pod connectivity issues):
+
+::
+
+ vpp# sh int addr
+ GigabitEthernet0/9/0 (up):
+ 192.168.16.1/24
+ local0 (dn):
+ loop0 (up):
+ l2 bridge bd_id 1 bvi shg 0
+ 192.168.30.1/24
+ tapcli-0 (up):
+ 172.30.1.1/24
+
+- IP forwarding table:
+
+::
+
+ vpp# sh ip fib
+ ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
+ 0.0.0.0/0
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
+ [0] [@0]: dpo-drop ip4
+ 0.0.0.0/32
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
+ [0] [@0]: dpo-drop ip4
+
+ ...
+ ...
+
+ 255.255.255.255/32
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
+ [0] [@0]: dpo-drop ip4
+
+- ARP Table:
+
+::
+
+ vpp# sh ip arp
+ Time IP4 Flags Ethernet Interface
+ 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
+ 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
+ 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
+ 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
+ 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
+ 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
+
+- NAT configuration (issues related to services):
+
+::
+
+ DBGvpp# sh nat44 addresses
+ NAT44 pool addresses:
+ 192.168.16.10
+ tenant VRF independent
+ 0 busy udp ports
+ 0 busy tcp ports
+ 0 busy icmp ports
+ NAT44 twice-nat pool addresses:
+
+::
+
+ vpp# sh nat44 static mappings
+ NAT44 static mappings:
+ tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
+ udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
+ tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
+
+::
+
+ vpp# sh nat44 interfaces
+ NAT44 interfaces:
+ loop0 in out
+ GigabitEthernet0/9/0 out
+ tapcli-0 in out
+
+::
+
+ vpp# sh nat44 sessions
+ NAT44 sessions:
+ 192.168.20.2: 0 dynamic translations, 3 static translations
+ 10.1.1.3: 0 dynamic translations, 0 static translations
+ 10.1.1.4: 0 dynamic translations, 0 static translations
+ 10.1.1.2: 0 dynamic translations, 6 static translations
+ 10.1.2.18: 0 dynamic translations, 2 static translations
+
+- ACL config (issues related to policies):
+
+::
+
+ vpp# sh acl-plugin acl
+
+- “Steal the NIC (STN)” config (issues related to host connectivity
+ when STN is active):
+
+::
+
+ vpp# sh stn rules
+ - rule_index: 0
+ address: 10.1.10.47
+ iface: tapcli-0 (2)
+ next_node: tapcli-0-output (410)
+
+- Errors:
+
+::
+
+ vpp# sh errors
+
+- Vxlan tunnels:
+
+::
+
+ vpp# sh vxlan tunnels
+
+- Vxlan tunnels:
+
+::
+
+ vpp# sh vxlan tunnels
+
+- Hardware interface information:
+
+::
+
+ vpp# sh hardware-interfaces
+
+Basic Example
+~~~~~~~~~~~~~
+
+`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
+is an example of a script that may be a useful starting point to
+gathering the above information using kubectl.
+
+Limitations: - The script does not include STN daemon logs nor does it
+handle the special case of a crash loop
+
+Prerequisites: - The user specified in the script must have passwordless
+access to all nodes in the cluster; on each node in the cluster the user
+must have passwordless access to sudo.
+
+Setting up Prerequisites
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+To enable logging into a node without a password, copy your public key
+to the following node:
+
+::
+
+ ssh-copy-id <user-id>@<node-name-or-ip-address>
+
+To enable running sudo without a password for a given user, enter:
+
+::
+
+ $ sudo visudo
+
+Append the following entry to run ALL command without a password for a
+given user:
+
+::
+
+ <userid> ALL=(ALL) NOPASSWD:ALL
+
+You can also add user ``<user-id>`` to group ``sudo`` and edit the
+``sudo`` entry as follows:
+
+::
+
+ # Allow members of group sudo to execute any command
+ %sudo ALL=(ALL:ALL) NOPASSWD:ALL
+
+Add user ``<user-id>`` to group ``<group-id>`` as follows:
+
+::
+
+ sudo adduser <user-id> <group-id>
+
+or as follows:
+
+::
+
+ usermod -a -G <group-id> <user-id>
+
+Working with the Contiv-VPP Vagrant Test Bed
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The script can be used to collect data from the `Contiv-VPP test bed
+created with
+Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
+To collect debug information from this Contiv-VPP test bed, do the
+following steps: \* In the directory where you created your vagrant test
+bed, do:
+
+::
+
+ vagrant ssh-config > vagrant-ssh.conf
+
+- To collect the debug information do:
+
+::
+
+ ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf