docs: add contiv vpp

Change-Id: I92227fc4968fc6a478beb7f38707b91e9f0635ec Signed-off-by: Scott Keeler <skeeler@cisco.com>
author: Scott Keeler <skeeler@cisco.com> 2018-10-01 14:50:57 -0400
committer: Dave Barach <openvpp@barachs.net> 2018-10-05 13:47:42 +0000
commit: 25c4d396eae99e23c4ebe7155fde7700dd1130b9 (patch)
tree: 2cd3661e26e37cf3e04327559479bc6ce0c9a752 /docs/usecases/contiv/BUG_REPORTS.md
parent: 2d24cd027275905f308f75bf45d0f9d163f2235b (diff)
1 files changed, 333 insertions, 0 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.md b/docs/usecases/contiv/BUG_REPORTS.md
new file mode 100644
index 00000000000..23c9a7c393c
--- /dev/null
+++ b/docs/usecases/contiv/BUG_REPORTS.md
@@ -0,0 +1,333 @@
+# Debugging and Reporting Bugs in Contiv-VPP
+
+## Bug Report Structure
+
+- [Deployment description](#describe-deployment):
+Briefly describes the deployment, where an issue was spotted,
+number of k8s nodes, is DHCP/STN/TAP used.
+
+- [Logs](#collecting-the-logs):
+Attach corresponding logs, at least from the vswitch pods.
+
+- [VPP config](#inspect-vpp-config):
+Attach output of the show commands.
+
+- [Basic Collection Example](#basic-example)
+
+### Describe Deployment
+Since contiv-vpp can be used with different configurations, it is helpful 
+to attach the config that was applied. Either attach `values.yaml` passed to the helm chart,
+or attach the [corresponding part](https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38) from the deployment yaml file.
+
+```
+  contiv.yaml: |-
+    TCPstackDisabled: true
+    UseTAPInterfaces: true
+    TAPInterfaceVersion: 2
+    NatExternalTraffic: true
+    MTUSize: 1500
+    IPAMConfig:
+      PodSubnetCIDR: 10.1.0.0/16
+      PodNetworkPrefixLen: 24
+      PodIfIPCIDR: 10.2.1.0/24
+      VPPHostSubnetCIDR: 172.30.0.0/16
+      VPPHostNetworkPrefixLen: 24
+      NodeInterconnectCIDR: 192.168.16.0/24
+      VxlanCIDR: 192.168.30.0/24
+      NodeInterconnectDHCP: False
+```
+
+Information that might be helpful:
+ - Whether node IPs are statically assigned, or if DHCP is used
+ - STN is enabled
+ - Version of TAP interfaces used
+ - Output of `kubectl get pods -o wide --all-namespaces`
+ 
+
+### Collecting the Logs
+
+The most essential thing that needs to be done when debugging and **reporting an issue**
+in Contiv-VPP is **collecting the logs from the contiv-vpp vswitch containers**.
+
+#### a) Collecting Vswitch Logs Using kubectl
+In order to collect the logs from individual vswitches in the cluster, connect to the master node
+and then find the POD names of the individual vswitch containers:
+
+```
+$ kubectl get pods --all-namespaces | grep vswitch
+kube-system   contiv-vswitch-lqxfp               2/2       Running   0          1h
+kube-system   contiv-vswitch-q6kwt               2/2       Running   0          1h
+```
+
+Then run the following command, with *pod name* replaced by the actual POD name:
+```
+$ kubectl logs <pod name> -n kube-system -c contiv-vswitch
+```
+
+Redirect the output to a file to save the logs, for example:
+
+```
+kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
+```
+
+#### b) Collecting Vswitch Logs Using Docker
+If option a) does not work, then you can still collect the same logs using the plain docker
+command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container:
+
+```
+$ docker ps | grep contivvpp/vswitch
+b682b5837e52        contivvpp/vswitch                                        "/usr/bin/supervisor…"   2 hours ago         Up 2 hours                              k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
+```
+
+Now use the ID from the first column to dump the logs into the `logs-master.txt` file:
+```
+$ docker logs b682b5837e52 > logs-master.txt
+```
+
+#### Reviewing the Vswitch Logs
+
+In order to debug an issue, it is good to start by grepping the logs for the `level=error` string, for example:
+```
+$ cat logs-master.txt | grep level=error
+```
+
+Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string `exit`, for example:
+```
+$ cat logs-master.txt | grep exit
+2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
+2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
+```
+
+#### Collecting the STN Daemon Logs
+In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs
+from the STN daemon. This needs to be done on each node:
+```
+$ docker logs contiv-stn > logs-stn-master.txt
+```
+
+#### Collecting Logs in Case of Crash Loop
+If the vswitch is crashing in a loop (which can be determined by increasing the number in the `RESTARTS`
+column of the `kubectl get pods --all-namespaces` output), the `kubectl logs` or `docker logs` would
+give us the logs of the latest incarnation of the vswitch. That might not be the original root cause
+of the very first crash, so in order to debug that, we need to disable k8s health check probes to not
+restart the vswitch after the very first crash. This can be done by commenting-out the `readinessProbe`
+and `livenessProbe` in the contiv-vpp deployment YAML:
+
+```diff
+diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
+index 3676047..ffa4473 100644
+--- a/k8s/contiv-vpp.yaml
++++ b/k8s/contiv-vpp.yaml
+@@ -224,18 +224,18 @@ spec:
+           ports:
+             # readiness + liveness probe
+             - containerPort: 9999
+-          readinessProbe:
+-            httpGet:
+-              path: /readiness
+-              port: 9999
+-            periodSeconds: 1
+-            initialDelaySeconds: 15
+-          livenessProbe:
+-            httpGet:
+-              path: /liveness
+-              port: 9999
+-            periodSeconds: 1
+-            initialDelaySeconds: 60
++ #         readinessProbe:
++ #           httpGet:
++ #             path: /readiness
++ #             port: 9999
++ #           periodSeconds: 1
++ #           initialDelaySeconds: 15
++ #         livenessProbe:
++ #           httpGet:
++ #             path: /liveness
++ #             port: 9999
++ #           periodSeconds: 1
++ #           initialDelaySeconds: 60
+           env:
+             - name: MICROSERVICE_LABEL
+               valueFrom:
+```
+
+If VPP is the crashing process, please follow the [CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
+
+
+### Inspect VPP Config
+Inspect the following areas:
+- Configured interfaces (issues related basic node/pod connectivity issues):
+```
+vpp# sh int addr
+GigabitEthernet0/9/0 (up):
+  192.168.16.1/24
+local0 (dn):
+loop0 (up):
+  l2 bridge bd_id 1 bvi shg 0
+  192.168.30.1/24
+tapcli-0 (up):
+  172.30.1.1/24
+```
+
+- IP forwarding table:
+```
+vpp# sh ip fib
+ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
+0.0.0.0/0
+  unicast-ip4-chain
+  [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
+    [0] [@0]: dpo-drop ip4
+0.0.0.0/32
+  unicast-ip4-chain
+  [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
+    [0] [@0]: dpo-drop ip4
+
+... 
+...
+
+255.255.255.255/32
+  unicast-ip4-chain
+  [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
+    [0] [@0]: dpo-drop ip4
+```
+- ARP Table:
+```
+vpp# sh ip arp
+    Time           IP4       Flags      Ethernet              Interface       
+    728.6616  192.168.16.2     D    08:00:27:9c:0e:9f GigabitEthernet0/8/0
+    542.7045  192.168.30.2     S    1a:2b:3c:4d:5e:02 loop0
+      1.4241   172.30.1.2      D    86:41:d5:92:fd:24 tapcli-0
+     15.2485    10.1.1.2      SN    00:00:00:00:00:02 tapcli-1
+    739.2339    10.1.1.3      SN    00:00:00:00:00:02 tapcli-2
+    739.4119    10.1.1.4      SN    00:00:00:00:00:02 tapcli-3
+```
+- NAT configuration (issues related to services):
+```
+DBGvpp# sh nat44 addresses
+NAT44 pool addresses:
+192.168.16.10
+  tenant VRF independent
+  0 busy udp ports
+  0 busy tcp ports
+  0 busy icmp ports
+NAT44 twice-nat pool addresses:
+```
+```
+vpp# sh nat44 static mappings 
+NAT44 static mappings:
+ tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0  out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0  out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0  out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0  out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0  out2in-only
+ tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0  out2in-only
+ udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
+ tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
+```
+```
+vpp# sh nat44 interfaces
+NAT44 interfaces:
+ loop0 in out
+ GigabitEthernet0/9/0 out
+ tapcli-0 in out
+```
+```
+vpp# sh nat44 sessions
+NAT44 sessions:
+  192.168.20.2: 0 dynamic translations, 3 static translations
+  10.1.1.3: 0 dynamic translations, 0 static translations
+  10.1.1.4: 0 dynamic translations, 0 static translations
+  10.1.1.2: 0 dynamic translations, 6 static translations
+  10.1.2.18: 0 dynamic translations, 2 static translations
+```
+- ACL config (issues related to policies):
+```
+vpp# sh acl-plugin acl
+```
+- "Steal the NIC (STN)" config (issues related to host connectivity when STN is active):
+```
+vpp# sh stn rules 
+- rule_index: 0
+  address: 10.1.10.47
+  iface: tapcli-0 (2)
+  next_node: tapcli-0-output (410)
+```
+- Errors:
+```
+vpp# sh errors
+```
+- Vxlan tunnels:
+```
+vpp# sh vxlan tunnels
+```
+- Vxlan tunnels:
+```
+vpp# sh vxlan tunnels
+```
+- Hardware interface information:
+```
+vpp# sh hardware-interfaces
+```
+
+### Basic Example
+
+[contiv-vpp-bug-report.sh][1] is an example of a script that may be a useful starting point to gathering the above information using kubectl.  
+
+Limitations: 
+- The script does not include STN daemon logs nor does it handle the special
+  case of a crash loop
+  
+Prerequisites:
+- The user specified in the script must have passwordless access to all nodes
+  in the cluster; on each node in the cluster the user must have passwordless
+  access to sudo.
+  
+#### Setting up Prerequisites
+To enable looging into a node without a password, copy your public key to the following
+node:
+```
+ssh-copy-id <user-id>@<node-name-or-ip-address>
+```
+
+To enable running sudo without a password for a given user, enter:
+```
+$ sudo visudo
+```
+
+Append the following entry to run ALL command without a password for a given
+user:
+```
+<userid> ALL=(ALL) NOPASSWD:ALL
+```
+
+You can also add user `<user-id>` to group `sudo` and edit the `sudo`
+entry as follows:
+
+```
+# Allow members of group sudo to execute any command
+%sudo	ALL=(ALL:ALL) NOPASSWD:ALL
+```
+
+Add user `<user-id>` to group `<group-id>` as follows:
+```
+sudo adduser <user-id> <group-id>
+```
+or as follows:
+```
+usermod -a -G <group-id> <user-id>
+```
+#### Working with the Contiv-VPP Vagrant Test Bed 
+The script can be used to collect data from the [Contiv-VPP test bed created with Vagrant][2].
+To collect debug information from this Contiv-VPP test bed, do the
+following steps:
+* In the directory where you created your vagrant test bed, do:
+```
+  vagrant ssh-config > vagrant-ssh.conf
+```
+* To collect the debug information do:
+```
+  ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf
+```
+
+[1]: https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh
+[2]: https://github.com/contiv/vpp/blob/master/vagrant/README.md
author	Scott Keeler <skeeler@cisco.com>	2018-10-01 14:50:57 -0400
committer	Dave Barach <openvpp@barachs.net>	2018-10-05 13:47:42 +0000
commit	25c4d396eae99e23c4ebe7155fde7700dd1130b9 (patch)
tree	2cd3661e26e37cf3e04327559479bc6ce0c9a752 /docs/usecases/contiv/BUG_REPORTS.md
parent	2d24cd027275905f308f75bf45d0f9d163f2235b (diff)