diff options
Diffstat (limited to 'docs/usecases/contiv/BUG_REPORTS.md')
-rw-r--r-- | docs/usecases/contiv/BUG_REPORTS.md | 333 |
1 files changed, 0 insertions, 333 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.md b/docs/usecases/contiv/BUG_REPORTS.md deleted file mode 100644 index 5b9c3cf4875..00000000000 --- a/docs/usecases/contiv/BUG_REPORTS.md +++ /dev/null @@ -1,333 +0,0 @@ -# Debugging and Reporting Bugs in Contiv-VPP - -## Bug Report Structure - -- [Deployment description](#describe-deployment): -Briefly describes the deployment, where an issue was spotted, -number of k8s nodes, is DHCP/STN/TAP used. - -- [Logs](#collecting-the-logs): -Attach corresponding logs, at least from the vswitch pods. - -- [VPP config](#inspect-vpp-config): -Attach output of the show commands. - -- [Basic Collection Example](#basic-example) - -### Describe Deployment -Since contiv-vpp can be used with different configurations, it is helpful -to attach the config that was applied. Either attach `values.yaml` passed to the helm chart, -or attach the [corresponding part](https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38) from the deployment yaml file. - -``` - contiv.yaml: |- - TCPstackDisabled: true - UseTAPInterfaces: true - TAPInterfaceVersion: 2 - NatExternalTraffic: true - MTUSize: 1500 - IPAMConfig: - PodSubnetCIDR: 10.1.0.0/16 - PodNetworkPrefixLen: 24 - PodIfIPCIDR: 10.2.1.0/24 - VPPHostSubnetCIDR: 172.30.0.0/16 - VPPHostNetworkPrefixLen: 24 - NodeInterconnectCIDR: 192.168.16.0/24 - VxlanCIDR: 192.168.30.0/24 - NodeInterconnectDHCP: False -``` - -Information that might be helpful: - - Whether node IPs are statically assigned, or if DHCP is used - - STN is enabled - - Version of TAP interfaces used - - Output of `kubectl get pods -o wide --all-namespaces` - - -### Collecting the Logs - -The most essential thing that needs to be done when debugging and **reporting an issue** -in Contiv-VPP is **collecting the logs from the contiv-vpp vswitch containers**. - -#### a) Collecting Vswitch Logs Using kubectl -In order to collect the logs from individual vswitches in the cluster, connect to the master node -and then find the POD names of the individual vswitch containers: - -``` -$ kubectl get pods --all-namespaces | grep vswitch -kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h -kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h -``` - -Then run the following command, with *pod name* replaced by the actual POD name: -``` -$ kubectl logs <pod name> -n kube-system -c contiv-vswitch -``` - -Redirect the output to a file to save the logs, for example: - -``` -kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt -``` - -#### b) Collecting Vswitch Logs Using Docker -If option a) does not work, then you can still collect the same logs using the plain docker -command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container: - -``` -$ docker ps | grep contivvpp/vswitch -b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0 -``` - -Now use the ID from the first column to dump the logs into the `logs-master.txt` file: -``` -$ docker logs b682b5837e52 > logs-master.txt -``` - -#### Reviewing the Vswitch Logs - -In order to debug an issue, it is good to start by grepping the logs for the `level=error` string, for example: -``` -$ cat logs-master.txt | grep level=error -``` - -Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string `exit`, for example: -``` -$ cat logs-master.txt | grep exit -2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected) -2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request -``` - -#### Collecting the STN Daemon Logs -In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs -from the STN daemon. This needs to be done on each node: -``` -$ docker logs contiv-stn > logs-stn-master.txt -``` - -#### Collecting Logs in Case of Crash Loop -If the vswitch is crashing in a loop (which can be determined by increasing the number in the `RESTARTS` -column of the `kubectl get pods --all-namespaces` output), the `kubectl logs` or `docker logs` would -give us the logs of the latest incarnation of the vswitch. That might not be the original root cause -of the very first crash, so in order to debug that, we need to disable k8s health check probes to not -restart the vswitch after the very first crash. This can be done by commenting-out the `readinessProbe` -and `livenessProbe` in the contiv-vpp deployment YAML: - -```diff -diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml -index 3676047..ffa4473 100644 ---- a/k8s/contiv-vpp.yaml -+++ b/k8s/contiv-vpp.yaml -@@ -224,18 +224,18 @@ spec: - ports: - # readiness + liveness probe - - containerPort: 9999 -- readinessProbe: -- httpGet: -- path: /readiness -- port: 9999 -- periodSeconds: 1 -- initialDelaySeconds: 15 -- livenessProbe: -- httpGet: -- path: /liveness -- port: 9999 -- periodSeconds: 1 -- initialDelaySeconds: 60 -+ # readinessProbe: -+ # httpGet: -+ # path: /readiness -+ # port: 9999 -+ # periodSeconds: 1 -+ # initialDelaySeconds: 15 -+ # livenessProbe: -+ # httpGet: -+ # path: /liveness -+ # port: 9999 -+ # periodSeconds: 1 -+ # initialDelaySeconds: 60 - env: - - name: MICROSERVICE_LABEL - valueFrom: -``` - -If VPP is the crashing process, please follow the \[CORE_FILES\](CORE_FILES.html) guide and provide the coredump file. - - -### Inspect VPP Config -Inspect the following areas: -- Configured interfaces (issues related basic node/pod connectivity issues): -``` -vpp# sh int addr -GigabitEthernet0/9/0 (up): - 192.168.16.1/24 -local0 (dn): -loop0 (up): - l2 bridge bd_id 1 bvi shg 0 - 192.168.30.1/24 -tapcli-0 (up): - 172.30.1.1/24 -``` - -- IP forwarding table: -``` -vpp# sh ip fib -ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ] -0.0.0.0/0 - unicast-ip4-chain - [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]] - [0] [@0]: dpo-drop ip4 -0.0.0.0/32 - unicast-ip4-chain - [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]] - [0] [@0]: dpo-drop ip4 - -... -... - -255.255.255.255/32 - unicast-ip4-chain - [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]] - [0] [@0]: dpo-drop ip4 -``` -- ARP Table: -``` -vpp# sh ip arp - Time IP4 Flags Ethernet Interface - 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0 - 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0 - 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0 - 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1 - 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2 - 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3 -``` -- NAT configuration (issues related to services): -``` -DBGvpp# sh nat44 addresses -NAT44 pool addresses: -192.168.16.10 - tenant VRF independent - 0 busy udp ports - 0 busy tcp ports - 0 busy icmp ports -NAT44 twice-nat pool addresses: -``` -``` -vpp# sh nat44 static mappings -NAT44 static mappings: - tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only - tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only - tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only - tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only - tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only - tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only - udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only - tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only -``` -``` -vpp# sh nat44 interfaces -NAT44 interfaces: - loop0 in out - GigabitEthernet0/9/0 out - tapcli-0 in out -``` -``` -vpp# sh nat44 sessions -NAT44 sessions: - 192.168.20.2: 0 dynamic translations, 3 static translations - 10.1.1.3: 0 dynamic translations, 0 static translations - 10.1.1.4: 0 dynamic translations, 0 static translations - 10.1.1.2: 0 dynamic translations, 6 static translations - 10.1.2.18: 0 dynamic translations, 2 static translations -``` -- ACL config (issues related to policies): -``` -vpp# sh acl-plugin acl -``` -- "Steal the NIC (STN)" config (issues related to host connectivity when STN is active): -``` -vpp# sh stn rules -- rule_index: 0 - address: 10.1.10.47 - iface: tapcli-0 (2) - next_node: tapcli-0-output (410) -``` -- Errors: -``` -vpp# sh errors -``` -- Vxlan tunnels: -``` -vpp# sh vxlan tunnels -``` -- Vxlan tunnels: -``` -vpp# sh vxlan tunnels -``` -- Hardware interface information: -``` -vpp# sh hardware-interfaces -``` - -### Basic Example - -[contiv-vpp-bug-report.sh][1] is an example of a script that may be a useful starting point to gathering the above information using kubectl. - -Limitations: -- The script does not include STN daemon logs nor does it handle the special - case of a crash loop - -Prerequisites: -- The user specified in the script must have passwordless access to all nodes - in the cluster; on each node in the cluster the user must have passwordless - access to sudo. - -#### Setting up Prerequisites -To enable logging into a node without a password, copy your public key to the following -node: -``` -ssh-copy-id <user-id>@<node-name-or-ip-address> -``` - -To enable running sudo without a password for a given user, enter: -``` -$ sudo visudo -``` - -Append the following entry to run ALL command without a password for a given -user: -``` -<userid> ALL=(ALL) NOPASSWD:ALL -``` - -You can also add user `<user-id>` to group `sudo` and edit the `sudo` -entry as follows: - -``` -# Allow members of group sudo to execute any command -%sudo ALL=(ALL:ALL) NOPASSWD:ALL -``` - -Add user `<user-id>` to group `<group-id>` as follows: -``` -sudo adduser <user-id> <group-id> -``` -or as follows: -``` -usermod -a -G <group-id> <user-id> -``` -#### Working with the Contiv-VPP Vagrant Test Bed -The script can be used to collect data from the [Contiv-VPP test bed created with Vagrant][2]. -To collect debug information from this Contiv-VPP test bed, do the -following steps: -* In the directory where you created your vagrant test bed, do: -``` - vagrant ssh-config > vagrant-ssh.conf -``` -* To collect the debug information do: -``` - ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf -``` - -[1]: https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh -[2]: https://github.com/contiv/vpp/blob/master/vagrant/README.md |