diff options
author | Scott Keeler <skeeler@cisco.com> | 2018-10-01 14:50:57 -0400 |
---|---|---|
committer | Dave Barach <openvpp@barachs.net> | 2018-10-05 13:47:42 +0000 |
commit | 25c4d396eae99e23c4ebe7155fde7700dd1130b9 (patch) | |
tree | 2cd3661e26e37cf3e04327559479bc6ce0c9a752 /docs/usecases/contiv/BUG_REPORTS.md | |
parent | 2d24cd027275905f308f75bf45d0f9d163f2235b (diff) |
docs: add contiv vpp
Change-Id: I92227fc4968fc6a478beb7f38707b91e9f0635ec
Signed-off-by: Scott Keeler <skeeler@cisco.com>
Diffstat (limited to 'docs/usecases/contiv/BUG_REPORTS.md')
-rw-r--r-- | docs/usecases/contiv/BUG_REPORTS.md | 333 |
1 files changed, 333 insertions, 0 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.md b/docs/usecases/contiv/BUG_REPORTS.md new file mode 100644 index 00000000000..23c9a7c393c --- /dev/null +++ b/docs/usecases/contiv/BUG_REPORTS.md @@ -0,0 +1,333 @@ +# Debugging and Reporting Bugs in Contiv-VPP + +## Bug Report Structure + +- [Deployment description](#describe-deployment): +Briefly describes the deployment, where an issue was spotted, +number of k8s nodes, is DHCP/STN/TAP used. + +- [Logs](#collecting-the-logs): +Attach corresponding logs, at least from the vswitch pods. + +- [VPP config](#inspect-vpp-config): +Attach output of the show commands. + +- [Basic Collection Example](#basic-example) + +### Describe Deployment +Since contiv-vpp can be used with different configurations, it is helpful +to attach the config that was applied. Either attach `values.yaml` passed to the helm chart, +or attach the [corresponding part](https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38) from the deployment yaml file. + +``` + contiv.yaml: |- + TCPstackDisabled: true + UseTAPInterfaces: true + TAPInterfaceVersion: 2 + NatExternalTraffic: true + MTUSize: 1500 + IPAMConfig: + PodSubnetCIDR: 10.1.0.0/16 + PodNetworkPrefixLen: 24 + PodIfIPCIDR: 10.2.1.0/24 + VPPHostSubnetCIDR: 172.30.0.0/16 + VPPHostNetworkPrefixLen: 24 + NodeInterconnectCIDR: 192.168.16.0/24 + VxlanCIDR: 192.168.30.0/24 + NodeInterconnectDHCP: False +``` + +Information that might be helpful: + - Whether node IPs are statically assigned, or if DHCP is used + - STN is enabled + - Version of TAP interfaces used + - Output of `kubectl get pods -o wide --all-namespaces` + + +### Collecting the Logs + +The most essential thing that needs to be done when debugging and **reporting an issue** +in Contiv-VPP is **collecting the logs from the contiv-vpp vswitch containers**. + +#### a) Collecting Vswitch Logs Using kubectl +In order to collect the logs from individual vswitches in the cluster, connect to the master node +and then find the POD names of the individual vswitch containers: + +``` +$ kubectl get pods --all-namespaces | grep vswitch +kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h +kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h +``` + +Then run the following command, with *pod name* replaced by the actual POD name: +``` +$ kubectl logs <pod name> -n kube-system -c contiv-vswitch +``` + +Redirect the output to a file to save the logs, for example: + +``` +kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt +``` + +#### b) Collecting Vswitch Logs Using Docker +If option a) does not work, then you can still collect the same logs using the plain docker +command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container: + +``` +$ docker ps | grep contivvpp/vswitch +b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0 +``` + +Now use the ID from the first column to dump the logs into the `logs-master.txt` file: +``` +$ docker logs b682b5837e52 > logs-master.txt +``` + +#### Reviewing the Vswitch Logs + +In order to debug an issue, it is good to start by grepping the logs for the `level=error` string, for example: +``` +$ cat logs-master.txt | grep level=error +``` + +Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string `exit`, for example: +``` +$ cat logs-master.txt | grep exit +2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected) +2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request +``` + +#### Collecting the STN Daemon Logs +In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs +from the STN daemon. This needs to be done on each node: +``` +$ docker logs contiv-stn > logs-stn-master.txt +``` + +#### Collecting Logs in Case of Crash Loop +If the vswitch is crashing in a loop (which can be determined by increasing the number in the `RESTARTS` +column of the `kubectl get pods --all-namespaces` output), the `kubectl logs` or `docker logs` would +give us the logs of the latest incarnation of the vswitch. That might not be the original root cause +of the very first crash, so in order to debug that, we need to disable k8s health check probes to not +restart the vswitch after the very first crash. This can be done by commenting-out the `readinessProbe` +and `livenessProbe` in the contiv-vpp deployment YAML: + +```diff +diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml +index 3676047..ffa4473 100644 +--- a/k8s/contiv-vpp.yaml ++++ b/k8s/contiv-vpp.yaml +@@ -224,18 +224,18 @@ spec: + ports: + # readiness + liveness probe + - containerPort: 9999 +- readinessProbe: +- httpGet: +- path: /readiness +- port: 9999 +- periodSeconds: 1 +- initialDelaySeconds: 15 +- livenessProbe: +- httpGet: +- path: /liveness +- port: 9999 +- periodSeconds: 1 +- initialDelaySeconds: 60 ++ # readinessProbe: ++ # httpGet: ++ # path: /readiness ++ # port: 9999 ++ # periodSeconds: 1 ++ # initialDelaySeconds: 15 ++ # livenessProbe: ++ # httpGet: ++ # path: /liveness ++ # port: 9999 ++ # periodSeconds: 1 ++ # initialDelaySeconds: 60 + env: + - name: MICROSERVICE_LABEL + valueFrom: +``` + +If VPP is the crashing process, please follow the [CORE_FILES](CORE_FILES.html) guide and provide the coredump file. + + +### Inspect VPP Config +Inspect the following areas: +- Configured interfaces (issues related basic node/pod connectivity issues): +``` +vpp# sh int addr +GigabitEthernet0/9/0 (up): + 192.168.16.1/24 +local0 (dn): +loop0 (up): + l2 bridge bd_id 1 bvi shg 0 + 192.168.30.1/24 +tapcli-0 (up): + 172.30.1.1/24 +``` + +- IP forwarding table: +``` +vpp# sh ip fib +ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ] +0.0.0.0/0 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]] + [0] [@0]: dpo-drop ip4 +0.0.0.0/32 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]] + [0] [@0]: dpo-drop ip4 + +... +... + +255.255.255.255/32 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]] + [0] [@0]: dpo-drop ip4 +``` +- ARP Table: +``` +vpp# sh ip arp + Time IP4 Flags Ethernet Interface + 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0 + 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0 + 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0 + 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1 + 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2 + 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3 +``` +- NAT configuration (issues related to services): +``` +DBGvpp# sh nat44 addresses +NAT44 pool addresses: +192.168.16.10 + tenant VRF independent + 0 busy udp ports + 0 busy tcp ports + 0 busy icmp ports +NAT44 twice-nat pool addresses: +``` +``` +vpp# sh nat44 static mappings +NAT44 static mappings: + tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only + udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only + tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only +``` +``` +vpp# sh nat44 interfaces +NAT44 interfaces: + loop0 in out + GigabitEthernet0/9/0 out + tapcli-0 in out +``` +``` +vpp# sh nat44 sessions +NAT44 sessions: + 192.168.20.2: 0 dynamic translations, 3 static translations + 10.1.1.3: 0 dynamic translations, 0 static translations + 10.1.1.4: 0 dynamic translations, 0 static translations + 10.1.1.2: 0 dynamic translations, 6 static translations + 10.1.2.18: 0 dynamic translations, 2 static translations +``` +- ACL config (issues related to policies): +``` +vpp# sh acl-plugin acl +``` +- "Steal the NIC (STN)" config (issues related to host connectivity when STN is active): +``` +vpp# sh stn rules +- rule_index: 0 + address: 10.1.10.47 + iface: tapcli-0 (2) + next_node: tapcli-0-output (410) +``` +- Errors: +``` +vpp# sh errors +``` +- Vxlan tunnels: +``` +vpp# sh vxlan tunnels +``` +- Vxlan tunnels: +``` +vpp# sh vxlan tunnels +``` +- Hardware interface information: +``` +vpp# sh hardware-interfaces +``` + +### Basic Example + +[contiv-vpp-bug-report.sh][1] is an example of a script that may be a useful starting point to gathering the above information using kubectl. + +Limitations: +- The script does not include STN daemon logs nor does it handle the special + case of a crash loop + +Prerequisites: +- The user specified in the script must have passwordless access to all nodes + in the cluster; on each node in the cluster the user must have passwordless + access to sudo. + +#### Setting up Prerequisites +To enable looging into a node without a password, copy your public key to the following +node: +``` +ssh-copy-id <user-id>@<node-name-or-ip-address> +``` + +To enable running sudo without a password for a given user, enter: +``` +$ sudo visudo +``` + +Append the following entry to run ALL command without a password for a given +user: +``` +<userid> ALL=(ALL) NOPASSWD:ALL +``` + +You can also add user `<user-id>` to group `sudo` and edit the `sudo` +entry as follows: + +``` +# Allow members of group sudo to execute any command +%sudo ALL=(ALL:ALL) NOPASSWD:ALL +``` + +Add user `<user-id>` to group `<group-id>` as follows: +``` +sudo adduser <user-id> <group-id> +``` +or as follows: +``` +usermod -a -G <group-id> <user-id> +``` +#### Working with the Contiv-VPP Vagrant Test Bed +The script can be used to collect data from the [Contiv-VPP test bed created with Vagrant][2]. +To collect debug information from this Contiv-VPP test bed, do the +following steps: +* In the directory where you created your vagrant test bed, do: +``` + vagrant ssh-config > vagrant-ssh.conf +``` +* To collect the debug information do: +``` + ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf +``` + +[1]: https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh +[2]: https://github.com/contiv/vpp/blob/master/vagrant/README.md |