diff options
Diffstat (limited to 'docs/usecases/contiv/BUG_REPORTS.rst')
-rw-r--r-- | docs/usecases/contiv/BUG_REPORTS.rst | 401 |
1 files changed, 401 insertions, 0 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.rst b/docs/usecases/contiv/BUG_REPORTS.rst new file mode 100644 index 00000000000..8e55d5b3c8d --- /dev/null +++ b/docs/usecases/contiv/BUG_REPORTS.rst @@ -0,0 +1,401 @@ +Debugging and Reporting Bugs in Contiv-VPP +========================================== + +Bug Report Structure +-------------------- + +- `Deployment description <#describe-deployment>`__: Briefly describes + the deployment, where an issue was spotted, number of k8s nodes, is + DHCP/STN/TAP used. + +- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least + from the vswitch pods. + +- `VPP config <#inspect-vpp-config>`__: Attach output of the show + commands. + +- `Basic Collection Example <#basic-example>`__ + +Describe Deployment +~~~~~~~~~~~~~~~~~~~ + +Since contiv-vpp can be used with different configurations, it is +helpful to attach the config that was applied. Either attach +``values.yaml`` passed to the helm chart, or attach the `corresponding +part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__ +from the deployment yaml file. + +.. code:: yaml + + contiv.yaml: |- + TCPstackDisabled: true + UseTAPInterfaces: true + TAPInterfaceVersion: 2 + NatExternalTraffic: true + MTUSize: 1500 + IPAMConfig: + PodSubnetCIDR: 10.1.0.0/16 + PodNetworkPrefixLen: 24 + PodIfIPCIDR: 10.2.1.0/24 + VPPHostSubnetCIDR: 172.30.0.0/16 + VPPHostNetworkPrefixLen: 24 + NodeInterconnectCIDR: 192.168.16.0/24 + VxlanCIDR: 192.168.30.0/24 + NodeInterconnectDHCP: False + +Information that might be helpful: - Whether node IPs are statically +assigned, or if DHCP is used - STN is enabled - Version of TAP +interfaces used - Output of +``kubectl get pods -o wide --all-namespaces`` + +Collecting the Logs +~~~~~~~~~~~~~~~~~~~ + +The most essential thing that needs to be done when debugging and +**reporting an issue** in Contiv-VPP is **collecting the logs from the +contiv-vpp vswitch containers**. + +a) Collecting Vswitch Logs Using kubectl +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to collect the logs from individual vswitches in the cluster, +connect to the master node and then find the POD names of the individual +vswitch containers: + +:: + + $ kubectl get pods --all-namespaces | grep vswitch + kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h + kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h + +Then run the following command, with *pod name* replaced by the actual +POD name: + +:: + + $ kubectl logs <pod name> -n kube-system -c contiv-vswitch + +Redirect the output to a file to save the logs, for example: + +:: + + kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt + +b) Collecting Vswitch Logs Using Docker +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If option a) does not work, then you can still collect the same logs +using the plain docker command. For that, you need to connect to each +individual node in the k8s cluster, and find the container ID of the +vswitch container: + +:: + + $ docker ps | grep contivvpp/vswitch + b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0 + +Now use the ID from the first column to dump the logs into the +``logs-master.txt`` file: + +:: + + $ docker logs b682b5837e52 > logs-master.txt + +Reviewing the Vswitch Logs +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to debug an issue, it is good to start by grepping the logs for +the ``level=error`` string, for example: + +:: + + $ cat logs-master.txt | grep level=error + +Also, VPP or contiv-agent may crash with some bugs. To check if some +process crashed, grep for the string ``exit``, for example: + +:: + + $ cat logs-master.txt | grep exit + 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected) + 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request + +Collecting the STN Daemon Logs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In STN (Steal The NIC) deployment scenarios, often need to collect and +review the logs from the STN daemon. This needs to be done on each node: + +:: + + $ docker logs contiv-stn > logs-stn-master.txt + +Collecting Logs in Case of Crash Loop +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If the vswitch is crashing in a loop (which can be determined by +increasing the number in the ``RESTARTS`` column of the +``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or +``docker logs`` would give us the logs of the latest incarnation of the +vswitch. That might not be the original root cause of the very first +crash, so in order to debug that, we need to disable k8s health check +probes to not restart the vswitch after the very first crash. This can +be done by commenting-out the ``readinessProbe`` and ``livenessProbe`` +in the contiv-vpp deployment YAML: + +.. code:: diff + + diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml + index 3676047..ffa4473 100644 + --- a/k8s/contiv-vpp.yaml + +++ b/k8s/contiv-vpp.yaml + @@ -224,18 +224,18 @@ spec: + ports: + # readiness + liveness probe + - containerPort: 9999 + - readinessProbe: + - httpGet: + - path: /readiness + - port: 9999 + - periodSeconds: 1 + - initialDelaySeconds: 15 + - livenessProbe: + - httpGet: + - path: /liveness + - port: 9999 + - periodSeconds: 1 + - initialDelaySeconds: 60 + + # readinessProbe: + + # httpGet: + + # path: /readiness + + # port: 9999 + + # periodSeconds: 1 + + # initialDelaySeconds: 15 + + # livenessProbe: + + # httpGet: + + # path: /liveness + + # port: 9999 + + # periodSeconds: 1 + + # initialDelaySeconds: 60 + env: + - name: MICROSERVICE_LABEL + valueFrom: + +If VPP is the crashing process, please follow the +[CORE_FILES](CORE_FILES.html) guide and provide the coredump file. + +Inspect VPP Config +~~~~~~~~~~~~~~~~~~ + +Inspect the following areas: - Configured interfaces (issues related +basic node/pod connectivity issues): + +:: + + vpp# sh int addr + GigabitEthernet0/9/0 (up): + 192.168.16.1/24 + local0 (dn): + loop0 (up): + l2 bridge bd_id 1 bvi shg 0 + 192.168.30.1/24 + tapcli-0 (up): + 172.30.1.1/24 + +- IP forwarding table: + +:: + + vpp# sh ip fib + ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ] + 0.0.0.0/0 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]] + [0] [@0]: dpo-drop ip4 + 0.0.0.0/32 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]] + [0] [@0]: dpo-drop ip4 + + ... + ... + + 255.255.255.255/32 + unicast-ip4-chain + [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]] + [0] [@0]: dpo-drop ip4 + +- ARP Table: + +:: + + vpp# sh ip arp + Time IP4 Flags Ethernet Interface + 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0 + 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0 + 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0 + 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1 + 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2 + 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3 + +- NAT configuration (issues related to services): + +:: + + DBGvpp# sh nat44 addresses + NAT44 pool addresses: + 192.168.16.10 + tenant VRF independent + 0 busy udp ports + 0 busy tcp ports + 0 busy icmp ports + NAT44 twice-nat pool addresses: + +:: + + vpp# sh nat44 static mappings + NAT44 static mappings: + tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only + tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only + udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only + tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only + +:: + + vpp# sh nat44 interfaces + NAT44 interfaces: + loop0 in out + GigabitEthernet0/9/0 out + tapcli-0 in out + +:: + + vpp# sh nat44 sessions + NAT44 sessions: + 192.168.20.2: 0 dynamic translations, 3 static translations + 10.1.1.3: 0 dynamic translations, 0 static translations + 10.1.1.4: 0 dynamic translations, 0 static translations + 10.1.1.2: 0 dynamic translations, 6 static translations + 10.1.2.18: 0 dynamic translations, 2 static translations + +- ACL config (issues related to policies): + +:: + + vpp# sh acl-plugin acl + +- “Steal the NIC (STN)” config (issues related to host connectivity + when STN is active): + +:: + + vpp# sh stn rules + - rule_index: 0 + address: 10.1.10.47 + iface: tapcli-0 (2) + next_node: tapcli-0-output (410) + +- Errors: + +:: + + vpp# sh errors + +- Vxlan tunnels: + +:: + + vpp# sh vxlan tunnels + +- Vxlan tunnels: + +:: + + vpp# sh vxlan tunnels + +- Hardware interface information: + +:: + + vpp# sh hardware-interfaces + +Basic Example +~~~~~~~~~~~~~ + +`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__ +is an example of a script that may be a useful starting point to +gathering the above information using kubectl. + +Limitations: - The script does not include STN daemon logs nor does it +handle the special case of a crash loop + +Prerequisites: - The user specified in the script must have passwordless +access to all nodes in the cluster; on each node in the cluster the user +must have passwordless access to sudo. + +Setting up Prerequisites +^^^^^^^^^^^^^^^^^^^^^^^^ + +To enable logging into a node without a password, copy your public key +to the following node: + +:: + + ssh-copy-id <user-id>@<node-name-or-ip-address> + +To enable running sudo without a password for a given user, enter: + +:: + + $ sudo visudo + +Append the following entry to run ALL command without a password for a +given user: + +:: + + <userid> ALL=(ALL) NOPASSWD:ALL + +You can also add user ``<user-id>`` to group ``sudo`` and edit the +``sudo`` entry as follows: + +:: + + # Allow members of group sudo to execute any command + %sudo ALL=(ALL:ALL) NOPASSWD:ALL + +Add user ``<user-id>`` to group ``<group-id>`` as follows: + +:: + + sudo adduser <user-id> <group-id> + +or as follows: + +:: + + usermod -a -G <group-id> <user-id> + +Working with the Contiv-VPP Vagrant Test Bed +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The script can be used to collect data from the `Contiv-VPP test bed +created with +Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__. +To collect debug information from this Contiv-VPP test bed, do the +following steps: \* In the directory where you created your vagrant test +bed, do: + +:: + + vagrant ssh-config > vagrant-ssh.conf + +- To collect the debug information do: + +:: + + ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf |