aboutsummaryrefslogtreecommitdiffstats
path: root/docs/usecases/contiv/BUG_REPORTS.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/usecases/contiv/BUG_REPORTS.rst')
-rw-r--r--docs/usecases/contiv/BUG_REPORTS.rst401
1 files changed, 401 insertions, 0 deletions
diff --git a/docs/usecases/contiv/BUG_REPORTS.rst b/docs/usecases/contiv/BUG_REPORTS.rst
new file mode 100644
index 00000000000..8e55d5b3c8d
--- /dev/null
+++ b/docs/usecases/contiv/BUG_REPORTS.rst
@@ -0,0 +1,401 @@
+Debugging and Reporting Bugs in Contiv-VPP
+==========================================
+
+Bug Report Structure
+--------------------
+
+- `Deployment description <#describe-deployment>`__: Briefly describes
+ the deployment, where an issue was spotted, number of k8s nodes, is
+ DHCP/STN/TAP used.
+
+- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
+ from the vswitch pods.
+
+- `VPP config <#inspect-vpp-config>`__: Attach output of the show
+ commands.
+
+- `Basic Collection Example <#basic-example>`__
+
+Describe Deployment
+~~~~~~~~~~~~~~~~~~~
+
+Since contiv-vpp can be used with different configurations, it is
+helpful to attach the config that was applied. Either attach
+``values.yaml`` passed to the helm chart, or attach the `corresponding
+part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
+from the deployment yaml file.
+
+.. code:: yaml
+
+ contiv.yaml: |-
+ TCPstackDisabled: true
+ UseTAPInterfaces: true
+ TAPInterfaceVersion: 2
+ NatExternalTraffic: true
+ MTUSize: 1500
+ IPAMConfig:
+ PodSubnetCIDR: 10.1.0.0/16
+ PodNetworkPrefixLen: 24
+ PodIfIPCIDR: 10.2.1.0/24
+ VPPHostSubnetCIDR: 172.30.0.0/16
+ VPPHostNetworkPrefixLen: 24
+ NodeInterconnectCIDR: 192.168.16.0/24
+ VxlanCIDR: 192.168.30.0/24
+ NodeInterconnectDHCP: False
+
+Information that might be helpful: - Whether node IPs are statically
+assigned, or if DHCP is used - STN is enabled - Version of TAP
+interfaces used - Output of
+``kubectl get pods -o wide --all-namespaces``
+
+Collecting the Logs
+~~~~~~~~~~~~~~~~~~~
+
+The most essential thing that needs to be done when debugging and
+**reporting an issue** in Contiv-VPP is **collecting the logs from the
+contiv-vpp vswitch containers**.
+
+a) Collecting Vswitch Logs Using kubectl
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to collect the logs from individual vswitches in the cluster,
+connect to the master node and then find the POD names of the individual
+vswitch containers:
+
+::
+
+ $ kubectl get pods --all-namespaces | grep vswitch
+ kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
+ kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
+
+Then run the following command, with *pod name* replaced by the actual
+POD name:
+
+::
+
+ $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
+
+Redirect the output to a file to save the logs, for example:
+
+::
+
+ kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
+
+b) Collecting Vswitch Logs Using Docker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If option a) does not work, then you can still collect the same logs
+using the plain docker command. For that, you need to connect to each
+individual node in the k8s cluster, and find the container ID of the
+vswitch container:
+
+::
+
+ $ docker ps | grep contivvpp/vswitch
+ b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
+
+Now use the ID from the first column to dump the logs into the
+``logs-master.txt`` file:
+
+::
+
+ $ docker logs b682b5837e52 > logs-master.txt
+
+Reviewing the Vswitch Logs
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to debug an issue, it is good to start by grepping the logs for
+the ``level=error`` string, for example:
+
+::
+
+ $ cat logs-master.txt | grep level=error
+
+Also, VPP or contiv-agent may crash with some bugs. To check if some
+process crashed, grep for the string ``exit``, for example:
+
+::
+
+ $ cat logs-master.txt | grep exit
+ 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
+ 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
+
+Collecting the STN Daemon Logs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In STN (Steal The NIC) deployment scenarios, often need to collect and
+review the logs from the STN daemon. This needs to be done on each node:
+
+::
+
+ $ docker logs contiv-stn > logs-stn-master.txt
+
+Collecting Logs in Case of Crash Loop
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If the vswitch is crashing in a loop (which can be determined by
+increasing the number in the ``RESTARTS`` column of the
+``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
+``docker logs`` would give us the logs of the latest incarnation of the
+vswitch. That might not be the original root cause of the very first
+crash, so in order to debug that, we need to disable k8s health check
+probes to not restart the vswitch after the very first crash. This can
+be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
+in the contiv-vpp deployment YAML:
+
+.. code:: diff
+
+ diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
+ index 3676047..ffa4473 100644
+ --- a/k8s/contiv-vpp.yaml
+ +++ b/k8s/contiv-vpp.yaml
+ @@ -224,18 +224,18 @@ spec:
+ ports:
+ # readiness + liveness probe
+ - containerPort: 9999
+ - readinessProbe:
+ - httpGet:
+ - path: /readiness
+ - port: 9999
+ - periodSeconds: 1
+ - initialDelaySeconds: 15
+ - livenessProbe:
+ - httpGet:
+ - path: /liveness
+ - port: 9999
+ - periodSeconds: 1
+ - initialDelaySeconds: 60
+ + # readinessProbe:
+ + # httpGet:
+ + # path: /readiness
+ + # port: 9999
+ + # periodSeconds: 1
+ + # initialDelaySeconds: 15
+ + # livenessProbe:
+ + # httpGet:
+ + # path: /liveness
+ + # port: 9999
+ + # periodSeconds: 1
+ + # initialDelaySeconds: 60
+ env:
+ - name: MICROSERVICE_LABEL
+ valueFrom:
+
+If VPP is the crashing process, please follow the
+[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
+
+Inspect VPP Config
+~~~~~~~~~~~~~~~~~~
+
+Inspect the following areas: - Configured interfaces (issues related
+basic node/pod connectivity issues):
+
+::
+
+ vpp# sh int addr
+ GigabitEthernet0/9/0 (up):
+ 192.168.16.1/24
+ local0 (dn):
+ loop0 (up):
+ l2 bridge bd_id 1 bvi shg 0
+ 192.168.30.1/24
+ tapcli-0 (up):
+ 172.30.1.1/24
+
+- IP forwarding table:
+
+::
+
+ vpp# sh ip fib
+ ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
+ 0.0.0.0/0
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
+ [0] [@0]: dpo-drop ip4
+ 0.0.0.0/32
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
+ [0] [@0]: dpo-drop ip4
+
+ ...
+ ...
+
+ 255.255.255.255/32
+ unicast-ip4-chain
+ [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
+ [0] [@0]: dpo-drop ip4
+
+- ARP Table:
+
+::
+
+ vpp# sh ip arp
+ Time IP4 Flags Ethernet Interface
+ 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
+ 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
+ 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
+ 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
+ 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
+ 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
+
+- NAT configuration (issues related to services):
+
+::
+
+ DBGvpp# sh nat44 addresses
+ NAT44 pool addresses:
+ 192.168.16.10
+ tenant VRF independent
+ 0 busy udp ports
+ 0 busy tcp ports
+ 0 busy icmp ports
+ NAT44 twice-nat pool addresses:
+
+::
+
+ vpp# sh nat44 static mappings
+ NAT44 static mappings:
+ tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
+ tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
+ udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
+ tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
+
+::
+
+ vpp# sh nat44 interfaces
+ NAT44 interfaces:
+ loop0 in out
+ GigabitEthernet0/9/0 out
+ tapcli-0 in out
+
+::
+
+ vpp# sh nat44 sessions
+ NAT44 sessions:
+ 192.168.20.2: 0 dynamic translations, 3 static translations
+ 10.1.1.3: 0 dynamic translations, 0 static translations
+ 10.1.1.4: 0 dynamic translations, 0 static translations
+ 10.1.1.2: 0 dynamic translations, 6 static translations
+ 10.1.2.18: 0 dynamic translations, 2 static translations
+
+- ACL config (issues related to policies):
+
+::
+
+ vpp# sh acl-plugin acl
+
+- “Steal the NIC (STN)” config (issues related to host connectivity
+ when STN is active):
+
+::
+
+ vpp# sh stn rules
+ - rule_index: 0
+ address: 10.1.10.47
+ iface: tapcli-0 (2)
+ next_node: tapcli-0-output (410)
+
+- Errors:
+
+::
+
+ vpp# sh errors
+
+- Vxlan tunnels:
+
+::
+
+ vpp# sh vxlan tunnels
+
+- Vxlan tunnels:
+
+::
+
+ vpp# sh vxlan tunnels
+
+- Hardware interface information:
+
+::
+
+ vpp# sh hardware-interfaces
+
+Basic Example
+~~~~~~~~~~~~~
+
+`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
+is an example of a script that may be a useful starting point to
+gathering the above information using kubectl.
+
+Limitations: - The script does not include STN daemon logs nor does it
+handle the special case of a crash loop
+
+Prerequisites: - The user specified in the script must have passwordless
+access to all nodes in the cluster; on each node in the cluster the user
+must have passwordless access to sudo.
+
+Setting up Prerequisites
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+To enable logging into a node without a password, copy your public key
+to the following node:
+
+::
+
+ ssh-copy-id <user-id>@<node-name-or-ip-address>
+
+To enable running sudo without a password for a given user, enter:
+
+::
+
+ $ sudo visudo
+
+Append the following entry to run ALL command without a password for a
+given user:
+
+::
+
+ <userid> ALL=(ALL) NOPASSWD:ALL
+
+You can also add user ``<user-id>`` to group ``sudo`` and edit the
+``sudo`` entry as follows:
+
+::
+
+ # Allow members of group sudo to execute any command
+ %sudo ALL=(ALL:ALL) NOPASSWD:ALL
+
+Add user ``<user-id>`` to group ``<group-id>`` as follows:
+
+::
+
+ sudo adduser <user-id> <group-id>
+
+or as follows:
+
+::
+
+ usermod -a -G <group-id> <user-id>
+
+Working with the Contiv-VPP Vagrant Test Bed
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The script can be used to collect data from the `Contiv-VPP test bed
+created with
+Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
+To collect debug information from this Contiv-VPP test bed, do the
+following steps: \* In the directory where you created your vagrant test
+bed, do:
+
+::
+
+ vagrant ssh-config > vagrant-ssh.conf
+
+- To collect the debug information do:
+
+::
+
+ ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf