summaryrefslogtreecommitdiffstats
path: root/docs/usecases/contiv/BUG_REPORTS.rst
blob: 8e55d5b3c8d413a70660180282ad3a34423cd676 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
Debugging and Reporting Bugs in Contiv-VPP
==========================================

Bug Report Structure
--------------------

-  `Deployment description <#describe-deployment>`__: Briefly describes
   the deployment, where an issue was spotted, number of k8s nodes, is
   DHCP/STN/TAP used.

-  `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
   from the vswitch pods.

-  `VPP config <#inspect-vpp-config>`__: Attach output of the show
   commands.

-  `Basic Collection Example <#basic-example>`__

Describe Deployment
~~~~~~~~~~~~~~~~~~~

Since contiv-vpp can be used with different configurations, it is
helpful to attach the config that was applied. Either attach
``values.yaml`` passed to the helm chart, or attach the `corresponding
part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
from the deployment yaml file.

.. code:: yaml

    contiv.yaml: |-
      TCPstackDisabled: true
      UseTAPInterfaces: true
      TAPInterfaceVersion: 2
      NatExternalTraffic: true
      MTUSize: 1500
      IPAMConfig:
        PodSubnetCIDR: 10.1.0.0/16
        PodNetworkPrefixLen: 24
        PodIfIPCIDR: 10.2.1.0/24
        VPPHostSubnetCIDR: 172.30.0.0/16
        VPPHostNetworkPrefixLen: 24
        NodeInterconnectCIDR: 192.168.16.0/24
        VxlanCIDR: 192.168.30.0/24
        NodeInterconnectDHCP: False

Information that might be helpful: - Whether node IPs are statically
assigned, or if DHCP is used - STN is enabled - Version of TAP
interfaces used - Output of
``kubectl get pods -o wide --all-namespaces``

Collecting the Logs
~~~~~~~~~~~~~~~~~~~

The most essential thing that needs to be done when debugging and
**reporting an issue** in Contiv-VPP is **collecting the logs from the
contiv-vpp vswitch containers**.

a) Collecting Vswitch Logs Using kubectl
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to collect the logs from individual vswitches in the cluster,
connect to the master node and then find the POD names of the individual
vswitch containers:

::

    $ kubectl get pods --all-namespaces | grep vswitch
    kube-system   contiv-vswitch-lqxfp               2/2       Running   0          1h
    kube-system   contiv-vswitch-q6kwt               2/2       Running   0          1h

Then run the following command, with *pod name* replaced by the actual
POD name:

::

    $ kubectl logs <pod name> -n kube-system -c contiv-vswitch

Redirect the output to a file to save the logs, for example:

::

    kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt

b) Collecting Vswitch Logs Using Docker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If option a) does not work, then you can still collect the same logs
using the plain docker command. For that, you need to connect to each
individual node in the k8s cluster, and find the container ID of the
vswitch container:

::

    $ docker ps | grep contivvpp/vswitch
    b682b5837e52        contivvpp/vswitch                                        "/usr/bin/supervisor…"   2 hours ago         Up 2 hours                              k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0

Now use the ID from the first column to dump the logs into the
``logs-master.txt`` file:

::

    $ docker logs b682b5837e52 > logs-master.txt

Reviewing the Vswitch Logs
^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to debug an issue, it is good to start by grepping the logs for
the ``level=error`` string, for example:

::

    $ cat logs-master.txt | grep level=error

Also, VPP or contiv-agent may crash with some bugs. To check if some
process crashed, grep for the string ``exit``, for example:

::

    $ cat logs-master.txt | grep exit
    2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
    2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request

Collecting the STN Daemon Logs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In STN (Steal The NIC) deployment scenarios, often need to collect and
review the logs from the STN daemon. This needs to be done on each node:

::

    $ docker logs contiv-stn > logs-stn-master.txt

Collecting Logs in Case of Crash Loop
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the vswitch is crashing in a loop (which can be determined by
increasing the number in the ``RESTARTS`` column of the
``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
``docker logs`` would give us the logs of the latest incarnation of the
vswitch. That might not be the original root cause of the very first
crash, so in order to debug that, we need to disable k8s health check
probes to not restart the vswitch after the very first crash. This can
be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
in the contiv-vpp deployment YAML:

.. code:: diff

    diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
    index 3676047..ffa4473 100644
    --- a/k8s/contiv-vpp.yaml
    +++ b/k8s/contiv-vpp.yaml
    @@ -224,18 +224,18 @@ spec:
        	ports:
                  # readiness + liveness probe
                  - containerPort: 9999
    -          readinessProbe:
    -            httpGet:
    -              path: /readiness
    -              port: 9999
    -            periodSeconds: 1
    -            initialDelaySeconds: 15
    -          livenessProbe:
    -            httpGet:
    -              path: /liveness
    -              port: 9999
    -            periodSeconds: 1
    -            initialDelaySeconds: 60
    + #         readinessProbe:
    + #           httpGet:
    + #             path: /readiness
    + #             port: 9999
    + #           periodSeconds: 1
    + #           initialDelaySeconds: 15
    + #         livenessProbe:
    + #           httpGet:
    + #             path: /liveness
    + #             port: 9999
    + #           periodSeconds: 1
    + #           initialDelaySeconds: 60
        	env:
                  - name: MICROSERVICE_LABEL
                    valueFrom:

If VPP is the crashing process, please follow the
[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.

Inspect VPP Config
~~~~~~~~~~~~~~~~~~

Inspect the following areas: - Configured interfaces (issues related
basic node/pod connectivity issues):

::

    vpp# sh int addr
    GigabitEthernet0/9/0 (up):
      192.168.16.1/24
    local0 (dn):
    loop0 (up):
      l2 bridge bd_id 1 bvi shg 0
      192.168.30.1/24
    tapcli-0 (up):
      172.30.1.1/24

-  IP forwarding table:

::

    vpp# sh ip fib
    ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
    0.0.0.0/0
      unicast-ip4-chain
      [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
        [0] [@0]: dpo-drop ip4
    0.0.0.0/32
      unicast-ip4-chain
      [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
        [0] [@0]: dpo-drop ip4

    ...
    ...

    255.255.255.255/32
      unicast-ip4-chain
      [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
        [0] [@0]: dpo-drop ip4

-  ARP Table:

::

    vpp# sh ip arp
        Time           IP4       Flags      Ethernet              Interface
        728.6616  192.168.16.2     D    08:00:27:9c:0e:9f GigabitEthernet0/8/0
        542.7045  192.168.30.2     S    1a:2b:3c:4d:5e:02 loop0
          1.4241   172.30.1.2      D    86:41:d5:92:fd:24 tapcli-0
          15.2485    10.1.1.2      SN    00:00:00:00:00:02 tapcli-1
        739.2339    10.1.1.3      SN    00:00:00:00:00:02 tapcli-2
        739.4119    10.1.1.4      SN    00:00:00:00:00:02 tapcli-3

-  NAT configuration (issues related to services):

::

    DBGvpp# sh nat44 addresses
    NAT44 pool addresses:
    192.168.16.10
      tenant VRF independent
      0 busy udp ports
      0 busy tcp ports
      0 busy icmp ports
    NAT44 twice-nat pool addresses:

::

    vpp# sh nat44 static mappings
    NAT44 static mappings:
      tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0  out2in-only
      tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0  out2in-only
      tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0  out2in-only
      tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0  out2in-only
      tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0  out2in-only
      tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0  out2in-only
      udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
      tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only

::

    vpp# sh nat44 interfaces
    NAT44 interfaces:
      loop0 in out
      GigabitEthernet0/9/0 out
      tapcli-0 in out

::

    vpp# sh nat44 sessions
    NAT44 sessions:
      192.168.20.2: 0 dynamic translations, 3 static translations
      10.1.1.3: 0 dynamic translations, 0 static translations
      10.1.1.4: 0 dynamic translations, 0 static translations
      10.1.1.2: 0 dynamic translations, 6 static translations
      10.1.2.18: 0 dynamic translations, 2 static translations

-  ACL config (issues related to policies):

::

    vpp# sh acl-plugin acl

-  “Steal the NIC (STN)” config (issues related to host connectivity
   when STN is active):

::

    vpp# sh stn rules
    - rule_index: 0
      address: 10.1.10.47
      iface: tapcli-0 (2)
      next_node: tapcli-0-output (410)

-  Errors:

::

    vpp# sh errors

-  Vxlan tunnels:

::

    vpp# sh vxlan tunnels

-  Vxlan tunnels:

::

    vpp# sh vxlan tunnels

-  Hardware interface information:

::

    vpp# sh hardware-interfaces

Basic Example
~~~~~~~~~~~~~

`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
is an example of a script that may be a useful starting point to
gathering the above information using kubectl.

Limitations: - The script does not include STN daemon logs nor does it
handle the special case of a crash loop

Prerequisites: - The user specified in the script must have passwordless
access to all nodes in the cluster; on each node in the cluster the user
must have passwordless access to sudo.

Setting up Prerequisites
^^^^^^^^^^^^^^^^^^^^^^^^

To enable logging into a node without a password, copy your public key
to the following node:

::

    ssh-copy-id <user-id>@<node-name-or-ip-address>

To enable running sudo without a password for a given user, enter:

::

    $ sudo visudo

Append the following entry to run ALL command without a password for a
given user:

::

    <userid> ALL=(ALL) NOPASSWD:ALL

You can also add user ``<user-id>`` to group ``sudo`` and edit the
``sudo`` entry as follows:

::

    # Allow members of group sudo to execute any command
    %sudo   ALL=(ALL:ALL) NOPASSWD:ALL

Add user ``<user-id>`` to group ``<group-id>`` as follows:

::

    sudo adduser <user-id> <group-id>

or as follows:

::

    usermod -a -G <group-id> <user-id>

Working with the Contiv-VPP Vagrant Test Bed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The script can be used to collect data from the `Contiv-VPP test bed
created with
Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
To collect debug information from this Contiv-VPP test bed, do the
following steps: \* In the directory where you created your vagrant test
bed, do:

::

    vagrant ssh-config > vagrant-ssh.conf

-  To collect the debug information do:

::

    ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf