1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
|
.. BSD LICENSE
Copyright(c) 2010-2015 Intel Corporation. All rights reserved.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
* Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Vhost Sample Application
========================
The vhost sample application demonstrates integration of the Data Plane Development Kit (DPDK)
with the Linux* KVM hypervisor by implementing the vhost-net offload API.
The sample application performs simple packet switching between virtual machines based on Media Access Control
(MAC) address or Virtual Local Area Network (VLAN) tag.
The splitting of Ethernet traffic from an external switch is performed in hardware by the Virtual Machine Device Queues
(VMDQ) and Data Center Bridging (DCB) features of the Intel® 82599 10 Gigabit Ethernet Controller.
Background
----------
Virtio networking (virtio-net) was developed as the Linux* KVM para-virtualized method for communicating network packets
between host and guest.
It was found that virtio-net performance was poor due to context switching and packet copying between host, guest, and QEMU.
The following figure shows the system architecture for a virtio-based networking (virtio-net).
.. _figure_qemu_virtio_net:
.. figure:: img/qemu_virtio_net.*
System Architecture for Virtio-based Networking (virtio-net).
The Linux* Kernel vhost-net module was developed as an offload mechanism for virtio-net.
The vhost-net module enables KVM (QEMU) to offload the servicing of virtio-net devices to the vhost-net kernel module,
reducing the context switching and packet copies in the virtual dataplane.
This is achieved by QEMU sharing the following information with the vhost-net module through the vhost-net API:
* The layout of the guest memory space, to enable the vhost-net module to translate addresses.
* The locations of virtual queues in QEMU virtual address space,
to enable the vhost module to read/write directly to and from the virtqueues.
* An event file descriptor (eventfd) configured in KVM to send interrupts to the virtio- net device driver in the guest.
This enables the vhost-net module to notify (call) the guest.
* An eventfd configured in KVM to be triggered on writes to the virtio-net device's
Peripheral Component Interconnect (PCI) config space.
This enables the vhost-net module to receive notifications (kicks) from the guest.
The following figure shows the system architecture for virtio-net networking with vhost-net offload.
.. _figure_virtio_linux_vhost:
.. figure:: img/virtio_linux_vhost.*
Virtio with Linux
Sample Code Overview
--------------------
The DPDK vhost-net sample code demonstrates KVM (QEMU) offloading the servicing of a Virtual Machine's (VM's)
virtio-net devices to a DPDK-based application in place of the kernel's vhost-net module.
The DPDK vhost-net sample code is based on vhost library. Vhost library is developed for user space Ethernet switch to
easily integrate with vhost functionality.
The vhost library implements the following features:
* Management of virtio-net device creation/destruction events.
* Mapping of the VM's physical memory into the DPDK vhost-net's address space.
* Triggering/receiving notifications to/from VMs via eventfds.
* A virtio-net back-end implementation providing a subset of virtio-net features.
There are two vhost implementations in vhost library, vhost cuse and vhost user. In vhost cuse, a character device driver is implemented to
receive and process vhost requests through ioctl messages. In vhost user, a socket server is created to received vhost requests through
socket messages. Most of the messages share the same handler routine.
.. note::
**Any vhost cuse specific requirement in the following sections will be emphasized**.
Two implementations are turned on and off statically through configure file. Only one implementation could be turned on. They don't co-exist in current implementation.
The vhost sample code application is a simple packet switching application with the following feature:
* Packet switching between virtio-net devices and the network interface card,
including using VMDQs to reduce the switching that needs to be performed in software.
The following figure shows the architecture of the Vhost sample application based on vhost-cuse.
.. _figure_vhost_net_arch:
.. figure:: img/vhost_net_arch.*
Vhost-net Architectural Overview
The following figure shows the flow of packets through the vhost-net sample application.
.. _figure_vhost_net_sample_app:
.. figure:: img/vhost_net_sample_app.*
Packet Flow Through the vhost-net Sample Application
Supported Distributions
-----------------------
The example in this section have been validated with the following distributions:
* Fedora* 18
* Fedora* 19
* Fedora* 20
.. _vhost_app_prerequisites:
Prerequisites
-------------
This section lists prerequisite packages that must be installed.
Installing Packages on the Host(vhost cuse required)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The vhost cuse code uses the following packages; fuse, fuse-devel, and kernel-modules-extra.
The vhost user code don't rely on those modules as eventfds are already installed into vhost process through
Unix domain socket.
#. Install Fuse Development Libraries and headers:
.. code-block:: console
yum -y install fuse fuse-devel
#. Install the Cuse Kernel Module:
.. code-block:: console
yum -y install kernel-modules-extra
QEMU simulator
~~~~~~~~~~~~~~
For vhost user, qemu 2.2 is required.
Setting up the Execution Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The vhost sample code requires that QEMU allocates a VM's memory on the hugetlbfs file system.
As the vhost sample code requires hugepages,
the best practice is to partition the system into separate hugepage mount points for the VMs and the vhost sample code.
.. note::
This is best-practice only and is not mandatory.
For systems that only support 2 MB page sizes,
both QEMU and vhost sample code can use the same hugetlbfs mount point without issue.
**QEMU**
VMs with gigabytes of memory can benefit from having QEMU allocate their memory from 1 GB huge pages.
1 GB huge pages must be allocated at boot time by passing kernel parameters through the grub boot loader.
#. Calculate the maximum memory usage of all VMs to be run on the system.
Then, round this value up to the nearest Gigabyte the execution environment will require.
#. Edit the /etc/default/grub file, and add the following to the GRUB_CMDLINE_LINUX entry:
.. code-block:: console
GRUB_CMDLINE_LINUX="... hugepagesz=1G hugepages=<Number of hugepages required> default_hugepagesz=1G"
#. Update the grub boot loader:
.. code-block:: console
grub2-mkconfig -o /boot/grub2/grub.cfg
#. Reboot the system.
#. The hugetlbfs mount point (/dev/hugepages) should now default to allocating gigabyte pages.
.. note::
Making the above modification will change the system default hugepage size to 1 GB for all applications.
**Vhost Sample Code**
In this section, we create a second hugetlbs mount point to allocate hugepages for the DPDK vhost sample code.
#. Allocate sufficient 2 MB pages for the DPDK vhost sample code:
.. code-block:: console
echo 256 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
#. Mount hugetlbs at a separate mount point for 2 MB pages:
.. code-block:: console
mount -t hugetlbfs nodev /mnt/huge -o pagesize=2M
The above steps can be automated by doing the following:
#. Edit /etc/fstab to add an entry to automatically mount the second hugetlbfs mount point:
::
hugetlbfs <tab> /mnt/huge <tab> hugetlbfs defaults,pagesize=1G 0 0
#. Edit the /etc/default/grub file, and add the following to the GRUB_CMDLINE_LINUX entry:
::
GRUB_CMDLINE_LINUX="... hugepagesz=2M hugepages=256 ... default_hugepagesz=1G"
#. Update the grub bootloader:
.. code-block:: console
grub2-mkconfig -o /boot/grub2/grub.cfg
#. Reboot the system.
.. note::
Ensure that the default hugepage size after this setup is 1 GB.
Setting up the Guest Execution Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is recommended for testing purposes that the DPDK testpmd sample application is used in the guest to forward packets,
the reasons for this are discussed in `Running the Virtual Machine (QEMU)`_.
The testpmd application forwards packets between pairs of Ethernet devices,
it requires an even number of Ethernet devices (virtio or otherwise) to execute.
It is therefore recommended to create multiples of two virtio-net devices for each Virtual Machine either through libvirt or
at the command line as follows.
.. note::
Observe that in the example, "-device" and "-netdev" are repeated for two virtio-net devices.
For vhost cuse:
.. code-block:: console
qemu-system-x86_64 ... \
-netdev tap,id=hostnet1,vhost=on,vhostfd=<open fd> \
-device virtio-net-pci, netdev=hostnet1,id=net1 \
-netdev tap,id=hostnet2,vhost=on,vhostfd=<open fd> \
-device virtio-net-pci, netdev=hostnet2,id=net1
For vhost user:
.. code-block:: console
qemu-system-x86_64 ... \
-chardev socket,id=char1,path=<sock_path> \
-netdev type=vhost-user,id=hostnet1,chardev=char1 \
-device virtio-net-pci,netdev=hostnet1,id=net1 \
-chardev socket,id=char2,path=<sock_path> \
-netdev type=vhost-user,id=hostnet2,chardev=char2 \
-device virtio-net-pci,netdev=hostnet2,id=net2
sock_path is the path for the socket file created by vhost.
Compiling the Sample Code
-------------------------
#. Compile vhost lib:
To enable vhost, turn on vhost library in the configure file config/common_linuxapp.
.. code-block:: console
CONFIG_RTE_LIBRTE_VHOST=n
vhost user is turned on by default in the configure file config/common_linuxapp.
To enable vhost cuse, disable vhost user.
.. code-block:: console
CONFIG_RTE_LIBRTE_VHOST_USER=y
After vhost is enabled and the implementation is selected, build the vhost library.
#. Go to the examples directory:
.. code-block:: console
export RTE_SDK=/path/to/rte_sdk
cd ${RTE_SDK}/examples/vhost
#. Set the target (a default target is used if not specified). For example:
.. code-block:: console
export RTE_TARGET=x86_64-native-linuxapp-gcc
See the DPDK Getting Started Guide for possible RTE_TARGET values.
#. Build the application:
.. code-block:: console
cd ${RTE_SDK}
make config ${RTE_TARGET}
make install ${RTE_TARGET}
cd ${RTE_SDK}/examples/vhost
make
#. Go to the eventfd_link directory(vhost cuse required):
.. code-block:: console
cd ${RTE_SDK}/lib/librte_vhost/eventfd_link
#. Build the eventfd_link kernel module(vhost cuse required):
.. code-block:: console
make
Running the Sample Code
-----------------------
#. Install the cuse kernel module(vhost cuse required):
.. code-block:: console
modprobe cuse
#. Go to the eventfd_link directory(vhost cuse required):
.. code-block:: console
export RTE_SDK=/path/to/rte_sdk
cd ${RTE_SDK}/lib/librte_vhost/eventfd_link
#. Install the eventfd_link module(vhost cuse required):
.. code-block:: console
insmod ./eventfd_link.ko
#. Go to the examples directory:
.. code-block:: console
export RTE_SDK=/path/to/rte_sdk
cd ${RTE_SDK}/examples/vhost/build/app
#. Run the vhost-switch sample code:
vhost cuse:
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- -p 0x1 --dev-basename usvhost
vhost user: a socket file named usvhost will be created under current directory. Use its path as the socket path in guest's qemu commandline.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- -p 0x1 --dev-basename usvhost
.. note::
Please note the huge-dir parameter instructs the DPDK to allocate its memory from the 2 MB page hugetlbfs.
.. note::
The number used with the --socket-mem parameter may need to be more than 1024.
The number required depends on the number of mbufs allocated by vhost-switch.
.. _vhost_app_parameters:
Parameters
~~~~~~~~~~
**Basename.**
vhost cuse uses a Linux* character device to communicate with QEMU.
The basename is used to generate the character devices name.
/dev/<basename>
For compatibility with the QEMU wrapper script, a base name of "usvhost" should be used:
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- -p 0x1 --dev-basename usvhost
**vm2vm.**
The vm2vm parameter disable/set mode of packet switching between guests in the host.
Value of "0" means disabling vm2vm implies that on virtual machine packet transmission will always go to the Ethernet port;
Value of "1" means software mode packet forwarding between guests, it needs packets copy in vHOST,
so valid only in one-copy implementation, and invalid for zero copy implementation;
value of "2" means hardware mode packet forwarding between guests, it allows packets go to the Ethernet port,
hardware L2 switch will determine which guest the packet should forward to or need send to external,
which bases on the packet destination MAC address and VLAN tag.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --vm2vm [0,1,2]
**Mergeable Buffers.**
The mergeable buffers parameter controls how virtio-net descriptors are used for virtio-net headers.
In a disabled state, one virtio-net header is used per packet buffer;
in an enabled state one virtio-net header is used for multiple packets.
The default value is 0 or disabled since recent kernels virtio-net drivers show performance degradation with this feature is enabled.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --mergeable [0,1]
**Stats.**
The stats parameter controls the printing of virtio-net device statistics.
The parameter specifies an interval second to print statistics, with an interval of 0 seconds disabling statistics.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --stats [0,n]
**RX Retry.**
The rx-retry option enables/disables enqueue retries when the guests RX queue is full.
This feature resolves a packet loss that is observed at high data-rates,
by allowing it to delay and retry in the receive path.
This option is enabled by default.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --rx-retry [0,1]
**RX Retry Number.**
The rx-retry-num option specifies the number of retries on an RX burst,
it takes effect only when rx retry is enabled.
The default value is 4.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --rx-retry 1 --rx-retry-num 5
**RX Retry Delay Time.**
The rx-retry-delay option specifies the timeout (in micro seconds) between retries on an RX burst,
it takes effect only when rx retry is enabled.
The default value is 15.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --rx-retry 1 --rx-retry-delay 20
**Zero copy.**
The zero copy option enables/disables the zero copy mode for RX/TX packet,
in the zero copy mode the packet buffer address from guest translate into host physical address
and then set directly as DMA address.
If the zero copy mode is disabled, then one copy mode is utilized in the sample.
This option is disabled by default.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --zero-copy [0,1]
**RX descriptor number.**
The RX descriptor number option specify the Ethernet RX descriptor number,
Linux legacy virtio-net has different behavior in how to use the vring descriptor from DPDK based virtio-net PMD,
the former likely allocate half for virtio header, another half for frame buffer,
while the latter allocate all for frame buffer,
this lead to different number for available frame buffer in vring,
and then lead to different Ethernet RX descriptor number could be used in zero copy mode.
So it is valid only in zero copy mode is enabled. The value is 32 by default.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --zero-copy 1 --rx-desc-num [0, n]
**TX descriptor number.**
The TX descriptor number option specify the Ethernet TX descriptor number, it is valid only in zero copy mode is enabled.
The value is 64 by default.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --zero-copy 1 --tx-desc-num [0, n]
**VLAN strip.**
The VLAN strip option enable/disable the VLAN strip on host, if disabled, the guest will receive the packets with VLAN tag.
It is enabled by default.
.. code-block:: console
./vhost-switch -c f -n 4 --socket-mem 1024 --huge-dir /mnt/huge \
-- --vlan-strip [0, 1]
.. _vhost_app_running:
Running the Virtual Machine (QEMU)
----------------------------------
QEMU must be executed with specific parameters to:
* Ensure the guest is configured to use virtio-net network adapters.
.. code-block:: console
qemu-system-x86_64 ... -device virtio-net-pci,netdev=hostnet1, \
id=net1 ...
* Ensure the guest's virtio-net network adapter is configured with offloads disabled.
.. code-block:: console
qemu-system-x86_64 ... -device virtio-net-pci,netdev=hostnet1, \
id=net1, csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off
* Redirect QEMU to communicate with the DPDK vhost-net sample code in place of the vhost-net kernel module(vhost cuse).
.. code-block:: console
qemu-system-x86_64 ... -netdev tap,id=hostnet1,vhost=on, \
vhostfd=<open fd> ...
* Enable the vhost-net sample code to map the VM's memory into its own process address space.
.. code-block:: console
qemu-system-x86_64 ... -mem-prealloc -mem-path /dev/hugepages ...
.. note::
The QEMU wrapper (qemu-wrap.py) is a Python script designed to automate the QEMU configuration described above.
It also facilitates integration with libvirt, although the script may also be used standalone without libvirt.
Redirecting QEMU to vhost-net Sample Code(vhost cuse)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To redirect QEMU to the vhost-net sample code implementation of the vhost-net API,
an open file descriptor must be passed to QEMU running as a child process.
.. code-block:: python
#!/usr/bin/python
fd = os.open("/dev/usvhost-1", os.O_RDWR)
subprocess.call
("qemu-system-x86_64 ... -netdev tap,id=vhostnet0,vhost=on,vhostfd="
+ fd +"...", shell=True)
.. note::
This process is automated in the `QEMU Wrapper Script`_.
Mapping the Virtual Machine's Memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the DPDK vhost-net sample code to be run correctly, QEMU must allocate the VM's memory on hugetlbfs.
This is done by specifying mem-prealloc and mem-path when executing QEMU.
The vhost-net sample code accesses the virtio-net device's virtual rings and packet buffers
by finding and mapping the VM's physical memory on hugetlbfs.
In this case, the path passed to the guest should be that of the 1 GB page hugetlbfs:
.. code-block:: console
qemu-system-x86_64 ... -mem-prealloc -mem-path /dev/hugepages ...
.. note::
This process is automated in the `QEMU Wrapper Script`_.
The following two sections only applies to vhost cuse.
For vhost-user, please make corresponding changes to qemu-wrapper script and guest XML file.
QEMU Wrapper Script
~~~~~~~~~~~~~~~~~~~
The QEMU wrapper script automatically detects and calls QEMU with the necessary parameters required
to integrate with the vhost sample code.
It performs the following actions:
* Automatically detects the location of the hugetlbfs and inserts this into the command line parameters.
* Automatically open file descriptors for each virtio-net device and inserts this into the command line parameters.
* Disables offloads on each virtio-net device.
* Calls Qemu passing both the command line parameters passed to the script itself and those it has auto-detected.
The QEMU wrapper script will automatically configure calls to QEMU:
.. code-block:: console
qemu-wrap.py -machine pc-i440fx-1.4,accel=kvm,usb=off \
-cpu SandyBridge -smp 4,sockets=4,cores=1,threads=1 \
-netdev tap,id=hostnet1,vhost=on \
-device virtio-net-pci,netdev=hostnet1,id=net1 \
-hda <disk img> -m 4096
which will become the following call to QEMU:
.. code-block:: console
qemu-system-x86_64 -machine pc-i440fx-1.4,accel=kvm,usb=off \
-cpu SandyBridge -smp 4,sockets=4,cores=1,threads=1 \
-netdev tap,id=hostnet1,vhost=on,vhostfd=<open fd> \
-device virtio-net-pci,netdev=hostnet1,id=net1, \
csum=off,gso=off,guest_tso4=off,guest_tso6=off,guest_ecn=off \
-hda <disk img> -m 4096 -mem-path /dev/hugepages -mem-prealloc
Libvirt Integration
~~~~~~~~~~~~~~~~~~~
The QEMU wrapper script (qemu-wrap.py) "wraps" libvirt calls to QEMU,
such that QEMU is called with the correct parameters described above.
To call the QEMU wrapper automatically from libvirt, the following configuration changes must be made:
* Place the QEMU wrapper script in libvirt's binary search PATH ($PATH).
A good location is in the directory that contains the QEMU binary.
* Ensure that the script has the same owner/group and file permissions as the QEMU binary.
* Update the VM xml file using virsh edit <vm name>:
* Set the VM to use the launch script
* Set the emulator path contained in the #<emulator><emulator/> tags For example,
replace <emulator>/usr/bin/qemu-kvm<emulator/> with <emulator>/usr/bin/qemu-wrap.py<emulator/>
* Set the VM's virtio-net device's to use vhost-net offload:
.. code-block:: xml
<interface type="network">
<model type="virtio"/>
<driver name="vhost"/>
<interface/>
* Enable libvirt to access the DPDK Vhost sample code's character device file by adding it
to controllers cgroup for libvirtd using the following steps:
.. code-block:: xml
cgroup_controllers = [ ... "devices", ... ] clear_emulator_capabilities = 0
user = "root" group = "root"
cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero",
"/dev/random", "/dev/urandom",
"/dev/ptmx", "/dev/kvm", "/dev/kqemu",
"/dev/rtc", "/dev/hpet", "/dev/net/tun",
"/dev/<devbase-name>-<index>",
]
* Disable SELinux or set to permissive mode.
* Mount cgroup device controller:
.. code-block:: console
mkdir /dev/cgroup
mount -t cgroup none /dev/cgroup -o devices
* Restart the libvirtd system process
For example, on Fedora* "systemctl restart libvirtd.service"
* Edit the configuration parameters section of the script:
* Configure the "emul_path" variable to point to the QEMU emulator.
.. code-block:: xml
emul_path = "/usr/local/bin/qemu-system-x86_64"
* Configure the "us_vhost_path" variable to point to the DPDK vhost-net sample code's character devices name.
DPDK vhost-net sample code's character device will be in the format "/dev/<basename>".
.. code-block:: xml
us_vhost_path = "/dev/usvhost"
Common Issues
~~~~~~~~~~~~~
* QEMU failing to allocate memory on hugetlbfs, with an error like the following::
file_ram_alloc: can't mmap RAM pages: Cannot allocate memory
When running QEMU the above error indicates that it has failed to allocate memory for the Virtual Machine on
the hugetlbfs. This is typically due to insufficient hugepages being free to support the allocation request.
The number of free hugepages can be checked as follows:
.. code-block:: console
cat /sys/kernel/mm/hugepages/hugepages-<pagesize>/nr_hugepages
The command above indicates how many hugepages are free to support QEMU's allocation request.
* User space VHOST when the guest has 2MB sized huge pages:
The guest may have 2MB or 1GB sized huge pages. The user space VHOST should work properly in both cases.
* User space VHOST will not work with QEMU without the ``-mem-prealloc`` option:
The current implementation works properly only when the guest memory is pre-allocated, so it is required to
use a QEMU version (e.g. 1.6) which supports ``-mem-prealloc``. The ``-mem-prealloc`` option must be
specified explicitly in the QEMU command line.
* User space VHOST will not work with a QEMU version without shared memory mapping:
As shared memory mapping is mandatory for user space VHOST to work properly with the guest, user space VHOST
needs access to the shared memory from the guest to receive and transmit packets. It is important to make sure
the QEMU version supports shared memory mapping.
* In an Ubuntu environment, QEMU fails to start a new guest normally with user space VHOST due to not being able
to allocate huge pages for the new guest:
The solution for this issue is to add ``-boot c`` into the QEMU command line to make sure the huge pages are
allocated properly and then the guest should start normally.
Use ``cat /proc/meminfo`` to check if there is any changes in the value of ``HugePages_Total`` and ``HugePages_Free``
after the guest startup.
* Log message: ``eventfd_link: module verification failed: signature and/or required key missing - tainting kernel``:
This log message may be ignored. The message occurs due to the kernel module ``eventfd_link``, which is not a standard
Linux module but which is necessary for the user space VHOST current implementation (CUSE-based) to communicate with
the guest.
.. _vhost_app_running_dpdk:
Running DPDK in the Virtual Machine
-----------------------------------
For the DPDK vhost-net sample code to switch packets into the VM,
the sample code must first learn the MAC address of the VM's virtio-net device.
The sample code detects the address from packets being transmitted from the VM, similar to a learning switch.
This behavior requires no special action or configuration with the Linux* virtio-net driver in the VM
as the Linux* Kernel will automatically transmit packets during device initialization.
However, DPDK-based applications must be modified to automatically transmit packets during initialization
to facilitate the DPDK vhost- net sample code's MAC learning.
The DPDK testpmd application can be configured to automatically transmit packets during initialization
and to act as an L2 forwarding switch.
Testpmd MAC Forwarding
~~~~~~~~~~~~~~~~~~~~~~
At high packet rates, a minor packet loss may be observed.
To resolve this issue, a "wait and retry" mode is implemented in the testpmd and vhost sample code.
In the "wait and retry" mode if the virtqueue is found to be full, then testpmd waits for a period of time before retrying to enqueue packets.
The "wait and retry" algorithm is implemented in DPDK testpmd as a forwarding method call "mac_retry".
The following sequence diagram describes the algorithm in detail.
.. _figure_tx_dpdk_testpmd:
.. figure:: img/tx_dpdk_testpmd.*
Packet Flow on TX in DPDK-testpmd
Running Testpmd
~~~~~~~~~~~~~~~
The testpmd application is automatically built when DPDK is installed.
Run the testpmd application as follows:
.. code-block:: console
cd ${RTE_SDK}/x86_64-native-linuxapp-gcc/app
./testpmd -c 0x3 -n 4 --socket-mem 512 \
-- --burst=64 --i --disable-hw-vlan-filter
The destination MAC address for packets transmitted on each port can be set at the command line:
.. code-block:: console
./testpmd -c 0x3 -n 4 --socket-mem 512 \
-- --burst=64 --i --disable-hw-vlan-filter \
--eth-peer=0,aa:bb:cc:dd:ee:ff --eth-peer=1,ff:ee:dd:cc:bb:aa
* Packets received on port 1 will be forwarded on port 0 to MAC address
aa:bb:cc:dd:ee:ff
* Packets received on port 0 will be forwarded on port 1 to MAC address
ff:ee:dd:cc:bb:aa
The testpmd application can then be configured to act as an L2 forwarding application:
.. code-block:: console
testpmd> set fwd mac_retry
The testpmd can then be configured to start processing packets,
transmitting packets first so the DPDK vhost sample code on the host can learn the MAC address:
.. code-block:: console
testpmd> start tx_first
.. note::
Please note "set fwd mac_retry" is used in place of "set fwd mac_fwd" to ensure the retry feature is activated.
Passing Traffic to the Virtual Machine Device
---------------------------------------------
For a virtio-net device to receive traffic,
the traffic's Layer 2 header must include both the virtio-net device's MAC address and VLAN tag.
The DPDK sample code behaves in a similar manner to a learning switch in that
it learns the MAC address of the virtio-net devices from the first transmitted packet.
On learning the MAC address,
the DPDK vhost sample code prints a message with the MAC address and VLAN tag virtio-net device.
For example:
.. code-block:: console
DATA: (0) MAC_ADDRESS cc:bb:bb:bb:bb:bb and VLAN_TAG 1000 registered
The above message indicates that device 0 has been registered with MAC address cc:bb:bb:bb:bb:bb and VLAN tag 1000.
Any packets received on the NIC with these values is placed on the devices receive queue.
When a virtio-net device transmits packets, the VLAN tag is added to the packet by the DPDK vhost sample code.
|