aboutsummaryrefslogtreecommitdiffstats
path: root/src/plugins/lb/lb_plugin_doc.rst
blob: 603453e784802335b70175eddb986d34c684f9ae (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
Load Balancer plugin
====================

Version
-------

The load balancer plugin is currently in *beta* version. Both CLIs and
APIs are subject to *heavy* changes, which also means feedback is really
welcome regarding features, apis, etc…

Overview
--------

This plugin provides load balancing for VPP in a way that is largely
inspired from Google’s MagLev:
http://research.google.com/pubs/pub44824.html

The load balancer is configured with a set of Virtual IPs (VIP, which
can be prefixes), and for each VIP, with a set of Application Server
addresses (ASs).

There are four encap types to steer traffic to different ASs: 1).
IPv4+GRE ad IPv6+GRE encap types: Traffic received for a given VIP (or
VIP prefix) is tunneled using GRE towards the different ASs in a way
that (tries to) ensure that a given session will always be tunneled to
the same AS.

2). IPv4+L3DSR encap types: L3DSR is used to overcome Layer 2
limitations of Direct Server Return Load Balancing. It maps VIP to DSCP
bits, and reuse TOS bits to transfer DSCP bits to server, and then
server will get VIP from DSCP-to-VIP mapping.

Both VIPs or ASs can be IPv4 or IPv6, but for a given VIP, all ASs must
be using the same encap. type (i.e. IPv4+GRE or IPv6+GRE or IPv4+L3DSR).
Meaning that for a given VIP, all AS addresses must be of the same
family.

3). IPv4/IPv6 + NAT4/NAT6 encap types: This type provides kube-proxy
data plane on user space, which is used to replace linux kernel’s
kube-proxy based on iptables.

Currently, load balancer plugin supports three service types: a) Cluster
IP plus Port: support any protocols, including TCP, UDP. b) Node IP plus
Node Port: currently only support UDP. c) External Load Balancer.

For Cluster IP plus Port case: kube-proxy is configured with a set of
Virtual IPs (VIP, which can be prefixes), and for each VIP, with a set
of AS addresses (ASs).

For a specific session received for a given VIP (or VIP prefix), first
packet selects a AS according to internal load balancing algorithm, then
does DNAT operation and sent to chosen AS. At the same time, will create
a session entry to store AS chosen result. Following packets for that
session will look up session table first, which ensures that a given
session will always be routed to the same AS.

For returned packet from AS, it will do SNAT operation and sent out.

Please refer to below for details:
https://schd.ws/hosted_files/ossna2017/1e/VPP_K8S_GTPU_OSSNA.pdf

Performance
-----------

The load balancer has been tested up to 1 millions flows and still
forwards more than 3Mpps per core in such circumstances. Although 3Mpps
seems already good, it is likely that performance will be improved in
next versions.

Configuration
-------------

Global LB parameters
~~~~~~~~~~~~~~~~~~~~

The load balancer needs to be configured with some parameters:

::

   lb conf [ip4-src-address <addr>] [ip6-src-address <addr>]
           [buckets <n>] [timeout <s>]

ip4-src-address: the source address used to send encap. packets using
IPv4 for GRE4 mode. or Node IP4 address for NAT4 mode.

ip6-src-address: the source address used to send encap. packets using
IPv6 for GRE6 mode. or Node IP6 address for NAT6 mode.

buckets: the *per-thread* established-connections-table number of
buckets.

timeout: the number of seconds a connection will remain in the
established-connections-table while no packet for this flow is received.

Configure the VIPs
~~~~~~~~~~~~~~~~~~

::

   lb vip <prefix> [encap (gre6|gre4|l3dsr|nat4|nat6)] \
     [dscp <n>] [port <n> target_port <n> node_port <n>] [new_len <n>] [del]

new_len is the size of the new-connection-table. It should be 1 or 2
orders of magnitude bigger than the number of ASs for the VIP in order
to ensure a good load balancing. Encap l3dsr and dscp is used to map VIP
to dscp bit and rewrite DSCP bit in packets. So the selected server
could get VIP from DSCP bit in this packet and perform DSR. Encap
nat4/nat6 and port/target_port/node_port is used to do kube-proxy data
plane.

Examples:

::

   lb vip 2002::/16 encap gre6 new_len 1024
   lb vip 2003::/16 encap gre4 new_len 2048
   lb vip 80.0.0.0/8 encap gre6 new_len 16
   lb vip 90.0.0.0/8 encap gre4 new_len 1024
   lb vip 100.0.0.0/8 encap l3dsr dscp 2 new_len 32
   lb vip 90.1.2.1/32 encap nat4 port 3306 target_port 3307 node_port 30964 new_len 1024
   lb vip 2004::/16 encap nat6 port 6306 target_port 6307 node_port 30966 new_len 1024

Configure the ASs (for each VIP)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   lb as <vip-prefix> [<address> [<address> [...]]] [del]

You can add (or delete) as many ASs at a time (for a single VIP). Note
that the AS address family must correspond to the VIP encap. IP family.

Examples:

::

   lb as 2002::/16 2001::2 2001::3 2001::4
   lb as 2003::/16 10.0.0.1 10.0.0.2
   lb as 80.0.0.0/8 2001::2
   lb as 90.0.0.0/8 10.0.0.1

Configure SNAT
~~~~~~~~~~~~~~

::

   lb set interface nat4 in <intfc> [del]

Set SNAT feature in a specific interface. (applicable in NAT4 mode only)

::

   lb set interface nat6 in <intfc> [del]

Set SNAT feature in a specific interface. (applicable in NAT6 mode only)

Monitoring
----------

The plugin provides quite a bunch of counters and information. These are
still subject to quite significant changes.

::

   show lb
   show lb vip
   show lb vip verbose

   show node counters

Design notes
------------

Multi-Threading
~~~~~~~~~~~~~~~

MagLev is a distributed system which pseudo-randomly generates a
new-connections-table based on AS names such that each server configured
with the same set of ASs ends up with the same table. Connection
stickiness is then ensured with an established-connections-table. Using
ECMP, it is assumed (but not relied on) that servers will mostly receive
traffic for different flows.

This implementation pushes the parallelism a little bit further by using
one established-connections table per thread. This is equivalent to
assuming that RSS will make a job similar to ECMP, and is pretty useful
as threads don’t need to get a lock in order to write in the table.

Hash Table
~~~~~~~~~~

A load balancer requires an efficient read and write hash table. The
hash table used by ip6-forward is very read-efficient, but not so much
for writing. In addition, it is not a big deal if writing into the hash
table fails (again, MagLev uses a flow table but does not heavily
relies on it).

The plugin therefore uses a very specific (and stupid) hash table. -
Fixed (and power of 2) number of buckets (configured at runtime) - Fixed
(and power of 2) elements per buckets (configured at compilation time)

Reference counting
~~~~~~~~~~~~~~~~~~

When an AS is removed, there is two possible ways to react. - Keep using
the AS for established connections - Change AS for established
connections (likely to cause error for TCP)

In the first case, although an AS is removed from the configuration, its
associated state needs to stay around as long as it is used by at least
one thread.

In order to avoid locks, a specific reference counter is used. The
design is quite similar to clib counters but: - It is possible to
decrease the value - Summing will not zero the per-thread counters -
Only the thread can reallocate its own counters vector (to avoid
concurrency issues)

This reference counter is lock free, but reading a count of 0 does not
mean the value can be freed unless it is ensured by *other* means that
no other thread is concurrently referencing the object. In the case of
this plugin, it is assumed that no concurrent event will take place
after a few seconds.