aboutsummaryrefslogtreecommitdiffstats
path: root/docs/usecases/contiv/NETWORKING.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/usecases/contiv/NETWORKING.rst')
-rw-r--r--docs/usecases/contiv/NETWORKING.rst196
1 files changed, 196 insertions, 0 deletions
diff --git a/docs/usecases/contiv/NETWORKING.rst b/docs/usecases/contiv/NETWORKING.rst
new file mode 100644
index 00000000000..b6799961c1d
--- /dev/null
+++ b/docs/usecases/contiv/NETWORKING.rst
@@ -0,0 +1,196 @@
+Contiv/VPP Network Operation
+============================
+
+This document describes the network operation of the Contiv/VPP k8s
+network plugin. It elaborates the operation and config options of the
+Contiv IPAM, as well as details on how the VPP gets programmed by
+Contiv/VPP control plane.
+
+The following picture shows 2-node k8s deployment of Contiv/VPP, with a
+VXLAN tunnel established between the nodes to forward inter-node POD
+traffic. The IPAM options are depicted on the Node 1, whereas the VPP
+programming is depicted on the Node 2.
+
+.. figure:: /_images/contiv-networking.png
+ :alt: contiv-networking.png
+
+ Contiv/VPP Architecture
+
+Contiv/VPP IPAM (IP Address Management)
+---------------------------------------
+
+IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID
+is a number that uniquely identifies a node in the k8s cluster. The
+first node is assigned the ID of 1, the second node 2, etc. If a node
+leaves the cluster, its ID is released back to the pool and will be
+re-used by the next node.
+
+The Node ID is used to calculate per-node IP subnets for PODs and other
+internal subnets that need to be unique on each node. Apart from the
+Node ID, the input for IPAM calculations is a set of config knobs, which
+can be specified in the ``IPAMConfig`` section of the [Contiv/VPP
+deployment YAML](../../../k8s/contiv-vpp.yaml):
+
+- **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP
+ address assigned from this range. The size of this range (default
+ ``/16``) dictates upper limit of POD count for the entire k8s cluster
+ (default 65536 PODs).
+
+- **PodNetworkPrefixLen** (default ``24``): per-node dedicated
+ podSubnet range. From the allocatable range defined in
+ ``PodSubnetCIDR``, this value will dictate the allocation for each
+ node. With the default value (``24``) this indicates that each node
+ has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to
+ address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``,
+ ``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD
+ subnet for the node would be ``10.1.5.0/24``.
+
+- **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put
+ the VPP interfaces facing towards the PODs into L3 mode. This IP
+ range will be reused on each node, thereby it is never externally
+ addressable outside of the node itself. The only requirement is that
+ this subnet should not collide with any other IPAM subnet.
+
+- **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for
+ addressing the interconnect of VPP with the Linux network stack,
+ within the same node. Since this subnet needs to be unique on each
+ node, the Node ID is used to determine the actual subnet used on the
+ node with the combination of ``VPPHostNetworkPrefixLen``,
+ ``PodSubnetCIDR`` and ``PodNetworkPrefixLen``.
+
+- **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the
+ subnet for addressing the interconnect of VPP with the Linux network
+ stack, within the same node. With
+ ``VPPHostSubnetCIDR = 172.30.0.0/16``,
+ ``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting
+ subnet for the node would be ``172.30.5.0/24``.
+
+- **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the
+ addresses assigned to the data plane interfaces managed by VPP.
+ Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP
+ control plane automatically assigns an IP address from this range to
+ the DPDK-managed ethernet interface bound to VPP on each node. The
+ actual IP address will be calculated from the Node ID (e.g., with
+ ``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the
+ resulting IP address assigned to the ethernet interface on VPP will
+ be ``192.168.16.5`` ).
+
+- **NodeInterconnectDHCP** (default ``False``): instead of assigning
+ the IPs for the data plane interfaces, which are managed by VPP from
+ ``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP
+ assigns the IP addresses. The DHCP must be running in the network
+ where the data plane interface is connected, in case
+ ``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored.
+
+- **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide
+ inter-node POD to POD connectivity via any underlay network (not
+ necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay
+ between each of the 2 nodes within the cluster. Each node needs its
+ unique IP address of the VXLAN BVI interface. This IP address is
+ automatically calculated from the Node ID, (e.g., with
+ ``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP
+ address assigned to the VXLAN BVI interface will be
+ ``192.168.30.5``).
+
+VPP Programming
+---------------
+
+This section describes how the Contiv/VPP control plane programs VPP,
+based on the events it receives from k8s. This section is not
+necessarily for understanding basic Contiv/VPP operation, but is very
+useful for debugging purposes.
+
+Contiv/VPP currently uses a single VRF to forward the traffic between
+PODs on a node, PODs on different nodes, host network stack, and
+DPDK-managed dataplane interface. The forwarding between each of them is
+purely L3-based, even for cases of communication between 2 PODs within
+the same node.
+
+DPDK-Managed Data Interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to allow inter-node communication between PODs on different
+nodes and between PODs and outside world, Contiv/VPP uses data-plane
+interfaces bound to VPP using DPDK. Each node should have one “main” VPP
+interface, which is unbound from the host network stack and bound to
+VPP. The Contiv/VPP control plane automatically configures the interface
+either via DHCP, or with a statically assigned address (see
+``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings).
+
+PODs on the Same Node
+~~~~~~~~~~~~~~~~~~~~~
+
+PODs are connected to VPP using virtio-based TAP interfaces created by
+VPP, with the POD-end of the interface placed into the POD container
+network namespace. Each POD is assigned an IP address from the
+``PodSubnetCIDR``. The allocated IP is configured with the prefix length
+``/32``. Additionally, a static route pointing towards the VPP is
+configured in the POD network namespace. The prefix length ``/32`` means
+that all IP traffic will be forwarded to the default route - VPP. To get
+rid of unnecessary broadcasts between POD and VPP, a static ARP entry is
+configured for the gateway IP in the POD namespace, as well as for POD
+IP on VPP. Both ends of the TAP interface have a static (non-default)
+MAC address applied.
+
+PODs with hostNetwork=true
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+PODs with a ``hostNetwork=true`` attribute are not placed into a
+separate network namespace, they instead use the main host Linux network
+namespace; therefore, they are not directly connected to the VPP. They
+rely on the interconnection between the VPP and the host Linux network
+stack, which is described in the next paragraph. Note, when these PODs
+access some service IP, their network communication will be NATed in
+Linux (by iptables rules programmed by kube-proxy) as opposed to VPP,
+which is the case for the PODs connected to VPP directly.
+
+Linux Host Network Stack
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to interconnect the Linux host network stack with VPP (to allow
+access to the cluster resources from the host itself, as well as for the
+PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP
+and the main network namespace. The TAP interface is configured with IP
+addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the
+latest octet on the VPP side, and ``.2`` on the host side. The name of
+the host interface is ``vpp1``. The host has static routes pointing to
+VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route
+traffic targeting PODs towards VPP. - A route to ``ServiceCIDR``
+(default ``10.96.0.0/12``), to route service IP targeted traffic that
+has not been translated by kube-proxy for some reason towards VPP. - The
+host also has a static ARP entry configured for the IP of the VPP-end
+TAP interface, to get rid of unnecessary broadcasts between the main
+network namespace and VPP.
+
+VXLANs to Other Nodes
+~~~~~~~~~~~~~~~~~~~~~
+
+In order to provide inter-node POD to POD connectivity via any underlay
+network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN
+tunnel overlay between each 2 nodes within the cluster (full mesh).
+
+All VXLAN tunnels are terminated in one bridge domain on each VPP. The
+bridge domain has learning and flooding disabled, the l2fib of the
+bridge domain contains a static entry for each VXLAN tunnel. Each bridge
+domain has a BVI interface, which interconnects the bridge domain with
+the main VRF (L3 forwarding). This interface needs a unique IP address,
+which is assigned from the ``VxlanCIDR`` as describe above.
+
+The main VRF contains several static routes that point to the BVI IP
+addresses of other nodes. For each node, it is a route to PODSubnet and
+VppHostSubnet of the remote node, as well as a route to the management
+IP address of the remote node. For each of these routes, the next hop IP
+is the BVI interface IP of the remote node, which goes via the BVI
+interface of the local node.
+
+The VXLAN tunnels and the static routes pointing to them are
+added/deleted on each VPP, whenever a node is added/deleted in the k8s
+cluster.
+
+More Info
+~~~~~~~~~
+
+Please refer to the [Packet Flow Dev
+Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of
+paths traversed by request and response packets inside Contiv/VPP
+Kubernetes cluster under different situations.