aboutsummaryrefslogtreecommitdiffstats
path: root/docs/usecases/contiv/NETWORKING.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/usecases/contiv/NETWORKING.md')
-rw-r--r--docs/usecases/contiv/NETWORKING.md137
1 files changed, 137 insertions, 0 deletions
diff --git a/docs/usecases/contiv/NETWORKING.md b/docs/usecases/contiv/NETWORKING.md
new file mode 100644
index 00000000000..0b6d08127fb
--- /dev/null
+++ b/docs/usecases/contiv/NETWORKING.md
@@ -0,0 +1,137 @@
+# Contiv/VPP Network Operation
+
+This document describes the network operation of the Contiv/VPP k8s network plugin. It
+elaborates the operation and config options of the Contiv IPAM, as well as
+details on how the VPP gets programmed by Contiv/VPP control plane.
+
+The following picture shows 2-node k8s deployment of Contiv/VPP, with a VXLAN tunnel
+established between the nodes to forward inter-node POD traffic. The IPAM options
+are depicted on the Node 1, whereas the VPP programming is depicted on the Node 2.
+
+![Contiv/VPP Architecture](/_images/contiv-networking.png "contiv-networking.png")
+
+## Contiv/VPP IPAM (IP Address Management)
+
+IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID is a number
+that uniquely identifies a node in the k8s cluster. The first node is assigned
+the ID of 1, the second node 2, etc. If a node leaves the cluster, its
+ID is released back to the pool and will be re-used by the next node.
+
+The Node ID is used to calculate per-node IP subnets for PODs
+and other internal subnets that need to be unique on each node. Apart from the Node ID,
+the input for IPAM calculations is a set of config knobs, which can be specified
+in the `IPAMConfig` section of the [Contiv/VPP deployment YAML](../../../k8s/contiv-vpp.yaml):
+
+- **PodSubnetCIDR** (default `10.1.0.0/16`): each pod gets an IP address assigned
+from this range. The size of this range (default `/16`) dictates upper limit of
+POD count for the entire k8s cluster (default 65536 PODs).
+
+- **PodNetworkPrefixLen** (default `24`): per-node dedicated podSubnet range.
+From the allocatable range defined in `PodSubnetCIDR`, this value will dictate the
+allocation for each node. With the default value (`24`) this indicates that each node
+has a `/24` slice of the `PodSubnetCIDR`. The Node ID is used to address the node.
+In case of `PodSubnetCIDR = 10.1.0.0/16`, `PodNetworkPrefixLen = 24` and `NodeID = 5`,
+the resulting POD subnet for the node would be `10.1.5.0/24`.
+
+- **PodIfIPCIDR** (default `10.2.1.0/24`): VPP-internal addresses put the VPP interfaces
+facing towards the PODs into L3 mode. This IP range will be reused
+on each node, thereby it is never externally addressable outside of the node itself.
+The only requirement is that this subnet should not collide with any other IPAM subnet.
+
+- **VPPHostSubnetCIDR** (default `172.30.0.0/16`): used for addressing
+the interconnect of VPP with the Linux network stack, within the same node.
+Since this subnet needs to be unique on each node, the Node ID is used to determine
+the actual subnet used on the node with the combination of `VPPHostNetworkPrefixLen`, `PodSubnetCIDR` and `PodNetworkPrefixLen`.
+
+- **VPPHostNetworkPrefixLen** (default `24`): used to calculate the subnet
+for addressing the interconnect of VPP with the Linux network stack, within the same node.
+With `VPPHostSubnetCIDR = 172.30.0.0/16`, `VPPHostNetworkPrefixLen = 24` and
+`NodeID = 5` the resulting subnet for the node would be `172.30.5.0/24`.
+
+- **NodeInterconnectCIDR** (default `192.168.16.0/24`): range for the addresses
+assigned to the data plane interfaces managed by VPP. Unless DHCP is used
+(`NodeInterconnectDHCP = True`), the Contiv/VPP control plane automatically assigns
+an IP address from this range to the DPDK-managed ethernet interface bound to VPP
+on each node. The actual IP address will be calculated from the Node ID (e.g., with
+`NodeInterconnectCIDR = 192.168.16.0/24` and `NodeID = 5`, the resulting IP
+address assigned to the ethernet interface on VPP will be `192.168.16.5` ).
+
+- **NodeInterconnectDHCP** (default `False`): instead of assigning the IPs
+for the data plane interfaces, which are managed by VPP from `NodeInterconnectCIDR` by the Contiv/VPP
+control plane, DHCP assigns the IP addresses. The DHCP must be running in the network where the data
+plane interface is connected, in case `NodeInterconnectDHCP = True`,
+`NodeInterconnectCIDR` is ignored.
+
+- **VxlanCIDR** (default `192.168.30.0/24`): in order to provide inter-node
+POD to POD connectivity via any underlay network (not necessarily an L2 network),
+Contiv/VPP sets up a VXLAN tunnel overlay between each of the 2 nodes within the cluster. Each node needs its unique IP address of the VXLAN BVI interface. This IP address
+is automatically calculated from the Node ID, (e.g., with `VxlanCIDR = 192.168.30.0/24`
+and `NodeID = 5`, the resulting IP address assigned to the VXLAN BVI interface will be `192.168.30.5`).
+
+## VPP Programming
+This section describes how the Contiv/VPP control plane programs VPP, based on the
+events it receives from k8s. This section is not necessarily for understanding
+basic Contiv/VPP operation, but is very useful for debugging purposes.
+
+Contiv/VPP currently uses a single VRF to forward the traffic between PODs on a node,
+PODs on different nodes, host network stack, and DPDK-managed dataplane interface. The forwarding
+between each of them is purely L3-based, even for cases of communication
+between 2 PODs within the same node.
+
+#### DPDK-Managed Data Interface
+In order to allow inter-node communication between PODs on different
+nodes and between PODs and outside world, Contiv/VPP uses data-plane interfaces
+bound to VPP using DPDK. Each node should have one "main" VPP interface,
+which is unbound from the host network stack and bound to VPP.
+The Contiv/VPP control plane automatically configures the interface either
+via DHCP, or with a statically assigned address (see `NodeInterconnectCIDR` and
+`NodeInterconnectDHCP` yaml settings).
+
+#### PODs on the Same Node
+PODs are connected to VPP using virtio-based TAP interfaces created by VPP,
+with the POD-end of the interface placed into the POD container network namespace.
+Each POD is assigned an IP address from the `PodSubnetCIDR`. The allocated IP
+is configured with the prefix length `/32`. Additionally, a static route pointing
+towards the VPP is configured in the POD network namespace.
+The prefix length `/32` means that all IP traffic will be forwarded to the
+default route - VPP. To get rid of unnecessary broadcasts between POD and VPP,
+a static ARP entry is configured for the gateway IP in the POD namespace, as well
+as for POD IP on VPP. Both ends of the TAP interface have a static (non-default)
+MAC address applied.
+
+#### PODs with hostNetwork=true
+PODs with a `hostNetwork=true` attribute are not placed into a separate network namespace, they instead use the main host Linux network namespace; therefore, they are not directly connected to the VPP. They rely on the interconnection between the VPP and the host Linux network stack,
+which is described in the next paragraph. Note, when these PODs access some service IP, their network communication will be NATed in Linux (by iptables rules programmed by kube-proxy)
+as opposed to VPP, which is the case for the PODs connected to VPP directly.
+
+#### Linux Host Network Stack
+In order to interconnect the Linux host network stack with VPP (to allow access
+to the cluster resources from the host itself, as well as for the PODs with `hostNetwork=true`),
+VPP creates a TAP interface between VPP and the main network namespace. The TAP interface is configured with IP addresses from the `VPPHostSubnetCIDR` range, with `.1` in the latest octet on the VPP side, and `.2` on the host side. The name of the host interface is `vpp1`. The host has static routes pointing to VPP configured with:
+- A route to the whole `PodSubnetCIDR` to route traffic targeting PODs towards VPP.
+- A route to `ServiceCIDR` (default `10.96.0.0/12`), to route service IP targeted traffic that has not been translated by kube-proxy for some reason towards VPP.
+- The host also has a static ARP entry configured for the IP of the VPP-end TAP interface, to get rid of unnecessary broadcasts between the main network namespace and VPP.
+
+#### VXLANs to Other Nodes
+In order to provide inter-node POD to POD connectivity via any underlay network
+(not necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay between
+each 2 nodes within the cluster (full mesh).
+
+All VXLAN tunnels are terminated in one bridge domain on each VPP. The bridge domain
+has learning and flooding disabled, the l2fib of the bridge domain contains a static entry for each VXLAN tunnel. Each bridge domain has a BVI interface, which
+interconnects the bridge domain with the main VRF (L3 forwarding). This interface needs
+a unique IP address, which is assigned from the `VxlanCIDR` as describe above.
+
+The main VRF contains several static routes that point to the BVI IP addresses of other nodes.
+For each node, it is a route to PODSubnet and VppHostSubnet of the remote node, as well as a route
+to the management IP address of the remote node. For each of these routes, the next hop IP is the
+BVI interface IP of the remote node, which goes via the BVI interface of the local node.
+
+The VXLAN tunnels and the static routes pointing to them are added/deleted on each VPP,
+whenever a node is added/deleted in the k8s cluster.
+
+
+#### More Info
+Please refer to the [Packet Flow Dev Guide](../dev-guide/PACKET_FLOW.html) for more
+detailed description of paths traversed by request and response packets
+inside Contiv/VPP Kubernetes cluster under different situations. \ No newline at end of file