diff options
Diffstat (limited to 'docs/usecases/contiv/NETWORKING.rst')
-rw-r--r-- | docs/usecases/contiv/NETWORKING.rst | 196 |
1 files changed, 196 insertions, 0 deletions
diff --git a/docs/usecases/contiv/NETWORKING.rst b/docs/usecases/contiv/NETWORKING.rst new file mode 100644 index 00000000000..b6799961c1d --- /dev/null +++ b/docs/usecases/contiv/NETWORKING.rst @@ -0,0 +1,196 @@ +Contiv/VPP Network Operation +============================ + +This document describes the network operation of the Contiv/VPP k8s +network plugin. It elaborates the operation and config options of the +Contiv IPAM, as well as details on how the VPP gets programmed by +Contiv/VPP control plane. + +The following picture shows 2-node k8s deployment of Contiv/VPP, with a +VXLAN tunnel established between the nodes to forward inter-node POD +traffic. The IPAM options are depicted on the Node 1, whereas the VPP +programming is depicted on the Node 2. + +.. figure:: /_images/contiv-networking.png + :alt: contiv-networking.png + + Contiv/VPP Architecture + +Contiv/VPP IPAM (IP Address Management) +--------------------------------------- + +IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID +is a number that uniquely identifies a node in the k8s cluster. The +first node is assigned the ID of 1, the second node 2, etc. If a node +leaves the cluster, its ID is released back to the pool and will be +re-used by the next node. + +The Node ID is used to calculate per-node IP subnets for PODs and other +internal subnets that need to be unique on each node. Apart from the +Node ID, the input for IPAM calculations is a set of config knobs, which +can be specified in the ``IPAMConfig`` section of the [Contiv/VPP +deployment YAML](../../../k8s/contiv-vpp.yaml): + +- **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP + address assigned from this range. The size of this range (default + ``/16``) dictates upper limit of POD count for the entire k8s cluster + (default 65536 PODs). + +- **PodNetworkPrefixLen** (default ``24``): per-node dedicated + podSubnet range. From the allocatable range defined in + ``PodSubnetCIDR``, this value will dictate the allocation for each + node. With the default value (``24``) this indicates that each node + has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to + address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``, + ``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD + subnet for the node would be ``10.1.5.0/24``. + +- **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put + the VPP interfaces facing towards the PODs into L3 mode. This IP + range will be reused on each node, thereby it is never externally + addressable outside of the node itself. The only requirement is that + this subnet should not collide with any other IPAM subnet. + +- **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for + addressing the interconnect of VPP with the Linux network stack, + within the same node. Since this subnet needs to be unique on each + node, the Node ID is used to determine the actual subnet used on the + node with the combination of ``VPPHostNetworkPrefixLen``, + ``PodSubnetCIDR`` and ``PodNetworkPrefixLen``. + +- **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the + subnet for addressing the interconnect of VPP with the Linux network + stack, within the same node. With + ``VPPHostSubnetCIDR = 172.30.0.0/16``, + ``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting + subnet for the node would be ``172.30.5.0/24``. + +- **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the + addresses assigned to the data plane interfaces managed by VPP. + Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP + control plane automatically assigns an IP address from this range to + the DPDK-managed ethernet interface bound to VPP on each node. The + actual IP address will be calculated from the Node ID (e.g., with + ``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the + resulting IP address assigned to the ethernet interface on VPP will + be ``192.168.16.5`` ). + +- **NodeInterconnectDHCP** (default ``False``): instead of assigning + the IPs for the data plane interfaces, which are managed by VPP from + ``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP + assigns the IP addresses. The DHCP must be running in the network + where the data plane interface is connected, in case + ``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored. + +- **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide + inter-node POD to POD connectivity via any underlay network (not + necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay + between each of the 2 nodes within the cluster. Each node needs its + unique IP address of the VXLAN BVI interface. This IP address is + automatically calculated from the Node ID, (e.g., with + ``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP + address assigned to the VXLAN BVI interface will be + ``192.168.30.5``). + +VPP Programming +--------------- + +This section describes how the Contiv/VPP control plane programs VPP, +based on the events it receives from k8s. This section is not +necessarily for understanding basic Contiv/VPP operation, but is very +useful for debugging purposes. + +Contiv/VPP currently uses a single VRF to forward the traffic between +PODs on a node, PODs on different nodes, host network stack, and +DPDK-managed dataplane interface. The forwarding between each of them is +purely L3-based, even for cases of communication between 2 PODs within +the same node. + +DPDK-Managed Data Interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to allow inter-node communication between PODs on different +nodes and between PODs and outside world, Contiv/VPP uses data-plane +interfaces bound to VPP using DPDK. Each node should have one “main” VPP +interface, which is unbound from the host network stack and bound to +VPP. The Contiv/VPP control plane automatically configures the interface +either via DHCP, or with a statically assigned address (see +``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings). + +PODs on the Same Node +~~~~~~~~~~~~~~~~~~~~~ + +PODs are connected to VPP using virtio-based TAP interfaces created by +VPP, with the POD-end of the interface placed into the POD container +network namespace. Each POD is assigned an IP address from the +``PodSubnetCIDR``. The allocated IP is configured with the prefix length +``/32``. Additionally, a static route pointing towards the VPP is +configured in the POD network namespace. The prefix length ``/32`` means +that all IP traffic will be forwarded to the default route - VPP. To get +rid of unnecessary broadcasts between POD and VPP, a static ARP entry is +configured for the gateway IP in the POD namespace, as well as for POD +IP on VPP. Both ends of the TAP interface have a static (non-default) +MAC address applied. + +PODs with hostNetwork=true +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +PODs with a ``hostNetwork=true`` attribute are not placed into a +separate network namespace, they instead use the main host Linux network +namespace; therefore, they are not directly connected to the VPP. They +rely on the interconnection between the VPP and the host Linux network +stack, which is described in the next paragraph. Note, when these PODs +access some service IP, their network communication will be NATed in +Linux (by iptables rules programmed by kube-proxy) as opposed to VPP, +which is the case for the PODs connected to VPP directly. + +Linux Host Network Stack +~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to interconnect the Linux host network stack with VPP (to allow +access to the cluster resources from the host itself, as well as for the +PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP +and the main network namespace. The TAP interface is configured with IP +addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the +latest octet on the VPP side, and ``.2`` on the host side. The name of +the host interface is ``vpp1``. The host has static routes pointing to +VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route +traffic targeting PODs towards VPP. - A route to ``ServiceCIDR`` +(default ``10.96.0.0/12``), to route service IP targeted traffic that +has not been translated by kube-proxy for some reason towards VPP. - The +host also has a static ARP entry configured for the IP of the VPP-end +TAP interface, to get rid of unnecessary broadcasts between the main +network namespace and VPP. + +VXLANs to Other Nodes +~~~~~~~~~~~~~~~~~~~~~ + +In order to provide inter-node POD to POD connectivity via any underlay +network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN +tunnel overlay between each 2 nodes within the cluster (full mesh). + +All VXLAN tunnels are terminated in one bridge domain on each VPP. The +bridge domain has learning and flooding disabled, the l2fib of the +bridge domain contains a static entry for each VXLAN tunnel. Each bridge +domain has a BVI interface, which interconnects the bridge domain with +the main VRF (L3 forwarding). This interface needs a unique IP address, +which is assigned from the ``VxlanCIDR`` as describe above. + +The main VRF contains several static routes that point to the BVI IP +addresses of other nodes. For each node, it is a route to PODSubnet and +VppHostSubnet of the remote node, as well as a route to the management +IP address of the remote node. For each of these routes, the next hop IP +is the BVI interface IP of the remote node, which goes via the BVI +interface of the local node. + +The VXLAN tunnels and the static routes pointing to them are +added/deleted on each VPP, whenever a node is added/deleted in the k8s +cluster. + +More Info +~~~~~~~~~ + +Please refer to the [Packet Flow Dev +Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of +paths traversed by request and response packets inside Contiv/VPP +Kubernetes cluster under different situations. |