aboutsummaryrefslogtreecommitdiffstats
path: root/src/vnet/srmpls/sr_doc.md
blob: 29110ec8c41b8e8962f196805352f64593fb475c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# SR-MPLS: Segment Routing for MPLS    {#srmpls_doc}

This is a memo intended to contain documentation of the VPP SR-MPLS implementation.
Everything that is not directly obvious should come here.
For any feedback on content that should be explained please mailto:pcamaril@cisco.com

## Segment Routing

Segment routing is a network technology focused on addressing the limitations of existing IP and Multiprotocol Label Switching (MPLS) networks in terms of simplicity, scale, and ease of operation. It is a foundation for application engineered routing as it prepares the networks for new business models where applications can control the network behavior.

Segment routing seeks the right balance between distributed intelligence and centralized optimization and programming. It was built for the software-defined networking (SDN) era.

Segment routing enhances packet forwarding behavior by enabling a network to transport unicast packets through a specific forwarding path, different from the normal path that a packet usually takes (IGP shortest path or BGP best path). This capability benefits many use cases, and one can build those specific paths based on application requirements.

Segment routing uses the source routing paradigm. A node, usually a router but also a switch, a trusted server, or a virtual forwarder running on a hypervisor, steers a packet through an ordered list of instructions, called segments. A segment can represent any instruction, topological or service-based. A segment can have a local semantic to a segment-routing node or global within a segment-routing network. Segment routing allows an operator to enforce a flow through any topological path and service chain while maintaining per-flow state only at the ingress node to the segment-routing network. Segment routing also supports equal-cost multipath (ECMP) by design.

Segment routing can operate with either an MPLS or an IPv6 data plane. All the currently available MPLS services, such as Layer 3 VPN (L3VPN), L2VPN (Virtual Private Wire Service [VPWS], Virtual Private LAN Services [VPLS], Ethernet VPN [E-VPN], and Provider Backbone Bridging Ethernet VPN [PBB-EVPN]), can run on top of a segment-routing transport network.

**The implementation of Segment Routing in VPP covers both the IPv6 data plane (SRv6) as well as the MPLS data plane (SR-MPLS). This page contains the SR-MPLS documentation.**

## Segment Routing terminology

* SegmentID (SID): is an MPLS label.
* Segment List (SL) (SID List): is the sequence of SIDs that the packet will traverse.
* SR Policy: is a set of candidate paths (SID list+weight). An SR policy is uniquely identified by its Binding SID and associated with a weighted set of Segment Lists. In case several SID lists are defined, traffic steered into the policy is unevenly load-balanced among them according to their respective weights.
* BindingSID: a BindingSID is a SID (only one) associated one-one with an SR Policy. If a packet arrives with MPLS label corresponding to a BindingSID, then the SR policy will be applied to such packet. (BindingSID is popped first.)

## SR-MPLS features in VPP

The SR-MPLS implementation is focused on the SR policies, as well on its steering. Others SR-MPLS features, such as for example AdjSIDs, can be achieved using the regular VPP MPLS implementation.

The <a href="https://datatracker.ietf.org/doc/draft-filsfils-spring-segment-routing-policy/">Segment Routing Policy (*draft-filsfils-spring-segment-routing-policy*)</a> defines SR Policies.

## Creating a SR Policy

An SR Policy is defined by a Binding SID and a weighted set of Segment Lists.

A new SR policy is created with a first SID list using:

    sr mpls policy add bsid 40001 next 16001 next 16002 next 16003 (weight 5)

* The weight parameter is only used if more than one SID list is associated with the policy.

An SR policy is deleted with:

    sr mpls policy del bsid 40001

The existing SR policies are listed with:

    show sr mpls policies

### Adding/Removing SID Lists from an SR policy

An additional SID list is associated with an existing SR policy with:

    sr mpls policy mod bsid 40001 add sl next 16001 next 16002 next 16003 (weight 3)

Conversely, a SID list can be removed from an SR policy with:

    sr mpls policy mod bsid 4001 del sl index 1

Note that this CLI cannot be used to remove the last SID list of a policy. Instead the SR policy delete CLI must be used.

The weight of a SID list can also be modified with:

    sr mpls policy mod bsid 40001 mod sl index 1 weight 4

### SR Policies: Spray policies

Spray policies are a specific type of SR policies where the packet is replicated on all the SID lists, rather than load-balanced among them.

SID list weights are ignored with this type of policies.

A Spray policy is instantiated by appending the keyword **spray** to a regular SR-MPLS policy command, as in:

    sr mpls policy add bsid 40002 next 16001 next 16002 next 16003 spray

Spray policies are used for removing multicast state from a network core domain, and instead send a linear unicast copy to every access node. The last SID in each list accesses the multicast tree within the access node.  

## Steering packets into a SR Policy

Segment Routing supports three methos of steering traffic into an SR policy.

### Local steering

In this variant incoming packets match a routing policy which directs them on a local SR policy.

In order to achieve this behavior the user needs to create an 'sr steering policy via sr policy bsid'.

    sr mpls steer l3 2001::/64 via sr policy bsid 40001
    sr mpls steer l3 2001::/64 via sr policy bsid 40001 fib-table 3
    sr mpls steer l3 10.0.0.0/16 via sr policy bsid 40001
    sr mpls steer l3 10.0.0.0/16 via sr policy bsid 40001 vpn-label 500

### Remote steering

In this variant incoming packets have an active SID matching a local BSID at the head-end.

In order to achieve this behavior the packets should simply arrive with an active SID equal to the Binding SID of a locally instantiated SR policy.

### Automated steering

In this variant incoming packets match a BGP/Service route which recurses on the BSID of a local policy.

In order to achieve this behavior the user first needs to color the SR policies. He can do so by using the CLI:

    sr mpls policy te bsid xxxxx endpoint x.x.x.x color 12341234

Notice that an SR policy can have a single endpoint and a single color. Notice that the *endpoint* value is an IP46 address and the color a u32.


Then, for any BGP/Service route the user has to use the API to steer prefixes:

    sr steer l3 2001::/64 via next-hop 2001::1 color 1234 co 2
    sr steer l3 2001::/64 via next-hop 2001::1 color 1234 co 2 vpn-label 500    

Notice that *co* refers to the CO-bits (values [0|1|2|3]). 

Notice also that a given prefix might be steered over several colors (same next-hop and same co-bit value). In order to add new colors just execute the API several times (or with the del parameter to delete the color).

This variant is meant to be used in conjunction with a control plane agent that uses the underlying binary API bindings of *sr_mpls_steering_policy_add*/*sr_mpls_steering_policy_del* for any BGP service route received.
/* Name.Tag */ .highlight .nv { color: #336699 } /* Name.Variable */ .highlight .ow { color: #008800 } /* Operator.Word */ .highlight .w { color: #bbbbbb } /* Text.Whitespace */ .highlight .mb { color: #0000DD; font-weight: bold } /* Literal.Number.Bin */ .highlight .mf { color: #0000DD; font-weight: bold } /* Literal.Number.Float */ .highlight .mh { color: #0000DD; font-weight: bold } /* Literal.Number.Hex */ .highlight .mi { color: #0000DD; font-weight: bold } /* Literal.Number.Integer */ .highlight .mo { color: #0000DD; font-weight: bold } /* Literal.Number.Oct */ .highlight .sa { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Affix */ .highlight .sb { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Backtick */ .highlight .sc { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Char */ .highlight .dl { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Delimiter */ .highlight .sd { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Doc */ .highlight .s2 { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Double */ .highlight .se { color: #0044dd; background-color: #fff0f0 } /* Literal.String.Escape */ .highlight .sh { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Heredoc */ .highlight .si { color: #3333bb; background-color: #fff0f0 } /* Literal.String.Interpol */ .highlight .sx { color: #22bb22; background-color: #f0fff0 } /* Literal.String.Other */ .highlight .sr { color: #008800; background-color: #fff0ff } /* Literal.String.Regex */ .highlight .s1 { color: #dd2200; background-color: #fff0f0 } /* Literal.String.Single */ .highlight .ss { color: #aa6600; background-color: #fff0f0 } /* Literal.String.Symbol */ .highlight .bp { color: #003388 } /* Name.Builtin.Pseudo */ .highlight .fm { color: #0066bb; font-weight: bold } /* Name.Function.Magic */ .highlight .vc { color: #336699 } /* Name.Variable.Class */ .highlight .vg { color: #dd7700 } /* Name.Variable.Global */ .highlight .vi { color: #3333bb } /* Name.Variable.Instance */ .highlight .vm { color: #336699 } /* Name.Variable.Magic */ .highlight .il { color: #0000DD; font-weight: bold } /* Literal.Number.Integer.Long */
/*
 * Copyright (c) 2020 Doc.ai and/or its affiliates.
 * Copyright (c) 2020 Cisco and/or its affiliates.
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at:
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#include <vnet/adj/adj_midchain.h>
#include <vnet/fib/fib_table.h>
#include <wireguard/wireguard_peer.h>
#include <wireguard/wireguard_if.h>
#include <wireguard/wireguard_messages.h>
#include <wireguard/wireguard_key.h>
#include <wireguard/wireguard_send.h>
#include <wireguard/wireguard.h>

static fib_source_t wg_fib_source;
wg_peer_t *wg_peer_pool;

index_t *wg_peer_by_adj_index;

static void
wg_peer_endpoint_reset (wg_peer_endpoint_t * ep)
{
  ip46_address_reset (&ep->addr);
  ep->port = 0;
}

static void
wg_peer_endpoint_init (wg_peer_endpoint_t * ep,
		       const ip46_address_t * addr, u16 port)
{
  ip46_address_copy (&ep->addr, addr);
  ep->port = port;
}

static void
wg_peer_fib_flush (wg_peer_t * peer)
{
  wg_peer_allowed_ip_t *allowed_ip;

  vec_foreach (allowed_ip, peer->allowed_ips)
  {
    fib_table_entry_delete_index (allowed_ip->fib_entry_index, wg_fib_source);
    allowed_ip->fib_entry_index = FIB_NODE_INDEX_INVALID;
  }
}

static void
wg_peer_fib_populate (wg_peer_t * peer, u32 fib_index)
{
  wg_peer_allowed_ip_t *allowed_ip;

  vec_foreach (allowed_ip, peer->allowed_ips)
  {
    allowed_ip->fib_entry_index =
      fib_table_entry_path_add (fib_index,
				&allowed_ip->prefix,
				wg_fib_source,
				FIB_ENTRY_FLAG_NONE,
				fib_proto_to_dpo (allowed_ip->
						  prefix.fp_proto),
				&peer->dst.addr, peer->wg_sw_if_index, ~0, 1,
				NULL, FIB_ROUTE_PATH_FLAG_NONE);
  }
}

static void
wg_peer_clear (vlib_main_t * vm, wg_peer_t * peer)
{
  wg_timers_stop (peer);
  for (int i = 0; i < WG_N_TIMERS; i++)
    {
      peer->timers[i] = ~0;
    }

  peer->last_sent_handshake = vlib_time_now (vm) - (REKEY_TIMEOUT + 1);

  clib_memset (&peer->cookie_maker, 0, sizeof (peer->cookie_maker));

  wg_peer_endpoint_reset (&peer->src);
  wg_peer_endpoint_reset (&peer->dst);

  if (INDEX_INVALID != peer->adj_index)
    {
      adj_unlock (peer->adj_index);
      wg_peer_by_adj_index[peer->adj_index] = INDEX_INVALID;
    }
  wg_peer_fib_flush (peer);

  peer->input_thread_index = ~0;
  peer->output_thread_index = ~0;
  peer->adj_index = INDEX_INVALID;
  peer->timer_wheel = 0;
  peer->persistent_keepalive_interval = 0;
  peer->timer_handshake_attempts = 0;
  peer->last_sent_packet = 0;
  peer->last_received_packet = 0;
  peer->session_derived = 0;
  peer->rehandshake_started = 0;
  peer->new_handshake_interval_tick = 0;
  peer->rehandshake_interval_tick = 0;
  peer->timer_need_another_keepalive = false;
  peer->is_dead = true;
  vec_free (peer->allowed_ips);
}

static void
wg_peer_init (vlib_main_t * vm, wg_peer_t * peer)
{
  peer->adj_index = INDEX_INVALID;
  wg_peer_clear (vm, peer);
}

static u8 *
wg_peer_build_rewrite (const wg_peer_t * peer)
{
  // v4 only for now
  ip4_udp_header_t *hdr;
  u8 *rewrite = NULL;

  vec_validate (rewrite, sizeof (*hdr) - 1);
  hdr = (ip4_udp_header_t *) rewrite;

  hdr->ip4.ip_version_and_header_length = 0x45;
  hdr->ip4.ttl = 64;
  hdr->ip4.src_address = peer->src.addr.ip4;
  hdr->ip4.dst_address = peer->dst.addr.ip4;
  hdr->ip4.protocol = IP_PROTOCOL_UDP;
  hdr->ip4.checksum = ip4_header_checksum (&hdr->ip4);

  hdr->udp.src_port = clib_host_to_net_u16 (peer->src.port);
  hdr->udp.dst_port = clib_host_to_net_u16 (peer->dst.port);
  hdr->udp.checksum = 0;

  return (rewrite);
}

static void
wg_peer_adj_stack (wg_peer_t * peer)
{
  ip_adjacency_t *adj;
  u32 sw_if_index;
  wg_if_t *wgi;

  adj = adj_get (peer->adj_index);
  sw_if_index = adj->rewrite_header.sw_if_index;

  wgi = wg_if_get (wg_if_find_by_sw_if_index (sw_if_index));

  if (!wgi)
    return;

  if (!vnet_sw_interface_is_admin_up (vnet_get_main (), wgi->sw_if_index))
    {
      adj_midchain_delegate_unstack (peer->adj_index);
    }
  else
    {
      /* *INDENT-OFF* */
      fib_prefix_t dst = {
        .fp_len = 32,
        .fp_proto = FIB_PROTOCOL_IP4,
        .fp_addr = peer->dst.addr,
      };
      /* *INDENT-ON* */
      u32 fib_index;

      fib_index = fib_table_find (FIB_PROTOCOL_IP4, peer->table_id);

      adj_midchain_delegate_stack (peer->adj_index, fib_index, &dst);
    }
}

walk_rc_t
wg_peer_if_admin_state_change (wg_if_t * wgi, index_t peeri, void *data)
{
  wg_peer_adj_stack (wg_peer_get (peeri));

  return (WALK_CONTINUE);
}

walk_rc_t
wg_peer_if_table_change (wg_if_t * wgi, index_t peeri, void *data)
{
  wg_peer_table_bind_ctx_t *ctx = data;
  wg_peer_t *peer;

  peer = wg_peer_get (peeri);

  wg_peer_fib_flush (peer);
  wg_peer_fib_populate (peer, ctx->new_fib_index);

  return (WALK_CONTINUE);
}

static int
wg_peer_fill (vlib_main_t * vm, wg_peer_t * peer,
	      u32 table_id,
	      const ip46_address_t * dst,
	      u16 port,
	      u16 persistent_keepalive_interval,
	      const fib_prefix_t * allowed_ips, u32 wg_sw_if_index)
{
  wg_peer_endpoint_init (&peer->dst, dst, port);

  peer->table_id = table_id;
  peer->wg_sw_if_index = wg_sw_if_index;
  peer->timer_wheel = &wg_main.timer_wheel;
  peer->persistent_keepalive_interval = persistent_keepalive_interval;
  peer->last_sent_handshake = vlib_time_now (vm) - (REKEY_TIMEOUT + 1);
  peer->is_dead = false;

  const wg_if_t *wgi = wg_if_get (wg_if_find_by_sw_if_index (wg_sw_if_index));

  if (NULL == wgi)
    return (VNET_API_ERROR_INVALID_INTERFACE);

  ip_address_to_46 (&wgi->src_ip, &peer->src.addr);
  peer->src.port = wgi->port;

  /*
   * and an adjacency for the endpoint address in the overlay
   * on the wg interface
   */
  peer->rewrite = wg_peer_build_rewrite (peer);

  peer->adj_index = adj_nbr_add_or_lock (FIB_PROTOCOL_IP4,
					 VNET_LINK_IP4,
					 &peer->dst.addr, wgi->sw_if_index);

  vec_validate_init_empty (wg_peer_by_adj_index,
			   peer->adj_index, INDEX_INVALID);
  wg_peer_by_adj_index[peer->adj_index] = peer - wg_peer_pool;

  adj_nbr_midchain_update_rewrite (peer->adj_index,
				   NULL,
				   NULL,
				   ADJ_FLAG_MIDCHAIN_IP_STACK,
				   vec_dup (peer->rewrite));
  wg_peer_adj_stack (peer);

  /*
   * add a route in the overlay to each of the allowed-ips
   */
  u32 ii;

  vec_validate (peer->allowed_ips, vec_len (allowed_ips) - 1);

  vec_foreach_index (ii, allowed_ips)
  {
    peer->allowed_ips[ii].prefix = allowed_ips[ii];
  }

  wg_peer_fib_populate (peer,
			fib_table_get_index_for_sw_if_index
			(FIB_PROTOCOL_IP4, peer->wg_sw_if_index));

  return (0);
}

int
wg_peer_add (u32 tun_sw_if_index,
	     const u8 public_key[NOISE_PUBLIC_KEY_LEN],
	     u32 table_id,
	     const ip46_address_t * endpoint,
	     const fib_prefix_t * allowed_ips,
	     u16 port, u16 persistent_keepalive, u32 * peer_index)
{
  wg_if_t *wg_if;
  wg_peer_t *peer;
  int rv;

  vlib_main_t *vm = vlib_get_main ();

  if (tun_sw_if_index == ~0)
    return (VNET_API_ERROR_INVALID_SW_IF_INDEX);

  wg_if = wg_if_get (wg_if_find_by_sw_if_index (tun_sw_if_index));
  if (!wg_if)
    return (VNET_API_ERROR_INVALID_SW_IF_INDEX);

  /* *INDENT-OFF* */
  pool_foreach (peer, wg_peer_pool)
   {
    if (!memcmp (peer->remote.r_public, public_key, NOISE_PUBLIC_KEY_LEN))
    {
      return (VNET_API_ERROR_ENTRY_ALREADY_EXISTS);
    }
  }
  /* *INDENT-ON* */

  if (pool_elts (wg_peer_pool) > MAX_PEERS)
    return (VNET_API_ERROR_LIMIT_EXCEEDED);

  pool_get (wg_peer_pool, peer);

  wg_peer_init (vm, peer);

  rv = wg_peer_fill (vm, peer, table_id, endpoint, (u16) port,
		     persistent_keepalive, allowed_ips, tun_sw_if_index);

  if (rv)
    {
      wg_peer_clear (vm, peer);
      pool_put (wg_peer_pool, peer);
      return (rv);
    }

  noise_remote_init (&peer->remote, peer - wg_peer_pool, public_key,
		     wg_if->local_idx);
  cookie_maker_init (&peer->cookie_maker, public_key);

  if (peer->persistent_keepalive_interval != 0)
    {
      wg_send_keepalive (vm, peer);
    }

  *peer_index = peer - wg_peer_pool;
  wg_if_peer_add (wg_if, *peer_index);

  return (0);
}

int
wg_peer_remove (index_t peeri)
{
  wg_main_t *wmp = &wg_main;
  wg_peer_t *peer = NULL;
  wg_if_t *wgi;

  if (pool_is_free_index (wg_peer_pool, peeri))
    return VNET_API_ERROR_NO_SUCH_ENTRY;

  peer = pool_elt_at_index (wg_peer_pool, peeri);

  wgi = wg_if_get (wg_if_find_by_sw_if_index (peer->wg_sw_if_index));
  wg_if_peer_remove (wgi, peeri);

  vnet_feature_enable_disable ("ip4-output", "wg-output-tun",
			       peer->wg_sw_if_index, 0, 0, 0);

  noise_remote_clear (wmp->vlib_main, &peer->remote);
  wg_peer_clear (wmp->vlib_main, peer);
  pool_put (wg_peer_pool, peer);

  return (0);
}

index_t
wg_peer_walk (wg_peer_walk_cb_t fn, void *data)
{
  index_t peeri;

  /* *INDENT-OFF* */
  pool_foreach_index (peeri, wg_peer_pool)
  {
    if (WALK_STOP == fn(peeri, data))
      return peeri;
  }
  /* *INDENT-ON* */
  return INDEX_INVALID;
}

static u8 *
format_wg_peer_endpoint (u8 * s, va_list * args)
{
  wg_peer_endpoint_t *ep = va_arg (*args, wg_peer_endpoint_t *);

  s = format (s, "%U:%d",
	      format_ip46_address, &ep->addr, IP46_TYPE_ANY, ep->port);

  return (s);
}

u8 *
format_wg_peer (u8 * s, va_list * va)
{
  index_t peeri = va_arg (*va, index_t);
  wg_peer_allowed_ip_t *allowed_ip;
  u8 key[NOISE_KEY_LEN_BASE64];
  wg_peer_t *peer;

  peer = wg_peer_get (peeri);
  key_to_base64 (peer->remote.r_public, NOISE_PUBLIC_KEY_LEN, key);

  s = format (s, "[%d] endpoint:[%U->%U] %U keep-alive:%d adj:%d",
	      peeri,
	      format_wg_peer_endpoint, &peer->src,
	      format_wg_peer_endpoint, &peer->dst,
	      format_vnet_sw_if_index_name, vnet_get_main (),
	      peer->wg_sw_if_index,
	      peer->persistent_keepalive_interval, peer->adj_index);
  s = format (s, "\n  key:%=s %U",
	      key, format_hex_bytes, peer->remote.r_public,
	      NOISE_PUBLIC_KEY_LEN);
  s = format (s, "\n  allowed-ips:");
  vec_foreach (allowed_ip, peer->allowed_ips)
  {
    s = format (s, " %U", format_fib_prefix, &allowed_ip->prefix);
  }

  return s;
}

static clib_error_t *
wg_peer_module_init (vlib_main_t * vm)
{
  /*
   * use a priority better than interface source, so that
   * if the same subnet is added to the wg interface and is
   * used as an allowed IP, then the wireguard soueced prefix
   * wins and traffic is routed to the endpoint rather than dropped
   */
  wg_fib_source = fib_source_allocate ("wireguard", 0x2, FIB_SOURCE_BH_API);

  return (NULL);
}

VLIB_INIT_FUNCTION (wg_peer_module_init);

/*
 * fd.io coding-style-patch-verification: ON
 *
 * Local Variables:
 * eval: (c-set-style "gnu")
 * End:
 */