VXLAN EVPN

VXLAN

VXLAN is an encapsulation protocol (RFC 7384) which is used to encapsulate L2 frames (MAC) into UDP packets (and therefore IP packets). The advantages of VXLAN are:

  • uses UDP so it ls transported over L3 networks which can provide a loop free, ECMP network

  • uses UDP so it can reduce traffic polarization. By using variable UDP source ports, the underlay entropy (you can say variation in hashing) is increased thus making the load sharing more effective across multiple paths. The destination port remains fixed (UDP 4789) which make it easy to be identified. UDP doesn't provide reliability but this can be handled by the application whos traffic is encapsulated in the VXLAN UDP packets

  • supports network segmentation at scale through the use of VXLAN VNIs (24 bits) resulting in 16 million segments instead of the traditional VLAN ID (12 bits - 4096 segments)

A key element of VXLAN is the VTEPs (VXLAN Tunnel Endpoints), also known as NVE (Network Virtualization Edge) that encapsulate frames from the hosts and sends them over to the VTEP where the destination host is attached.

Obviously now the question is how does a VTEP know how to forward the frame?

Flood and learn

The initial mechanism to learn about MAC addresses in VXLAN networks is called Flood and Learn:

  1. Frame arrives at Ingress VTEP

    1. the VTEP learns about the SRC MAC Address

  2. VTEP Floods the frame into the VXLAN segment using

    1. Head-end replication (Manually maintened list of VTEPs) aka Ingress Replication

    2. Multicast groups (VTEPs in the same VNI join the same multicast group - which requires the underlay to support multicast)

  3. Flooded VXLAN reaches all VTEPs in the VNI

    1. Destination VTEP learns about source MAC address and the VTEP that forwarded it

    2. If the VTEP has an entry for the MAC address, it forwards the traffic to the destination port where the destination exists

    3. If the VTEP doesn't have an entry for the MAC address, it drops the packet silently.

  4. Reply traffic comes back to the attached VTEP

    1. since destination is now known (as it was previously learned), the traffic is forwarded as unicast to the destination VTEP

  5. Traffic reaches the intial VTEP and the frame is forwared to the destination MAC which is known since it was learned when the initial frame arrived.

    1. From now on, traffic will be all unicast since source and destination MACs are known by their attached VTEPs.

The main issue with this approach is that the traffic is multiplied across the entire fabric.

EVPN

Essentially EVPN is a control plane mechanism that allows advertisements of MAC addresses to the VTEPs. Each VTEP advertises even before any data traffic flows:

  • it's VTEP IP

  • the VNIs it participates in

  • the MAC/IP Addresses of local endpoints

So when a frame arrives at the ingress VTEP, it already knows which is the egress VTEP so it will unicast the frame encapsulated with the VXLAN header.

The EVPN approach reduces drastically the flooding of frames across the network.

Since EVPN relies on MP-BGP we inherit some additional functionality such as the possibility to use Route Reflectors(RR). The spines can be setup as RR to reduce the number of BGP peerings required. So when a leaf sends EVPN updates, it sends it only to the RRs (the spines). By design all leafs are linked to the spines so the RRs can advertise the learned information to the leafs.

What can we advertise from a leaf using EVPN (most common)?

  • Host advertisements (Route type 2): advertises the MAC Address, the VNI and Route Target and the Next Hop (advertising/attached leaf)

  • Host advertisements (Route type 2): advertises the MAC Address and the IP address, the VNI and Route Target and the Next Hop (advertising/attached leaf)

  • Subnet Advertisements (Route type 5): advertises the IP prefix, the VNI and Route Target and the next hop (Advertising /attached leaf)

Again, thanks to MPBGP, EVPN inherits the possilbity to distinguish resources across different Overlays (VRFs) using RD (makes MAC or IP addresses in a VRF distinguishable from similar MAC or IP addresses in another VRF) and the posibility to import/export routes using Route Targets.

Since leafs (VTEPs) are connected to all spines but to no other leaf, it makes perfect sense to use the spines as Route Reflectors (RR) thus reducing the need for full mesh iBGP peering between all VTEPs.

VXLAN Packet Flow

VXLAN Packet

A VXLAN packet encapsulates the original L2 frame so that it can be forwarded across the network to the destination.

First, the L2 Frame is fronted by a VXLAN header which contains the VNI (VXLAN Network Identifier) - a field of 24 bits. This allows the original frame assigned to a VLAN (12 bits: 1-4096) to be mapped to a VNI (24 bits: 1-16777216).

Then the VXLAN header and the frame are encapsulated into a UDP packet that has always the destination set to port 4789 and the source is assigned per flow by the source VTEP.

When sent between VTEPs (via spines) the UDP packet goes into an IP packet that has as source the source VTEP IP Address and as destination the destination VTEP IP Address. This way, the packet can reach the destination VTEP where the original frame is extracted and sent towards the destination host.

Packet Flow within the same VLAN / VNI (L2VNI)

Whenever they learn about a new attached host, the VTEPs will advertise that information using EVPN Route Type 2 to the other VTEPs (in the same VRF) via the spine RRs.

What this accomplishes is that VTEPs start to build a table with known MAC addresses, the VNI they are mapped to and the next hop to reach them.

You can verify this on a Cisco Nexus using show mac-address table vni <VNI>

Here you can see that mac address 0011.22.33.4455 is locally attached while mac address 00aa.bbcc.ddee is reachable via VTEP 10.0.0.12

And if you don't know the VNI, you can see all known VNIs on a switch using

to find out how this information was received using EVPN, you can use show bgp l2vpn evpn <MAC-ADDR>

Now you can almost see 2 routes being receive, that seem to include the mac-address. One has also the IP address in it and the other has 0.0.0.0/0 in the same place:

These are BGP EVNP NLRIs and they should be read as follows:

  • ROUTE-TYPE: 2 because this type is used to advertise host mac or host mac + IP Addresses

  • ESI: it's 0 if the host is attached to a single VTEP. It can be different than 0 for multi-homed hosts.

  • BRIDGE-DOMAIN: always 0 in EVPN VXLAN

  • MAC-LENGTH: 48 because that's the size of the MAC field

  • MAC-ADDRESS: the actual L2 address

  • IP-LENGTH: for route type 2 it's either 0 (no ip address) or 32 (IPv4 address length)

  • IP-ADDRESS: the actual L3 address

  • /LENGTH: the full length of the NLRI, telling BGP process how much data to read.

So why are there 2 different NLRIs advertised? One with the IP Address and one without the IP Address. This is by design. For pure L2 forwarding as well as for non-IP or unknown IP traffic the MAC-only NLRI is sufficient.

By advertising also the MAC+IP NLRI the remote VTEPs can respond to ARP requests on its behalf, thus achieving ARP suppression, and limitting the ammount of BUM traffic that needs to be replicate across the fabric. It is also useful for improving efficiency of anycast IP gateways used for routing between VNIs in the same VRF.

Packet Flow between different VLANs / VNIs (L3VNI)

The packet flow for traffic between 2 hosts in different VNIs of the same VRF is almost the same, but some fields in the packet will change.

First, because the host is configured with the Anycast IP Gateway, it will use its mac address for traffic outside its subnet. So when the frame arrives at the VTEP it has a destination MAC address set to its own Anycast IP Gateway MAC address.

Then, because this traffic is not part of a L2VNI it will be forwarded to the destination VTEP using the L3VNI of the VRF. In order to lookup the destination, you can verify the routing table for the vrf

the route would appear as a /32 BGP learned route which is the type-2 route learned via BGP EVPN.

If traffic is destined for a host where there is not type-2 host, then it will be routed based on the VLAN subnet (e.g. /24) which is advertised by the remote VTEP using type-5 routes.

At this point, the original L2 frame is modified with the MAC address destination for the destination host.

Upon arrival at the destination VTEP, the L2 frame is extracted and the source is replaced with the MAC address of the destination Anycast Gateway, thus making sure that return traffic goes through the same VTEP.

Anycast IP Gateway

This functionality allows leafs to distribute the default gateways for directly attached VLANs. All VTEPs will provide this service and use the same MAC Address allowing hosts to always use their local leaf for first-hop routing. This is an improvement to classic design where the default gateway would be on distribution switches and using HSRP or similar protocols to manage an active gateway.

This functionality is enabled per SVI:

Last updated