NVIDIA® Cumulus Linux is the first full-featured Linux operating system for the networking industry. The Debian Jessie-based, networking-focused distribution runs on hardware produced by a broad partner ecosystem, ensuring unmatched customer choice regarding silicon, optics, cables, and systems.
This user guide provides in-depth documentation on the Cumulus Linux installation process, system configuration and management, network solutions, and monitoring and troubleshooting recommendations. In addition, the quick start guide provides an end-to-end setup process to get you started.
What’s New in this Release
For a list of the new features in this release, see What's New. For bug fixes and known issues present in this release, refer to the Cumulus Linux 3.7 Release Notes.
Open Source Contributions
To implement various Cumulus Linux features, Cumulus Networks has forked various software projects, like CFEngine, Netdev and some Puppet Labs packages. The forked code resides in the Cumulus Networks GitHub repository.
Hardware Compatibility List
You can find the most up-to-date hardware compatibility list (HCL)
here. Use the HCL to confirm that
your switch model is supported by Cumulus Linux. The HCL is updated
regularly, listing products by port configuration, manufacturer, and SKU
part number.
Extended Support Release
This version of Cumulus Linux is an Extended Support Release (ESR). Cumulus Linux 3.7 ESR started with Cumulus Linux 3.7.12 and all future releases in the 3.7 product family will all be ESR releases. To learn about ESR, please read this article.
The PDF of the 3.7.12 ESR user guide is available here.
PDFs of pre-ESR 3.7 versions are available below.
Cumulus Linux 3.7.16 contains bug fixes and security fixes.
What’s New in Cumulus Linux 3.7.15
Cumulus Linux 3.7.15 contains bug fixes and security fixes.
What’s New in Cumulus Linux 3.7.14.2
Cumulus Linux 3.7.14.2 contains bug fixes and security fixes.
What’s New in Cumulus Linux 3.7.14
Cumulus Linux 3.7.14 contains bug fixes and security fixes.
What’s New in Cumulus Linux 3.7.13
Cumulus Linux 3.7.13 contains bug fixes and security fixes.
What’s New in Cumulus Linux 3.7.12
Cumulus Linux 3.7.12 contains bug fixes.
Cumulus Linux 3.7.12 also includes a firmware update for Mellanox switches that addresses an issue with certain Virtium SSDs. The firmware update occurs automatically when you upgrade Cumulus Linux on a Mellanox switch and requires no user action.
What’s New in Cumulus Linux 3.7.11
Cumulus Linux 3.7.11 supports new platforms, provides bug fixes, and contains several new features and improvements.
Support for non-contiguous subnet masks in IPv4 and IPv6 address rule matches (for example, 10.0.0.1/255.0.255.0)
Multiple subnet support for a single VXLAN
switchd: increased reliability and fixed memory leaks identified by GCC address sanitizer checks
What’s New in Cumulus Linux 3.7.8
Cumulus Linux 3.7.8 contains bug fixes and the following new transceivers.
Mellanox 100G-PSM4 (MMS1C10-CM)
Wave Splitter WST-QS28-CM4C-D (100G-CWDM4-OCP) and WST-QS28-CM4-C (100G CWDM4)
What’s New in Cumulus Linux 3.7.7
Cumulus Linux 3.7.7 contains bug fixes only.
What’s New in Cumulus Linux 3.7.6
Cumulus Linux 3.7.6 contains bug fixes, and the following new platform and power supply:
Dell N3048EP-ON (1G PoE Helix4) - Depending upon the revision of the switch you have, you might not be able to install Cumulus Linux on it. For more information, read this knowledge base article.
48V DC PSU for the Dell Z9100-ON switch
What’s New in Cumulus Linux 3.7.5
Cumulus Linux 3.7.5 fixes an issue with EVPN centralized routing on Tomahawk and Tomahawk+ switches (CM-24495), an issue with switchd when IGMP snooping is enabled on a Broadcom switch (CM-24508) and includes additional security fixes.
Lightweight network virtualization (LNV) has been deprecated. The feature will be removed in Cumulus Linux 4.0. Use Ethernet virtual private network (EVPN) for network virtualization.
What’s New in Cumulus Linux 3.7.4
Cumulus Linux 3.7.4 is no longer available due to issues that are resolved in Cumulus Linux 3.7.5.
What’s New in Cumulus Linux 3.7.3
Cumulus Linux 3.7.3 supports new platforms, provides bug fixes, and contains several new features and improvements.
New Platforms
Dell Z9264F-ON (100G Broadcom Tomahawk2)
Edgecore AS7816-64X (100G Broadcom Tomahawk2)
Edgecore AS7726-32X (100G Broadcom Trident3)
Edgecore AS7326-56X (25G Broadcom Trident3)
HPE SN2700M (100G Mellanox Spectrum)
HPE SN2100M (100G Mellanox Spectrum)
HPE SN2410M (25G Mellanox Spectrum)
Lenovo NE0152TO (1G Broadcom Helix4) now generally available
This quick start guide provides an end-to-end setup process for installing and running Cumulus Linux, as well as a collection of example commands for getting started after installation is complete.
Intermediate-level Linux knowledge is assumed for this guide. You should be familiar with basic text editing, Unix file permissions, and process monitoring. A variety of text editors are pre-installed, including vi and nano.
You must have access to a Linux or UNIX shell. If you are running
Windows, use a Linux environment like Cygwin
as your command line tool for interacting with Cumulus Linux.
If you are a networking engineer but are unfamiliar with Linux concepts,
refer to this reference guide
to compare the Cumulus Linux CLI and configuration options, and their
equivalent Cisco Nexus 3000 NX-OS commands and settings. You can also
watch a series of short videos introducing you to
Linux and Cumulus Linux-specific concepts.
Install Cumulus Linux
To install Cumulus Linux, you use
ONIE (Open Network Install
Environment), an extension to the traditional U-Boot software that
allows for automatic discovery of a network installer image. This
facilitates the ecosystem model of procuring switches with an operating
system choice, such as Cumulus Linux.
If Cumulus Linux is already installed on your switch and you need to
upgrade the software only, skip to Upgrading Cumulus Linux.
The easiest way to install Cumulus Linux with ONIE is with local HTTP discovery:
If your host (laptop or server) is IPv6-enabled, make sure it is
running a web server. If the host is IPv4-enabled, make sure it is
running DHCP in addition to a web server.
Download the Cumulus Linux
installation file to the root directory of the web server. Rename
this file onie-installer.
Connect your host using an Ethernet cable to the management Ethernet
port of the switch.
Power on the switch. The switch downloads the ONIE image installer
and boots. You can watch the progress of the install in your
terminal. After the installation completes, the Cumulus Linux login
prompt appears in the terminal window.
These steps describe a flexible unattended installation method. You do not need a console cable. A fresh install with ONIE using a local web server typically completes in less than ten minutes.
You have more options for installing Cumulus Linux with ONIE. Read
Installing a New Cumulus Linux Image
to install Cumulus Linux using ONIE in the following ways:
DHCP/web server with and without DHCP options
Web server without DHCP
FTP or TFTP without a web server
Local file
USB
ONIE supports many other discovery mechanisms using USB (copy the
installer to the root of the drive), DHCPv6 and DHCPv4, and image copy
methods including HTTP, FTP, and TFTP. For more information on these
discovery methods, refer to the ONIE documentation.
After installing Cumulus Linux, you are ready to:
Log in to Cumulus Linux on the switch.
Install the Cumulus Linux license.
Configure Cumulus Linux. This quick start guide provides instructions on configuring switch ports and a loopback interface.
Getting Started
When starting Cumulus Linux for the first time, the management port makes a DHCPv4 request. To determine the IP address of the switch, you can cross reference the MAC address of the switch with your DHCP server. The MAC address is typically located on the side of the switch or on the box in which the unit ships.
Login Credentials
The default installation includes one system account, root, with full system privileges, and one user account, cumulus, with sudo privileges. The root account password is set to null by default (which prohibits login). In Cumulus Linux 3.7.11 and earlier, the cumulus account is configured with this default password:
CumulusLinux!
For optimum security, change the default password (using the passwd command) before you configure Cumulus Linux on the switch.
In Cumulus Linux 3.7.12 and later, the cumulus account is configured with this default password:
cumulus
The first time you log into Cumulus Linux 3.7.12 or later, you are required to change this default password. When prompted, enter a new password, then confirm the new password.
In this quick start guide, you use the cumulus account to configure Cumulus Linux.
All accounts except root are permitted remote SSH login; you can use sudo to grant a non-root account root-level access. Commands that change the system configuration require this elevated level of access.
You are encouraged to perform management and configuration over the network, either in band or out of band. Using a serial console is fully supported; however, many customers prefer the convenience of network-based management.
Typically, switches ship from the manufacturer with a mating DB9 serial cable. Switches with ONIE are always set to a 115200 baud rate.
Wired Ethernet Management
Switches supported in Cumulus Linux always contain at least one dedicated Ethernet management port, which is named eth0. This interface is geared specifically for out-of-band management use. The management interface uses DHCPv4 for addressing by default. You can set a static IP address with the Network Command Line Utility (NCLU).
To set a static IP address, run the interface address and interface gateway NCLU commands. For example:
cumulus@switch:~$ net add interface eth0 ip address 192.0.2.42/24
cumulus@switch:~$ net add interface eth0 ip gateway 192.0.2.1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
auto eth0
iface eth0
address 192.0.2.42/24
gateway 192.0.2.1
auto eth0
iface eth0
address 192.0.2.42/24
gateway 192.0.2.1
Configure the Hostname and Timezone
To change the hostname, run net add hostname, which modifies both the /etc/hostname and /etc/hosts files with the desired hostname.
cumulus@switch:~$ net add hostname <hostname>
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Do not use an underscore (_) in the hostname; underscores are not permitted.
Avoid using apostrophes or non-ASCII characters in the hostname. Cumulus Linux does not parse these characters.
The command prompt in the terminal does not reflect the new hostname until you either log out of the switch or start a new shell.
When you use the NCLU command to set the hostname, DHCP does not override the hostname when you reboot the switch. However, if you disable the hostname setting with NCLU, DHCP does override the hostname the next time you reboot the switch.
To update the timezone, use NTP interactive mode:
Run the following command in a terminal:
sudo dpkg-reconfigure tzdata
Follow the on screen menu options to select the geographic area and region.
Programs that are already running (including log files) and users currently logged in, do not see timezone changes made with interactive mode. To have the timezone set for all services and daemons, a reboot is required.
Verify the System Time
Before you install the license, verify that the date and time on the
switch are correct. You must correct the date and time if they
are incorrect. The wrong date and time can have impacts on the switch,
such as the inability to synchronize with Puppet or return errors like
this one after you restart switchd:
Warning: Unit file of switchd.service changed on disk, systemctl daemon-reload recommended.
Install the License
Cumulus Linux is licensed on a per-instance basis. Each network system is fully operational, enabling any capability to be utilized on the switch with the exception of forwarding on switch panel ports. Only eth0 and console ports are activated on an unlicensed instance of Cumulus Linux. Enabling front panel ports requires a license.
NVIDIA provides a generic license for Cumulus Linux. Download the license from the NVIDIA Enterprise support portal and apply it.
There are three ways to install the license onto the switch:
Copy the license from a local server. Create a text file with the license and copy it to a server accessible from the switch. On the switch, use the following command to transfer the file directly on the switch, then install the license file:
It is not necessary to reboot the switch to activate the switch ports.
After you install the license, restart the switchd service. All front
panel ports become active and show up as swp1, swp2, and so on.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
If a license is not installed on a Cumulus Linux switch, the switchd service does not start. After you install the license, start switchd as described above.
Configure Breakout Ports with Splitter Cables
If you are using 4x10G DAC or AOC cables, or want to break out 100G or
40G switch ports, configure the breakout ports. For more details, see
Breakout Ports.
Test Cable Connectivity
By default, all data plane ports (every Ethernet port except the management interface, eth0) are disabled.
To test cable connectivity, administratively enable a port:
cumulus@switch:~$ net add interface swp1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To administratively enable all physical ports, run the following command, where swp1-52 represents a switch with switch ports numbered from swp1 to swp52:
cumulus@switch:~$ net add interface swp1-52
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To view link status, use the net show interface all command. The following examples show the output of ports in admin down, down, and up modes:
cumulus@switch:~$ net show interface all
State Name Spd MTU Mode LLDP Summary
----- ------------- --- ----- ------------- ---------------------- -------------------------
UP lo N/A 65536 Loopback IP: 127.0.0.1/8
lo IP: 10.0.0.11/32
lo IP: 10.0.0.112/32
lo IP: ::1/128
UP eth0 1G 1500 Mgmt oob-mgmt-switch (swp6) Master: mgmt(UP)
eth0 IP: 192.168.0.11/24(DHCP)
UP swp1 1G 9000 BondMember server01 (eth1) Master: bond01(UP)
UP swp2 1G 9000 BondMember server02 (eth1) Master: bond02(UP)
ADMDN swp45 N/A 1500 NotConfigured
ADMDN swp46 N/A 1500 NotConfigured
ADMDN swp47 N/A 1500 NotConfigured
ADMDN swp48 N/A 1500 NotConfigured
UP swp49 1G 9000 BondMember leaf02 (swp49) Master: peerlink(UP)
UP swp50 1G 9000 BondMember leaf02 (swp50) Master: peerlink(UP)
UP swp51 1G 9216 NotConfigured spine01 (swp1)
UP swp52 1G 9216 NotConfigured spine02 (swp1)
UP bond01 1G 9000 802.3ad Master: bridge(UP)
bond01 Bond Members: swp1(UP)
UP bond02 1G 9000 802.3ad Master: bridge(UP)
bond02 Bond Members: swp2(UP)
UP bridge N/A 1500 Bridge/L2
UP mgmt N/A 65536 Interface/L3 IP: 127.0.0.1/8
UP peerlink 2G 9000 802.3ad Master: bridge(UP)
peerlink Bond Members: swp49(UP)
peerlink Bond Members: swp50(UP)
DN peerlink.4094 2G 9000 SubInt/L3 IP: 169.254.1.1/30
ADMDN vagrant N/A 1500 NotConfigured
UP vlan13 N/A 1500 Interface/L3 Master: vrf1(UP)
vlan13 IP: 10.1.3.11/24
UP vlan13-v0 N/A 1500 Interface/L3 Master: vrf1(UP)
vlan13-v0 IP: 10.1.3.1/24
UP vlan24 N/A 1500 Interface/L3 Master: vrf1(UP)
vlan24 IP: 10.2.4.11/24
UP vlan24-v0 N/A 1500 Interface/L3 Master: vrf1(UP)
vlan24-v0 IP: 10.2.4.1/24
UP vlan4001 N/A 1500 NotConfigured Master: vrf1(UP)
UP vni13 N/A 9000 Access/L2 Master: bridge(UP)
UP vni24 N/A 9000 Access/L2 Master: bridge(UP)
UP vrf1 N/A 65536 NotConfigured
UP vxlan4001 N/A 1500 Access/L2 Master: bridge(UP)
Configure Switch Ports
Layer 2 Port Configuration
Cumulus Linux does not put all ports into a bridge by default. To create a bridge and configure one or more front panel ports as members of the bridge, use the following examples as a guide.
Examples
In the following configuration example, the front panel port swp1 is placed into a bridge called bridge. The NCLU commands are:
cumulus@switch:~$ net add bridge bridge ports swp1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above create the following /etc/network/interfaces snippet:
auto bridge
iface bridge
bridge-ports swp1
bridge-vlan-aware yes
You can add a range of ports in one command. For example, add swp1 through swp10, swp12, and swp14 through swp20 to bridge:
cumulus@switch:~$ net add bridge bridge ports swp1-10,12,14-20
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above create the following snippet in the /etc/network/interfaces file:
To view the changes in the kernel, use the brctl command:
cumulus@switch:~$ brctl show
bridge name bridge id STP enabled interfaces
bridge 8000.443839000004 yes swp1
swp2
Layer 3 Port Configuration
You can also use NCLU to configure a front panel port or bridge interface as a layer 3 port.
In the following configuration example, the front panel port swp1 is configured as a layer 3 access port:
cumulus@switch:~$ net add interface swp1 ip address 10.1.1.1/30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above create the following snippet in the /etc/network/interfaces file:
auto swp1
iface swp1
address 10.1.1.1/30
To add an IP address to a bridge interface, you must put it into a VLAN interface. If you want to use a VLAN other than the native one, set the bridge PVID:
cumulus@switch:~$ net add vlan 100 ip address 10.2.2.1/24
cumulus@switch:~$ net add bridge bridge pvid 100
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above create the following snippet in the /etc/network/interfaces file:
auto bridge
iface bridge
bridge-ports swp1
bridge-pvid 100
bridge-vlan-aware yes
auto vlan100
iface vlan100
address 192.168.10.1/24
vlan-id 100
vlan-raw-device bridge
To view the changes in the kernel, use the ip addr show command:
cumulus@switch:~$ ip addr show
...
4. swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bridge state UP group default qlen 1000
link/ether 44:38:39:00:6e:fe brd ff:ff:ff:ff:ff:ff
...
14: bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 44:38:39:00:00:04 brd ff:ff:ff:ff:ff:ff
inet6 fe80::4638:39ff:fe00:4/64 scope link
valid_lft forever preferred_lft forever
...
Configure a Loopback Interface
Cumulus Linux has a loopback preconfigured in the /etc/network/interfaces file. When the switch boots up, it has a loopback interface called lo, which is up and assigned an IP address of 127.0.0.1.
The loopback interface lo must always be specified in the /etc/network/interfaces file and must always be up.
To see the status of the loopback interface (lo), use the net show interface lo command:
cumulus@switch:~$ net show interface lo
Name MAC Speed MTU Mode
-- ------ ----------------- ------- ----- --------
UP lo 00:00:00:00:00:00 N/A 65536 Loopback
Alias
-----
loopback interface
IP Details
------------------------- --------------------
IP: 127.0.0.1/8, ::1/128
IP Neighbor(ARP) Entries: 0
Note that the loopback is up and is assigned an IP address of 127.0.0.1.
To add an IP address to a loopback interface, configure the lo interface with NCLU:
cumulus@switch:~$ net add loopback lo ip address 10.1.1.1/32
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
You can configure multiple loopback addresses by adding additional address lines:
cumulus@switch:~$ net add loopback lo ip address 172.16.2.1/24
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above create the following snippet in the /etc/network/interfaces file:
auto lo
iface lo inet loopback
address 10.1.1.1/32
address 172.16.2.1/24
Installation Management
You can only install one image of the operating system on a Cumulus Linux switch. This section discusses how to install new and update existing Cumulus Linux disk images, and configure those images with additional applications (using packages).
System Configuration
This section provides information to help you set up your system for authentication, configure packet filtering, set the time and date, and provides other related system tasks.
Layer 1 and Switch Ports
This section describes the physical layer configuration and how to configure switch ports.
VXLAN (Virtual Extensible LAN) is a standard overlay protocol that abstracts logical virtual networks from the physical network underneath. You can deploy simple and scalable layer 3 Clos architectures while extending layer 2 segments over that layer 3 network.
VXLAN uses a VLAN-like encapsulation technique to encapsulate MAC-based layer 2 Ethernet frames within layer 3 UDP packets. Each virtual network is a VXLAN logical layer 2 segment. VXLAN scales to 16 million segments (a 24-bit VXLAN network identifier (VNI ID) in the VXLAN header) for multi-tenancy.
Hosts on a given virtual network are joined together through an overlay protocol that initiates and terminates tunnels at the edge of the multi-tenant network, typically the hypervisor vSwitch or top of rack. These edge points are the VXLAN tunnel end points (VTEP).
Cumulus Linux can initiate and terminate VTEPs in hardware and supports wire-rate VXLAN. VXLAN provides an efficient hashing scheme across the IP fabric during the encapsulation process; the source UDP port is unique, with the hash based on layer 2 through layer 4 information from the original frame. The UDP destination port is the standard port 4789.
VXLAN is supported only on switches in the Cumulus Linux HCL using the Broadcom Tomahawk, Trident II, Trident II+ and Trident3 chipsets, as well as the Mellanox Spectrum chipset.
VXLAN encapsulation over layer 3 subinterfaces (for example, swp3.111) or SVIs is not supported as traffic transiting through the switch may get dropped; even if the subinterface is used only for underlay traffic and does not perform VXLAN encapsulation, traffic might still get dropped. Only configure VXLAN uplinks as layer 3 interfaces without any subinterfaces (for example, swp3).
The VXLAN tunnel endpoints cannot share a common subnet; there must be at least one layer 3 hop between the VXLAN source and destination.
Caveats and Errata
Cut-through Mode and Store and Forward Switching
On switches using Broadcom Tomahawk, Trident II, Trident II+, and Trident3 ASICs, Cumulus Linux supports store and forward switching for VXLANs but does not support cut-through mode.
On switches using Mellanox Spectrum ASICs, Cumulus Linux supports cut-through mode for VXLANs but does not support store and forward switching.
MTU Size for Virtual Network Interfaces
Ensure that the maximum transmission unit (MTU) size for a virtual network interface is 50 bytes smaller than the MTU for the physical interfaces on the switch. For more information on setting the MTU, read Switch Port Attributes.
Layer 3 and Layer 2 VNIs Cannot Share the Same ID
A layer 3 VNI and a layer 2 VNI cannot have the same ID. If the VNI IDs are the same, the layer 2 VNI does not get created.
TC Filters
NVIDIA recommends you run TC filter commands on each VLAN interface on the VTEP to install rules to protect the UDP port that Cumulus Linux uses for VXLAN encapsulation against VXLAN hopping vulnerabilities. If you have VRR configured on the VLAN, add a similar rule for the VRR device.
The following example installs an IPv4 and an IPv6 filter on vlan10 to protect the default port 4879:
cumulus@switch:mgmt:~$ tc filter add dev vlan10 prio 1 protocol ip ingress flower ip_proto udp dst_port 4879 action drop
cumulus@switch:mgmt:~$ tc filter add dev vlan10 prio 2 protocol ipv6 ingress flower ip_proto udp dst_port 4879 action drop
The following example installs an IPv4 and an IPv6 filter on VRR device vlan10-v0 to protect port 4879:
cumulus@switch:mgmt:~$ tc filter add dev vlan10-v0 prio 1 protocol ip ingress flower ip_proto udp dst_port 4879 action drop
cumulus@switch:mgmt:~$ tc filter add dev vlan10-v0 prio 2 protocol ipv6 ingress flower ip_proto udp dst_port 4879 action drop
This section describes layer 3 configuration. Read this section to understand routing protocols and learn how to configure routing on the Cumulus Linux switch.
Monitoring and Troubleshooting
This chapter introduces monitoring and troubleshooting Cumulus Linux.
Serial Console
The serial console can be a useful tool for debugging issues, especially
when you find yourself rebooting the switch often or if you do not have a
reliable network connection.
The default serial console baud rate is 115200, which is the baud rate
ONIE uses.
Configure the Serial Console on ARM Switches
On ARM switches, the U-Boot environment variable baudrate identifies
the baud rate of the serial console. To change the baudrate variable,
use the fw_setenv command:
cumulus@switch:~$ sudo fw_setenv baudrate 9600
Updating environment variable: `baudrate'
Proceed with update [N/y]? y
You must reboot the switch for the baudrate change to take effect.
The valid values for baudrate are:
300
600
1200
2400
4800
9600
19200
38400
115200
Configure the Serial Console on x86 Switches
On x86 switches, you configure serial console baud rate by editing grub.
Incorrect configuration settings in grub can cause the switch to be
inaccessible via the console. Grub changes should be carefully reviewed
before implementation.
The valid values for the baud rate are:
300
600
1200
2400
4800
9600
19200
38400
115200
To change the serial console baud rate:
Edit /etc/default/grub. The two relevant lines in
/etc/default/grub are as follows; replace the 115200 value with
a valid value specified above in the --speed variable in the first
line and in the console variable in the second line:
After you save your changes to the grub configuration, type the
following at the command prompt:
cumulus@switch:~$ update-grub
If you plan on accessing the switch BIOS over the serial console, you need to update the baud rate in the switch BIOS. For more information, see this this knowledge base article.
Reboot the switch.
Change the Console Log level
By default, the console prints all log messages except debug messages. To tune console logging to be less verbose so that certain levels of messages are not printed, run the dmesg -n <level> command, where the log levels are:
Level
Description
0
Emergency messages (the system is about to crash or is unstable).
1
Serious conditions; you must take action immediately.
2
Critical conditions (serious hardware or software failures).
3
Error conditions (often used by drivers to indicate difficulties with the hardware).
4
Warning messages (nothing serious but might indicate problems).
5
Message notifications for many conditions, including security events.
6
Informational messages.
7
Debug messages.
Only messages with a value lower than the level specified are printed to the console. For example, if you specify level 3, only level 2 (critical conditions), level 1 (serious conditions), and level 0 (emergency messages) are printed to the console:
cumulus@switch:~$ sudo dmesg -n 3
Alternatively, you can run dmesg --console-level <level> command, where the log levels are emerg, alert, crit, err, warn, notice, info, or debug. For example, to print critical conditions, run the following command:
cumulus@switch:~$ sudo dmesg --console-level crit
The dmesg command is applied until the next reboot.
For more details about the dmesg command, run man dmesg.
Show General System Information
Two commands are helpful for getting general information about the
switch and the version of Cumulus Linux you are running. These are
helpful with system diagnostics and if you need to submit a support
request.
For information about the version of Cumulus Linux running on the
switch, run net show version, which displays the contents of
/etc/lsb-release:
cumulus@switch:~$ net show version
NCLU_VERSION=1.0
DISTRIB_ID="Cumulus Linux"
DISTRIB_RELEASE=3.4.0
DISTRIB_DESCRIPTION="Cumulus Linux 3.4.0"
For general information about the switch, run net show system, which
gathers information about the switch from a number of files in the
system:
cumulus@switch:~$ net show system
Hostname......... celRED
Build............ Cumulus Linux 3.7.4~1551312781.35d3264
Uptime........... 8 days, 12:24:01.770000
Model............ Cel REDSTONE
CPU.............. x86_64 Intel Atom C2538 2.4 GHz
Memory........... 4GB
Disk............. 14.9GB
ASIC............. Broadcom Trident2 BCM56854
Ports............ 48 x 10G-SFP+ & 6 x 40G-QSFP+
Base MAC Address. a0:00:00:00:00:50
Serial Number.... A1010B2A011212AB000001
Diagnostics Using cl-support
You can use cl-support to generate a single export file that contains
various details and the configuration from a switch. This is useful for
remote debugging and troubleshooting. For more information about
cl-support, read Understanding the cl-support Output File.
You should run cl-support before you submit a support request as this file helps in the investigation of issues.
cumulus@switch:~$ sudo cl-support -h
Usage: cl-support [-h] [-s] [-t] [-v] [reason]...
Args:
[reason]: Optional reason to give for invoking cl-support.
Saved into tarball's cmdline.args file.
Options:
-h: Print this usage statement
-s: Security sensitive collection
-t: User filename tag
-v: Verbose
-e MODULES: Enable modules. Comma separated module list (run with -e help for module names)
-d MODULES: Disable modules. Comma separated module list (run with -d help for module names)
Send Log Files to a syslog Server
The remote syslog server can be configured on the switch using the
following configuration:
cumulus@switch:~$ net add syslog host ipv4 192.168.0.254 port udp 514
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This creates a file called /etc/rsyslog.d/11-remotesyslog.conf in the
rsyslog directory. The file has the following content:
cumulus@switch:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
# This file was automatically generated by NCLU.
*.* @192.168.0.254:514 # UDP
Logging on Cumulus Linux is done with rsyslog.
rsyslog provides both local logging to the syslog file as well as the ability
to export logs to an external syslog server. High precision timestamps are enabled
for all rsyslog log files; here’s an example:
2015-08-14T18:21:43.337804+00:00 cumulus switchd[3629]: switchd.c:1409 switchd version 1.0-cl2.5+5
There are applications in Cumulus Linux that could write directly to a
log file without going through rsyslog. These files are typically
located in /var/log/.
All Cumulus Linux rules are stored in separate files in
/etc/rsyslog.d/, which are called at the end of the GLOBAL DIRECTIVES section of /etc/rsyslog.conf. As a result, the RULES
section at the end of rsyslog.conf is ignored because the messages
have to be processed by the rules in /etc/rsyslog.d and then dropped
by the last line in /etc/rsyslog.d/99-syslog.conf.
Local Logging
Most logs within Cumulus Linux are sent through rsyslog, which then
writes them to files in the /var/log directory. There are default
rules in the /etc/rsyslog.d/ directory that define where the logs are
written:
Rule
Purpose
10-rules.conf
Sets defaults for log messages, include log format and log rate limits.
15-crit.conf
Logs crit, alert or emerg log messages to /var/log/crit.log to ensure they are not rotated away rapidly.
20-clagd.conf
Logs clagd messages to /var/log/clagd.log for MLAG.
22-linkstate.conf
Logs link state changes for all physical and logical network links to /var/log/linkstate
Logs routing protocol messages to /var/log/frr/frr.log. This includes BGP and OSPF log messages.
99-syslog.conf
All remaining processes that use rsyslog are sent to /var/log/syslog.
Log files that are rotated are compressed into an archive. Processes
that do not use rsyslog write to their own log files within the
/var/log directory. For more information on specific log files, see
Troubleshooting Log Files.
Enable Remote syslog
By default not all log messages are sent to a remote server
If you need to send other log files - such as switchd logs - to a
syslog server, do the following:
Create a file in /etc/rsyslog.d/. Make sure it starts with a
number lower than 99 so that it executes before log messages are
dropped in, such as 20-clagd.conf or 25-switchd.conf. Our
example file is called /etc/rsyslog.d/11-remotesyslog.conf. Add
content similar to the following:
## Logging switchd messages to remote syslog server
@192.168.1.2:514
This configuration sends log messages to a remote syslog server
for the following processes: clagd, switchd, ptmd, rdnbrd,
netd and syslog. It follows the same syntax as the
/var/log/syslog file, where @ indicates UDP, 192.168.1.2 is
the IP address of the syslog server, and 514 is the UDP port.
For TCP-based syslog, use two @@ before the IP address @@192.168.1.2:514.
The numbering of the files in /etc/rsyslog.d/ dictates how the rules are installed into rsyslog.d. Lower numbered rules are processed first, and rsyslog processing terminates with the stop keyword. For example, the rsyslog configuration for FRR is stored in the 45-frr.conf file with an explicit stop at the bottom of the file. FRR messages are logged to the /var/log/frr/frr.log file on the local disk only (these messages are not sent to a remote server using the default configuration). To log FRR messages remotely in addition to writing FRR messages to the local disk, rename the 99-syslog.conf file to 11-remotesyslog.conf. FRR messages are first processed by the 11-remotesyslog.conf rule (transmit to remote server), then continue to be processed by the 45-frr.conf file (write to local disk in the /var/log/frr/frr.log file).
Do not use the imfile module with any file written by rsyslogd.
You can write to syslog with
management VRF enabled by applying the
following configuration; this configuration is commented out in the
/etc/rsyslog.d/11-remotesyslog.conf file:
cumulus@switch:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
## Copy all messages to the remote syslog server at 192.168.0.254 port 514
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
For each syslog server, configure a unique action line. For example,
to configure two syslog servers at 192.168.0.254 and 10.0.0.1:
cumulus@switch:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
## Copy all messages to the remote syslog servers at 192.168.0.254 and 10.0.0.1 port 514
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
action(type="omfwd" Target="10.0.0.1" Device="mgmt" Port="514" Protocol="udp")
If you configure remote logging to use the TCP protocol, local logging might stop when the remote syslog server is unreachable. To avoid this behavior, configure a disk queue size and maximum retry count in your rsyslog configuration:
If you want to limit the number of syslog messages that can be written
to the syslog file from individual processes, add the following
configuration to /etc/rsyslog.conf. Adjust the interval and burst
values to rate-limit messages to the appropriate levels required by your
environment. For more information, read the
rsyslog documentation.
Harmless syslog Error: Failed to reset devices.list
The following message gets logged to /var/log/syslog when you run
systemctl daemon-reload and during system boot:
systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
This message is harmless, and can be ignored. It is logged when
systemd attempts to change cgroup attributes that are read only. The
upstream version of systemd has been modified to not log this message by
default.
The systemctl daemon-reload command is often issued when Debian
packages are installed, so the message may be seen multiple times when
upgrading packages.
Syslog Troubleshooting Tips
You can use the following commands to troubleshoot syslog issues.
Verifying that rsyslog is Running
To verify that the rsyslog service is running, use the sudo systemctl status rsyslog.service command:
cumulus@leaf01:mgmt-vrf:~$ sudo systemctl status rsyslog.service
rsyslog.service - System Logging Service
Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled)
Active: active (running) since Sat 2017-12-09 00:48:58 UTC; 7min ago
Docs: man:rsyslogd(8)
http://www.rsyslog.com/doc/
Main PID: 11751 (rsyslogd)
CGroup: /system.slice/rsyslog.service
└─11751 /usr/sbin/rsyslogd -n
Dec 09 00:48:58 leaf01 systemd[1]: Started System Logging Service.
Verify your rsyslog Configuration
After making manual changes to any files in the /etc/rsyslog.d
directory, use the sudo rsyslogd -N1 command to identify any errors in
the configuration files that might prevent the rsyslog service from
starting.
In the following example, a closing parenthesis is missing in the
11-remotesyslog.conf file, which is used to configure syslog for
management VRF:
cumulus@leaf01:mgmt-vrf:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp"
cumulus@leaf01:mgmt-vrf:~$ sudo rsyslogd -N1
rsyslogd: version 8.4.2, config validation run (level 1), master config /etc/rsyslog.conf
rsyslogd: error during parsing file /etc/rsyslog.d/15-crit.conf, on or before line 3: invalid character '$' in object definition - is there an invalid escape sequence somewhere? [try http://www.rsyslog.com/e/2207 ]
rsyslogd: error during parsing file /etc/rsyslog.d/15-crit.conf, on or before line 3: syntax error on token 'crit_log' [try http://www.rsyslog.com/e/2207 ]
After correcting the invalid syntax, issuing the sudo rsyslogd -N1
command produces the following output.
cumulus@leaf01:mgmt-vrf:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
cumulus@leaf01:mgmt-vrf:~$ sudo rsyslogd -N1
rsyslogd: version 8.4.2, config validation run (level 1), master config /etc/rsyslog.conf
rsyslogd: End of config validation run. Bye.
tcpdump
If a syslog server is not accessible to validate that syslog messages
are being exported, you can use tcpdump.
In the following example, a syslog server has been configured at
192.168.0.254 for UDP syslogs on port 514:
cumulus@leaf01:mgmt-vrf:~$ sudo tcpdump -i eth0 host 192.168.0.254 and udp port 514
A simple way to generate syslog messages is to use sudo in another
session, such as sudo date. Using sudo generates an authpriv log.
cumulus@leaf01:mgmt-vrf:~$ sudo tcpdump -i eth0 host 192.168.0.254 and udp port 514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:57:15.356836 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.notice, length: 105
00:57:15.364346 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.info, length: 103
00:57:15.369476 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.info, length: 85
To see the contents of the syslog file, use the tcpdump -X option:
This section discusses the various architectures and strategies available with Cumulus Linux and describes different solutions, such as RDMA over Converged Ethernet (RoCE).
Managing Cumulus Linux Disk Images
The Cumulus Linux operating system resides on a switch as a disk image. This section discusses how to manage the disk image.
To determine if your switch is on an x86 or ARM platform, run the uname -m command.
For example, on an x86 platform, uname -m outputs x86_64:
cumulus@x86switch$ uname -m
x86_64
On an ARM platform, uname -m outputs armv7l:
cumulus@ARMswitch$ uname -m
armv7l
You can also visit the HCL (hardware compatibility list) to look at your hardware and determine the processor type.
Reprovision the System (Restart the Installer)
Reprovisioning the system deletes all system data from the switch.
To stage an ONIE installer from the network (where ONIE automatically locates the installer), run the onie-select -i command. A reboot is required for the reinstall to begin.
cumulus@switch:~$ sudo onie-select -i
WARNING:
WARNING: Operating System install requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling install at next reboot...done.
Reboot required to take effect.
To cancel a pending reinstall operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending install at next reboot...done.
To stage an installer located in a specific location, run the onie-install -i command. You can specify a local, absolute or relative path, an HTTP or HTTPS server, SCP or FTP server. You can also stage a Zero Touch Provisioning (ZTP) script along with the installer. The onie-install command is typically used with the -a option to activate installation. If you do not specify the -a option, a reboot is required for the reinstall to begin.
The following example stages the installer located at http://203.0.113.10/image-installer together with the ZTP script located at http://203.0.113.10/ztp-script and activates installation and ZTP:
You can also specify these options together in the same command. For example:
cumulus@switch:~$ sudo onie-install -i http://203.0.113.10/image-installer -z http://203.0.113.10/ztp-script -a
To see more onie-install options, run man onie-install.
Uninstall All Images and Remove the Configuration
To remove all installed images and configurations and return the switch to its factory defaults, run the onie-select -k command.
The onie-select -k command takes a long time to run as it overwrites the entire NOS section of the flash. Only use this command if you want to erase all NOS data and take the switch out of service.
cumulus@switch:~$ sudo onie-select -k
WARNING:
WARNING: Operating System uninstall requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling uninstall at next reboot...done.
Reboot required to take effect.
A reboot is required for the uninstall to begin.
To cancel a pending uninstall operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending uninstall at next reboot...done.
Boot into Rescue Mode
If your system becomes broken is some way, you can correct certain issues by booting into ONIE rescue mode. In rescue mode, the file systems are unmounted and you can use various Cumulus Linux utilities to try and resolve a problem.
To reboot the system into ONIE rescue mode, run the onie-select -r command:
cumulus@switch:~$ sudo onie-select -r
WARNING:
WARNING: Rescue boot requested.
WARNING:
Are you sure (y/N)? y
Enabling rescue at next reboot...done.
Reboot required to take effect.
A reboot is required to boot into rescue mode.
To cancel a pending rescue boot operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending rescue at next reboot...done.
Inspect the Image File
The Cumulus Linux installation disk image file is executable. From a running switch, you can display, extract, and verify the contents of the image file.
To display the contents of the Cumulus Linux image file, pass the info option to the image file. For example, to display the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:
To extract the contents of the image file, use with the extract <path> option. For example, to extract an image file called onie-installer located in the /var/lib/cumulus/installer directory to the mypath directory:
cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer extract mypath
total 181860
-rw-r--r-- 1 4000 4000 308 May 16 19:04 control
drwxr-xr-x 5 4000 4000 4096 Apr 26 21:28 embedded-installer
-rw-r--r-- 1 4000 4000 13273936 May 16 19:04 initrd
-rw-r--r-- 1 4000 4000 4239088 May 16 19:04 kernel
-rw-r--r-- 1 4000 4000 168701528 May 16 19:04 sysroot.tar
To verify the contents of the image file, use with the verify option. For example, to verify the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:
cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer verify
Verifying image checksum ...OK.
Preparing image archive ... OK.
./cumulus-linux-bcm-amd64.bin.1: 161: ./cumulus-linux-bcm-amd64.bin.1: onie-sysinfo: not found
Verifying image compatibility ...OK.
Verifying system ram ...OK.
This topic discusses how to install a new Cumulus Linux disk image using ONIE, an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on bare metal switches.
Before you install Cumulus Linux, the switch can be in two different states:
No image is installed on the switch (the switch is only running ONIE).
Cumulus Linux is already installed on the switch but you want to use ONIE to reinstall Cumulus Linux or upgrade to a newer version.
The sections below describe some of the different ways you can install the Cumulus Linux disk image, such as using a DHCP/web server, FTP, TFTP, a local file, or a USB drive. Steps are provided for both installing directly from ONIE (if no image is installed on the switch) and from Cumulus Linux (if the image is already installed on the switch), where applicable. For additional methods to find and install the Cumulus Linux image, see the ONIE Design Specification.
Installing the Cumulus Linux disk image is destructive; configuration files on the switch are not saved; copy them to a different server before installing.
In the example commands, [PLATFORM] can be any supported Cumulus Linux platform, such as x86_64, or arm.
Run the sudo onie-install -h command to show the ONIE installer options.
After you install the Cumulus Linux disk image, you need to install the license file. Refer to Install the License.
In Cumulus Linux 3.7.12, the default password for the cumulus user account has changed to cumulus. The first time you log into Cumulus Linux, you are required to change this default password. Be sure to update any automation scripts before you upgrade to Cumulus Linux 3.7.12.
Install Using a DHCP/Web Server with DHCP Options
To install Cumulus Linux using a DHCP/web server with DHCP options, set up a DHCP/web server on your laptop and connect the eth0 management port of the switch to your laptop. After you connect the cable, the installation proceeds as follows:
The bare metal switch boots up and requests an IP address (DHCP request).
The DHCP server acknowledges and responds with DHCP option 114 and the location of the installation image.
ONIE downloads the Cumulus Linux disk image, installs, and reboots.
Success! You are now running Cumulus Linux.
The most common method is to send DHCP option 114 with the entire URL to the web server (this can be the same system). However, there are many other ways to use DHCP even if you do not have full control over DHCP. See the ONIE user guide for help with partial installer URLs and advanced DHCP options; both articles list more supported DHCP options.
Here is an example DHCP configuration with an ISC DHCP server:
Install Using a DHCP/Web Server without DHCP Options
Follow the steps below if you have a laptop on the same network and the switch can pull DHCP from the corporate network, but you cannot modify DHCP options (maybe it is controlled by another team).
Install from ONIE
Place the Cumulus Linux disk image in a directory on the web server.
From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.
cumulus@switch:~$ sudo onie-install -a -i /path/to/local/file/cumulus-install-[PLATFORM].bin
Install Using a USB Drive
Follow the steps below to install the Cumulus Linux disk image using a USB drive. Instructions are provided for x86 and ARM platforms.
Installing Cumulus Linux using a USB drive is fine for a single switch here and there but is not scalable. DHCP can scale to hundreds of switch installs with zero manual input unlike USB installs.
From a computer, prepare your USB drive by formatting it using one of the supported formats: FAT32, vFAT or EXT2.
Optional: Prepare a USB Drive inside Cumulus Linux
Use caution when performing the actions below; it is possible to severely damage your system with the following utilities.
Insert your USB drive into the USB port on the switch running Cumulus Linux and log in to the switch.
Examine output from cat /proc/partitions and sudo fdisk -l [device] to determine on which device your USB drive can be found. For example, sudo fdisk -l /dev/sdb.
These instructions assume your USB drive is the /dev/sdb device, which is typical if you insert the USB drive after the machine is already booted. However, if you insert the USB drive during the boot process, it is possible that your USB drive is the /dev/sda device. Make sure to modify the commands below to use the proper device for your USB drive.
Create a new partition table on the USB drive:
sudo parted /dev/sdb mklabel msdos
The parted utility should already be installed. However, if it is not, install it with: sudo -E apt-get install parted
Create a new partition on the USB drive:
sudo parted /dev/sdb -a optimal mkpart primary 0% 100%
Format the partition to your filesystem of choice using one of the examples below:
When using a Mac or Windows computer to rename the installation file, the file extension might still be present. Make sure to remove the file extension otherwise ONIE is not able to detect the file.
Insert the USB drive into the switch, then continue with the appropriate instructions below for your x86 or ARM platform.
Instructions for x86 Platforms
Click to expand x86 instructions...
Prepare the switch for installation:
If the switch is offline, connect to the console and power on the switch.
If the switch is already online in ONIE, use the reboot command.
SSH sessions to the switch get dropped after this step. To complete the remaining instructions, connect to the console of the switch. Cumulus Linux switches display their boot process to the console; you need to monitor the console specifically to complete the next step.
Monitor the console and select the ONIE option from the first GRUB screen shown below.
Cumulus Linux on x86 uses GRUB chainloading to present a second GRUB menu specific to the ONIE partition. No action is necessary in this menu to select the default option ONIE: Install OS.
The USB drive is recognized and mounted automatically. The image file is located and automatic installation of Cumulus Linux begins. Here is some sample output:
ONIE: OS Install Mode ...
Version : quanta_common_rangeley-2014.05.05-6919d98-201410171013
Build Date: 2014-10-17T10:13+0800
Info: Mounting kernel filesystems... done.
Info: Mounting LABEL=ONIE-BOOT on /mnt/onie-boot ...
initializing eth0...
scsi 6:0:0:0: Direct-Access SanDisk Cruzer Facet 1.26 PQ: 0 ANSI: 6
sd 6:0:0:0: [sdb] 31266816 512-byte logical blocks: (16.0 GB/14.9 GiB)
sd 6:0:0:0: [sdb] Write Protect is off
sd 6:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
sd 6:0:0:0: [sdb] Attached SCSI disk
<...snip...>
ONIE: Executing installer: file://dev/sdb1/onie-installer-x86_64
Verifying image checksum ... OK.
Preparing image archive ... OK.
Dumping image info...
Control File Contents
=====================
Description: Cumulus Linux
OS-Release: 3.0.0-3b46bef-201509041633-build
Architecture: amd64
Date: Fri, 27 May 2016 17:10:30 -0700
Installer-Version: 1.2
Platforms: accton_as5712_54x accton_as6712_32x mlx_sx1400_i73612 dell_s6000_s1220 dell_s4000_c2338 dell_s3000_c2338 cel_redstone_xp cel_smallstone_xp cel_pebble quanta_panther quanta_ly8_rangeley quanta_ly6_rangeley quanta_ly9_rangeley
Homepage: http://www.cumulusnetworks.com/
After installation completes, the switch automatically reboots into the newly installed instance of Cumulus Linux.
Instructions for ARM Platforms
Click to expand ARM instructions...
Prepare the switch for installation:
If the switch is offline, connect to the console and power on the switch.
If the switch is already online in ONIE, use the reboot command.
SSH sessions to the switch get dropped after this step. To complete the remaining instructions, connect to the console of the switch. Cumulus Linux switches display their boot process to the console; you need to monitor the console specifically to complete the next step.
Interrupt the normal boot process before the countdown (shown below) completes. Press any key to stop the autoboot.
A command prompt appears so that you can run commands. Execute the following command:
run onie_bootcmd
The USB drive is recognized and mounted automatically. The image file is located and automatic installation of Cumulus Linux begins. Here is some sample output:
Loading Open Network Install Environment ...
Platform: arm-as4610_54p-r0
Version : 1.6.1.3
WARNING: adjusting available memory to 30000000
## Booting kernel from Legacy Image at ec040000 ...
Image Name: as6701_32x.1.6.1.3
Image Type: ARM Linux Multi-File Image (gzip compressed)
Data Size: 4456555 Bytes = 4.3 MiB
Load Address: 00000000
Entry Point: 00000000
Contents:
Image 0: 3738543 Bytes = 3.6 MiB
Image 1: 706440 Bytes = 689.9 KiB
Image 2: 11555 Bytes = 11.3 KiB
Verifying Checksum ... OK
## Loading init Ramdisk from multi component Legacy Image at ec040000 ...
## Flattened Device Tree from multi component Image at EC040000
Booting using the fdt at 0xec47d388
Uncompressing Multi-File Image ... OK
Loading Ramdisk to 2ff53000, end 2ffff788 ... OK
Loading Device Tree to 03ffa000, end 03fffd22 ... OK
<...snip...>
ONIE: Starting ONIE Service Discovery
ONIE: Executing installer: file://dev/sdb1/onie-installer-arm
Verifying image checksum ... OK.
Preparing image archive ... OK.
Dumping image info ...
Control File Contents
=====================
Description: Cumulus Linux
OS-Release: 3.0.0-3b46bef-201509041633-build
Architecture: arm
Date: Fri, 27 May 2016 17:08:35 -0700
Installer-Version: 1.2
Platforms: accton_as4600_54t, accton_as6701_32x, accton_5652, accton_as5610_52x, dni_6448, dni_7448, dni_c7448n, cel_kennisis, cel_redstone, cel_smallstone, cumulus_p2020, quanta_lb9, quanta_ly2, quanta_ly2r, quanta_ly6_p2020
Homepage: http://www.cumulusnetworks.com/
After installation completes, the switch automatically reboots into the newly installed instance of Cumulus Linux.
This topic describes how to upgrade Cumulus Linux on your switches to a more recent release.
Consider deploying, provisioning, configuring, and upgrading switches using automation, even with small networks or test labs. During the upgrade process, you can quickly upgrade dozens of devices in a repeatable manner. Using tools like Ansible, Chef, or Puppet for configuration management greatly increases the speed and accuracy of the next major upgrade; these tools also enable the quick swap of failed switch hardware.
In Cumulus Linux 3.7.12, the default password for the cumulus user account has changed to cumulus. The first time you log into Cumulus Linux, you are required to change this default password. Be sure to update any automation scripts before you upgrade to Cumulus Linux 3.7.12.
Understanding the location of configuration data is required for successful upgrades, migrations, and backup. As with other Linux distributions, the /etc directory is the primary location for all configuration data in Cumulus Linux. The following list is a likely set of files that you need to back up and migrate to a new release. Make sure you examine any file that has been changed. Consider making the following files and directories part of a backup strategy.
Network Configuration Files
File Name and Location
Explanation
Cumulus Linux Documentation
Debian Documentation
/etc/network/
Network configuration files, most notably /etc/network/interfaces and /etc/network/interfaces.d/
Best practice is to place changes in /etc/sudoers.d/ instead of /etc/sudoers; changes in the /etc/sudoers.d/ directory are not lost during upgrade. If you are upgrading from a release prior to 3.2 (such as 3.1.2) to a 3.2 or later release, be aware that the sudoers file changed in Cumulus Linux 3.2.
If you are using the root user account, consider including /root/.
If you have custom user accounts, consider including /home/<username>/.
Run the net show configuration files | grep -B 1 "===" command and back up the files listed in the command output.
Files to Never Migrate between Versions or Switches
File Name and Location
Explanation
/etc/bcm.d/
Per-platform hardware configuration directory, created on first boot. Do not copy.
/etc/mlx/
Per-platform hardware configuration directory, created on first boot. Do not copy.
/etc/default/clagd
Created and managed by ifupdown2. Do not copy.
/etc/default/grub
Grub init table. Do not modify manually.
/etc/default/hwclock
Platform hardware-specific file. Created during first boot. Do not copy.
/etc/init
Platform initialization files. Do not copy.
/etc/init.d/
Platform initialization files. Do not copy.
/etc/fstab
Static info on filesystem. Do not copy.
/etc/image-release
System version data. Do not copy.
/etc/os-release
System version data. Do not copy.
/etc/lsb-release
System version data. Do not copy.
/etc/lvm/archive
Filesystem files. Do not copy.
/etc/lvm/backup
Filesystem files. Do not copy.
/etc/modules
Created during first boot. Do not copy.
/etc/modules-load.d/
Created during first boot. Do not copy.
/etc/sensors.d
Platform-specific sensor data. Created during first boot. Do not copy.
/root/.ansible
Ansible tmp files. Do not copy.
/home/cumulus/.ansible
Ansible tmp files. Do not copy.
Create a cl-support File
Before and after you upgrade the switch, run the cl-support script to create a cl-support archive file. The file is a compressed archive of useful information for troubleshooting. If you experience any issues during upgrade, you can send this archive file to the Cumulus Linux support team to investigate.
Create the cl-support archive file with the cl-support command:
cumulus@switch:~$ sudo cl-support
Copy the cl-support file off the switch to a different location.
After upgrade is complete, run the cl-support command again to create a new archive file:
cumulus@switch:~$ sudo cl-support
Upgrade Cumulus Linux
You can upgrade Cumulus Linux in one of two ways:
Install a disk image of the new release, using ONIE.
Upgrade only the changed packages using the sudo -E apt-get update and sudo -E apt-get upgrade command.
Upgrading an MLAG pair requires additional steps. If you are using MLAG to dual connect two Cumulus Linux switches in your environment, follow the steps in Upgrade Switches in an MLAG Pair below to ensure a smooth upgrade.
Should I Install a Disk Image or Upgrade Packages?
The decision to upgrade Cumulus Linux by either installing a disk image or upgrading packages depends on your environment and your preferences. Here are some recommendations for each upgrade method.
Installing a disk image is recommended if you are performing a rolling upgrade in a production environment and if are using up-to-date and comprehensive automation scripts. This upgrade method enables you to choose the exact release to which you want to upgrade and is the only method available to upgrade your switch to a new release train (for example, from 2.5.6 to 3.7.0) or from a release earlier than 3.6.2.
Be aware of the following when installing the disk image:
Installing a disk image is destructive; any configuration files on the switch are not saved; copy them to a different server before you start the disk image install.
You must move configuration data to the new OS using ZTP or automation while the OS is first booted, or soon afterwards using out-of-band management.
Merge conflicts with configuration file changes in the new release might go undetected.
If configuration files are not restored correctly, you might be unable to ssh to the switch from in-band management. Out-of-band connectivity (eth0 or console) is recommended.
You must reinstall and reconfigure third-party applications after upgrade.
Package upgrade is recommended if you are upgrading from Cumulus Linux 3.6.2 or later, or if you use third-party applications (package upgrade does not replace or remove third-party applications, unlike disk image install).
Be aware of the following when upgrading packages:
You cannot upgrade the switch to a new release train. For example, you cannot upgrade the switch from 2.5.6 to 3.y.z.
If you are upgrading Cumulus Linux from a release earlier than 3.6.2, you might encounter certain issues due to package changes and service restarts.
You cannot choose the exact release that you want to run. When you upgrade, you upgrade all packages to the latest available release in the Cumulus Linux repository.
If you are upgrading from a release earlier than 3.6.2, certain upgrade operations terminate SSH sessions and/or routing on the in-band (front panel) ports, leaving you unable to monitor the upgrade process. (As a workaround, you can use the dtach tool.)
The sudo -E apt-get upgrade command might result in services being restarted or stopped as part of the upgrade process.
The sudo -E apt-get install command might disrupt core services by changing core service dependency packages.
After you upgrade, account UIDs and GIDs created by packages might be different on different switches, depending on the configuration and package installation history.
Disk Image Install (ONIE)
ONIE is an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on a bare metal switch.
To upgrade the switch with a new disk image using ONIE:
Back up the configurations off the switch.
Download the Cumulus Linux image you want to install.
Install the disk image with the onie-install -a -i <image-location> command, which boots the switch into ONIE. The following example command installs the image from a web server, then reboots the switch. There are additional ways to install the disk image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/cumulus-linux-3.7.1-mlx-amd64.bin && sudo reboot
Restore the configuration files to the new release - ideally with automation.
Verify correct operation with the old configurations on the new release.
Reinstall third party applications and associated configurations.
Package Upgrade
Cumulus Linux completely embraces the Linux and Debian upgrade workflow, where you use an installer to install a base image, then perform any upgrades within that release train with sudo -E apt-get update and -E apt-get upgrade commands. Any packages that have been changed since the base install get upgraded in place from the repository. All switch configuration files remain untouched, or in rare cases merged (using the Debian merge function) during the package upgrade.
When you use package upgrade to upgrade your switch, configuration data stays in place while the packages are upgraded. If the new release updates a configuration file that you changed previously, you are prompted for the version you want to use or if you want to evaluate the differences.
To upgrade the switch using package upgrade:
Back up the configurations from the switch.
To upgrade to Cumulus Linux 3.7.16, you must download the new repository keys:
Fetch the latest update metadata from the repository.
cumulus@switch:~$ sudo -E apt-get update
Review potential upgrade issues (in some cases, upgrading new packages might also upgrade additional existing packages due to dependencies). Run the following command to see the additional packages that will be installed or upgraded.
Upgrade all the packages to the latest distribution.
cumulus@switch:~$ sudo -E apt-get upgrade
If no reboot is required after the upgrade completes, the upgrade ends, restarts all upgraded services, and logs messages in the /var/log/syslog file similar to the ones shown below. In the examples below, only the frr package was upgraded.
Policy: Service frr.service action stop postponed
Policy: Service frr.service action start postponed
Policy: Restarting services: frr.service
Policy: Finished restarting services
Policy: Removed /usr/sbin/policy-rc.d
Policy: Upgrade is finished
If the upgrade process encounters changed configuration files that have new versions in the release to which you are upgrading, you see a message similar to this:
Configuration file '/etc/frr/daemons'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** daemons (Y/I/N/O/D/Z) [default=N] ?
- To see the differences between the currently installed version
and the new version, type `D`- To keep the currently installed
version, type `N`. The new package version is installed with the
suffix `_.dpkg-dist` (for example, `/etc/frr/daemons.dpkg-dist`).
When upgrade is complete and **before** you reboot, merge your
changes with the changes from the newly installed file.
- To install the new version, type `I`. Your currently installed
version is saved with the suffix `.dpkg-old`.
When the upgrade is complete, you can search for the files with the
`sudo find / -mount -type f -name '*.dpkg-*'` command.
If you see errors for expired GPG keys that prevent you from upgrading packages, follow the steps in Upgrading Expired GPG Keys.
Reboot the switch if the upgrade messages indicate that a system restart is required.
cumulus@switch:~$ sudo -E apt-get upgrade
... upgrade messages here ...
*** Caution: Service restart prior to reboot could cause unpredictable behavior
*** System reboot required ***
cumulus@switch:~$ sudo reboot
Verify correct operation with the old configurations on the new version.
Upgrade Notes
Package upgrade always updates to the latest available release in the Cumulus Linux repository. For example, if you are currently running Cumulus Linux 3.0.1 and run the sudo -E apt-get upgrade command on that switch, the packages are upgraded to the latest releases contained in the latest 3.y.z release.
Because Cumulus Linux is a collection of different Debian Linux packages, be aware of the following:
The /etc/os-release and /etc/lsb-release files are updated to the currently installed Cumulus Linux release when you upgrade the switch using either package upgrade or disk image install. For example, if you run sudo -E apt-get upgrade and the latest Cumulus Linux release on the repository is 3.7.1, these two files display the release as 3.7.1 after the upgrade.
The /etc/image-release file is updated only when you run a disk image install. Therefore, if you run a disk image install of Cumulus Linux 3.5.0, followed by a package upgrade to 3.7.1 using sudo -E apt-get upgrade, the /etc/image-release file continues to display Cumulus Linux 3.5.0, which is the originally installed base image.
Upgrade Switches in an MLAG Pair
If you are using MLAG to dual connect two switches in your environment, follow the steps below according to the version of Cumulus Linux from which you are upgrading.
You must upgrade both switches in the MLAG pair to the same release of Cumulus Linux.
Only during the upgrade process does Cumulus Linux supports different software versions between MLAG peer switches. After you upgrade the first MLAG switch in the pair, run the clagctl showtimers command to monitor the init-delay timer. When the timer expires, make the upgraded MLAG switch the primary, then upgrade the peer to the same version of Cumulus Linux.
Running different versions of Cumulus Linux on MLAG peer switches outside of the upgrade time period is untested and might have unexpected results.
For Cumulus Linux 3.7.10 and later, MLAG bonds stay single-connected during upgrade while the switches are running different major releases; for example, while leaf01 is running 3.7.12 and leaf02 is running 4.1.1.
This is due to a change in the bonding driver regarding how the actor port key is derived, which causes the port key to have a different value for links with the same speed/duplex settings across different major releases. The port key received from the LACP partner must remain consistent between all bond members in order for all bonds to be synchronized. When each MLAG switch sends LACPDUs with different port keys, only links to one MLAG switch are in sync.
Upgrade from Cumulus Linux 3.y.z to a Later 3.y.z Release
When you upgrade Cumulus Linux from 3.y.z to a later 3.y.z release, you can either install a disk image using ONIE or use package upgrade. Both methods are included below.
To upgrade the switches:
Verify the switch is in the secondary role:
cumulus@switch:~$ clagctl status
If you want to install a disk image, go to the next step. If you want to use package upgrade, update the Cumulus Linux repositories:
cumulus@switch:~$ sudo -E apt-get update
Shut down the core uplink layer 3 interfaces:
cumulus@switch:~$ sudo ip link set swpX down
Shut down the peerlink:
cumulus@switch:~$ sudo ip link set peerlink down
Perform the upgrade either by installing a disk image or upgrading packages. To install a disk image, run the onie-install -a -i <image-location> command to boot the switch into ONIE. The following example command installs the image from a web server. There are additional ways to install the disk image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/downloads/cumulus-linux-3.7.1-mlx-amd64.bin
To use *package upgrade*, run the `sudo -E apt-get upgrade` command:
Verify the other switch is now in the secondary role.
Repeat steps 2-10 on the new secondary switch.
Remove the priority 2048 and restore the priority back to 32768 on the current primary switch:
cumulus@switch:~$ clagctl priority 32768
Upgrade from Cumulus Linux 2.y.z to 3.y.z
If you are using MLAG to dual connect two switches in your environment and those switches are still running Cumulus Linux 2.5 ESR or any other release earlier than 3.0.0, the switches are not dual-connected after you upgrade the first switch.
To upgrade the switches, you must install a new disk image using ONIE; you cannot use package upgrade:
Disable clagd in the /etc/network/interfaces file (set clagd-enable to no), then restart switchd, networking, and FRR services.
Run cl-img-select -fr to boot the switch in the secondary role into ONIE, then reboot the switch.
Install Cumulus Linux onto the secondary switch using ONIE. At this time, all traffic goes to the switch in the primary role.
After the install, copy the license file and all the configuration files you backed up, then restart the switchd, networking, and Quagga services. All traffic is still going to the primary switch.
Run cl-img-select -fr to boot the switch in the primary role into ONIE, then reboot the switch. Now, all traffic is going to the switch in the secondary role that you just upgraded.
Install Cumulus Linux onto the primary switch using ONIE.
After the install, copy the license file and all the configuration files you backed up.
Enable clagd again in the /etc/network/interfaces file (set clagd-enable to yes), then run ifreload -a.
cumulus@switch:~$ sudo ifreload -a
Bring up all the front panel ports:
cumulus@switch:~$ sudo ip link set swp<#> up
The two switches are dual-connected again and traffic flows to both switches.
Roll Back a Cumulus Linux Installation
Even the most well planned and tested upgrades can result in unforeseen problems; sometimes the best solution is to roll back to the previous state.There are three main strategies; all require detailed planning and execution:
Back out individual packages: If you identify the problematic package, you can downgrade the affected package directly. In rare cases, you might need to restore the configuration files from backup or edit to back out any changes made automatically by the upgrade package.
Flatten and rebuild: If the OS becomes unusable, you can use orchestration tools to reinstall the previous OS release from scratch and then rebuild the configuration automatically.
Backup and restore: Another common strategy is to restore to a previous state using a backup captured before the upgrade.
The method you employ is specific to your deployment strategy, so providing detailed steps for each scenario is outside the scope of this document.
Third Party Packages
Third party packages in the Linux host world often use the same package system as the distribution into which it is to be installed (for example, Debian uses apt-get). Or, the package might be compiled and installed by the system administrator. Configuration and executable files generally follow the same filesystem hierarchy standards as other applications.
If you install any third party applications on a Cumulus Linux switch, configuration data is typically installed into the /etc directory, but it is not guaranteed. It is your responsibility to understand the behavior and configuration file information of any third party packages installed on the switch.
After you upgrade using a full disk image install, you need to reinstall any third party packages or any Cumulus Linux add-on packages, such as vxsnd or vxrd.
Cumulus Linux supports the ability to take snapshots of the complete file system as well as the ability to roll back to a previous snapshot. Snapshots are performed automatically right before and after you upgrade Cumulus Linux using package install, and right before and after you commit a switch configuration using NCLU. In addition, you can take a snapshot at any time. You can roll back the entire file system to a specific snapshot or just retrieve specific files.
The primary snapshot components include:
btrfs - an underlying file system in Cumulus Linux, which supports snapshots.
snapper - a userspace utility to create and manage snapshots on demand as well as taking snapshots automatically before and after running apt-get upgrade|install|remove|dist-upgrade. You can use snapper to roll back to earlier snapshots, view existing snapshots, or delete one or more snapshots.
takes snapshots automatically before and after committing network configurations. You can use NCLU to roll back to earlier snapshots, view existing snapshots, or delete one or more snapshots.
Install the Snapshot Package
If you are upgrading from a version of Cumulus Linux earlier than version 3.2, you need to install the cumulus-snapshot package before you can use snapshots.
For more information about using snapper, run snapper --help or man snapper(8).
View Available Snapshots
You can use both NCLU and snapper to view available snapshots on the switch.
cumulus@switch:~$ net show commit history
# Date Description
--- ------------------------------- --------------------------------------
20 Thu 01 Dec 2016 01:43:29 AM UTC nclu pre 'net commit' (user cumulus)
21 Thu 01 Dec 2016 01:43:31 AM UTC nclu post 'net commit' (user cumulus)
22 Thu 01 Dec 2016 01:44:18 AM UTC nclu pre '20 rollback' (user cumulus)
23 Thu 01 Dec 2016 01:44:18 AM UTC nclu post '20 rollback' (user cumulus)
24 Thu 01 Dec 2016 01:44:22 AM UTC nclu pre '22 rollback' (user cumulus)
31 Fri 02 Dec 2016 12:18:08 AM UTC nclu pre 'ACL' (user cumulus)
32 Fri 02 Dec 2016 12:18:10 AM UTC nclu post 'ACL' (user cumulus)
However, net show commit history only displays snapshots taken when you update your switch configuration. It does not list any snapshots taken directly with snapper. To see all the snapshots on the switch, run the sudo snapper list command:
cumulus@switch:~$ sudo snapper list
Type | # | Pre # | Date | User | Cleanup | Description | Userdata
-------+----+-------+---------------------------------+------+---------+----------------------------------------+--------------
single | 0 | | | root | | current |
single | 1 | | Sat 24 Sep 2016 01:45:36 AM UTC | root | | first root filesystem |
pre | 20 | | Thu 01 Dec 2016 01:43:29 AM UTC | root | number | nclu pre 'net commit' (user cumulus) |
post | 21 | 20 | Thu 01 Dec 2016 01:43:31 AM UTC | root | number | nclu post 'net commit' (user cumulus) |
pre | 22 | | Thu 01 Dec 2016 01:44:18 AM UTC | root | number | nclu pre '20 rollback' (user cumulus) |
post | 23 | 22 | Thu 01 Dec 2016 01:44:18 AM UTC | root | number | nclu post '20 rollback' (user cumulus) |
single | 26 | | Thu 01 Dec 2016 11:23:06 PM UTC | root | | test_snapshot |
pre | 29 | | Thu 01 Dec 2016 11:55:16 PM UTC | root | number | pre-apt | important=yes
post | 30 | 29 | Thu 01 Dec 2016 11:55:21 PM UTC | root | number | post-apt | important=yes
pre | 31 | | Fri 02 Dec 2016 12:18:08 AM UTC | root | number | nclu pre 'ACL' (user cumulus) |
post | 32 | 31 | Fri 02 Dec 2016 12:18:10 AM UTC | root | number | nclu post 'ACL' (user cumulus) |
View Differences between Snapshots
To see a line by line comparison of changes between two snapshots, run the sudo snapper diff command:
Snapshot 0 is the running configuration. You cannot roll back to it or delete it. However, you can take a snapshot of it.
Snapshot 1 is the root file system.
The snapper utility preserves a number of snapshots and automatically deletes older snapshots after the limit is reached. It does this in two ways.
By default, snapper preserves 10 snapshots that are labeled important. A snapshot is labeled important if it is created when you run apt-get. To change this number, run:
Always make NUMBER_LIMIT_IMPORTANT an even number as two snapshots are always taken before and after an upgrade. This does not apply to NUMBER_LIMIT, described next.
snapper also deletes unlabeled snapshots. By default, snapper preserves five snapshots. To change this number, run:
You can prevent snapshots from being taken automatically before and after running apt-get upgrade|install|remove|dist-upgrade. Edit /etc/cumulus/apt-snapshot.conf and set:
APT_SNAPSHOT_ENABLE=no
Roll Back to Earlier Snapshots
If you need to restore Cumulus Linux to an earlier state, you can roll back to an older snapshot.
For a snapshot created with NCLU, you can revert to the configuration prior to a specific snapshot listed in the output from net show commit history by running net rollback SNAPSHOT_NUMBER. For example, if you have snapshots 10, 11 and 12 in your commit history and you run net rollback 11, the switch configuration reverts to the configuration captured by snapshot 10.
You can also revert to the previous snapshot by specifying last by running net rollback last.
cumulus@switch:~$ net rollback SNAPSHOT_NUMBER|last
If you provided a description when you committed changes, mentioning a description rolls the configuration back to the commit prior to the specified description. For example, consider the following commit history:
cumulus@switch:~$ net show commit history
# Date Description
-- ------------------------------- --------------------------------
10 Tue 06 Nov 2018 12:07:14 AM UTC nclu "net commit" (user cumulus)
12 Tue 06 Nov 2018 10:19:50 PM UTC nclu rocket
14 Tue 06 Nov 2018 10:20:22 PM UTC nclu turtle
Running net rollback description turtle rolls the configuration back to the state it was in when you ran net commit description rocket.
Roll Back with snapper
For any snapshot on the switch, you can use snapper to roll back to a specific snapshot. When running snapper rollback, you must reboot the switch for the rollback to complete:
You might notice that the root partition is mounted multiple times. This is due to the way the btrfs file system handles subvolumes, mounting the root partition once for each subvolume. btrfs keeps one subvolume
for each snapshot taken, which stores the snapshot data. While all snapshots are subvolumes, not all subvolumes are snapshots.
Cumulus Linux excludes a number of directories when taking a snapshot of the root file system (and from any rollbacks):
Directory
Reason
/home
This directory is excluded to avoid user data loss on rollbacks.
/var/log, /var/support
The log file and Cumulus support location. These directories are excluded from snapshots to allow post-rollback analysis.
/tmp, /var/tmp
There is no need to rollback temporary files.
/opt, /var/opt
Third-party software is installed typically in /opt. Exclude /opt to avoid re-installing these applications after rollbacks.
/srv
This directory contains data for HTTP and FTP servers. Exclude this directory to avoid server data loss on rollbacks.
/usr/local
This directory is used when installing locally built software. Exclude this directory to avoid re-installing this software after rollbacks.
/var/spool
Exclude this directory to avoid loss of mail after a rollback.
/var/lib/libvirt/images
This is the default directory for libvirt VM images. Exclude this directory from the snapshot. Additionally, disable Copy-On-Write (COW) for this subvolume as COW and VM image I/O access patterns are not compatible.
The GRUB kernel modules must stay in sync with the GRUB kernel installed in the master boot record or UEFI system partition.
Adding and Updating Packages
You use the Advanced Packaging Tool (apt) to manage additional applications (in the form of packages) and to install the latest updates.
Updating, upgrading, and installing packages with aptcauses disruptions to network services:
Upgrading a package might result in services being restarted or stopped as part of the upgrade process.
Installing a package might disrupt core services by changing core service dependency packages. In some cases, installing new packages might also upgrade additional existing packages due to dependencies.
If services are stopped, you might need to reboot the switch for those services to restart.
Update the Package Cache
To work properly, apt relies on a local cache listing of the available packages. You must populate the cache initially, and then periodically update it with sudo -E apt-get update:
Use the -E option with sudo whenever you run any apt-get command. This option preserves your environment variables (such as HTTP proxies) before you install new packages or upgrade your distribution.
List Available Packages
After the cache is populated, use the apt-cache command to search the cache and find the packages in which you are interested or to get information about an available package. Here are examples of the search and show sub-commands:
cumulus@switch:~$ apt-cache search tcp
socat - multipurpose relay for bidirectional data transfer
fakeroot - tool for simulating superuser privileges
tcpdump - command-line network traffic analyzer
openssh-server - secure shell (SSH) server, for secure access from remote machines
openssh-sftp-server - secure shell (SSH) sftp server module, for SFTP access from remote machines
python-dpkt - Python packet creation / parsing module
libfakeroot - tool for simulating superuser privileges - shared libraries
openssh-client - secure shell (SSH) client, for secure access to remote machines
rsyslog - reliable system and kernel logging daemon
libwrap0 - Wietse Venema's TCP wrappers library
netbase - Basic TCP/IP networking system
cumulus@switch:~$ apt-cache show tcpdump
Package: tcpdump
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 1092
Maintainer: Romain Francoise <rfrancoise@debian.org>
Architecture: amd64
Multi-Arch: foreign
Version: 4.6.2-5+deb8u1
Depends: libc6 (>= 2.14), libpcap0.8 (>= 1.5.1), libssl1.0.0 (>= 1.0.0)
Description: command-line network traffic analyzer
This program allows you to dump the traffic on a network. tcpdump
is able to examine IPv4, ICMPv4, IPv6, ICMPv6, UDP, TCP, SNMP, AFS
BGP, RIP, PIM, DVMRP, IGMP, SMB, OSPF, NFS and many other packet
types.
.
It can be used to print out the headers of packets on a network
interface, filter packets that match a certain expression. You can
use this tool to track down network problems, to detect attacks
or to monitor network activities.
Description-md5: f01841bfda357d116d7ff7b7a47e8782
Homepage: http://www.tcpdump.org/
cumulus@switch:~$
The search commands look for the search terms not only in the package name but in other parts of the package information; the search matches on more packages than you might expect.
List Installed Packages
The APT cache contains information about all the packages available in the repository. To see which packages are actually installed on your system, use dpkg. The following example lists all the package names on the system that contain tcp:
cumulus@switch:~$ dpkg -l \*tcp\*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-===================-===================-===============================================================
un tcpd <none> <none> (no description available)
ii tcpdump 4.6.2-5+deb8u1 amd64 command-line network traffic analyzer
cumulus@switch:~$
Display the Version of a Package
To show the version of a specific package installed on the system, run the net show package version <package> command. For example, the following command shows which version of the vrf package is installed on the system:
cumulus@switch:~$ net show package version vrf
1.0-cl3u11
As an alternative to the NCLU command described above, you can run the Linux dpkg -l <package_name> command.
To see a list of all packages installed on the system with their versions, run the net show package version command. For example:
To add a new package, first ensure the package is not already installed on the system:
cumulus@switch:~$ dpkg -l | grep <name of package>
If the package is installed already, you can update the package from the Cumulus Linux repository as part of the package upgrade process, which upgrades all packages on the system. See Upgrade Packages above.
If the package is not already installed, add it by running sudo -E apt-get install <name of package>. This retrieves the package from the Cumulus Linux repository and installs it on your system together with any other packages on which this package might depend. The following example adds the tcpreplay package to the system:
cumulus@switch:~$ sudo -E apt-get install tcpreplay
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
tcpreplay
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 436 kB of archives.
After this operation, 1008 kB of additional disk space will be used.
Get:1 https://repo.cumulusnetworks.com/ CumulusLinux-1.5/main tcpreplay amd64 4.6.2-5+deb8u1 [436 kB]
Fetched 436 kB in 0s (1501 kB/s)
Selecting previously unselected package tcpreplay.
(Reading database ... 15930 files and directories currently installed.)
Unpacking tcpreplay (from .../tcpreplay_4.6.2-5+deb8u1_amd64.deb) ...
Processing triggers for man-db ...
Setting up tcpreplay (4.6.2-5+deb8u1) ...
cumulus@switch:~$
You can install several packages at the same time:
In some cases, installing a new package might also upgrade
additional existing packages due to dependencies. To view these
additional packages before you install, run the `apt-get install
--dry-run` command.
Add Packages from Another Repository
As shipped, Cumulus Linux searches the Cumulus Linux repository for available packages. You can add additional repositories to search by adding them to the list of sources that apt-get consults. See man sources.list for more information.
NVIDIA has added features or made bug fixes to certain packages; you must not replace these packages with versions from other repositories. Cumulus Linux is configured to ensure that the packages from the Cumulus Linux repository are always preferred over packages from other repositories.
If you want to install packages that are not in the Cumulus Linux repository, the procedure is the same as above, but with one additional step.
Packages that are not part of the Cumulus Linux Repository are not typically tested and might not be supported by Cumulus Linux Technical Support.
Installing packages outside of the Cumulus Linux repository requires the use of sudo -E apt-get; however, depending on the package, you can use easy-install and other commands.
To install a new package, complete the following steps:
Run the dpkg command to ensure that the package is not already installed on the system:
cumulus@switch:~$ dpkg -l | grep {name of package}
If the package is installed already, ensure it is the version you need. If it is an older version, update the package from the Cumulus Linux repository:
If the package is not on the system, the package source location is most likely not in the /etc/apt/sources.list file. If the source for the new package is not in sources.list, edit and add the appropriate source to the file. For example, add the following if you want a package from the Debian repository that is not in the Cumulus Linux repository:
deb http://http.us.debian.org/debian jessie main
deb http://security.debian.org/ jessie/updates main
Otherwise, the repository might be listed in `/etc/apt/sources.list` but is commented out, as can be the case with the early-access repository:
NVIDIA provides a Supplemental Repository that contains third party applications commonly installed on switches.
The repository is provided for convenience only. You can download and use these applications; however, the applications in this repository are not tested, developed, certified, or supported by NVIDIA.
Below is a non-exhaustive list of some of the packages present in the repository:
Package
Description
htop
Lets you view CPU, memory, and process information.
scamper
ECMP traceroute utility.
mtr
ECMP traceroute utility.
dhcpdump
Similar to TCPdump but focused only on DHCP traffic.
vim
Text editor.
fping
Provides a list of targets through textfile to check reachability.
scapy
Custom packet generator for testing.
bwm-ng
Real-time bandwidth monitor.
iftop
Real-time traffic monitor.
tshark
CLI version of wireshark.
nmap
Network scanning utility.
minicom
USB/Serial console utility that turns your switch into a terminal server (useful for out of band management switches to provide a console on the dataplane switches in the rack).
apt-cacher-ng
Caches packages for mirroring purposes.
iptraf
ncurses-based traffic visualization utility.
swatch
Monitors system activity. It reads a configuration file that contains patterns for which to search and actions to perform when each pattern is found.
dos2unix
Converts line endings from Windows to Unix.
fail2ban
Monitors log files (such as /var/log/auth.log and /var/log/apache/access.log) and temporarily or persistently bans the login of failure-prone IP addresses by updating existing firewall rules. This utility is not hardware accelerated on a Cumulus Linux switch, so only affects the control plane.
To enable the Supplemental Repository:
In a file editor, open the /etc/apt/sources.list file.
man pages for apt-get, dpkg, sources.list, apt_preferences
Zero Touch Provisioning - ZTP
Zero touch provisioning (ZTP) enables you to deploy network devices quickly in large-scale environments. On first boot, Cumulus Linux invokes ZTP, which executes the provisioning automation used to deploy the device for its intended role in the network.
The provisioning framework allows for a one-time, user-provided script to be executed. You can develop this script using a variety of automation tools and scripting languages, providing ample flexibility for you to design the provisioning scheme to meet your needs. You can also use it to add the switch to a configuration management (CM) platform such as Puppet, Chef, CFEngine or possibly a custom, proprietary tool.
While developing and testing the provisioning logic, you can use the ztp command in Cumulus Linux to manually invoke your provisioning script on a device.
ZTP in Cumulus Linux can occur automatically in one of the following ways, in this order:
Through a local file
Using a USB drive inserted into the switch (ZTP-USB)
Through DHCP
Each method is discussed in greater detail below.
In Cumulus Linux 3.7.12, the default password for the cumulus user account has changed to cumulus. The first time you log into Cumulus Linux, you are required to change this default password. Be sure to update any automation scripts before you upgrade to Cumulus Linux 3.7.12.
Zero Touch Provisioning Using a Local File
ZTP only looks once for a ZTP script on the local file system when the switch boots. ZTP searches for an install script that matches an ONIE-style waterfall in /var/lib/cumulus/ztp, looking for the most specific name first, and ending at the most generic:
You can also trigger the ZTP process manually by running the ztp --run <URL> command, where the URL is the path to the ZTP script.
Zero Touch Provisioning Using a USB Drive (ZTP-USB)
This feature has been tested only with thumb drives, not an actual external large USB hard drive.
If the ztp process does not discover a local script, it tries once to locate an inserted but unmounted USB drive. If it discovers one, it begins the ZTP process.
Cumulus Linux supports the use of a FAT32, FAT16, or VFAT-formatted USB drive as an installation source for ZTP scripts. You must plug in the USB drive before you power up the switch.
At minimum, the script must:
Install the Cumulus Linux operating system and license.
Copy over a basic configuration to the switch.
Restart the switch or the relevant serves to get switchd up and
running with that configuration.
Follow these steps to perform zero touch provisioning using a USB drive:
Copy the Cumulus Linux license and installation image to the USB drive.
The ztp process searches the root filesystem of the newly mounted drive for filenames matching an ONIE-style waterfall (see the patterns and examples above), looking for the most specific name first, and ending at the most generic.
The contents of the script are parsed to ensure it contains the CUMULUS-AUTOPROVISIONING flag.
The USB drive is mounted to a temporary directory under /tmp (for example, /tmp/tmpigGgjf/). To reference files on the USB drive, use the environment variable ZTP_USB_MOUNTPOINT to refer to the USB root partition.
Zero Touch Provisioning over DHCP
If the ztp process does not discover a local/ONIE script or applicable USB drive, it checks DHCP every ten seconds for up to five minutes for the presence of a ZTP URL specified in /var/run/ztp.dhcp. The URL can be any of HTTP, HTTPS, FTP or TFTP.
For ZTP using DHCP, provisioning initially takes place over the management network and is initiated through a DHCP hook. A DHCP option is used to specify a configuration script. This script is then requested from the Web server and executed locally on the switch.
The zero touch provisioning process over DHCP follows these steps:
The first time you boot Cumulus Linux, eth0 is configured for DHCP and makes a DHCP request.
The DHCP server offers a lease to the switch.
If option 239 is present in the response, the zero touch provisioning process starts.
The zero touch provisioning process requests the contents of the script from the URL, sending additional HTTP headers containing details about the switch.
The contents of the script are parsed to ensure it contains the CUMULUS-AUTOPROVISIONING flag (see example scripts).
If provisioning is necessary, the script executes locally on the switch with root privileges.
The return code of the script is examined. If it is 0, the provisioning state is marked as complete in the autoprovisioning configuration file.
Trigger ZTP over DHCP
If provisioning has not already occurred, it is possible to trigger the zero touch provisioning process over DHCP when eth0 is set to use DHCP and one of the following events occur:
The switch boots.
You plug a cable into or unplug a cable from the eth0 port.
You disconnect, then reconnect the switch power cord.
You can also run the ztp --run <URL> command, where the URL is the path to the ZTP script.
Configure the DHCP Server
During the DHCP process over eth0, Cumulus Linux requests DHCP option 239. This option is used to specify the custom provisioning script.
For example, the /etc/dhcp/dhcpd.conf file for an ISC DHCP server looks like:
The following HTTP headers are sent in the request to the webserver to retrieve the provisioning script:
Header Value Example
------ ----- -------
User-Agent CumulusLinux-AutoProvision/0.4
CUMULUS-ARCH CPU architecture x86_64
CUMULUS-BUILD 3.7.3-5c6829a-201309251712-final
CUMULUS-LICENSE-INSTALLED Either 0 or 1 1
CUMULUS-MANUFACTURER odm
CUMULUS-PRODUCTNAME switch_model
CUMULUS-SERIAL XYZ123004
CUMULUS-BASE-MAC 44:38:39:FF:40:94
CUMULUS-MGMT-MAC 44:38:39:FF:00:00
CUMULUS-VERSION 3.7.3
CUMULUS-PROV-COUNT 0
CUMULUS-PROV-MAX 32
Write ZTP Scripts
Remember to include the following line in any of the supported scripts that you expect to run using the autoprovisioning framework.
# CUMULUS-AUTOPROVISIONING
This line is required somewhere in the script file for execution to occur.
The script must contain the CUMULUS-AUTOPROVISIONING flag. You can include this flag in a comment or remark; the flag does not need to be echoed or written to stdout.
You can write the script in any language currently supported by Cumulus Linux, such as:
Perl
Python
Ruby
Shell
The script must return an exit code of 0 upon success, as this triggers the autoprovisioning process to be marked as complete in the autoprovisioning configuration file.
The following script installs Cumulus Linux and its license from a USB drive and applies a configuration:
#!/bin/bash
function error() {
echo -e "\e[0;33mERROR: The Zero Touch Provisioning script failed while running the command $BASH_COMMAND at line $BASH_LINENO.\e[0m" >&2
exit 1
}
# Log all output from this script
exec >> /var/log/autoprovision 2>&1
date "+%FT%T ztp starting script $0"
trap error ERR
#Add Debian Repositories
echo "deb http://http.us.debian.org/debian jessie main" >> /etc/apt/sources.list
echo "deb http://security.debian.org/ jessie/updates main" >> /etc/apt/sources.list
#Update Package Cache
apt-get update -y
#Load interface config from usb
cp ${ZTP_USB_MOUNTPOINT}/interfaces /etc/network/interfaces
#Load port config from usb
# (if breakout cables are used for certain interfaces)
cp ${ZTP_USB_MOUNTPOINT}/ports.conf /etc/cumulus/ports.conf
#Install a License from usb and restart switchd
/usr/cumulus/bin/cl-license -i ${ZTP_USB_MOUNTPOINT}/license.txt && systemctl restart switchd.service
#Reload interfaces to apply loaded config
ifreload -a
#Output state of interfaces
net show interface
# CUMULUS-AUTOPROVISIONING
exit 0
ZTP scripts come in different forms and frequently perform many of the same tasks. As BASH is the most common language used for ZTP scripts, the following BASH snippets are provided to accelerate your ability to perform common tasks with robust error checking.
Install a License
Use the following function to include error checking for license file installation.
function install_license(){
# Install license
echo "$(date) INFO: Installing License..."
echo $1 | /usr/cumulus/bin/cl-license -i
return_code=$?
if [ "$return_code" == "0" ]; then
echo "$(date) INFO: License Installed."
else
echo "$(date) ERROR: License not installed. Return code was: $return_code"
/usr/cumulus/bin/cl-license
exit 1
fi
}
Change the Default Password
In Cumulus Linux 3.7.12, the default password for the cumulus user account has changed to cumulus. The first time you log into Cumulus Linux, you are now required to change this default password. You can use the following function to change the default password to CumulusLinux!:
function change_password(){
# Change default cumulus user password
echo "cumulus:CumulusLinux!" | chpasswd
}
Test DNS Name Resolution
DNS names are frequently used in ZTP scripts. The ping_until_reachable function tests that each DNS name resolves into a reachable IP address. Call this function with each DNS target used in your script before you use the DNS name elsewhere in your script.
The following example shows how to call the ping_until_reachable function in the context of a larger task.
function ping_until_reachable(){
last_code=1
max_tries=30
tries=0
while [ "0" != "$last_code" ] && [ "$tries" -lt "$max_tries" ]; do
tries=$((tries+1))
echo "$(date) INFO: ( Attempt $tries of $max_tries ) Pinging $1 Target Until Reachable."
ping $1 -c2 &> /dev/null
last_code=$?
sleep 1
done
if [ "$tries" -eq "$max_tries" ] && [ "$last_code" -ne "0" ]; then
echo "$(date) ERROR: Reached maximum number of attempts to ping the target $1 ."
exit 1
fi
}
Check the Cumulus Linux Release
The following script segment demonstrates how to check which Cumulus Linux release is running currently and upgrades the node if the release is not the target release. If the release is the target release, normal ZTP tasks execute. This script calls the ping_until_reachable script (described above) to make sure the server holding the image server and the ZTP script is reachable.
If you apply a management VRF in your script, either apply it last or reboot instead. If you do not apply a management VRF last, you need to prepend any commands that require eth0 to communicate out with /usr/bin/ip vrf exec mgmt; for example, /usr/bin/ip vrf exec mgmt apt-get update -y.
Perform Ansible Provisioning Callbacks
After initially configuring a node with ZTP, use Provisioning Callbacks to inform Ansible Tower or AWX that the node is ready for more detailed provisioning. The following example demonstrates how to use a provisioning callback:
Make sure to disable the DHCP hostname override setting in your script (NCLU does this for in Cumulus Linux 3.5 and above).
function set_hostname(){
# Remove DHCP Setting of Hostname
sed s/'SETHOSTNAME="yes"'/'SETHOSTNAME="no"'/g -i /etc/dhcp/dhclient-exit-hooks.d/dhcp-sethostname
hostnamectl set-hostname $1
}
NCLU in ZTP Scripts
Not all aspects of NCLU are supported when running during ZTP. Use traditional Linux methods of providing configuration to the switch during ZTP.
Most notably, using the net del all command in a ZTP script sets zebra=yes in /etc/frr/daemons. This causes ZTP to fail.
When you use NCLU in ZTP scripts, add the following loop to make sure NCLU has time to start up before being called.
# Waiting for NCLU to finish starting up
last_code=1
while [ "1" == "$last_code" ]; do
net show interface &> /dev/null
last_code=$?
done
net add vrf mgmt
net add time zone Etc/UTC
net add time ntp server 192.168.0.254 iburst
net commit
Test ZTP Scripts
There are a few commands you can use to test and debug your ZTP scripts.
You can use verbose mode to debug your script and see where your script failed. Include the -v option when you run ztp:
cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
Broadcast message from root@dell-s6000-01 (ttyS0) (Tue May 10 22:44:17 2016):
ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure
To see if ZTP is enabled and to see results of the most recent execution, you can run the ztp -s command.
cumulus@switch:~$ ztp -s
ZTP INFO:
State enabled
Version 1.0
Result Script Failure
Date Tue May 10 22:42:09 2016 UTC
Method ZTP DHCP
URL http://192.0.2.1/demo.sh
If ZTP runs when the switch boots and not manually, you can run the systemctl -l status ztp.service then journalctl -l -u ztp.service to see if any failures occur:
cumulus@switch:~$ sudo systemctl -l status ztp.service
● ztp.service - Cumulus Linux ZTP
Loaded: loaded (/lib/systemd/system/ztp.service; enabled)
Active: failed (Result: exit-code) since Wed 2016-05-11 16:38:45 UTC; 1min 47s ago
Docs: man:ztp(8)
Process: 400 ExecStart=/usr/sbin/ztp -b (code=exited, status=1/FAILURE)
Main PID: 400 (code=exited, status=1/FAILURE)
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6000-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6000-01 systemd[1]: Unit ztp.service entered failed state.
cumulus@switch:~$
cumulus@switch:~$ sudo journalctl -l -u ztp.service --no-pager
-- Logs begin at Wed 2016-05-11 16:37:42 UTC, end at Wed 2016-05-11 16:40:39 UTC. --
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp: Sate Directory does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Looking for ZTP local Script
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6000_s1220-rUNKNOWN
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6000_s1220
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Looking for unmounted USB devices
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Parsing partitions
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6000-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6000-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6000-01 systemd[1]: Unit ztp.service entered failed state.
Instead of running journalctl, you can see the log history by running:
cumulus@switch:~$ cat /var/log/syslog | grep ztp
2016-05-11T16:37:45.132583+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp: State Directory does not exist. Creating it...
2016-05-11T16:37:45.134081+00:00 cumulus ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
2016-05-11T16:37:45.135360+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
2016-05-11T16:37:45.185598+00:00 cumulus ztp [400]: ZTP LOCAL: Looking for ZTP local Script
2016-05-11T16:37:45.485084+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6000_s1220-rUNKNOWN
2016-05-11T16:37:45.486394+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6000_s1220
2016-05-11T16:37:45.488385+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
2016-05-11T16:37:45.489665+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
2016-05-11T16:37:45.490854+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
2016-05-11T16:37:45.492296+00:00 cumulus ztp [400]: ZTP USB: Looking for unmounted USB devices
2016-05-11T16:37:45.493525+00:00 cumulus ztp [400]: ZTP USB: Parsing partitions
2016-05-11T16:37:45.636422+00:00 cumulus ztp [400]: ZTP USB: Device not found
2016-05-11T16:38:43.372857+00:00 cumulus ztp [1805]: Found ZTP DHCP Request
2016-05-11T16:38:45.696562+00:00 cumulus ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
2016-05-11T16:38:45.698598+00:00 cumulus ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
2016-05-11T16:38:45.816275+00:00 cumulus ztp [400]: ZTP DHCP: URL response code 200
2016-05-11T16:38:45.817446+00:00 cumulus ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
2016-05-11T16:38:45.818402+00:00 cumulus ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
2016-05-11T16:38:45.834240+00:00 cumulus ztp [400]: ZTP DHCP: Payload returned code 1
2016-05-11T16:38:45.835488+00:00 cumulus ztp [400]: Script returned failure
2016-05-11T16:38:45.876334+00:00 cumulus systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
2016-05-11T16:38:45.879410+00:00 cumulus systemd[1]: Unit ztp.service entered failed state.
If you see that the issue is a script failure, you can modify the script and then run ztp manually using ztp -v -r <URL/path to that script>, as above.
cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
Broadcast message from root@dell-s6000-01 (ttyS0) (Tue May 10 22:44:17 2016):
ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure
cumulus@switch:~$ sudo ztp -s
State enabled
Version 1.0
Result Script Failure
Date Tue May 10 22:44:17 2016 UTC
Method ZTP Manual
URL http://192.0.2.1/demo.sh
Use the following command to check syslog for information about ZTP:
Errors in syslog for ZTP like those shown above often occur if the script is created (or edited as some point) on a Windows machine. Check to make sure that the \r\n characters are not present in the end-of-line encodings.
Use the cat -v ztp.sh command to view the contents of the script and search for any hidden characters.
root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_windows.sh
#!/bin/bash^M
^M
###################^M
# ZTP Script^M
###################^M
^M
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt^M
^M
# Clean method of performing a Reboot^M
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &^M
^M
exit 0^M
^M
# The line below is required to be a valid ZTP script^M
#CUMULUS-AUTOPROVISIONING^M
root@oob-mgmt-server:/var/www/html#
The ^M characters in the output of your ZTP script, as shown above, indicate the presence of Windows end-of-line encodings that you need to remove.
Use the translate (tr) command on any Linux system to remove the '\r' characters from the file.
root@oob-mgmt-server:/var/www/html# tr -d '\r' < ztp_oob_windows.sh > ztp_oob_unix.sh
root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_unix.sh
#!/bin/bash
###################
# ZTP Script
###################
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt
# Clean method of performing a Reboot
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &
exit 0
# The line below is required to be a valid ZTP script
#CUMULUS-AUTOPROVISIONING
root@oob-mgmt-server:/var/www/html#
Manually Use the ztp Command
To enable zero touch provisioning, use the -e option:
cumulus@switch:~$ sudo ztp -e
Enabling ztp means that ztp tries to run the next time the switch boots. However, if ZTP already ran on a previous boot up or if a manual configuration has been found, ZTP will just exit without trying to look for any script.
ZTP checks for these manual configurations during bootup:
Password changes
Users and groups changes
Packages changes
Interfaces changes
The presence of an installed license
When the switch is booted for the very first time, ZTP records the state of important files that are most likely going to be modified after that the switch is configured. If ZTP is still enabled after a reboot, ZTP compares the recorded state to the current state of these files. If they do not match, ZTP considers that the switch has already been provisioned and exits. These files are only erased after a reset.
To reset ztp to its original state, use the -R option and the -i option. This removes the ztp directory and ztp runs the next time the switch reboots.
To disable zero touch provisioning, use the -d option:
cumulus@switch:~$ sudo ztp -d
To force provisioning to occur and ignore the status listed in the configuration file, use the -r option:
cumulus@switch:~$ sudo ztp -r cumulus-ztp.sh
To see the current ztp state, use the -s option:
cumulus@switch:~$ sudo ztp -s
ZTP INFO:
State disabled
Version 1.0
Result success
Date Thu May 5 16:49:33 2016 UTC
Method Switch manually configured
URL None
In Cumulus Linux 3.7.11 and later, you can run the NCLU net show system ztp script or net show system ztp json command to see the current ztp state.
Notes
During the development of a provisioning script, the switch might need to be rebooted.
You can use the Cumulus Linux onie-select -i command to cause the switch to reprovision itself and install a network operating system again using ONIE.
Network Command Line Utility - NCLU
The Network Command Line Utility (NCLU) is a command line interface that simplifies the networking configuration process.
NCLU resides in the Linux user space and provides consistent access to
networking commands directly through bash, making configuration and
troubleshooting simple and easy; no need to edit files or enter modes
and sub-modes. NCLU provides these benefits:
Embeds help, examples, and automatic command checking with
suggestions in case you enter a typo.
Runs directly from and integrates with bash, while being
interoperable with the regular way of accessing underlying
configuration files.
Configures dependent features automatically so that you don’t have to.
The NCLU wrapper utility called net is capable of configuring layer 2
and layer 3 features of the networking stack, installing ACLs and
VXLANs, rolling back and deleting snapshots, as well as providing
monitoring and troubleshooting functionality for these features. You can
configure both the /etc/network/interfaces and /etc/frr/frr.conf
files with net, in addition to running show and clear commands related
to ifupdown2 and FRRouting.
If you use automation to configure your switches, NVIDIA recommends that
you do not use NCLU. Edit configuration files directly.
Install NCLU
If you upgraded Cumulus Linux from a version earlier than 3.2 instead of
performing a full disk image install, you need to install the nclu
package on your switch:
The nclu package installs a new bash completion script and displays
the following message:
Setting up nclu (1.0-cl3u3) ...
To enable the newly installed bash completion for nclu in this shell, execute...
source /etc/bash_completion
NCLU Basics
Use the following workflow to stage and commit changes to Cumulus Linux
with NCLU:
Use the net add and net del commands to stage and remove
configuration changes.
Use the net pending command to review staged changes.
Use net commit and net abort to commit and delete staged
changes.
net commit applies the changes to the relevant configuration files,
such as /etc/network/interfaces, then runs necessary follow on
commands to enable the configuration, such as ifreload -a.
If two different users try to commit a change at the same time, NCLU
displays a warning but implements the change according to the first
commit received. The second user will need to abort the commit.
If you provision a new switch without setting the system clock (manually
or with NTP or PTP), the NCLU net commit command fails when the system
clock is earlier than the modification date of configuration files. Make
sure to set the system clock on the switch.
When you have a running configuration, you can review and update the
configuration with the following commands:
net show is a series of commands for viewing various parts of the
network configuration. For example, use net show configuration
to view the complete network configuration, net show commit history to view a history of commits using NCLU, and
net show bgp to view BGP status.
net clear provides a way to clear net show counters, BGP and
OSPF neighbor content, and more.
net rollback provides a mechanism to
revert back
to an earlier configuration.
net commit confirm requires you to press Enter to commit changes
using NCLU. If you run net commit confirm but do not press Enter
within 10 seconds, the commit automatically reverts and no changes
are made.
net commit description <description> enables you to provide a
descriptive summary of the changes you are about to commit.
net commit permanent retains the
snapshot
taken when committing the change. Otherwise, the snapshots created
from NCLU commands are cleaned up periodically with a snapper cron job.
net commit delete deletes one or more snapshots created when
committing changes with NCLU.
net del all deletes all configurations.
The net del all command does not remove
management VRF configurations; NCLU
does not interact with eth0 interfaces and management VRF.
Tab Completion, Verification, and Inline Help
In addition to tab completion and partial keyword command
identification, NCLU includes verification checks to ensure correct
syntax is used. The examples below show the output for incorrect
commands:
cumulus@switch:~$ net add bgp router-id 1.1.1.1/32
ERROR: Command not found
Did you mean one of the following?
net add bgp router-id <ipv4>
This command is looking for an IP address, not an IP/prefixlen
cumulus@switch:~$ net add bgp router-id 1.1.1.1
cumulus@switch:~$ net add int swp10 mtu <TAB>
<552-9216> :
cumulus@switch:~$ net add int swp10 mtu 9300
ERROR: Command not found
Did you mean one of the following?
net add interface <interface> mtu <552-9216>
NCLU has a comprehensive built in help system. In addition to the net
man page, you can use ? and help to display available commands:
cumulus@switch:~$ net help
Usage:
# net <COMMAND> [<ARGS>] [help]
#
# net is a command line utility for networking on Cumulus Linux switches.
#
# COMMANDS are listed below and have context specific arguments which can
# be explored by typing "<TAB>" or "help" anytime while using net.
#
# Use 'man net' for a more comprehensive overview.
net abort
net commit [verbose] [confirm] [description <wildcard>]
net commit delete (<number>|<number-range>)
net help [verbose]
net pending
net rollback (<number>|last)
net show commit (history|<number>|<number-range>|last)
net show rollback (<number>|last)
net show configuration [commands|files|acl|bgp|ospf|ospf6|interface <interface>]
Options:
# Help commands
help : context sensitive information; see section below
example : detailed examples of common workflows
# Configuration commands
add : add/modify configuration
del : remove configuration
# Commit buffer commands
abort : abandon changes in the commit buffer
commit : apply the commit buffer to the system
pending : show changes staged in the commit buffer
rollback : revert to a previous configuration state
# Status commands
show : show command output
clear : clear counters, BGP neighbors, etc
cumulus@switch:~$ net help bestpath
The following commands contain keyword(s) 'bestpath'
net (add|del) bgp bestpath as-path multipath-relax [as-set|no-as-set]
net (add|del) bgp bestpath compare-routerid
net (add|del) bgp bestpath med missing-as-worst
net (add|del) bgp vrf <text> bestpath as-path multipath-relax [as-set|no-as-set]
net (add|del) bgp vrf <text> bestpath compare-routerid
net (add|del) bgp vrf <text> bestpath med missing-as-worst
net add bgp debug bestpath <ip/prefixlen>
net del bgp debug bestpath [<ip/prefixlen>]
net show bgp (<ipv4>|<ipv4/prefixlen>) [bestpath|multipath] [json]
net show bgp (<ipv6>|<ipv6/prefixlen>) [bestpath|multipath] [json]
net show bgp vrf <text> (<ipv4>|<ipv4/prefixlen>) [bestpath|multipath] [json]
You can configure multiple interfaces at once:
cumulus@switch:~$ net add int swp7-9,12,15-17,22 mtu 9216
Search for Specific Commands
To search for specific NCLU commands so that you can identify the correct syntax to use, run the net help verbose | <term> command. For example, to show only commands that include clag (for MLAG):
cumulus@leaf01:mgmt:~$ net help verbose | grep clag
net example clag basic-clag
net example clag l2-with-server-vlan-trunks
net example clag l3-uplinks-virtual-address
net add clag peer sys-mac <mac-clag> interface <interface> (primary|secondary) [backup-ip <ipv4>]
net add clag peer sys-mac <mac-clag> interface <interface> (primary|secondary) [backup-ip <ipv4> vrf <text>]
net del clag peer
net add clag port bond <interface> interface <interface> clag-id <0-65535>
net del clag port bond <interface>
net show clag [our-macs|our-multicast-entries|our-multicast-route|our-multicast-router-ports|peer-macs|peer-multicast-entries|peer-multicast-route|peer-multicast-router-ports|params|backup-ip|id] [verbose] [json]
net show clag macs [<mac>] [json]
net show clag neighbors [verbose]
net show clag peer-lacp-rate
net show clag verify-vlans [verbose]
net show clag status [verbose] [json]
net add bond <interface> clag id <0-65535>
net add interface <interface> clag args <wildcard>
net add interface <interface> clag backup-ip (<ipv4>|<ipv4> vrf <text>)
net add interface <interface> clag enable (yes|no)
net add interface <interface> clag peer-ip (<ipv4>|<ipv6>|linklocal)
net add interface <interface> clag priority <0-65535>
net add interface <interface> clag sys-mac <mac>
net add loopback lo clag vxlan-anycast-ip <ipv4>
net del bond <interface> clag id [<0-65535>]
net del interface <interface> clag args [<wildcard>]
...
Add ? (Question Mark) Ability to NCLU
While tab completion is enabled by default, you can also configure NCLU
to use the ? (question mark character) to look at available
commands. To enable this feature for the cumulus user, open the
following file:
cumulus@leaf01:~$ sudo nano ~/.inputrc
Uncomment the very last line in the .inputrc file so that the file
changes from this:
# Uncomment to use ? as an alternative to
# ?: complete
to this:
# Uncomment to use ? as an alternative to
?: complete
Save the file and reconnect to the switch. The ? (question mark) ability
will work on all subsequent sessions on the switch.
cumulus@leaf01:~$ net
abort : abandon changes in the commit buffer
add : add/modify configuration
clear : clear counters, BGP neighbors, etc
commit : apply the commit buffer to the system
del : remove configuration
example : detailed examples of common workflows
help : Show this screen and exit
pending : show changes staged in the commit buffer
rollback : revert to a previous configuration state
show : show command output
When the question mark is typed, NCLU autocompletes and shows all
available options, but the question mark does not actually appear on the
terminal. This is normal, expected behavior.
Built-In Examples
NCLU has a number of built in examples to guide users through basic
configuration setup:
cumulus@switch:~$ net example
acl : access-list
bgp : Border Gateway Protocol
bond : Bond, port-channel, etc
bridge : A layer2 bridge
clag : Multi-Chassis Link Aggregation
dot1x : Configure, Enable, Delete or Show IEEE 802.1X EAPOL
link-settings : Physical link parameters
lnv : Lightweight Network Virtualization
management-vrf : Management VRF
mlag : Multi-Chassis Link Aggregation
ospf : Open Shortest Path First (OSPFv2)
vlan-interfaces : IP interfaces for VLANs
cumulus@switch:~$ net example bridge
Scenario
========
We are configuring switch1 and would like to configure the following
- configure switch1 as an L2 switch for host-11 and host-12
- enable vlans 10-20
- place host-11 in vlan 10
- place host-12 in vlan 20
- create an SVI interface for vlan 10
- create an SVI interface for vlan 20
- assign IP 10.0.0.1/24 to the SVI for vlan 10
- assign IP 20.0.0.1/24 to the SVI for vlan 20
- configure swp3 as a trunk for vlans 10, 11, 12 and 20
swp3
*switch1 --------- switch2
/\
swp1 / \ swp2
/ \
/ \
host-11 host-12
switch1 net commands
====================
- enable vlans 10-20
switch1# net add vlan 10-20
- place host-11 in vlan 10
- place host-12 in vlan 20
switch1# net add int swp1 bridge access 10
switch1# net add int swp2 bridge access 20
- create an SVI interface for vlan 10
- create an SVI interface for vlan 20
- assign IP 10.0.0.1/24 to the SVI for vlan 10
- assign IP 20.0.0.1/24 to the SVI for vlan 20
switch1# net add vlan 10 ip address 10.0.0.1/24
switch1# net add vlan 20 ip address 20.0.0.1/24
- configure swp3 as a trunk for vlans 10, 11, 12 and 20
switch1# net add int swp3 bridge trunk vlans 10-12,20
# Review and commit changes
switch1# net pending
switch1# net commit
Verification
============
switch1# net show interface
switch1# net show bridge macs
Configure User Accounts
You can configure user accounts
in Cumulus Linux with read-only or edit permissions for NCLU:
You create user accounts with read-only permissions for NCLU by
adding them to the netshow group. A user in the netshow group
can run NCLU net show commands, such as net show interface or
net show config, and certain general Linux commands, such as ls,
cd or man, but cannot run net add, net del or net commit
commands.
You create user accounts with edit permissions for NCLU by
adding them to the netedit group. A user in the netedit group
can run NCLU configuration commands, such net add, net del or
net commit in addition to NCLU net show commands.
The examples below demonstrate how to add a new user account or modify
an existing user account called myuser.
To add a new user account with NCLU show permissions:
cumulus@switch:~$ sudo adduser --ingroup netshow myuser
Adding user `myuser' ...
Adding new user `myuser' (1001) with group `netshow' ...
To add NCLU show permissions to a user account that already exists:
cumulus@switch:~$ sudo addgroup myuser netshow
Adding user `myuser' to group `netshow' ...
Adding user myuser to group netshow
Done
To add a new user account with NCLU edit permissions:
cumulus@switch:~$ sudo adduser --ingroup netedit myuser
Adding user `myuser' ...
Adding new user `myuser' (1001) with group `netedit' ...
To add NCLU edit permissions to a user account that already exists:
cumulus@switch:~$ sudo addgroup myuser netedit
Adding user `myuser' to group `netedit' ...
Adding user myuser to group netedit
Done
You can use the adduser command for local user accounts only. You can
use the addgroup command for both local and remote user accounts. For
a remote user account, you must use the mapping username, such as
tacacs3 or radius_user, not the TACACS+
or RADIUS account name.
If the user tries to run commands that are not allowed, the following
error displays:
myuser@switch:~$ net add hostname host01
ERROR: User username does not have permission to make networking changes.
Edit the netd.conf File
Instead of using the NCLU commands described above, you can manually
configure users and groups to be able to run NCLU commands.
Edit the /etc/netd.conf file to add users to the users_with_edit
and users_with_show lines in the file, then save the file.
For example, if you want the user netoperator to be able to run both
edit and show commands, add the user to the users_with_edit and
users_with_show lines in the /etc/netd.conf file:
cumulus@switch:~$ sudo nano /etc/netd.conf
# Control which users/groups are allowed to run 'add', 'del',
# 'clear', 'net abort', 'net commit' and restart services
# to apply those changes
users_with_edit = root, cumulus, netoperator
groups_with_edit = netedit
# Control which users/groups are allowed to run 'show' commands
users_with_show = root, cumulus, netoperator
groups_with_show = netshow, netedit
To configure a new user group to use NCLU, add that group to the
groups_with_edit and groups_with_show lines in the file.
Use caution giving edit permissions to groups. For example, don’t give
edit permissions to the
tacacs group.
Restart the netd Service
Whenever you modify netd.conf or NSS services change, you must restart
the netd service for the changes to take effect:
You can easily back up your NCLU configuration to a file by outputting
the results of net show configuration commands to a file, then
retrieving the contents of the file using the source command. You can
then view the configuration at any time or copy it to other switches and
use the source command to apply that configuration to those switches.
For example, to copy the configuration of a leaf switch called leaf01,
run the following command:
cumulus@leaf01:~$ net show configuration commands >> leaf01.txt
With the commands all stored in a single file, you can now copy this
file to another ToR switch in your network called leaf01 and apply the
configuration by running:
cumulus@leaf01:~$ source leaf01.txt
Advanced Configuration
NCLU needs no initial configuration; however, if you need to modify its
configuration, you must manually update the /etc/netd.conf file. You
can configure this file to allow different permission levels for users
to edit configurations and run show commands. The file also contains a
blacklist that hides less frequently used terms from the tabbed
autocomplete.
After you edit the netd.conf file, restart the netd service for the
changes to take effect.
Hides corner case command options from tab complete, to simplify and streamline output.
Net Tab Complete Output
net provides an environment variable to set where the net output is
directed. To only use stdout, set the NCLU_TAB_STDOUT environment
variable to true. The value is not case sensitive.
Caveats and Errata
Unsupported Interface Names
NCLU does not support interfaces named dev.
Bonds With No Configured Members
If a bond interface is configured and it contains no members NCLU will report the interace does not exist.
Large NCLU Inputs
Each NCLU command must be parsed by the system. Large inputs, for example a large paste of NCLU commands can take some time, sometimes minutes, to process.
Setting Date and Time
Setting the time zone, date and time requires root privileges; use
sudo.
Set the Time Zone
You can use one of two methods to set the time zone on the switch:
Edit the /etc/timezone file.
Use the guided wizard.
Edit the /etc/timezone File
To see the current time zone, list the contents of /etc/timezone:
cumulus@switch:~$ cat /etc/timezone
US/Eastern
Edit the file to add your desired time zone. A list of valid time zones
can be found at the following
link.
Use the following command to apply the new time zone immediately.
To set the time zone using the guided wizard, run
dpkg-reconfigure tzdata as root:
cumulus@switch:~$ sudo dpkg-reconfigure tzdata
Then navigate the menus to enable the time zone you want. The following
example selects the US/Pacific time zone:
cumulus@switch:~$ sudo dpkg-reconfigure tzdata
Configuring tzdata
------------------
Please select the geographic area in which you live. Subsequent configuration
questions will narrow this down by presenting a list of cities, representing
the time zones in which they are located.
1. Africa 4. Australia 7. Atlantic 10. Pacific 13. Etc
2. America 5. Arctic 8. Europe 11. SystemV
3. Antarctica 6. Asia 9. Indian 12. US
Geographic area: 12
Please select the city or region corresponding to your time zone.
1. Alaska 4. Central 7. Indiana-Starke 10. Pacific
2. Aleutian 5. Eastern 8. Michigan 11. Pacific-New
3. Arizona 6. Hawaii 9. Mountain 12. Samoa
Time zone: 10
Current default time zone: 'US/Pacific'
Local time is now: Mon Jun 17 09:27:45 PDT 2013.
Universal Time is now: Mon Jun 17 16:27:45 UTC 2013.
The switch contains a battery backed hardware clock that maintains the
time while the switch is powered off and in between reboots. When the
switch is running, the Cumulus Linux operating system maintains its own
software clock.
During boot up, the time from the hardware clock is copied into the
operating system’s software clock. The software clock is then used for
all timekeeping responsibilities. During system shutdown, the software
clock is copied back to the battery backed hardware clock.
You can set the date and time on the software clock using the date
command. First, determine your current time zone:
cumulus@switch$ date +%Z
If you need to reconfigure the current time zone, refer to the
instructions above.
Then, to set the system clock according to the time zone configured:
cumulus@switch$ sudo date -s "Tue Jan 12 00:37:13 2016"
See man date(1) for more information.
You can write the current value of the system (software) clock to the
hardware clock using the hwclock command:
The ntpd daemon running on the switch implements the NTP protocol. It
synchronizes the system time with time servers listed in
/etc/ntp.conf. The ntpd daemon is started at boot by default. See
man ntpd(8) for ntpd details. You can check this site for an explanation of the output.
If you intend to run this service within a
VRF,
including the management VRF,
follow these steps for configuring the service.
By default, /etc/ntp.conf contains some default time servers. You can
specify the NTP server or servers you want to use with
NCLU;
include the iburst option to increase the sync speed.
cumulus@switch:~$ net add time ntp server 4.cumulusnetworks.pool.ntp.org iburst
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands add the NTP server to the list of servers in
/etc/ntp.conf:
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
server 4.cumulusnetworks.pool.ntp.org iburst
To set the initial date and time via NTP before starting the ntpd
daemon, use ntpd -q. This is the same as ntpdate, which is to be
retired and no longer available. See man ntp.conf(5) for details on
configuring ntpd using ntp.conf.
ntpd -q can hang if the time servers are not reachable.
cumulus@switch:~$ net show time ntp servers
remote refid st t when poll reach delay offset jitter
==============================================================================
+minime.fdf.net 58.180.158.150 3 u 140 1024 377 55.659 0.339 1.464
+69.195.159.158 128.138.140.44 2 u 259 1024 377 41.587 1.011 1.677
\*chl.la 216.218.192.202 2 u 210 1024 377 4.008 1.277 1.628
+vps3.drown.org 17.253.2.125 2 u 743 1024 377 39.319 -0.316 1.384
To remove one or more NTP servers:
cumulus@switch:~$ net del time ntp server 0.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 1.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 2.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 3.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Specify the NTP Source Interface
You can change the source interface that NTP uses if you want to use an
interface other than eth0, which is the default.
cumulus@switch:~$ net add time ntp source swp10
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
ntp.conf file:
The default NTP configuration comprises the following servers, which are
listed in the /etc/ntpd.conf file:
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
The contents of the /etc/ntpd.conf file are listed below.
Default ntpd.conf file ...
# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help
driftfile /var/lib/ntp/ntp.drift
# Enable this if you want statistics to be logged.
#statsdir /var/log/ntpstats/
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
# You do need to talk to an NTP server or two (or three).
#server ntp.your-provider.example
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
# Access control configuration; see /usr/share/doc/ntp-doc/html/accopt.html for
# details. The web page <http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a configuration
# that might be intended to block requests from certain clients could also end
# up blocking replies from your own upstream servers.
# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery
# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1
# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.
#restrict 192.168.123.0 mask 255.255.255.0 notrust
# If you want to provide time to your local subnet, change the next line.
# (Again, the address is an example only.)
#broadcast 192.168.123.255
# If you want to listen to time broadcasts on your local subnet, de-comment the
# next lines. Please do this only if you trust everybody on the network!
#disable auth
#broadcastclient
# Specify interfaces, don't listen on switch ports
interface listen eth0
Configure NTP with Authorization Keys
For added security, you can configure NTP to use authorization keys.
Configure the NTP server:
Create a .keys file, such as /etc/ntp.keys. Specify a key identifier (a number from 1-65535), an encryption method (M for MD5), and the password. The following provides an example:
#
# PLEASE DO NOT USE THE DEFAULT VALUES HERE.
#
#65535 M akey
#1 M pass
1 M CumulusLinux!
In the /etc/ntp/ntp.conf file, add a pointer to the /etc/ntp.keys file you created above and specify the key identifier. For example:
Restart NTP with the sudo systemctl restart ntp command.
Configure the NTP client (the Cumulus Linux switch):
Create the same .keys file you created on the NTP server (/etc/ntp.keys). For example:
cumulus@switch:~$ sudo nano /etc/ntp.keys
#
# PLEASE DO NOT USE THE DEFAULT VALUES HERE.
#
#65535 M akey
#1 M pass
1 M CumulusLinux!
Edit the /etc/ntp.conf file to specify the server you want to use, the key identifier, and a pointer to the /etc/ntp.keys file you created in step 1. For example:
cumulus@switch:~$ sudo nano /etc/ntp.conf
...
# You do need to talk to an NTP server or two (or three).
#pool ntp.your-provider.example
# OR
#server ntp.your-provider.example
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
#server 0.cumulusnetworks.pool.ntp.org iburst
#server 1.cumulusnetworks.pool.ntp.org iburst
#server 2.cumulusnetworks.pool.ntp.org iburst
#server 3.cumulusnetworks.pool.ntp.org iburst
server 10.50.23.121 key 1
#keys
keys /etc/ntp.keys
trustedkey 1
controlkey 1
requestkey 1
...
Restart NTP in the active VRF (default or management). For example:
Wait a few minutes, then run the ntpq -c as command to verify the configuration:
cumulus@switch:~$ ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 40828 f014 yes yes ok reject reachable 1
After authorization is accepted, you see the following command output:
cumulus@switch:~$ ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 40828 f61a yes yes ok sys.peer sys_peer 1
Precision Time Protocol (PTP) Boundary Clock
With the growth of low latency and high performance applications,
precision timing has become increasingly important. Precision Time
Protocol (PTP) is used to synchronize clocks in a network and is capable
of sub-microsecond accuracy. The clocks are organized in a master-slave
hierarchy. The slaves are synchronized to their masters, which can be
slaves to their own masters. The hierarchy is created and updated
automatically by the best master clock (BMC) algorithm, which runs on
every clock. The grandmaster clock is the top-level master and is
typically synchronized by using a Global Positioning System (GPS) time
source to provide a high-degree of accuracy.
A boundary clock has multiple ports; one or more master ports and one or
more slave ports. The master ports provide time (the time can originate
from other masters further up the hierarchy) and the slave ports receive
time. The boundary clock absorbs sync messages in the slave port, uses
that port to set its clock, then generates new sync messages from this
clock out of all of its master ports.
Cumulus Linux includes the linuxptp package for PTP, which uses the
phc2sys daemon to synchronize the PTP clock with the system clock.
Cumulus Linux currently supports PTP on the Mellanox Spectrum ASIC only.
If you do not perform a full disk image install of Cumulus Linux 3.6
or later, you need to install the linuxptp package with the sudo -E apt-get install linuxptp command.
PTP is supported in boundary clock mode only (the switch provides
timing to downstream servers; it is a slave to a higher-level clock
and a master to downstream clocks).
The switch uses hardware time stamping to capture timestamps from an
Ethernet frame at the physical layer. This allows PTP to account for
delays in message transfer and greatly improves the accuracy of time
synchronization.
Only IPv4/UDP PTP packets are supported.
Only a single PTP domain per network is supported. A PTP domain is a
network or a portion of a network within which all the clocks are
synchronized.
In the following example, boundary clock 2 receives time from Master 1
(the grandmaster) on a PTP slave port, sets its clock and passes the
time down from the PTP master port to boundary clock 1. Boundary clock 1
receives the time on a PTP slave port, sets its clock and passes the
time down the hierarchy through the PTP master ports to the hosts that
receive the time.
Enable the PTP Boundary Clock on the Switch
To enable the PTP boundary clock on the switch:
Open the /etc/cumulus/switchd.conf file in a text editor and add
the following line:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Configure the PTP Boundary Clock
To configure a boundary clock:
Configure the interfaces on the switch that you want to use for PTP.
Each interface must be configured as a layer 3 routed interface with
an IP address.
PTP is supported on BGP unnumbered interfaces. PTP is not supported on switched virtual interfaces (SVIs).
cumulus@switch:~$ net add interface swp13s0 ip address 10.0.0.9/32
cumulus@switch:~$ net add interface swp13s1 ip address 10.0.0.10/32
Configure PTP options on the switch:
Set the gm-capable option to no to configure the switch to
be a boundary clock.
Set the priority, which selects the best master clock. You can
set priority 1 or 2. For each priority, you can use a number
between 0 and 255. The default priority is 255. For the boundary
clock, use a number above 128. The lower priority is applied
first.
Add the time-stamping parameter. The switch automatically
enables hardware time-stamping to capture timestamps from an
Ethernet frame at the physical layer. If you are testing PTP in
a virtual environment, hardware time-stamping is not available;
however the time-stamping parameter is still required.
Add the PTP master and slave interfaces. You do not specify
which is a master interface and which is a slave interface; this
is determined by the PTP packet received.
The following commands provide an example configuration:
cumulus@switch:~$ net add ptp global gm-capable no
cumulus@switch:~$ net add ptp global priority2 254
cumulus@switch:~$ net add ptp global priority1 254
cumulus@switch:~$ net add ptp global time-stamping
cumulus@switch:~$ net add ptp interface swp13s0
cumulus@switch:~$ net add ptp interface swp13s1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The ptp4l man page describes all the configuration parameters.
In the following example, the boundary clock on the switch receives time
from Master 1 (the grandmaster) on PTP slave port swp3s0, sets its clock
and passes the time down through PTP master ports swp3s1, swp3s2, and
swp3s3 to the hosts that receive the time.
The configuration for the above example is shown below. The example
assumes that you have already configured the layer 3 routed interfaces
(swp3s0, swp3s1, swp3s2, and swp3s3) you want to use for PTP.
cumulus@switch:~$ net add ptp global gm-capable no
cumulus@switch:~$ net add ptp global priority2 254
cumulus@switch:~$ net add ptp global priority1 254
cumulus@switch:~$ net add ptp global time-stamping
cumulus@switch:~$ net add ptp interface swp3s0
cumulus@switch:~$ net add ptp interface swp3s1
cumulus@switch:~$ net add ptp interface swp3s2
cumulus@switch:~$ net add ptp interface swp3s3
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Verify PTP Boundary Clock Configuration
To view a summary of the PTP configuration on the switch, run the net show configuration ptp command:
To view the additional PTP status information, including the delta in
nanoseconds from the master clock, run the sudo pmc -u -b 0 'GET TIME_STATUS_NP' command:
To delete PTP configuration, delete the PTP master and slave interfaces.
The following example commands delete the PTP interfaces swp3s0,
swp3s1, and swp3s2.
cumulus@switch:~$ net del ptp interface swp3s0
cumulus@switch:~$ net del ptp interface swp3s1
cumulus@switch:~$ net del ptp interface swp3s2
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Considerations
Use NTP in a DHCP Environment
If you use DHCP and want to specify your NTP servers, you must specify
an alternate configuration file for NTP.
Before you create the file, ensure that the DHCP-generated configuration
file exists. In Cumulus Linux 3.6.1 and later (which uses NTP 1:4.2.8),
the DHCP-generated file is named /run/ntp.conf.dhcp while in Cumulus
Linux 3.6.0 and earlier (which uses NTP 1:4.2.6) the file is named
/var/lib/ntp/ntp.conf.dhcp. This file is generated by the
/etc/dhcp/dhclient-exit-hooks.d/ntp script and is a copy of the
default /etc/ntp.conf with a modified server list from the DHCP
server. If this file does not exist and you plan on using DHCP in the
future, you can copy your current /etc/ntp.conf file to the location
of the DHCP file.
To use an alternate configuration file that persists across upgrades of
Cumulus Linux, create a systemd unit override file called
/etc/systemd/system/ntp.service.d/config.conf and add the following
content:
If the state is not Active, or the alternate configuration file does
not appear in the ntp command line (for example, see below), it is likely that a mistake was made. Correct the mistake and rerun the three commands above to verify.
With this unit file override present, changing NTP settings using NCLU
do not take effect until the DHCP script regenerates the alternate NTP
configuration file.
System Clock and NCLU Commands
If you provision a new switch without setting the system clock (manually
or with NTP or PTP), the NCLU net commit command fails when the system
clock is earlier than the modification date of configuration files. Make
sure to set the system clock on the switch.
Spanning Tree and PTP
PTP frames are affected by STP filtering; events, such as an STP topology change (where ports temporarily go into the blocking state), can cause interruptions to PTP communications.
If you configure PTP on bridge ports, NVIDIA recommends that the bridge ports are spanning tree edge ports or in a bridge domain where spanning tree is disabled.
Read this section to understand how to set up ssh for remote access, use LDAP, TACACS and RADIUS authentication, and understand Cumulus Linux user accounts.
Netfilter - ACLs
Netfilter is the packet filtering framework
in Cumulus Linux as well as most other Linux distributions. There are a
number of tools available for configuring ACLs in Cumulus Linux:
iptables, ip6tables, and ebtables are Linux userspace tools
used to administer filtering rules for IPv4 packets, IPv6 packets,
and Ethernet frames (layer 2 using MAC addresses).
NCLU
is a Cumulus Linux-specific userspace tool used to configure custom
ACLs.
cl-acltool is a Cumulus Linux-specific userspace tool used to
administer filtering rules and configure default ACLs.
NCLU and cl-acltool operate on various configuration files and use
iptables, ip6tables, and ebtables to install rules into the
kernel. In addition, NCLU and cl-acltool program rules in hardware for
interfaces involving switch port interfaces, which iptables,
ip6tables and ebtables cannot do on their own.
In many instances, you can use NCLU to configure ACLs; however, in some
cases, you must use cl-acltool. The examples below specify when to use
which tool.
If you need help to configure ACLs, run net example acl to see a basic
configuration:
Click to see the example ...
cumulus@leaf01:~$ net example acl
Scenario
========
We would like to use access-lists on 'switch' to
- Restrict inbound traffic on swp1 to traffic from 10.1.1.0/24 destined for 10.1.2.0/24
- Restrict outbound traffic on swp2 to http, https, or ssh
\*switch
/\
swp1 / \ swp2
/ \
/ \
host-11 host-12
switch net commands
====================
Create an ACL that accepts traffic from 10.1.1.0/24 destined for 10.1.2.0/24
and drops all other traffic
switch# net add acl ipv4 MYACL accept source-ip 10.1.1.0/24 dest-ip 10.1.2.0/24
switch# net add acl ipv4 MYACL drop source-ip any dest-ip any
Apply MYACL inbound on swp1
switch# net add interface swp1 acl ipv4 MYACL inbound
Create an ACL that accepts http, https, or ssh traffic and drops all
other traffic.
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port any dest-ip any dest-port http
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port http dest-ip any dest-port any
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port any dest-ip any dest-port https
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port https dest-ip any dest-port any
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port any dest-ip any dest-port ssh
switch# net add acl ipv4 WEB_OR_SSH accept tcp source-ip any source-port ssh dest-ip any dest-port any
switch# net add acl ipv4 WEB_OR_SSH drop source-ip any dest-ip any
Apply WEB_OR_SSH outbound on swp2
switch# net add interface swp2 acl ipv4 WEB_OR_SSH outbound
commit the staged changes
switch# net commit
Verification
============
switch# net show configuration acl
The interfaces in the sample configuration in net example acl are
layer 3; they are not layer 2 bridge members.
Traffic Rules In Cumulus Linux
Chains
Netfilter describes the mechanism for which packets are classified and
controlled in the Linux kernel. Cumulus Linux uses the Netfilter
framework to control the flow of traffic to, from, and across the
switch. Netfilter does not require a separate software daemon to run; it
is part of the Linux kernel itself. Netfilter asserts policies at layers
2, 3 and 4 of the OSI model
by inspecting packet and frame headers based on a list of rules. Rules
are defined using syntax provided by the iptables, ip6tables and
ebtables userspace applications.
The rules created by these programs inspect or operate on packets at
several points in the life of the packet through the system. These five
points are known as chains and are shown here:
The chains and their uses are:
PREROUTING touches packets before they are routed
INPUT touches packets after they are determined to be destined
for the local system but before they are received by the control
plane software
FORWARD touches transit traffic as it moves through the box
OUTPUT touches packets that are sourced by the control plane
software before they are put on the wire
POSTROUTING touches packets immediately before they are put on
the wire but after the routing decision has been made
Tables
When building rules to affect the flow of traffic, the individual chains
can be accessed by tables. Linux provides three tables by default:
Filter classifies traffic or filters traffic
NAT applies Network Address Translation rules
Cumulus Linux does not support NAT.
Mangle alters packets as they move through the switch
Each table has a set of default chains that can be used to modify or
inspect packets at different points of the path through the switch.
Chains contain the individual rules to influence traffic. Each table and
the default chains they support are shown below. Tables and chains in
green are supported by Cumulus Linux, those in red are not supported
(that is, they are not hardware accelerated) at this time.
Rules
Rules are the items that actually classify traffic to be acted upon.
Rules are applied to chains, which are attached to tables, similar to
the graphic below.
Rules have several different components; the examples below highlight
those different components.
Table: The first argument is the table. Notice the second
example does not specify a table, that is because the filter table
is implied if a table is not specified.
Chain: The second argument is the chain. Each table supports
several different chains. See Understanding Tables above.
Matches: The third argument(s) are called the matches. You can
specify multiple matches in a single rule. However, the more matches
you use in a rule, the more memory that rule consumes.
Jump: The jump specifies the target of the rule; that is, what
action to take if the packet matches the rule. If this option is
omitted in a rule, then matching the rule will have no effect on the
packet’s fate, but the counters on the rule will be incremented.
Target(s): The target can be a user-defined chain (other than
the one this rule is in), one of the special built-in targets that
decides the fate of the packet immediately (like DROP), or an
extended target. See the
Supported Rule Types section
below for examples of different targets.
How Rules Are Parsed and Applied
All the rules from each chain are read from iptables, ip6tables, and
ebtables and entered in order into either the filter table or the
mangle table. The rules are read from the kernel in the following order:
IPv6 (ip6tables)
IPv4 (iptables)
ebtables
When rules are combined and put into one table, the order determines the
relative priority of the rules; iptables and ip6tables have the
highest precedence and ebtables has the lowest.
The Linux packet forwarding construct is an overlay for how the silicon
underneath processes packets. Be aware of the following:
The order of operations for how rules are processed is not perfectly
maintained when you compare how iptables and the switch silicon
process packets. The switch silicon reorders rules when switchd
writes to the ASIC, whereas traditional iptables execute the list
of rules in order.
All rules are terminating; after a rule matches, the action is
carried out and no more rules are processed. The exception to this
is when a SETCLASS rule is placed immediately before another rule;
this exists multiple times in the default ACL configuration. In the example below, the SETCLASS action applied with the
--in-interface option, creates the internal ASIC classification,
and continues to process the next rule, which does the
rate-limiting for the matched protocol:
If multiple contiguous rules with the same match criteria are
applied to --in-interface, only the first rule gets processed
and then terminates processing. This is a misconfiguration; there is
no reason to have duplicate rules with different actions.
When processing traffic, rules affecting the FORWARD chain that
specify an ingress interface are performed prior to rules that match
on an egress interface. As a workaround, rules that only affect the
egress interface can have an ingress interface wildcard (currently,
only swp+ and bond+ are supported as wildcard names; see below)
that matches any interface applied so that you can maintain order of
operations with other input interface rules. For example, with the
following rules:
-A FORWARD -i $PORTA -j ACCEPT
-A FORWARD -o $PORTA -j ACCEPT <-- This rule is performed LAST (because of egress interface matching)
-A FORWARD -i $PORTB -j DROP
If you modify the rules like this, they are performed in order:
-A FORWARD -i $PORTA -j ACCEPT
-A FORWARD -i swp+ -o $PORTA -j ACCEPT <-- These rules are performed in order (because of wildcard match on ingress interface)
-A FORWARD -i $PORTB -j DROP
When using rules that do a mangle and a filter lookup for a packet,
Cumulus Linux processes them in parallel and combines the action.
If a switch port is assigned to a bond, any egress rules must be
assigned to the bond.
When using the OUTPUT chain, rules must be assigned to the source.
For example, if a rule is assigned to the switch port in the
direction of traffic but the source is a bridge (VLAN), the traffic
is not affected by the rule and must be applied to the bridge.
If all transit traffic needs to have a rule applied, use the FORWARD
chain, not the OUTPUT chain.
ebtable rules are put into either the IPv4 or IPv6 memory space
depending on whether the rule utilizes IPv4 or IPv6 to make a
decision. Layer 2-only rules that match the MAC address are put into
the IPv4 memory space.
On Broadcom switches, the ingress INPUT chain rules match layer 2 and
layer 3 multicast packets before multicast packet replication has
occurred; therefore, a DROP rule affects all copies.
Rule Placement in Memory
INPUT and ingress (FORWARD -i) rules occupy the same memory space. A
rule counts as ingress if the -i option is set. If both input and
output options (-i and -o) are set, the rule is considered as
ingress and occupies that memory space. For example:
However, removing the -o option and interface make it a valid rule.
Nonatomic Update Mode and Update Mode
In Cumulus Linux, update mode is enabled by default. However, this
mode limits the number of ACL rules that you can configure.
To increase the number of ACL rules that can be configured, configure
the switch to operate in nonatomic mode.
Instead of reserving 50% of your TCAM space for atomic updates,
incremental update uses the available free space to write the new TCAM
rules and swap over to the new rules after this is complete. Cumulus
Linux then deletes the old rules and frees up the original TCAM space.
If there is insufficient free space to complete this task, the original
nonatomic update is performed, which interrupts traffic.
Enable Nonatomic Update Mode
You can enable nonatomic updates for switchd, which offer better
scaling because all TCAM resources are used to actively impact traffic.
With atomic updates, half of the hardware resources are on standby and
do not actively impact traffic.
Incremental nonatomic updates are table based, so they do not
interrupt network traffic when new rules are installed. The rules are
mapped into the following tables and are updated in this order:
mirror (ingress only)
ipv4-mac (can be both ingress and egress)
ipv6 (ingress only)
Only switches with the Broadcom ASIC support incremental nonataomic updates. Mellanox switches with the Spectrum-based ASIC only support standard nonatomic updates; using nonatomic mode on Spectrum-based ASICs impacts traffic on ACL updates.
The incremental nonatomic update operation follows this order:
Updates are performed incrementally, one table at a time without
stopping traffic.
Cumulus Linux checks if the rules in a table have changed since the
last time they were installed; if a table does not have any changes,
it is not reinstalled.
If there are changes in a table, the new rules are populated in new
groups or slices in hardware, then that table is switched over to
the new groups or slices.
Finally, old resources for that table are freed. This process is
repeated for each of the tables listed above.
If sufficient resources do not exist to hold both the new rule set
and old rule set, the regular nonatomic mode is attempted. This
interrupts network traffic.
If the regular nonatomic update fails, Cumulus Linux reverts back to
the previous rules.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
During regular non-incremental nonatomic updates, traffic is stopped
first, then enabled after the new configuration is written into the
hardware completely.
Use iptables, ip6tables, and ebtables Directly
Using iptables, ip6tables, ebtables directly is not
recommended because any rules installed in these cases only are applied
to the Linux kernel and are not hardware accelerated using
synchronization to the switch silicon. Running cl-acltool -i (the
installation command) resets all rules and deletes anything that is not
stored in /etc/cumulus/acl/policy.conf.
For example, performing:
cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP
Appears to work, and the rule appears when you run cl-acltool -L:
cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP icmp -- any any anywhere anywhere icmp echo-request
However, the rule is not synced to hardware when applied in this way and
running cl-acltool -i or reboot removes the rule without replacing
it. To ensure all rules that can be in hardware are hardware
accelerated, place them in /etc/cumulus/acl/policy.conf and install
them by running cl-acltool -i.
Estimate the Number of Rules
To estimate the number of rules you can create from an ACL entry, first
determine if that entry is an ingress or an egress. Then, determine if
it is an IPv4-mac or IPv6 type rule. This determines the slice to which
the rule belongs. Use the following to determine how many entries are
used up for each type.
By default, each entry occupies one double wide entry, except if the
entry is one of the following:
An entry with multiple comma-separated input interfaces is split
into one rule for each input interface (listed after
--in-interface below). For example, this entry splits into two
rules:
-A FORWARD --in-interface swp1s0,swp1s1 -p icmp -j ACCEPT
An entry with multiple comma-separated output interfaces is split
into one rule for each output interface (listed after
--out-interface below). This entry splits into two rules:
-A FORWARD --in-interface swp+ --out-interface swp1s0,swp1s1 -p icmp -j ACCEPT
An entry with both input and output comma-separated interfaces is
split into one rule for each combination of input and output
interface (listed after --in-interface and --out-interface
below). This entry splits into four rules:
-A FORWARD --in-interface swp1s0,swp1s1 --out-interface swp1s2,swp1s3 -p icmp -j ACCEPT
An entry with multiple layer 4 port ranges is split into one rule
for each range (listed after --dport below). For example, this
entry splits into two rules:
Cumulus Linux supports matching ACL rules for both ingress and egress
interfaces on both
VLAN-aware
and traditional mode
bridges, including bridge SVIs (switch VLAN interfaces)
for input and output. However, keep the following in mind:
If a traditional mode bridge has a mix of different VLANs, or has
both access and trunk members, output interface matching is not
supported.
For iptables rules, all IP packets in a bridge are matched, not
just routed packets.
You cannot match both input and output interfaces in a rule.
For routed packets, Cumulus Linux cannot match the output bridge for
SPAN/ERSPAN.
Matching SVI interfaces in ebtable rules is supported on switches
based on Broadcom ASICs. This feature is not currently supported on
switches with Mellanox Spectrum ASICs.
Following are example rules for a VLAN-aware bridge:
[ebtables]
-A FORWARD -i br0.100 -p IPv4 --ip-protocol icmp -j DROP
-A FORWARD -o br0.100 -p IPv4 --ip-protocol icmp -j ACCEPT
[iptables]
-A FORWARD -i br0.100 -p icmp -j DROP
-A FORWARD --out-interface br0.100 -p icmp -j ACCEPT
-A FORWARD --in-interface br0.100 -j POLICE --set-mode pkt --set-rate 1 --set-burst 1 --set-class 0
And here are example rules for a traditional mode bridge:
[ebtables]
-A FORWARD -i br0 -p IPv4 --ip-protocol icmp -j DROP
-A FORWARD -o br0 -p IPv4 --ip-protocol icmp -j ACCEPT
[iptables]
-A FORWARD -i br0 -p icmp -j DROP
-A FORWARD --out-interface br0 -p icmp -j ACCEPT
-A FORWARD --in-interface br0 -j POLICE --set-mode pkt --set-rate 1 --set-burst 1 --set-class 0
Match on VLAN IDs on Layer 2 Interfaces
Cumulus Linux 3.7.9 and later enables you to match on VLAN IDs on layer 2 interfaces for ingress rules.
Matching VLAN IDs on layer 2 interfaces is supported on switches with
Spectrum ASICs only.
The following example matches on a VLAN and DSCP class, and sets the internal class of the packet:
This can be combined with ingress iptable rules for extended matching on IP fields.
[ebtables]
-A FORWARD -p 802_1Q --vlan-id 100 -j mark --mark-set 0x66
[iptables]
-A FORWARD -i swp31 -m mark --mark 0x66 -m dscp --dscp-class CS1 -j SETCLASS --class 2
Install and Manage ACL Rules with NCLU
NCLU provides an easy way to create custom ACLs in Cumulus Linux. The
rules you create live in the /var/lib/cumulus/nclu/nclu_acl.conf file,
which gets converted to a rules file,
/etc/cumulus/acl/policy.d/50_nclu_acl.rules. This way, the rules you
create with NCLU are independent of the two default files in
/etc/cumulus/acl/policy.d/00control_plane.rules and
99control_plane_catch_all.rules, as the content in these files might
get updated after you upgrade Cumulus Linux.
Instead of crafting a rule by hand then installing it using
cl-acltool, NCLU handles many of the options automatically. For
example, consider the following iptables rule:
You create this rule, called EXAMPLE1, using NCLU like this:
cumulus@switch:~$ net add acl ipv4 EXAMPLE1 accept tcp source-ip 10.0.14.2/32 source-port any dest-ip 10.0.15.8/32 dest-port any
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
All options, such as the -j and -p, even FORWARD in the above
rule, are added automatically when you apply the rule to the control
plane; NCLU figures it all out for you.
You can also set a priority value, which specifies the order in which
the rules get executed and the order in which they appear in the rules
file. Lower numbers are executed first. To add a new rule in the middle,
first run net show config acl, which displays the priority numbers.
Otherwise, new rules get appended to the end of the list of rules in the
nclu_acl.conf and 50_nclu_acl.rules files.
If you need to hand edit a rule, do not edit the 50_nclu_acl.rules
file. Instead, edit the nclu_acl.conf file.
After you add the rule, you need to apply it to an inbound or outbound
interface using net add int acl. The inbound interface in our example
is swp1:
cumulus@switch:~$ net add int swp1 acl ipv4 EXAMPLE1 inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
After you commit your changes, you can verify the rule you created with
NCLU by running net show configuration acl:
cumulus@switch:~$ net show configuration acl
acl ipv4 EXAMPLEv4 priority 10 accept tcp source-ip 10.0.14.2/32 source-port any dest-ip 10.0.15.8/32 dest-port any
interface swp1
acl ipv4 EXAMPLE1 inbound
Or you can see all of the rules installed by running cat on the
50_nclu_acl.rules file:
For INPUT and FORWARD rules, apply the rule to a control plane interface
using net add control-plane:
cumulus@switch:~$ net add control-plane acl ipv4 EXAMPLE1 inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To remove a rule, use net del acl ipv4|ipv6|mac RULENAME:
cumulus@switch:~$ net del acl ipv4 EXAMPLE1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This deletes all rules from the 50_nclu_acl.rules file with that name.
It also deletes the interfaces referenced in the nclu_acl.conf file.
Install and Manage ACL Rules with cl-acltool
You can manage Cumulus Linux ACLs with cl-acltool. Rules are first
written to the iptables chains, as described above, and then synced to
hardware via switchd.
Use iptables/ip6tables/ebtables and cl-acltool to manage rules
in the default files, 00control_plane.rules and
99control_plane_catch_all.rules; they are not aware of rules created
using NCLU.
To examine the current state of chains and list all installed rules,
run:
cumulus@switch:~$ sudo cl-acltool -L all
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 90 packets, 14456 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP all -- swp+ any 240.0.0.0/5 anywhere
0 0 DROP all -- swp+ any loopback/8 anywhere
0 0 DROP all -- swp+ any base-address.mcast.net/8 anywhere
0 0 DROP all -- swp+ any 255.255.255.255 anywhere ...
To list installed rules using native iptables, ip6tables and
ebtables, use the -L option with the respective commands:
If the install fails, ACL rules in the kernel and hardware are rolled
back to the previous state. Errors from programming rules in the kernel
or ASIC are reported appropriately.
Install Packet Filtering (ACL) Rules
cl-acltool takes access control list (ACL) rules input in files. Each
ACL policy file contains iptables, ip6tables and ebtables
categories under the tags [iptables], [ip6tables] and [ebtables]
respectively.
Each rule in an ACL policy must be assigned to one of the rule
categories above.
See man cl-acltool(5) for ACL rule details. For iptables rule
syntax, see man iptables(8). For ip6tables rule syntax, see man ip6tables(8). For ebtables rule syntax, see man ebtables(8).
See man cl-acltool(5) and man cl-acltool(8) for further details on
using cl-acltool. Some examples are listed here and more are listed
later in this chapter.
By default:
ACL policy files are located in /etc/cumulus/acl/policy.d/.
All *.rules files in this directory are included in
/etc/cumulus/acl/policy.conf.
All files included in this policy.conf file are installed when the
switch boots up.
The policy.conf file expects rules files to have a .rules suffix
as part of the file name.
Here is an example ACL policy file:
[iptables]
-A INPUT --in-interface swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD --in-interface swp1 -p tcp --dport 80 -j ACCEPT
[ip6tables]
-A INPUT --in-interface swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD --in-interface swp1 -p tcp --dport 80 -j ACCEPT
[ebtables]
-A INPUT -p IPv4 -j ACCEPT
-A FORWARD -p IPv4 -j ACCEPT
You can use wildcards or variables to specify chain and interface lists
to ease administration of rules.
Interface Wildcards
Currently only swp+ and bond+ are supported as wildcard names. There
might be kernel restrictions in supporting more complex wildcards likes
swp1+ etc.
swp+ rules are applied as an aggregate, not per port. If you want to apply per port policing, specify a specific port instead of a wildcard.
You can write ACL rules for the system into multiple files under the
default /etc/cumulus/acl/policy.d/ directory. The ordering of rules
during installation follows the sort order of the files based on their
file names.
Use multiple files to stack rules. The example below shows two rules
files separating rules for management and datapath traffic:
cumulus@switch:~$ ls /etc/cumulus/acl/policy.d/
00sample_mgmt.rules 01sample_datapath.rules
cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/00sample_mgmt.rules
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT
[iptables]
# protect the switch management
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 10.0.11.2 -d 10.0.12.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -d 10.0.16.8 -p udp -j DROP
cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/01sample_datapath.rules
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT, FORWARD
[iptables]
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.5 -p icmp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.6 -d 192.0.2.4 -j DROP
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.2 -d 192.0.2.8 -j DROP
Apply all rules and policies included in
/etc/cumulus/acl/policy.conf:
cumulus@switch:~$ sudo cl-acltool -i
In addition to ensuring that the rules and policies referenced by
/etc/cumulus/acl/policy.conf are installed, this will remove any
currently active rules and policies that are not contained in the
files referenced by /etc/cumulus/acl/policy.conf.
Specify the Policy Files to Install
By default, Cumulus Linux installs any .rules file you configure in
/etc/cumulus/acl/policy.d/. To add other policy files to an ACL, you
need to include them in /etc/cumulus/acl/policy.conf. For example, for
Cumulus Linux to install a rule in a policy file called
01_new.datapathacl, add include /etc/cumulus/acl/policy.d/01_new.rules to policy.conf, as in this
example:
cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.conf
#
# This file is a master file for acl policy file inclusion
#
# Note: This is not a file where you list acl rules.
#
# This file can contain:
# - include lines with acl policy files
# example:
# include <filepath>
#
# see manpage cl-acltool(5) and cl-acltool(8) for how to write policy files
#
include /etc/cumulus/acl/policy.d/01_new.datapathacl
Hardware Limitations on Number of Rules
The maximum number of rules that can be handled in hardware is a
function of the following factors:
The platform type (switch silicon, like Tomahawk or Spectrum - see
the HCL to determine which
platform type applies to a particular switch).
The mix of IPv4 and IPv6 rules; Cumulus Linux does not support the
maximum number of rules for both IPv4 and IPv6 simultaneously.
The number of default rules provided by Cumulus Linux.
Whether the rules are applied on ingress or egress.
Whether the rules are in atomic or nonatomic mode; nonatomic mode
rules are used when nonatomic updates are enabled (see above).
If the maximum number of rules for a particular table is exceeded,
cl-acltool -i generates the following error:
error: hw sync failed (sync_acl hardware installation failed) Rolling back .. failed.
In the tables below, the default rules count toward the limits listed.
The raw limits below assume only one ingress and one egress table are
present.
Broadcom Tomahawk Limits
Direction
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
Ingress raw limit
512
512
1024
1024
Ingress limit with default rules
256 (36 default)
256 (29 default)
768 (36 default)
768 (29 default)
Egress raw limit
256
0
512
0
Egress limit with default rules
256 (29 default)
0
512 (29 default)
0
Broadcom Trident3 Limits
The Trident3 ASIC is divided into 12 slices, organized into 4 groups for ACLs. Each group contains 3 slices. Each group can support a maximum of 768 rules. You cannot mix IPv4 and IPv6 rules within the same group. IPv4 and MAC rules can be programmed into the same group.
Direction
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
Ingress raw limit
768
768
2304
2304
Ingress limit with default rules
768 (44 default)
768 (41 default)
2304 (44 default)
2304 (41 default)
Egress raw limit
512
0
512
0
Egress limit with default rules
512 (28 default)
0
512 (28 default)
0
Due to a hardware limitation on Trident3 switches, certain broadcast packets that are VXLAN decapsulated and sent to the CPU do not hit the normal INPUT chain ACL rules installed with cl-acltool. See Caveats and Errata.
Broadcom Trident II+ Limits
Direction
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
Ingress raw limit
4096
4096
8192
8192
Ingress limit with default rules
2048 (36 default)
3072 (29 default)
6144 (36 default)
6144 (29 default)
Egress raw limit
256
0
512
0
Egress limit with default rules
256 (29 default)
0
512 (29 default)
0
Broadcom Trident II Limits
Direction
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
Ingress raw limit
1024
1024
2048
2048
Ingress limit with default rules
512 (36 default)
768 (29 default)
1536 (36 default)
1536 (29 default)
Egress raw limit
256
0
512
0
Egress limit with default rules
256 (29 default)
0
512 (29 default)
0
Broadcom Helix4 Limits
Direction
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
Ingress raw limit
1024
512
2048
1024
Ingress limit with default rules
768 (36 default)
384 (29 default)
1792 (36 default)
896 (29 default)
Egress raw limit
256
0
512
0
Egress limit with default rules
256 (29 default)
0
512 (29 default)
0
Mellanox Spectrum Limits
The Mellanox Spectrum ASIC has one common
TCAM
for both ingress and egress, which can be used for other non-ACL-related
resources. However, the number of supported rules varies with the TCAM profile specified for the switch.
Profile
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
default
500
250
1000
500
ipmc-heavy
750
500
1500
1000
acl-heavy
1750
1000
3500
2000
ipmc-max
1000
500
2000
1000
ip-acl-heavy
7500
0
15000
0
Even though the table above specifies that zero IPv6 rules are supported
with the ip-acl-heavy profile, Cumulus Linux does not prevent you
from configuring IPv6 rules. However, there is no guarantee that IPv6
rules work under the ip-acl-heavy profile.
Supported Rule Types
The iptables/ip6tables/ebtables construct tries to layer the Linux
implementation on top of the underlying hardware but they are not always
directly compatible. Here are the supported rules for chains in
iptables, ip6tables and ebtables.
To learn more about any of the options shown in the tables below, run
iptables -h [name of option]. The same help syntax works for options
for ip6tables and ebtables.
Click to see an example of help syntax for an ebtables target
root@leaf1# ebtables -h tricolorpolice
<...snip...>
tricolorpolice option:
--set-color-mode STRING setting the mode in blind or aware
--set-cir INT setting committed information rate in kbits per second
--set-cbs INT setting committed burst size in kbyte
--set-pir INT setting peak information rate in kbits per second
--set-ebs INT setting excess burst size in kbyte
--set-conform-action-dscp INT setting dscp value if the action is accept for conforming packets
--set-exceed-action-dscp INT setting dscp value if the action is accept for exceeding packets
--set-violate-action STRING setting the action (accept/drop) for violating packets
--set-violate-action-dscp INT setting dscp value if the action is accept for violating packets
Supported chains for the filter table:
INPUT FORWARD OUTPUT
iptables/ip6tables Rule Support
Rule Element
Supported
Unsupported
Matches
Src/Dst, IP protocol
In/out interface
IPv4: icmp, ttl,
IPv6: icmp6, frag, hl,
IP common: tcp (with flags), udp, multiport, DSCP, addrtype
Rules with input/output Ethernet interfaces are ignored
Rules that have no matches and accept all packets in a chain are
currently ignored.
Chain default rules (that are ACCEPT) are also ignored.
IPv6 Egress Rules on Broadcom Switches
Cumulus Linux 3.7.2 and later supports IPv6 egress rules in ip6tables
on Broadcom switches. Because there are no slices to allocate in the
egress TCAM for IPv6, the matches are implemented using a combination of
the ingress IPv6 slice and the existing egress IPv4 MAC slice:
Cumulus Linux compares all the match fields in the IPv6 ingress
slice, except the --out-interface field, and marks the packet with
a classid.
The egress IPv4 MAC slice matches on the classid and the
out-interface, and performs the actions.
For example, the -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT
rule is split into the following:
IPv6 ingress: -A FORWARD -p icmp6 → action mark (for example,
classid 4)
IPv4 MAC egress: <match mark 4> and --out-interface vlan100 -j ACCEPT
IPv6 egress rules in ip6tables are not supported on Hurricane2
switches.
You cannot match both input and output interfaces in the same rule.
The egress TCAM IPv4 MAC slice is shared with other rules, which
constrains the scale to a much lower limit.
Caveats
Splitting rules across the ingress TCAM and the egress TCAM causes the
ingress IPv6 part of the rule to match packets going to all
destinations, which can interfere with the regular expected linear rule
match in a sequence.
A higher rule can prevent a lower rule from being matched unexpectedly:
Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT
Rule 1 matches all icmp6 packets from to all out interfaces in the ingress TCAM. This
prevents rule 2 from getting matched, which is more specific but with a different out interface.
Make sure to put more specific matches above more general matches even if the output interfaces are different.
When you have two rules with the same output interface, the lower rule might match unexpectedly depending on the presence of the previous rules.
Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT
Rule 2: -A FORWARD --out-interface vlan101 -s 00::01 -j DROP
Rule 3: -A FORWARD --out -interface vlan101 -p icmp6 -j ACCEPT
Rule 3 still matches for an icmp6 packet with sip 00:01 going out of vlan101. Rule 1 interferes with the normal function of rule 2 and/or rule 3.
When you have two adjacent rules with the same match and different output interfaces, such as:
Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT
Rule 2: -A FORWARD --out-interface vlan101 -p icmp6 -j DROP
Rule 2 will never be match on ingress. Both rules share the same mark.
Matching Untagged Packets (Trident3 Switches)
Untagged packets do not have an associated VLAN to match on egress;
therefore, the match must be on the underlying layer 2 port. For
example, for a bridge configured with pvid 100, member port swp1s0 and
swp1s1, and SVI vlan100, the output interface match on vlan100 has to be
expanded into each member port. The -A FORWARD -o vlan100 -p icmp6 -j ACCEPT rule must be specified as two rules:
Rule 1: -A FORWARD -o swp1s0 -p icmp6 -J ACCEPT
Rule 2: -A FORWARD -o swp1s1 -p icmp6 -j ACCEPT
Matching on an egress port matches all packets egressing the port,
tagged as well as untagged. Therefore, to match only untagged traffic on
the port, you must specify additional rules above this rule to prevent
tagged packets matching the rule. This is true for bridge member ports
as well as regular layer 2 ports. In the example rule above, if vlan101
is also present on the bridge, add a rule above rule 1 and rule 2 to
protect vlan101 tagged traffic:
Rule 0: -A FORWARD -o vlan101 -p icmp6 -j ACCEPT
Rule 1: -A FORWARD -o swp1s0 -p icmp6 -j ACCEPT
Rule 2: -A FORWARD -o swp1s1 -p icmp6 -j ACCEPT
For a standalone port or subinterface on swp1s2:
Rule 0: -A FORWARD -o swp1s2.101 -p icmp6 -j ACCEPT
Rule 1: -A FORWARD -o swp1s2 -p icmp6 -j ACCEPT
Common Examples
Control Plane and Data Plane Traffic
You can configure quality of service for traffic on both the control
plane and the data plane. By using QoS policers, you can rate limit
traffic so incoming packets get dropped if they exceed specified
thresholds.
Counters on POLICE ACL rules in iptables do not currently show the
packets that are dropped due to those rules.
Use the POLICE target with iptables. POLICE takes these arguments:
--set-class value sets the system internal class of service queue
configuration to value.
--set-rate value specifies the maximum rate in kilobytes (KB) or
packets.
--set-burst value specifies the number of packets or kilobytes
(KB) allowed to arrive sequentially.
--set-mode string sets the mode in KB (kilobytes) or pkt
(packets) for rate and burst size.
For example, to rate limit the incoming traffic on swp1 to 400 packets
per second with a burst of 100 packets per second and set the class of
the queue for the policed traffic as 0, set this rule in your
appropriate .rules file:
The examples here use the mangle table to modify the packet as it
transits the switch. DSCP is expressed in
decimal notation
in the examples below.
[iptables]
#Set SSH as high priority traffic.
-t mangle -A FORWARD -p tcp --dport 22 -j DSCP --set-dscp 46
#Set everything coming in SWP1 as AF13
-t mangle -A FORWARD --in-interface swp1 -j DSCP --set-dscp 14
#Set Packets destined for 10.0.100.27 as best effort
-t mangle -A FORWARD -d 10.0.100.27/32 -j DSCP --set-dscp 0
#Example using a range of ports for TCP traffic
-t mangle -A FORWARD -p tcp -s 10.0.0.17/32 --sport 10000:20000 -d 10.0.100.27/32 --dport 10000:20000 -j DSCP --set-dscp 34
Verify DSCP Values on Transit Traffic
The examples here use the DSCP match criteria in combination with other
IP, TCP, and interface matches to identify traffic and count the number
of packets.
[iptables]
#Match and count the packets that match SSH traffic with DSCP EF
-A FORWARD -p tcp --dport 22 -m dscp --dscp 46 -j ACCEPT
#Match and count the packets coming in SWP1 as AF13
-A FORWARD --in-interface swp1 -m dscp --dscp 14 -j ACCEPT
#Match and count the packets with a destination 10.0.0.17 marked best effort
-A FORWARD -d 10.0.100.27/32 -m dscp --dscp 0 -j ACCEPT
#Match and count the packets in a port range with DSCP AF41
-A FORWARD -p tcp -s 10.0.0.17/32 --sport 10000:20000 -d 10.0.100.27/32 --dport 10000:20000 -m dscp --dscp 34 -j ACCEPT
Check the Packet and Byte Counters for ACL Rules
To verify the counters using the above example rules, first send test traffic matching the patterns through the network. The following example generates traffic with mz (or mausezahn), which can be installed on host servers or even on Cumulus Linux switches. After traffic is sent to validate the counters, they are matched on switch1 use cl-acltool.
Policing counters do not increment on switches with the Spectrum ASIC.
# Send 100 TCP packets on host1 with a DSCP value of EF with a destination of host2 TCP port 22:
cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t tcp "dp=22,dscp=46"
IP: ver=4, len=40, tos=184, id=0, frag=0, ttl=255, proto=6, sum=0, SA=10.0.0.17, DA=10.0.100.27,
payload=[see next layer]
TCP: sp=0, dp=22, S=42, A=42, flags=0, win=10000, len=20, sum=0,
payload=
# Verify the 100 packets are matched on switch1
cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
100 6400 ACCEPT tcp -- any any anywhere anywhere tcp dpt:ssh DSCP match 0x2e
0 0 ACCEPT all -- swp1 any anywhere anywhere DSCP match 0x0e
0 0 ACCEPT all -- any any 10.0.0.17 anywhere DSCP match 0x00
0 0 ACCEPT tcp -- any any 10.0.0.17 10.0.100.27 tcp spts:webmin:20000 dpts:webmin:2002
# Send 100 packets with a small payload on host1 with a DSCP value of AF13 with a destination of host2:
cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t ip
IP: ver=4, len=20, tos=0, id=0, frag=0, ttl=255, proto=0, sum=0, SA=10.0.0.17, DA=10.0.100.27,
payload=
# Verify the 100 packets are matched on switch1
cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
100 6400 ACCEPT tcp -- any any anywhere anywhere tcp dpt:ssh DSCP match 0x2e
100 7000 ACCEPT all -- swp3 any anywhere anywhere DSCP match 0x0e
100 6400 ACCEPT all -- any any 10.0.0.17 anywhere DSCP match 0x00
0 0 ACCEPT tcp -- any any 10.0.0.17 10.0.100.27 tcp spts:webmin:20000 dpts:webmin:2002
# Send 100 packets on host1 with a destination of host2:
cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t ip
IP: ver=4, len=20, tos=56, id=0, frag=0, ttl=255, proto=0, sum=0, SA=10.0.0.17, DA=10.0.100.27,
payload=
# Verify the 100 packets are matched on switch1
cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
100 6400 ACCEPT tcp -- any any anywhere anywhere tcp dpt:ssh DSCP match 0x2e
100 7000 ACCEPT all -- swp3 any anywhere anywhere DSCP match 0x0e
0 0 ACCEPT all -- any any 10.0.0.17 anywhere DSCP match 0x00
0 0 ACCEPT tcp -- any any 10.0.0.17 10.0.100.27 tcp spts:webmin:20000 dpts:webmin:2002Still working
Filter Specific TCP Flags
The example solution below creates rules on the INPUT and FORWARD chains
to drop ingress IPv4 and IPv6 TCP packets when the SYN bit is set and
the RST, ACK, and FIN bits are reset. The default for the INPUT and
FORWARD chains allows all other packets. The ACL is applied to ports
swp20 and swp21. After configuring this ACL, new TCP sessions that
originate from ingress ports swp20 and swp21 are not allowed. TCP
sessions that originate from any other port are allowed.
INGRESS_INTF = swp20,swp21
[iptables]
-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --syn -j DROP
[ip6tables]
-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --syn -j DROP
The --syn flag in the above rule matches packets with the SYN bit set
and the ACK, RST, and FIN bits are cleared. It is equivalent to using
-tcp-flags SYN,RST,ACK,FIN SYN. For example, you can write the above
rule as:
-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --tcp-flags SYN,RST,ACK,FIN SYN -j DROP
Control Who Can SSH into the Switch
Run the following NCLU commands to control who can SSH into the switch.
In the following example, 10.0.0.11/32 is the interface IP address (or loopback IP address) of the switch and 10.255.4.0/24 can SSH into the switch.
cumulus@switch:~$ net add acl ipv4 test priority 10 accept source-ip 10.255.4.0/24 dest-ip 10.0.0.11/32
cumulus@switch:~$ net add acl ipv4 test priority 20 drop source-ip any dest-ip 10.0.0.11/32
cumulus@switch:~$ net add control-plane acl ipv4 test inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Cumulus Linux does not support the keyword iprouter (typically used for traffic sent to the CPU, where the destination MAC address is that of the router but the destination IP address is not the router).
Example Scenario
The following
example scenario demonstrates how several different rules are applied.
Following are the configurations for the two switches used in these
examples. The configuration for each switch appears in
/etc/network/interfaces on that switch.
Switch 1 Configuration
cumulus@switch1:~$ net show configuration files
...
/etc/network/interfaces
=======================
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto bond2
iface bond2
bond-slaves swp3 swp4
auto br-untagged
iface br-untagged
address 10.0.0.1/24
bridge_ports swp1 bond2
bridge_stp on
auto br-tag100
iface br-tag100
address 10.0.100.1/24
bridge_ports swp2.100 bond2.100
bridge_stp on
...
Switch 2 Configuration
cumulus@switch2:~$ net show configuration files
...
/etc/network/interfaces
=======================
auto swp3
iface swp3
auto swp4
iface swp4
auto br-untagged
iface br-untagged
address 10.0.0.2/24
bridge_ports bond2
bridge_stp on
auto br-tag100
iface br-tag100
address 10.0.100.2/24
bridge_ports bond2.100
bridge_stp on
auto bond2
iface bond2
bond-slaves swp3 swp4
...
Egress Rule
The following rule blocks any TCP traffic with destination port 200
going from host1 or host2 through the switch (corresponding to rule 1 in
the diagram above).
[iptables] -A FORWARD -o bond2 -p tcp --dport 200 -j DROP
Ingress Rule
The following rule blocks any UDP traffic with source port 200 going
from host1 through the switch (corresponding to rule 2 in the diagram
above).
[iptables] -A FORWARD -i swp2 -p udp --sport 200 -j DROP
Input Rule
The following rule blocks any UDP traffic with source port 200 and
destination port 50 going from host1 to the switch (corresponding to
rule 3 in the diagram above).
[iptables] -A INPUT -i swp1 -p udp --sport 200 --dport 50 -j DROP
Output Rule
The following rule blocks any TCP traffic with source port 123 and
destination port 123 going from Switch 1 to host2 (corresponding to rule
4 in the diagram above).
[iptables] -A OUTPUT -o br-tag100 -p tcp --sport 123 --dport 123 -j DROP
Combined Rules
The following rule blocks any TCP traffic with source port 123 and
destination port 123 going from any switch port egress or generated from
Switch 1 to host1 or host2 (corresponding to rules 1 and 4 in the
diagram above).
[iptables] -A OUTPUT,FORWARD -o swp+ -p tcp --sport 123 --dport 123 -j DROP
This also becomes two ACLs and is the same as:
[iptables]
-A FORWARD -o swp+ -p tcp --sport 123 --dport 123 -j DROP
-A OUTPUT -o swp+ -p tcp --sport 123 --dport 123 -j DROP
Layer 2-only Rules/ebtables
The following rule blocks any traffic with source MAC address
00:00:00:00:00:12 and destination MAC address 08:9e:01:ce:e2:04 going
from any switch port egress/ingress.
[ebtables] -A FORWARD -s 00:00:00:00:00:12 -d 08:9e:01:ce:e2:04 -j DROP
Not all iptables, ip6tables, or ebtables rules are supported.
Refer to the Supported Rules section above for specific rule
support.
Input Chain Rules on Broadcom Switches
Broadcom switches evaluate both IPv4 and IPv6 packets against INPUT chain iptables rules. For example, when you install the following rule, the switch drops both IPv6 and IPv4 packets with destination port 22.
[iptables]
-A INPUT -p tcp --dport 22 -j DROP
To work around this issue, use ebtables with IPv4 or IPv6 headers instead of the iptables and ip6tables generic INPUT chain DROP. For example:
[ebtables]
-A INPUT -i swp+ -p IPv4 --ip-protocol tcp --ip-destination-port 22 -j DROP
[ebtables]
-A INPUT -i swp+ -p IPv6 --ip6-protocol tcp --ip6-destination-port 22 -j DROP
ACL Log Policer Limits Traffic
To protect the CPU from overloading, traffic copied to the CPU is
limited to 1 pkt/s by an ACL Log Policer.
Bridge Traffic Limitations
Bridge traffic that matches LOG ACTION rules are not logged in syslog;
the kernel and hardware identify packets using different information.
Log Actions Cannot Be Forwarded
Logged packets cannot be forwarded. The hardware cannot both forward a
packet and send the packet to the control plane (or kernel) for logging.
To emphasize this, a log action must also have a drop action.
Broadcom Range Checker Limitations
Broadcom platforms have only 24 range checkers. This is a separate
resource from the total number of ACLs allowed. If you are creating a
large ACL configuration, use port ranges for large ranges of more than 5
ports.
Inbound LOG Actions Only for Broadcom Switches
On Broadcom-based switches, LOG actions can only be done on inbound
interfaces (the ingress direction), not on outbound interfaces (the
egress direction).
SPAN Sessions that Reference an Outgoing Interface
SPAN sessions that reference an outgoing interface create mirrored
packets based on the ingress interface before the routing/switching
decision.
Tomahawk Hardware Limitations
Rate Limiting per Pipeline, Not Global
On Tomahawk switches, the field processor (FP) polices on a per-pipeline
basis instead of globally, as with a Trident II switch. If packets come
in to different switch ports that are on different pipelines on the
ASIC, they might be rate limited differently.
For example, your switch is set so BFD is rate limited to 2000 packets
per second. When the BFD packets are received on port1/pipe1 and
port2/pipe2, they are each rate limited at 2000 pps; the switch is rate
limiting at 4000 pps overall. Because there are four pipelines on a
Tomahawk switch, you might see a fourfold increase of your configured
rate limits.
Atomic Update Mode Enabled by Default
In Cumulus Linux, atomic update mode is enabled by default. If you have
Tomahawk switches and plan to use SPAN and/or mangle rules, you must
disable atomic update mode.
To do so, enable nonatomic update mode by setting the value for
acl.non_atomic_update_mode to TRUE in /etc/cumulus/switchd.conf,
then restart switchd.
acl.non_atomic_update_mode = TRUE
Packets Undercounted during ACL Updates
On Tomahawk switches, when updating egress FP rules, some packets do not
get counted. This results in an underreporting of counts during
ping-pong or incremental switchover.
Trident II+ Hardware Limitations
On a Trident II+ switch, the TCAM allocation for ACLs is limited to 2048
rules in atomic mode for a default setup instead of 4096, as advertised
for ingress rules.
Trident3 Hardware Limitations
TCAM Allocation
On a Trident3 switch, the TCAM allocation for ACLs is limited to 2048
rules in atomic mode for a default setup instead of 4096, as advertised
for ingress rules.
Enable Nonatomic Mode
On a Trident3 switch, you must enable nonatomic update mode before you
can configure ERSPAN. To do so, set the value for
acl.non_atomic_update_mode to TRUE in /etc/cumulus/switchd.conf,
then restart switchd.
acl.non_atomic_update_mode = TRUE
Egress ACL Rules
On Trident3 switches, egress ACL rules matching on the output SVI
interface match layer 3 routed packets only, not bridged packets. To
match layer 2 traffic, use egress bridge member port-based rules.
iptables Interactions with cl-acltool
Because Cumulus Linux is a Linux operating system, the iptables
commands can be used directly. However, consider using cl-acltool
instead because:
Without using cl-acltool, rules are not installed into hardware.
Running cl-acltool -i (the installation command) resets all rules
and deletes anything that is not stored in
/etc/cumulus/acl/policy.conf.
For example, running the following command works:
cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP
And the rules appear when you run cl-acltool -L:
cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP icmp -- any any anywhere anywhere icmp echo-request
However, running cl-acltool -i or reboot removes them. To ensure
all rules that can be in hardware are hardware accelerated, place
them in the /etc/cumulus/acl/policy.conf file, then run
cl-acltool -i.
Mellanox Spectrum Hardware Limitations
Due to hardware limitations in the Spectrum ASIC, BFD policers
are shared between all BFD-related control plane rules. Specifically the
following default rules share the same policer in the
00control_plan.rules file:
To work around this limitation, set the rate and burst of all 6 of these
rules to the same values, using the --set-rate and --set-burst
options.
Where to Assign Rules
If a switch port is assigned to a bond, any egress rules must be
assigned to the bond.
When using the OUTPUT chain, rules must be assigned to the source.
For example, if a rule is assigned to the switch port in the
direction of traffic but the source is a bridge (VLAN), the traffic
is not affected by the rule and must be applied to the bridge.
If all transit traffic needs to have a rule applied, use the FORWARD
chain, not the OUTPUT chain.
Generic Error Message Displayed after ACL Rule Installation Failure
After an ACL rule installation failure, a generic error message like the
following is displayed:
cumulus@switch:$ sudo cl-acltool -i -p 00control_plane.rules
Using user provided rule file 00control_plane.rules
Reading rule file 00control_plane.rules ...
Processing rules in file 00control_plane.rules ...
error: hw sync failed (sync_acl hardware installation failed)
Installing acl policy... Rolling back ..
failed.
Dell S3048-ON Supports only 24K MAC Addresses
The Dell S3048-ON has a limit of 24576 MAC address entries instead of
32K for other 1G switches.
Mellanox Spectrum ASICs and INPUT Chain Rules
On switches with Mellanox Spectrum ASICs, INPUT chain rules are implemented using a
trap mechanism. Packets headed to the CPU are assigned trap IDs. The
default INPUT chain rules are mapped to these trap IDs. However, if a
packet matches multiple traps, they are resolved by an internal priority
mechanism that might be different from the rule priorities. Packets
might not get policed by the default expected rule, but by another rule
instead. For example, ICMP packets headed to the CPU are policed by the
LOCAL rule instead of the ICMP rule. Also, multiple rules might share
the same trap. In this case the policer that is applied is the largest
of the policer values.
To work around this issue, create rules on the INPUT and FORWARD chains
(INPUT,FORWARD).
Hardware Policing of Packets in the Input Chain
On certain platforms, there are limitations on hardware policing of
packets in the INPUT chain. To work around these limitations, Cumulus
Linux supports kernel based policing of these packets in software using
limit/hashlimit matches. Rules with these matches are not hardware
offloaded, but are ignored during hardware install.
ACLs Do not Match when the Output Port on the ACL is a Subinterface
Packets don’t get matched when a subinterface is configured as the output port.
The ACL matches on packets only if the primary port is configured as an output
port. If a subinterface is set as an output or egress port, the packets match correctly.
For example:
-A FORWARD --out-interface swp49s1.100 -j ACCEPT
Mellanox Switches and Egress ACL Matching on Bonds
On the Mellanox switch, ACL rules that match on an outbound bond interface are not supported. For example, the following rule is not supported:
[iptables]
-A FORWARD --out-interface <bond_intf> -j DROP
To work around this issue, duplicate the ACL rule on each physical port of the bond. For example:
[iptables]
-A FORWARD --out-interface <bond-member-port-1> -j DROP
-A FORWARD --out-interface <bond-member-port-2> -j DROP
Services and Daemons in Cumulus Linux
Services (also known as daemons) and processes are at the heart of
how a Linux system functions. Most of the time a service takes care of
itself; you just enable and start it, then let it run. However, because
a Cumulus Linux switch is a Linux system, you have the ability to dig
deeper if you like. Services may start multiple processes as they run.
Services tend to be the most important things to monitor on a Cumulus
Linux switch.
You manage services in Cumulus Linux in the following ways:
Identify currently active or stopped services
Identify boot time state of a specific service
Disable or enable a specific service
Identify active listener ports
systemd and the systemctl Command
In general, you manage services using systemd via the systemctl
command. You use it with any service on the switch to start, stop,
restart, reload, enable, disable, reenable, or get the status of the
service.
systemctl has a number of subcommands that perform a specific
operation on a given service.
status: Returns the status of the specified service.
start: Starts the service.
stop: Stops the service.
restart: Stops, then starts the service, all the while
maintaining state. So if there are dependent services or services
that mark the restarted service as Required, the other services
also get restarted. For example, running systemctl restart frr.service restarts any of the routing protocol services that are
enabled and running, such as bgpd or ospfd.
reload: Reloads a service’s configuration.
enable: Enables the service to start when the system boots, but
does not start it unless you use the systemctl start SERVICENAME.service command or reboot the switch.
disable: Disables the service, but does not stop it unless you
use the systemctl stop SERVICENAME.service command or reboot the
switch. A disabled service can still be started or stopped.
reenable: Disables, then enables a service. You might need to do
this so that any new Wants or WantedBy lines create the symlinks
necessary for ordering. This has no side effects on other services.
There is often little reason to interact with the services directly
using these commands. If a critical service should happen to crash or
hit an error it will be automatically respawned by systemd. Systemd is
effectively the caretaker of services in modern Linux systems and is
responsible for starting all the necessary services at boot time.
Ensure a Service Starts after Multiple Restarts
By default, systemd is configured to try to restart a particular
service only a certain number of times within a given interval before
the service fails to start at all. The settings for this are stored in
the service script. The settings are StartLimitInterval (which
defaults to 10 seconds) and StartBurstLimit (which defaults to 5
attempts), but many services override these defaults, sometimes with
much longer times. switchd.service, for example, sets
StartLimitInterval=10m and StartBurstLimit=3, which means if you
restart switchd more than 3 times in 10 minutes, it does not start.
When the restart fails for this reason, a message similar to the
following appears:
Job for switchd.service failed. See 'systemctl status switchd.service' and 'journalctl -xn' for details.
And systemctl status switchd.service shows output similar to:
Active: failed (Result: start-limit) since Thu 2016-04-07 21:55:14 UTC; 15s ago
To clear this error, run systemctl reset-failed switchd.service. If
you know you are going to restart frequently (multiple times within the
StartLimitInterval), you can run the same command before you issue the
restart request. This also applies to stop followed by start.
Keep systemd Services from Hanging after Starting
If you start, restart, or reload any systemd service that can be
started from another systemd service, you must use the --no-block
option with systemctl. Otherwise, that service or even the switch
itself might hang after starting or restarting.
Identify Active Listener Ports for IPv4 and IPv6
You can identify the active listener ports under both IPv4 and IPv6
using the netstat command:
To determine which services are currently active or stopped, run the
cl-service-summary command:
cumulus@switch:~$ cl-service-summary
Service cron enabled active
Service ssh enabled active
Service syslog enabled active
Service asic-monitor enabled inactive
Service clagd enabled inactive
Service cumulus-poe inactive
Service lldpd enabled active
Service mstpd enabled active
Service neighmgrd enabled active
Service netd enabled active
Service netq-agent enabled active
Service ntp enabled active
Service portwd enabled active
Service ptmd enabled active
Service pwmd enabled active
Service smond enabled active
Service switchd enabled active
Service sysmonitor enabled active
Service vxrd disabled inactive
Service vxsnd disabled inactive
Service rdnbrd disabled inactive
Service frr enabled inactive
Service bgpd disabled inactive
Service eigrpd disabled inactive
Service isisd disabled inactive
Service ldpd disabled inactive
Service nhrpd disabled inactive
Service ospf6d disabled inactive
Service ospfd disabled inactive
Service pbrd disabled inactive
Service pimd disabled inactive
Service ripd disabled inactive
Service ripngd disabled inactive
Service zebra disabled inactive
You can also run the systemctl list-unit-files --type service command
to list all services on the switch and see which ones are enabled:
The following table lists the most important services in Cumulus Linux.
Service Name
Description
Affects Forwarding?
switchd
Hardware abstraction daemon, synchronizes the kernel with the ASIC.
YES
sx_sdk
Only on Mellanox switches, interfaces with the Spectrum ASIC.
YES
portwd
Reads pluggable information over the I2C bus. Identifies and classifies the optics that are inserted into the system. Sets interface speeds and capabilities to match the optics.
YES, eventually, if optics are added/removed
frr
FRRouting, handles routing protocols. There are separate processes for each routing protocol, like bgpd and ospfd.
switchd is the daemon at the heart of Cumulus Linux. It communicates
between the switch and Cumulus Linux, and all the applications running
on Cumulus Linux.
The switchd configuration is stored in /etc/cumulus/switchd.conf.
The switchd File System
switchd also exports a file system, mounted on /cumulus/switchd,
that presents all the switchd configuration options as a series of
files arranged in a tree structure. You can see the contents by parsing
the switchd tree; run tree /cumulus/switchd. The output below is for
a switch with one switch port configured:
You can use cl-cfg to configure many switchd parameters at runtime
(like ACLs, interfaces, and route table utilization), which minimizes
disruption to your running switch. However, some options are read only
and cannot be configured at runtime.
You can show some of this information by running cl-resource-query. In Cumulus Linux 3.7.11 and later, you can run the NCLU command equivalent: net show system asic.
Restart switchd
Whenever you modify any switchd hardware configuration file (typically
changing any *.conf file that requires making a change to the
switching hardware, like /etc/cumulus/datapath/traffic.conf), you must
restart switchd for the change to take effect:
You do not have to restart the switchd service when you update a
network interface configuration (that is, edit
/etc/network/interfaces).
Restarting switchd causes all network ports to reset in addition to
resetting the switch hardware configuration.
Power over Ethernet - PoE
Cumulus Linux supports Power over Ethernet (PoE) and PoE+, so certain
Cumulus Linux switches can supply power from Ethernet switch ports to
enabled devices over the Ethernet cables that connect them. PoE is
capable of powering devices up to 15W, while PoE+ can power devices up to 30W.
Configuration for power negotiation is done over
LLDP.
PoE functionality is provided by the cumulus-poe package. When a
powered device is connected to the switch via an Ethernet cable:
If the available power is greater than the power required by the
connected device, power is supplied to the switch port, and the
device powers on
If available power is less than the power required by the connected
device and the switch port’s priority is less than the port priority
set on all powered ports, power is not supplied to the port
If available power is less than the power required by the connected
device and the switch port’s priority is greater than the priority
of a currently powered port, power is removed from lower priority
port(s) and power is supplied to the port
If the total consumed power exceeds the configured power limit of
the power source, low priority ports are turned off. In the case of
a tie, the port with the lower port number gets priority
Power is available as follows:
PSU 1
PSU 2
PoE Power Budget
920W
x
750W
x
920W
750W
920W
920W
1650W
The AS4610-54P has an LED on the front panel to indicate PoE status:
Green: The poed daemon is running and no errors are detected
Yellow: One or more errors are detected or the poed daemon is not
running
Link state and PoE state are completely independent of each other. When
a link is brought down on a particular port using ip link <port> down,
power on that port is not turned off; however, LLDP negotiation is not
possible.
Configure PoE
You use the poectl command utility to configure PoE on a
switch that supports
the feature. You can:
Enable or disable PoE for a given switch port
Set a switch port’s PoE priority to one of three values: low,
high or critical
The PoE configuration resides in /etc/cumulus/poe.conf. The file lists
all the switch ports, whether PoE is enabled for those ports and the
priority for each port.
By default, PoE and PoE+ are enabled on all Ethernet/1G switch ports,
and these ports are set with a low priority. Switch ports can have low,
high or critical priority.
There is no additional configuration for PoE+.
To change the priority for one or more switch ports, run
poectl -p swp# [low|high|critical]. For example:
cumulus@switch:~$ sudo poectl -p swp1-swp5,swp7 high
To disable PoE for one or more ports, run poectl -d [port_numbers]:
cumulus@switch:~$ sudo poectl -d swp1-swp5,swp7
To display PoE information for a set of switch ports, run
poectl -i [port_numbers]:
cumulus@switch:~$ sudo poectl -i swp10-swp13
Port Status Allocated Priority PD type PD class Voltage Current Power
----- -------------------- ----------- -------- ----------- -------- ------- ------- ---------
swp10 connected negotiating low IEEE802.3at 4 53.5 V 25 mA 3.9 W
swp11 searching n/a low IEEE802.3at none 0.0 V 0 mA 0.0 W
swp12 connected n/a low IEEE802.3at 2 53.5 V 25 mA 1.4 W
swp13 connected 51.0 W low IEEE802.3at 4 53.6 V 72 mA 3.8 W
The Status can be one of the following:
searching: PoE is enabled but no device has been detected.
disabled: The PoE port has been configured as disabled.
connected: A powered device is connected and receiving power.
power-denied: There is insufficient PoE power available to
enable the connected device.
The Allocated column displays how much PoE power has been allocated
to the port, which can be one of the following:
n/a: No device is connected or the connected device does not
support LLDP negotiation.
negotiating: An LLDP-capable device is connected and is
negotiating for PoE power.
XX.X W: An LLDP-capable device has negotiated for XX.X watts of
power (for example, 51.0 watts for swp13 above).
To see all the PoE information for a switch, run poectl -s:
cumulus@switch:~$ poectl -s
System power:
Total: 730.0 W
Used: 11.0 W
Available: 719.0 W
Connected ports:
swp11, swp24, swp27, swp48
The set commands (priority, enable, disable) either succeed silently or
display an error message if the command fails.
poectl Arguments
The poectl command takes the following arguments:
Argument
Description
-h, --help
Show this help message and exit
-i, --port-info PORT_LIST
Returns detailed information for the specified ports. You can specify a range of ports. For example: -i swp1-swp5,swp10
On an Edge-Core AS4610-54P switch, the voltage reported by the poectl -i command and measured through a power meter connected to the device varies by 5V. The current and power readings are correct and no difference is seen for them.
-a, --all
Returns PoE status and detailed information for all ports.
-p, --priority PORT_LIST PRIORITY
Sets priority for the specified ports: low, high, critical.
-d, --disable-ports PORT_LIST
Disables PoE operation on the specified ports.
-e, --enable-ports PORT_LIST
Enables PoE operation on the specified ports.
-s, --system
Returns PoE status for the entire switch.
-r, --reset PORT_LIST
Performs a hardware reset on the specified ports. Use this if one or more ports are stuck in an error state. This does not reset any configuration settings for the specified ports.
-v, --version
Displays version information.
-j, --json
Displays output in JSON format.
--save
Saves the current configuration. The saved configuration is automatically loaded on system boot.
--load
Loads and applies the saved configuration.
Troubleshooting
You can troubleshoot PoE and PoE+ using the following utilities and
files:
poectl -s, as described above.
The Cumulus Linux cl-support script, which includes PoE-related
output from poed.conf, syslog, poectl --diag-info and
lldpctl.
lldpcli show neighbors ports <swp> protocol lldp hidden details
tcpdump -v -v -i <swp> ether proto 0x88cc
The contents of the PoE/PoE+ /etc/lldpd.d/poed.conf configuration
file, as described above.
Verify the Link Is Up
LLDP requires network connectivity, so verify that the link is up.
cumulus@switch:~$ net show interface swp20
Name MAC Speed MTU Mode
-- ------ ----------------- ------- ----- ---------
UP swp20 44:38:39:00:00:04 1G 1500 Access/L2
View LLDP Information Using lldpcli
You can run lldpcli to view the LLDP information that has been
received on a switch port. For example:
cumulus@switch:~$ sudo lldpcli show neighbors ports swp20 protocol lldp hidden details
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: swp20, via: LLDP, RID: 2, Time: 0 day, 00:03:34
Chassis:
ChassisID: mac 68:c9:0b:25:54:7c
SysName: ihm-ubuntu
SysDescr: Ubuntu 14.04.2 LTS Linux 3.14.4+ #1 SMP Thu Jun 26 00:54:44 UTC 2014 armv7l
MgmtIP: fe80::6ac9:bff:fe25:547c
Capability: Bridge, off
Capability: Router, off
Capability: Wlan, off
Capability: Station, on
Port:
PortID: mac 68:c9:0b:25:54:7c
PortDescr: eth0
PMD autoneg: supported: yes, enabled: yes
Adv: 10Base-T, HD: yes, FD: yes
Adv: 100Base-TX, HD: yes, FD: yes
MAU oper type: 100BaseTXFD - 2 pair category 5 UTP, full duplex mode
MDI Power: supported: yes, enabled: yes, pair control: no
Device type: PD
Power pairs: spare
Class: class 4
Power type: 2
Power Source: Primary power source
Power Priority: low
PD requested power Value: 51000
PSE allocated power Value: 51000
UnknownTLVs:
TLV: OUI: 00,01,42, SubType: 1, Len: 1 05
TLV: OUI: 00,01,42, SubType: 1, Len: 1 0D
-------------------------------------------------------------------------------
View LLDP Information Using tcpdump
You can use tcpdump to view the LLDP frames being transmitted and
received. For example:
cumulus@switch:~$ sudo tcpdump -v -v -i swp20 ether proto 0x88cc
tcpdump: listening on swp20, link-type EN10MB (Ethernet), capture size 262144 bytes
18:41:47.559022 LLDP, length 211
Chassis ID TLV (1), length 7
Subtype MAC address (4): 00:30:ab:f2:d7:a5 (oui Unknown)
0x0000: 0400 30ab f2d7 a5
Port ID TLV (2), length 6
Subtype Interface Name (5): swp20
0x0000: 0573 7770 3230
Time to Live TLV (3), length 2: TTL 120s
0x0000: 0078
System Name TLV (5), length 13: dni-3048up-09
0x0000: 646e 692d 3330 3438 7570 2d30 39
System Description TLV (6), length 68
Cumulus Linux version 3.0.1~1466303042.2265c10 running on dni 3048up
0x0000: 4375 6d75 6c75 7320 4c69 6e75 7820 7665
0x0010: 7273 696f 6e20 332e 302e 317e 3134 3636
0x0020: 3330 3330 3432 2e32 3236 3563 3130 2072
0x0030: 756e 6e69 6e67 206f 6e20 646e 6920 3330
0x0040: 3438 7570
System Capabilities TLV (7), length 4
System Capabilities [Bridge, Router] (0x0014)
Enabled Capabilities [Router] (0x0010)
0x0000: 0014 0010
Management Address TLV (8), length 12
Management Address length 5, AFI IPv4 (1): 10.0.3.190
Interface Index Interface Numbering (2): 2
0x0000: 0501 0a00 03be 0200 0000 0200
Management Address TLV (8), length 24
Management Address length 17, AFI IPv6 (2): fe80::230:abff:fef2:d7a5
Interface Index Interface Numbering (2): 2
0x0000: 1102 fe80 0000 0000 0000 0230 abff fef2
0x0010: d7a5 0200 0000 0200
Port Description TLV (4), length 5: swp20
0x0000: 7377 7032 30
Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
Link aggregation Subtype (3)
aggregation status [supported], aggregation port ID 0
0x0000: 0012 0f03 0100 0000 00
Organization specific TLV (127), length 9: OUI IEEE 802.3 Private (0x00120f)
MAC/PHY configuration/status Subtype (1)
autonegotiation [supported, enabled] (0x03)
PMD autoneg capability [10BASE-T fdx, 100BASE-TX fdx, 1000BASE-T fdx] (0x2401)
MAU type 100BASEFX fdx (0x0012)
0x0000: 0012 0f01 0324 0100 12
Organization specific TLV (127), length 12: OUI IEEE 802.3 Private (0x00120f)
Power via MDI Subtype (2)
MDI power support [PSE, supported, enabled], power pair spare, power class class4
0x0000: 0012 0f02 0702 0513 01fe 01fe
Organization specific TLV (127), length 5: OUI Unknown (0x000142)
0x0000: 0001 4201 0d
Organization specific TLV (127), length 5: OUI Unknown (0x000142)
0x0000: 0001 4201 01
End TLV (0), length 0
Log poed Events in syslog
The poed service logs the following events to syslog:
When a switch provides power to a powered device.
When a device that was receiving power is removed.
When the power available to the switch changes.
Errors are detected.
Configuring a Global Proxy
You configure global HTTP and HTTPS proxies in the /etc/profile.d/ directory of Cumulus Linux. To do so, set the http_proxy and https_proxy variables, which tells the switch the address of the proxy server to use to fetch URLs on the command line. This is useful for programs such as apt/apt-get, curl and wget, which can all use this proxy.
In a terminal, create a new file in the /etc/profile.d/ directory. In the code example below, the file is called proxy.sh, and is created using the text editor nano.
Create a file in the /etc/apt/apt.conf.d directory and add the following lines to the file for acquiring the HTTP and HTTPS proxies; the example below uses http_proxy as the file name:
Cumulus Linux implements an HTTP (Web) application programing interface
to the OpenStack ML2 driver
and the NCLU
API. Rather than accessing Cumulus Linux using SSH, you can interact with the
switch using an HTTP client, such as cURL, HTTPie or a web browser.
The HTTP API service is enabled by default on chassis hardware only.
However, the associated server is configured to only listen to traffic
originating from within the chassis.
The service is not enabled by default on non-chassis hardware.
HTTP API Basics
If you are upgrading from a version of Cumulus Linux earlier than 3.4.0,
the supporting software for the API may not be installed. Install the
required software with the following command.
The first configuration file is used for non-chassis hardware; the
second, for chassis hardware.
Generally, only the configuration file relevant to your hardware needs
to be edited, as the associated services determine the appropriate
configuration file to use at run time.
Enable External Traffic on a Chassis
The HTTP API services are configured to listen on port 8080 for chassis
hardware by default. However, only HTTP traffic originating from
internal link local management IPv6s will be allowed. To configure the
services to also accept HTTP requests originating from external sources:
Open /etc/nginx/sites-available/nginx-restapi-chassis.conf in a
text editor.
Uncomment the server block lines near the end of the file.
Change the port on the now uncommented listen line if the default
value, 8080, is not the preferred port, and save the configuration
file.
The IP:port combinations that services listen to can be modified by
changing the parameters of the listen directive(s). By default,
nginx-restapi.conf has only one listen parameter, whereas
/etc/nginx/sites-available/nginx-restapi-chassis.conf has two
independently configurable server blocks, each with a listen
directive. One server block is for external traffic, and the other for
internal traffic.
All URLs must use HTTPS, rather than HTTP.
For more information on the listen directive, refer to the
NGINX documentation.
Do not set the same listening port for internal and external chassis
traffic.
Security
Authentication
The default configuration requires all HTTP requests from external
sources (not internal switch traffic) to set the HTTP Basic
Authentication header.
The user and password should correspond to a user on the host switch.
Transport Layer Security
All traffic must be secured in transport using TLSv1.2 by default. Cumulus Linux contains a self-signed certificate and private key used server-side in this application so that it works out of the box, but using your own certificates and keys is recommended. Certificates must be in the PEM format.
Do not copy the cumulus.pem or cumulus.key files. After
installation, edit the ssl\_certificate and ssl\_certificate\_key
values in the configuration file for your hardware.
cURL Examples
This section contains several example cURL commands for sending HTTP
requests to a non-chassis host. The following settings are used for
these examples:
Username: user
Password: pw
IP: 192.168.0.32
Port: 8080
Requests for NCLU require setting the Content-Type request header to be
set to application/json.
The cURL -k flag is necessary when the server uses a self-signed
certificate. This is the default configuration (see the Security section). To display the response
headers, include -D flag in the command.
To retrieve a list of all available HTTP endpoints:
cumulus@switch:~$ curl -X GET -k -u user:pw https://192.168.0.32:8080
To run net show counters on the host as a remote procedure call:
By default, ifupdown is quiet; use the verbose option -v when you
want to know what is going on when bringing an interface down or up.
Basic Commands
To bring up an interface or apply changes to an existing interface, run:
cumulus@switch:~$ sudo ifup <ifname>
To bring down a single interface, run:
cumulus@switch:~$ sudo ifdown <ifname>
ifdown always deletes logical interfaces after bringing them down. Use
the --admin-state option if you only want to administratively bring
the interface up or down.
To see the link and administrative state, use the ip link show
command:
cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
In this example, swp1 is administratively UP and the physical link is UP
(LOWER_UP flag). More information on interface administrative state and
physical state can be found in this knowledge base article.
To put an interface into an admin down state. The interface remains down
after any future reboots or applying configuration changes with
ifreload -a. For example:
cumulus@switch:~$ net add interface swp1 link down
These commands create the following configuration in the
/etc/network/interfaces file:
auto swp1
iface swp1
link-down yes
ifupdown2 Interface Classes
ifupdown2 provides for the grouping of interfaces into separate
classes, where a class is a user-defined label that groups interfaces
sharing a common function (like uplink, downlink or compute). You
specify classes in the /etc/network/interfaces file.
The most common class is auto, which you configure like this:
auto swp1
iface swp1
You can add other classes using the allow prefix. For example, if you
have multiple interfaces used for uplinks, you can make up a class
called uplinks:
auto swp1
allow-uplink swp1
iface swp1 inet static
address 10.1.1.1/31
auto swp2
allow-uplink swp2
iface swp2 inet static
address 10.1.1.3/31
This allows you to perform operations on only these interfaces using the
--allow=uplinks option, or still use the -a options since these
interfaces are also in the auto class:
cumulus@switch:~$ sudo ifup --allow=uplinks
cumulus@switch:~$ sudo ifreload -a
If you are using a management VRF, you can use the special
interface class called mgmt, and put the management interface into
that class.
The mgmt interface class is not supported if you are configuring Cumulus
Linux using NCLU.
All ifupdown2 commands (ifup, ifdown, ifquery, ifreload) can
take a class. Include the --allow=<class> option when you run the
command. For example, to reload the configuration for the management
interface described above, run:
cumulus@switch:~$ sudo ifreload --allow=mgmt
You can easily bring up or down all interfaces marked with the common
auto class in /etc/network/interfaces. Use the -a option. For further details,
see individual man pages for ifup(8), ifdown(8), ifreload(8).
To administratively bring up all interfaces marked auto, run:
cumulus@switch:~$ sudo ifup -a
To administratively bring down all interfaces marked auto, run:
cumulus@switch:~$ sudo ifdown -a
To reload all network interfaces marked auto, use the ifreload
command, which is equivalent to running ifdown then ifup, the one
difference being that ifreload skips any configurations that didn’t
change):
cumulus@switch:~$ sudo ifreload -a
Some syntax checks are done by default, however it may be safer to apply
the configs only if the syntax check passes, using the following
compound command:
cumulus@switch:~$ sudo bash -c "ifreload -s -a && ifreload -a"
Configure a Loopback Interface
Cumulus Linux has a loopback preconfigured in /etc/network/interfaces.
When the switch boots up, it has a loopback interface called lo,
which is up and assigned an IP address of 127.0.0.1.
The loopback interface lo must always be specified in /etc/network/interfaces and must always be up.
ifupdown Behavior with Child Interfaces
By default, ifupdown recognizes and uses any interface present on the
system - whether a VLAN, bond or physical interface - that is listed as
a dependent of an interface. You are not required to list them in the interfaces file unless they need a specific configuration, such MTU or link speed.
And if you need to delete a child interface, you should delete all
references to that interface from the interfaces file.
For this example, swp1 and swp2 below do not need an entry in the interfaces file. The following stanzas defined in
/etc/network/interfaces provide the exact same configuration:
With Child Interfaces Defined
auto swp1
iface swp1
auto swp2
iface swp2
auto bridge
iface bridge
bridge-vlan-aware yes
bridge-ports swp1 swp2
bridge-vids 1-100
bridge-pvid 1
bridge-stp on
Without Child Interfaces Defined
auto bridge
iface bridge
bridge-vlan-aware yes
bridge-ports swp1 swp2
bridge-vids 1-100
bridge-pvid 1
bridge-stp on
Bridge in Traditional Mode - Example
For this example, swp1.100 and swp2.100 below do not need an entry in the interfaces file. The following stanzas defined in
/etc/network/interfaces provide the exact same configuration:
With Child Interfaces Defined
auto swp1.100
iface swp1.100
auto swp2.100
iface swp2.100
auto br-100
iface br-100
address 10.0.12.2/24
address 2001:dad:beef::3/64
bridge-ports swp1.100 swp2.100
bridge-stp on
Without Child Interfaces Defined
auto br-100
iface br-100
address 10.0.12.2/24
address 2001:dad:beef::3/64
bridge-ports swp1.100 swp2.100
bridge-stp on
For more information on the bridge in traditional mode vs the bridge in
VLAN-aware mode, please read this knowledge base article.
ifupdown2 Interface Dependencies
ifupdown2 understands interface dependency relationships. When ifup
and ifdown are run with all interfaces, they always run with all
interfaces in dependency order. When run with the interface list on the
command line, the default behavior is to not run with dependents. But if
there are any built-in dependents, they will be brought up or down.
To run with dependents when you specify the interface list, use the
--with-depends option. --with-depends walks through all dependents
in the dependency tree rooted at the interface you specify. Consider the
following example configuration:
auto bond1
iface bond1
address 100.0.0.2/16
bond-slaves swp29 swp30
auto bond2
iface bond2
address 100.0.0.5/16
bond-slaves swp31 swp32
auto br2001
iface br2001
address 12.0.1.3/24
bridge-ports bond1.2001 bond2.2001
bridge-stp on
Using ifup --with-depends br2001 brings up all dependents of br2001:
bond1.2001, bond2.2001, bond1, bond2, bond1.2001, bond2.2001, swp29,
swp30, swp31, swp32.
cumulus@switch:~$ sudo ifup --with-depends br2001
Similarly, specifying ifdown --with-depends br2001 brings down all
dependents of br2001: bond1.2001, bond2.2001, bond1, bond2, bond1.2001,
bond2.2001, swp29, swp30, swp31, swp32.
ifdown2 always deletes logical interfaces after bringing them down. Use the --admin-state option if you only want to
administratively bring the interface up or down. In the above
example, ifdown br2001 deletes br2001.
To guide you through which interfaces will be brought down and up, use
the --print-dependency option to get the list of dependents.
Use ifquery --print-dependency=list -a to get the dependency list of
all interfaces:
You can use dot to render the graph on an external system where dot
is installed.
To print the dependency information of the entire interfaces file:
cumulus@switch:~$ sudo ifquery --print-dependency=dot -a >interfaces_all.dot
Subinterfaces
On Linux an interface is a network device, and can be either a
physical device like switch port (such as swp1), or virtual, like a VLAN
(vlan100). A VLAN subinterface is a VLAN device on an interface, and
the VLAN ID is appended to the parent interface using dot (.) VLAN
notation. For example, a VLAN with ID 100 that is a subinterface of swp1
is named swp1.100 in Cumulus Linux. The dot VLAN notation for a VLAN
device name is a standard way to specify a VLAN device on Linux. Many
Linux configuration tools, most notably ifupdown2 and its predecessor
ifupdown, recognize such a name as a VLAN interface name.
A VLAN subinterface only receives traffic tagged
for that VLAN, so swp1.100 only receives packets tagged with VLAN 100 on
switch port swp1. Similarly, any transmits from swp1.100 result in
tagging the packet with VLAN 100.
For an MLAG
deployment, the peerlink interface that connects the two switches in the
MLAG pair has a VLAN subinterface named 4094 by default, provided you
configured the subinterface with
NCLU.
The peerlink.4094 subinterface only receives traffic tagged for VLAN 4094.
ifup and Upper (Parent) Interfaces
When you run ifup on a logical interface (like a bridge, bond or VLAN
interface), if the ifup resulted in the creation of the logical
interface, by default it implicitly tries to execute on the interface’s
upper (or parent) interfaces as well. This helps in most cases,
especially when a bond is brought down and up, as in the example below.
This section describes the behavior of bringing up the upper interfaces.
Consider this example configuration:
auto br100
iface br100
bridge-ports bond1.100 bond2.100
auto bond1
iface bond1
bond-slaves swp1 swp2
If you run ifdown bond1, ifdown deletes bond1 and the VLAN interface
on bond1 (bond1.100); it also removes bond1 from the bridge br100. Next,
when you run ifup bond1, it creates bond1 and the VLAN interface on
bond1 (bond1.100); it also executes ifup br100 to add the bond VLAN
interface (bond1.100) to the bridge br100.
As you can see above, implicitly bringing up the upper interface helps,
but there can be cases where an upper interface (like br100) is not in
the right state, which can result in warnings. The warnings are mostly
harmless.
If you want to disable these warnings, you can disable the implicit
upper interface handling by setting skip_upperifaces=1 in
/etc/network/ifupdown2/ifupdown2.conf.
With skip_upperifaces=1, you will have to explicitly execute ifup on
the upper interfaces. In this case, you will have to run ifup br100
after an ifup bond1 to add bond1 back to bridge br100.
Although specifying a subinterface like swp1.100 and then running ifup swp1.100 will also result in the automatic creation of the swp1 interface in the kernel, also specifying the parent interface swp1 is recommended. A parent interface is one where any physical layer configuration can reside, such as link-speed 1000 or link-duplex full.
It’s important to note that if you only create swp1.100 and not swp1,
then you cannot run ifup swp1 since you did not specify it.
Configure IP Addresses
IP addresses are configured with the net add interface command.
The following commands configure three IP addresses for swp1: two IPv4
addresses, and one IPv6 address.
cumulus@switch:~$ net add interface swp1 ip address 12.0.0.1/30
cumulus@switch:~$ net add interface swp1 ip address 12.0.0.2/30
cumulus@switch:~$ net add interface swp1 ipv6 address 2001:DB8::1/126
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following code snippet:
auto swp1
iface swp1
address 12.0.0.1/30
address 12.0.0.2/30
address 2001:DB8::1/126
You can specify both IPv4 and IPv6 addresses for the same interface.
For IPv6 addresses, you can create or modify the IP address for an
interface using either “::” or “0:0:0” notation. Both of the following
examples are valid:
cumulus@switch:~$ net add bgp neighbor 2620:149:43:c109:0:0:0:5 remote-as internal
cumulus@switch:~$
cumulus@switch:~$ net add interface swp1 ipv6 address 2001:DB8::1/126
The address method and address family are added by NCLU when needed,
specifically when you are creating DHCP or loopback interfaces.
auto lo
iface lo inet loopback
To show the assigned address on an interface, use ip addr show:
cumulus@switch:~$ ip addr show dev swp1
3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
inet 192.0.2.1/30 scope global swp1
inet 192.0.2.2/30 scope global swp1
inet6 2001:DB8::1/126 scope global tentative
valid_lft forever preferred_lft forever
Specify IP Address Scope
ifupdown2 does not honor the configured IP address scope setting in
/etc/network/interfaces, treating all addresses as global. It does not
report an error. Consider this example configuration:
auto swp2
iface swp2
address 35.21.30.5/30
address 3101:21:20::31/80
scope link
When you run ifreload -a on this configuration, ifupdown2 considers
all IP addresses as global.
cumulus@switch:~$ ip addr show swp2
5: swp2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:82 brd ff:ff:ff:ff:ff:ff
inet 35.21.30.5/30 scope global swp2
valid_lft forever preferred_lft forever
inet6 3101:21:20::31/80 scope global
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6282/64 scope link
valid_lft forever preferred_lft forever
To work around this issue, configure the IP address scope:
cumulus@switch:~$ net add interface swp6 post-up ip address add 71.21.21.20/32 dev swp6 scope site
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following code snippet in the
/etc/network/interfaces file:
auto swp6
iface swp6
post-up ip address add 71.21.21.20/32 dev swp6 scope site
Now it has the correct scope:
cumulus@switch:~$ ip addr show swp6
9: swp6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:86 brd ff:ff:ff:ff:ff:ff
inet 71.21.21.20/32 scope site swp6
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6286/64 scope link
valid_lft forever preferred_lft forever
Purge Existing IP Addresses on an Interface
By default, ifupdown2 purges existing IP addresses on an interface. If
you have other processes that manage IP addresses for an interface, you
can disable this feature including the address-purge setting in the
interface’s configuration.
cumulus@switch:~$ net add interface swp1 address-purge no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
/etc/network/interfaces file:
auto swp1
iface swp1
address-purge no
Purging existing addresses on interfaces with multiple iface stanzas
is not supported. Doing so can result in the configuration of multiple
addresses for an interface after you change an interface address and
reload the configuration with ifreload -a. If this happens, you must
shut down and restart the interface with ifup and ifdown, or
manually delete superfluous addresses with ip address delete specify.ip.address.here/mask dev DEVICE. See also the Caveats and Errata
section below for some cautions about using multiple iface stanzas for the same interface.
Specify User Commands
You can specify additional user commands in the interfaces file. As
shown in the example below, the interface stanzas in
/etc/network/interfaces can have a command that runs at pre-up, up,
post-up, pre-down, down, and post-down:
cumulus@switch:~$ net add interface swp1 post-up /sbin/foo bar
cumulus@switch:~$ net add interface ip address 12.0.0.1/30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the
/etc/network/interfaces file:
auto swp1
iface swp1
address 12.0.0.1/30
post-up /sbin/foo bar
Any valid command can be hooked in the sequencing of bringing an
interface up or down, although commands should be limited in scope to
network-related commands associated with the particular interface.
For example, it wouldn’t make sense to install some Debian package on
ifup of swp1, even though that is technically possible. See man interfaces for more details.
If your post-up command also starts, restarts or reloads any systemd
service, you must use the --no-block option with systemctl.
Otherwise, that service or even the switch itself may hang after
starting or restarting.
For example, to restart the dhcrelay service after bringing up VLAN
100, first run:
cumulus@switch:~$ net add vlan 100 post-up systemctl --no-block restart dhcrelay.service
This command creates the following configuration in the
/etc/network/interfaces file:
Sourcing interface files helps organize and manage the interfaces
file. For example:
cumulus@switch:~$ cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet dhcp
source /etc/network/interfaces.d/bond0
NCLU supports globs to define port lists (that is, a range of ports).
The glob keyword is implied when you specify bridge ports and bond
slaves:
cumulus@switch:~$ net add bridge bridge ports swp1-4,6,10-12
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
While you must use commas to separate different ranges of ports in the
NCLU command, the /etc/network/interfaces file renders the list of ports individually, as in the example output below.
These commands produce the following snippet in the
/etc/network/interfaces file:
...
auto bridge
iface bridge
bridge-ports swp1 swp2 swp3 swp4 swp6 swp10 swp11 swp12
bridge-vlan-aware yes
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp6
iface swp6
auto swp10
iface swp10
auto swp11
iface swp11
auto swp12
iface swp12
Mako Templates
ifupdown2 supports Mako-style templates. The Mako template engine is run over the interfaces file before parsing.
While ifupdown2 supports Mako templates, NCLU does not understand them. As a result, NCLU cannot read or write to the /etc/network/interfaces file.
Use the template to declare cookie-cutter bridges in the interfaces
file:
%for v in [11,12]:
auto vlan${v}
iface vlan${v}
address 10.20.${v}.3/24
bridge-ports glob swp19-20.${v}
bridge-stp on
%endfor
And use it to declare addresses in the interfaces file:
%for i in [1,12]:
auto swp${i}
iface swp${i}
address 10.20.${i}.3/24
Regarding Mako syntax, use square brackets ([1,12]) to specify a list
of individual numbers (in this case, 1 and 12). Use range(1,12) to
specify a range of interfaces.
You can test your template and confirm it evaluates correctly by running
mako-render /etc/network/interfaces.
To comment out content in Mako templates, use double hash marks (##).
For example:
## % for i in range(1, 4):
## auto swp${i}
## iface swp${i}
## % endfor
##
Run ifupdown Scripts under /etc/network/ with ifupdown2
Unlike the traditional ifupdown system, ifupdown2 does not run scripts installed in /etc/network/*/ automatically to configure
network interfaces.
To enable or disable ifupdown2 scripting, edit the addon_scripts_support line in the /etc/network/ifupdown2/ifupdown2.conf file. 1 enables scripting and 2 disables scripting. The following example enables scripting.
cumulus@switch:~$ sudo nano /etc/network/ifupdown2/ifupdown2.conf
# Support executing of ifupdown style scripts.
# Note that by default python addon modules override scripts with the same name
addon_scripts_support=1
ifupdown2 sets the following environment variables when executing
commands:
$IFACE represents the physical name of the interface being processed; for example, br0 or vxlan42. The name is obtained from the /etc/network/interfaces file.
$LOGICAL represents the logical name (configuration name) of the interface being processed.
$METHOD represents the address method; for example, loopback, DHCP, DHCP6, manual, static, and so on.
$ADDRFAM r epresents the address families associated with the interface, formatted in a comma-separated list;
for example, "inet,inet6" .
Add Descriptions to Interfaces
You can add descriptions to the interfaces configured in
/etc/network/interfaces by using the alias keyword.
The following commands create an alias for swp1:
cumulus@switch:~$ net add interface swp1 alias hypervisor_port_1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following code snippet:
auto swp1
iface swp1
alias hypervisor_port_1
You can query the interface description using NCLU:
cumulus@switch$ net show interface swp1
Name MAC Speed MTU Mode
-- ---- ----------------- ------- ----- ---------
UP swp1 44:38:39:00:00:04 1G 1500 Access/L2
Alias
-----
hypervisor_port_1
Interface descriptions also appear in the SNMP OID
IF-MIB::ifAlias
.
Aliases are limited to 256 characters.
Avoid using apostrophes or non-ASCII characters in the alias string. Cumulus Linux does not parse these characters.
To
show the interface description (alias) for all interfaces on the switch,
run the net show interface alias command. For example:
cumulus@switch:~$ net show interface alias
State Name Mode Alias
----- ------------- ------------- ------------------
UP bond01 LACP
UP bond02 LACP
UP bridge Bridge/L2
UP eth0 Mgmt
UP lo Loopback loopback interface
UP mgmt Interface/L3
UP peerlink LACP
UP peerlink.4094 SubInt/L3
UP swp1 BondMember hypervisor_port_1
UP swp2 BondMember to Server02
...
To show the interface description for all interfaces on the switch in
JSON format, run the net show interface alias json command.
Caveats and Errata
While ifupdown2 supports the inclusion of multiple iface stanzas for
the same interface, use a single iface stanza for each interface, if possible.
There are cases where you must specify more than one iface stanza for
the same interface. For example, the configuration for a single
interface can come from many places, like a template or a sourced file.
If you do specify multiple iface stanzas for the same interface, make
sure the stanzas do not specify the same interface attributes.
Otherwise, unexpected behavior can result.
As well as /etc/network/interfaces.d/speed_settings
cumulus@switch:~$ cat /etc/network/interfaces.d/speed_settings
auto swp1
iface swp1
link-speed 1000
link-duplex full
ifupdown2 correctly parses a configuration like this because the same
attributes are not specified in multiple iface stanzas.
And, as stated in the note above, you cannot purge existing addresses on
interfaces with multiple iface stanzas.
ifupdown2 and sysctl
For sysctl commands in the pre-up , up, post-up, pre-down, down, and post-down lines that use the
$IFACE variable, if the interface name contains a dot (.), ifupdown2 does not change the name to work with sysctl. For example, the interface name bridge.1 is not
converted to bridge/1.
Interface Name Limitations
Interface names are limited to 15 characters in length, the first character cannot be a number and the name cannot include a dash (-). In addition, any name that matches with the regular expression .{0,13}\-v.* is not supported.
If you encounter issues, remove the interface name from the /etc/network/interfaces file, then restart the networking.service.
Hardware datapath configuration manages packet buffering, queueing and
scheduling in hardware. There are two configuration input files:
/etc/cumulus/datapath/traffic.conf, which describes priority
groups and assigns the scheduling algorithm and weights
/usr/lib/python2.7/dist-packages/cumulus/__chip_config/[bcm|mlx]/datapath.conf, which assigns buffer space and egress queues
The default thresholds defined in the datapath.conf file are intended for data center environments, but certain workloads may require additional tuning. It is best to make small, incremental changes to validate the changes with your application performance. Be sure to backup the original file before making changes.
Each packet is assigned to an ASIC Class of Service (CoS) value based on
the packet’s priority value stored in the 802.1p (Class of Service) or
DSCP (Differentiated Services Code Point) header field. The choice to
schedule packets based on COS or DSCP is a configurable option in the
/etc/cumulus/datapath/traffic.conf file.
Priority groups include:
Control: Highest priority traffic
Service: Second-highest priority traffic
Bulk: All remaining traffic
The scheduler is configured to use a hybrid scheduling algorithm. It
applies strict priority to control traffic queues and a weighted round
robin selection from the remaining queues. Unicast packets and multicast
packets with the same priority value are assigned to separate queues,
which are assigned equal scheduling weights.
Datapath configuration takes effect when you initialize switchd.
Changes to the traffic.conf file require you to restart the `switchd` service.
You can configure Quality of Service (QoS) for switches on the following
platforms only:
Broadcom Tomahawk, Trident II, Trident II+ and Trident3
Mellanox Spectrum
Commands
If you modify the configuration in the /etc/cumulus/datapath/traffic.conf file, you must restart switchd for the changes to take effect:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Example Configuration File
The following example /etc/cumulus/datapath/traffic.conf datapath
configuration file applies to 10G, 40G, and 100G switches on Broadcom
Tomahawk, Trident II, Trident II+, or Trident3 and Mellanox Spectrum
platforms only. However, see the note
above for all the supported ASICs.
Keep in mind the following about the configuration:
Regarding the default source packet fields and mapping, each
selected packet field should have a block of mapped values. Any
packet field value that is not specified in the configuration is
assigned to a default internal switch priority. The configuration
applies to every forwarding port unless a custom remark
configuration is defined for that port (see below).
Regarding the default remark packet fields and mapping, each
selected packet field should have a block of mapped values. Any
internal switch priority value that is not specified in the
configuration is assigned to a default packet field value. The
configuration applies to every forwarding port unless a custom
remark configuration is defined for that port (see below).
Per-port source packet fields and mapping apply to the designated
set of ports.
Per-port remark packet fields and mapping apply to the designated
set of ports.
Click to view sample traffic.conf file ...
cumulus@switch:~$ cat /etc/cumulus/datapath/traffic.conf
#
# /etc/cumulus/datapath/traffic.conf
#
# packet header field used to determine the packet priority level
# fields include {802.1p, dscp}
traffic.packet_priority_source_set = [802.1p,dscp]
# remark packet priority value
# fields include {802.1p, none}
# remark packet priority value
# fields include {802.1p, dscp}
traffic.packet_priority_remark_set = [802.1p,dscp]
# packet priority remark values assigned from each internal cos value
# internal cos values {cos_0..cos_7}
# (internal cos 3 has been reserved for CPU-generated traffic)
#
# 802.1p values = {0..7}
traffic.cos_0.priority_remark.8021p = [1]
traffic.cos_1.priority_remark.8021p = [0]
traffic.cos_2.priority_remark.8021p = [3]
traffic.cos_3.priority_remark.8021p = [2]
traffic.cos_4.priority_remark.8021p = [4]
traffic.cos_5.priority_remark.8021p = [5]
traffic.cos_6.priority_remark.8021p = [7]
traffic.cos_7.priority_remark.8021p = [6]
# dscp values = {0..63}
traffic.cos_0.priority_remark.dscp = [1]
traffic.cos_1.priority_remark.dscp = [9]
traffic.cos_2.priority_remark.dscp = [17]
traffic.cos_3.priority_remark.dscp = [25]
traffic.cos_4.priority_remark.dscp = [33]
traffic.cos_5.priority_remark.dscp = [41]
traffic.cos_6.priority_remark.dscp = [49]
traffic.cos_7.priority_remark.dscp = [57]
# Per-port remark packet fields and mapping: applies to the designated set of ports.
remark.port_group_list = [remark_port_group]
remark.remark_port_group.packet_priority_remark_set = [802.1p,dscp]
remark.remark_port_group.port_set = swp1-swp4,swp6
remark.remark_port_group.cos_0.priority_remark.dscp = [2]
remark.remark_port_group.cos_1.priority_remark.dscp = [10]
remark.remark_port_group.cos_2.priority_remark.dscp = [18]
remark.remark_port_group.cos_3.priority_remark.dscp = [26]
remark.remark_port_group.cos_4.priority_remark.dscp = [34]
remark.remark_port_group.cos_5.priority_remark.dscp = [42]
remark.remark_port_group.cos_6.priority_remark.dscp = [50]
remark.remark_port_group.cos_7.priority_remark.dscp = [58]
# packet priority values assigned to each internal cos value
# internal cos values {cos_0..cos_7}
# (internal cos 3 has been reserved for CPU-generated traffic)
#
# 802.1p values = {0..7}
traffic.cos_0.priority_source.8021p = [0]
traffic.cos_1.priority_source.8021p = [1]
traffic.cos_2.priority_source.8021p = [2]
traffic.cos_3.priority_source.8021p = []
traffic.cos_4.priority_source.8021p = [3,4]
traffic.cos_5.priority_source.8021p = [5]
traffic.cos_6.priority_source.8021p = [6]
traffic.cos_7.priority_source.8021p = [7]
# dscp values = {0..63}
traffic.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
traffic.cos_1.priority_source.dscp = [8,9,10,11,12,13,14,15]
traffic.cos_2.priority_source.dscp = []
traffic.cos_3.priority_source.dscp = []
traffic.cos_4.priority_source.dscp = []
traffic.cos_5.priority_source.dscp = []
traffic.cos_6.priority_source.dscp = []
traffic.cos_7.priority_source.dscp = [56,57,58,59,60,61,62,63]
# Per-port source packet fields and mapping: applies to the designated set of ports.
source.port_group_list = [source_port_group]
source.source_port_group.packet_priority_source_set = [802.1p,dscp]
source.source_port_group.port_set = swp1-swp4,swp6
source.source_port_group.cos_0.priority_source.8021p = [7]
source.source_port_group.cos_1.priority_source.8021p = [6]
source.source_port_group.cos_2.priority_source.8021p = [5]
source.source_port_group.cos_3.priority_source.8021p = [4]
source.source_port_group.cos_4.priority_source.8021p = [3]
source.source_port_group.cos_5.priority_source.8021p = [2]
source.source_port_group.cos_6.priority_source.8021p = [1]
source.source_port_group.cos_7.priority_source.8021p = [0]
# priority groups
traffic.priority_group_list = [control, service, bulk]
# internal cos values assigned to each priority group
# each cos value should be assigned exactly once
# internal cos values {0..7}
priority_group.control.cos_list = [7]
priority_group.service.cos_list = [2]
priority_group.bulk.cos_list = [0,1,3,4,5,6]
# to configure priority flow control on a group of ports:
# -- assign cos value(s) to the cos list
# -- add or replace a port group names in the port group list
# -- for each port group in the list
# -- populate the port set, e.g.
# swp1-swp4,swp8,swp50s0-swp50s3
# -- set a PFC buffer size in bytes for each port in the group
# -- set the xoff byte limit (buffer limit that triggers PFC frame transmit to start)
# -- set the xon byte delta (buffer limit that triggers PFC frame transmit to stop)
# -- enable PFC frame transmit and/or PFC frame receive
# priority flow control
# pfc.port_group_list = [pfc_port_group]
# pfc.pfc_port_group.cos_list = []
# pfc.pfc_port_group.port_set = swp1-swp4,swp6
# pfc.pfc_port_group.port_buffer_bytes = 25000
# pfc.pfc_port_group.xoff_size = 10000
# pfc.pfc_port_group.xon_delta = 2000
# pfc.pfc_port_group.tx_enable = true
# pfc.pfc_port_group.rx_enable = true
# to configure pause on a group of ports:
# -- add or replace port group names in the port group list
# -- for each port group in the list
# -- populate the port set, e.g.
# swp1-swp4,swp8,swp50s0-swp50s3
# -- set a pause buffer size in bytes for each port in the group
# -- set the xoff byte limit (buffer limit that triggers pause frames transmit to start)
# -- set the xon byte delta (buffer limit that triggers pause frames transmit to stop)
# link pause
# link_pause.port_group_list = [pause_port_group]
# link_pause.pause_port_group.port_set = swp1-swp4,swp6
# link_pause.pause_port_group.port_buffer_bytes = 25000
# link_pause.pause_port_group.xoff_size = 10000
# link_pause.pause_port_group.xon_delta = 2000
# link_pause.pause_port_group.rx_enable = true
# link_pause.pause_port_group.tx_enable = true
# scheduling algorithm: algorithm values = {dwrr}
scheduling.algorithm = dwrr
# traffic group scheduling weight
# weight values = {0..127}
# '0' indicates strict priority
priority_group.control.weight = 0
priority_group.service.weight = 32
priority_group.bulk.weight = 16
# To turn on/off Denial of service (DOS) prevention checks
dos_enable = false
# Cut-through is disabled by default on all chips with the exception of
# Spectrum. On Spectrum cut-through cannot be disabled.
#cut_through_enable = false
# Enable resilient hashing
#resilient_hash_enable = FALSE
# Resilient hashing flowset entries per ECMP group
# Valid values - 64, 128, 256, 512, 1024
#resilient_hash_entries_ecmp = 128
# Enable symmetric hashing
#symmetric_hash_enable = TRUE
# Set sflow/sample ingress cpu packet rate and burst in packets/sec
# Values: {0..16384}
#sflow.rate = 16384
#sflow.burst = 16384
#Specify the maximum number of paths per route entry.
# Maximum paths supported is 200.
# Default value 0 takes the number of physical ports as the max path size.
#ecmp_max_paths = 0
#Specify the hash seed for Equal cost multipath entries
# Default value 0
# Value Rang: {0..4294967295}
#ecmp_hash_seed = 42
# Specify the forwarding table resource allocation profile, applicable
# only on platforms that support universal forwarding resources.
#
# /usr/cumulus/sbin/cl-rsource-query reports the allocated table sizes
# based on the profile setting.
#
# Values: one of {'default', 'l2-heavy', 'v4-lpm-heavy', 'v6-lpm-heavy'}
# Default value: 'default'
# Note: some devices may support more modes, please consult user
# guide for more details
#
#forwarding_table.profile = default
On Spectrum switches, packet priority remark must be enabled on
the ingress port. A packet received on a remark-enabled port is
remarked according to the priority mapping configured on the egress
port. If packet priority remark is configured the same way on every
port, the default configuration example above is correct. However,
per-port customized configurations require two port groups: one for the
ingress ports and one for the egress ports, as below:
You can mark traffic for egress packets through iptables or
ip6tables rule classifications. To enable these rules, you do one of
the following:
Mark DSCP values in egress packets.
Mark 802.1p CoS values in egress packets.
To enable traffic marking, use cl-acltool. Add the -p option to
specify the location of the policy file. By default, if you don’t
include the -p option, cl-acltool looks for the policy file in
/etc/cumulus/acl/policy.d/.
The iptables-/ip6tables-based marking is supported via the following
action extension:
-j SETQOS --set-dscp 10 --set-cos 5
For ebtables, the setqos keyword must be in lowercase, as in:
[ebtables]
-A FORWARD -o swp5 -j setqos --set-cos 5
You can specify one of the following targets for SETQOS/setqos:
Option
Description
--set-cos INT
Sets the datapath resource/queuing class value. Values are defined in IEEE_P802.1p.
--set-dscp value
Sets the DSCP field in packet header to a value, which can be either a decimal or hex value.
--set-dscp-class class
Sets the DSCP field in the packet header to the value represented by the DiffServ class value. This class can be EF, BE or any of the CSxx or AFxx classes.
You can specify either --set-dscp or --set-dscp-class, but not both.
You can put the rule in either the mangle table or the default
filter table; the mangle table and filter table are put into separate
TCAM slices in the hardware.
To put the rule in the mangle table, include -t mangle; to put the
rule in the filter table, omit -t mangle.
Configure Priority Flow Control
Priority flow control, as defined in the
IEEE 802.1Qbb standard, provides a
link-level flow control mechanism that can be controlled independently
for each Class of Service (CoS) with the intention to ensure no data
frames are lost when congestion occurs in a bridged network.
PFC is a layer 2 mechanism that prevents congestion by throttling packet
transmission. When PFC is enabled for received packets on a set of
switch ports, the switch detects congestion in the ingress buffer of the
receiving port and signals the upstream switch to stop sending traffic.
If the upstream switch has PFC enabled for packet transmission on the
designated priorities, it responds to the downstream switch and stops
sending those packets for a period of time.
PFC operates between two adjacent neighbor switches; it does not provide
end-to-end flow control. However, when an upstream neighbor throttles
packet transmission, it could build up packet congestion and propagate
PFC frames further upstream: eventually the sending server could receive
PFC frames and stop sending traffic for a time.
The PFC mechanism can be enabled for individual switch priorities on
specific switch ports for RX and/or TX traffic. The switch port’s
ingress buffer occupancy is used to measure congestion. If congestion is
present, the switch transmits flow control frames to the upstream
switch. Packets with priority values that do not have PFC configured are
not counted during congestion detection; neither do they get throttled
by the upstream switch when it receives flow control frames.
PFC congestion detection is implemented on the switch using xoff and xon
threshold values for the specific ingress buffer which is used by the
targeted switch priorities. When a packet enters the buffer and the
buffer occupancy is above the xoff threshold, the switch transmits an
Ethernet PFC frame to the upstream switch to signal packet transmission
should stop. When the buffer occupancy drops below the xon threshold,
the switch sends another PFC frame upstream to signal that packet
transmission can resume. (PFC frames contain a quanta value to indicate
a timeout value for the upstream switch: packet transmission can resume
after the timer has expired, or when a PFC frame with quanta == 0 is
received from the downstream switch.)
After the downstream switch has sent a PFC frame upstream, it continues
to receive packets until the upstream switch receives and responds to
the PFC frame. The downstream ingress buffer must be large enough to
store those additional packets after the xoff threshold has been
reached.
Before Cumulus Linux 3.1.1, PFC was designated as a lossless priority
group. The lossless priority group has been removed from Cumulus Linux.
Priority flow control is fully supported on both
Broadcom
and Mellanox switches.
PFC is disabled by default in Cumulus Linux. Enabling priority flow
control (PFC) requires configuring the following settings in
/etc/cumulus/datapath/traffic.conf on the switch:
Specifying the name of the port group in pfc.port_group_list in
brackets; for example, pfc.port_group_list =
[pfc_port_group].
Assigning a CoS value to the port group in
pfc.pfc_port_group.cos_list setting. Note that pfc_port_group
is the name of a port group you specified above and is used
throughout the following settings.
Populating the port group with its member ports in
pfc.pfc_port_group.port_set.
Setting a PFC buffer size in pfc.pfc_port_group.port_buffer_bytes.
This is the maximum number of bytes allocated for storing bursts of
packets, guaranteed at the ingress port. The default is 25000
bytes.
Setting the xoff byte limit in pfc.pfc_port_group.xoff_size. This
is a threshold for the PFC buffer; when this limit is reached, an
xoff transition is initiated, signaling the upstream port to stop
sending traffic, during which time packets continue to arrive due to
the latency of the communication. The default is 10000 bytes.
Setting the xon delta limit in pfc.pfc_port_group.xon_delta. This
is the number of bytes to subtract from the xoff limit, which
results in a second threshold at which the egress port resumes
sending traffic. After the xoff limit is reached and the upstream
port stops sending traffic, the buffer begins to drain. When the
buffer reaches 8000 bytes (assuming default xoff and xon settings),
the egress port signals that it can start receiving traffic again.
The default is 2000 bytes.
Enabling the egress port to signal the upstream port to stop sending
traffic (pfc.pfc_port_group.tx_enable). The default is true.
Enabling the egress port to receive notifications and act on them
(pfc.pfc_port_group.rx_enable). The default is true.
The switch priority value(s) are mapped to the specific ingress
buffer for each targeted switch port. Cumulus Linux looks at either
the 802.1p bits or the IP layer DSCP bits depending on which is
configured in the traffic.conf file to map packets to internal
switch priority values.
The following configuration example shows PFC configured for ports swp1
through swp4 and swp6:
# to configure priority flow control on a group of ports:
# -- assign cos value(s) to the cos list
# -- add or replace a port group names in the port group list
# -- for each port group in the list
# -- populate the port set, e.g.
# swp1-swp4,swp8,swp50s0-swp50s3
# -- set a PFC buffer size in bytes for each port in the group
# -- set the xoff byte limit (buffer limit that triggers PFC frame transmit to start)
# -- set the xon byte delta (buffer limit that triggers PFC frame transmit to stop)
# -- enable PFC frame transmit and/or PFC frame receive
# priority flow control
pfc.port_group_list = [pfc_port_group]
pfc.pfc_port_group.cos_list = []
pfc.pfc_port_group.port_set = swp1-swp4,swp6
pfc.pfc_port_group.port_buffer_bytes = 25000
pfc.pfc_port_group.xoff_size = 10000
pfc.pfc_port_group.xon_delta = 2000
pfc.pfc_port_group.tx_enable = true
pfc.pfc_port_group.rx_enable = true
Port Groups
A port group refers to one or more sequences of contiguous ports.
Multiple port groups can be defined by:
Adding a comma-separated list of port group names to the
port_group_list.
Adding the port_set, rx_enable, and tx_enable configuration lines
for each port group.
You can specify the set of ports in a port group in comma-separated
sequences of contiguous ports; you can see which ports are contiguous in
/var/lib/cumulus/porttab. The syntax supports:
A single port (swp1s0 or swp5)
A sequence of regular swp ports (swp2-swp5)
A sequence within a breakout swp port (swp6s0-swp6s3)
A sequence of regular and breakout ports, provided they are all in a
contiguous range. For example:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Configure Link Pause
The PAUSE frame is a flow control mechanism that halts the transmission
of the transmitter for a specified period of time. A server or other
network node within the data center may be receiving traffic faster than
it can handle it, thus the PAUSE frame. In Cumulus Linux, individual
ports can be configured to execute link pause by:
Transmitting pause frames when its ingress buffers become congested
(TX pause enable) and/or
Responding to received pause frames (RX pause enable).
Link pause is disabled by default. Enabling link pause requires
configuring settings in /etc/cumulus/datapath/traffic.conf, similar to
how you configure priority flow control. The settings are explained in
that section as well.
What’s the difference between link pause and priority flow control?
Priority flow control is applied to an individual priority group for a
specific ingress port.
Link pause (also known as port pause or global pause) is applied to all
the traffic for a specific ingress port.
Here is an example configuration that enables both types of link pause
for swp1 through swp4 and swp6:
# to configure pause on a group of ports:
# -- add or replace port group names in the port group list
# -- for each port group in the list
# -- populate the port set, e.g.
# swp1-swp4,swp8,swp50s0-swp50s3
# -- set a pause buffer size in bytes for each port in the group
# -- set the xoff byte limit (buffer limit that triggers pause frames transmit to start)
# -- set the xon byte delta (buffer limit that triggers pause frames transmit to stop)
# link pause
link_pause.port_group_list = [pause_port_group]
link_pause.pause_port_group.port_set = swp1-swp4,swp6
link_pause.pause_port_group.port_buffer_bytes = 25000
link_pause.pause_port_group.xoff_size = 10000
link_pause.pause_port_group.xon_delta = 2000
link_pause.pause_port_group.rx_enable = true
link_pause.pause_port_group.tx_enable = true
Restart switchd to allow link pause configuration changes to take effect:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Configure Cut-through Mode and Store and Forward Switching
Cut-through mode is disabled in Cumulus Linux by default on switches
with Broadcom ASICs. With cut-though mode enabled and link pause is
asserted, Cumulus Linux generates a TOVR and TUFL ERROR; certain error
counters increment on a given physical port.
On switches using Broadcom Tomahawk, Trident II, Trident II+, and
Trident3 ASICs, Cumulus Linux supports store and forward switching but
does not support cut-through mode.
On switches using Spectrum ASICs, Cumulus Linux supports cut-through mode but does not support store and forward switching.
Configure Explicit Congestion Notification
Explicit Congestion Notification (ECN) is defined by
RFC 3168. ECN gives a Cumulus
Linux switch the ability to mark a packet to signal impending congestion
instead of dropping the packet outright, which is how TCP typically
behaves when ECN is not enabled.
ECN is a layer 3 end-to-end congestion notification mechanism only.
Packets can be marked as ECN-capable transport (ECT) by the sending
server. If congestion is observed by any switch while the packet is
getting forwarded, the ECT-enabled packet can be marked by the switch to
indicate the congestion. The end receiver can respond to the ECN-marked
packets by signaling the sending server to slow down transmission. The
sending server marks a packet ECT by setting the least 2 significant
bits in an IP header DiffServ (ToS) field to 01 or 10. A packet
that has the least 2 significant bits set to 00 indicates a
non-ECT-enabled packet.
The ECN mechanism on a switch only marks packets to notify the end
receiver. It does not take any other action or change packet handling in
any way, nor does it respond to packets that have already been marked
ECN by an upstream switch.
On Trident II switches only, if ECN is enabled on a specific queue,
the ASIC also enables RED on the same queue. If the packet is ECT marked
(the ECN bits are 01 or 10), the ECN mechanism executes as described
above. However, if it is entering an ECN-enabled queue but is not ECT
marked (the ECN bits are 00), then the RED mechanism uses the same
threshold and probability values to decide whether to drop the packet.
Packets entering a non-ECN-enabled queue do not get marked or dropped
due to ECN or RED in any case.
ECN is implemented on the switch using minimum and maximum threshold
values for the egress queue length. When a packet enters the queue and
the average queue length is between the minimum and maximum threshold
values, a configurable probability value will determine whether the
packet will be marked. If the average queue length is above the maximum
threshold value, the packet is always marked.
The downstream switches with ECN enabled perform the same actions as the
traffic is received. If the ECN bits are set, they remain set. The only
way to overwrite ECN bits is to enable it - that is, set the ECN bits to
11.
ECN is disabled by default in Cumulus Linux. You can enable ECN for
individual switch priorities on specific switch ports. ECN requires
configuring the following settings in
/etc/cumulus/datapath/traffic.conf on the switch:
Specifying the name of the port group in ecn.port_group_list in
brackets; for example, ecn.port_group_list = [ecn_port_group].
Assigning a CoS value to the port group in
ecn.ecn_port_group.cos_list. If the CoS value of a packet matches
the value of this setting, then ECN is applied. Note that
ecn_port_group is the name of a port group you specified above.
Populating the port group with its member ports
(ecn.ecn_port_group.port_set), where ecn_port_group is the
name of the port group you specified above. Congestion is measured
on the egress port queue for the ports listed here, using the
average queue length: if congestion is present, a packet entering
the queue may be marked to indicate that congestion was observed.
Marking a packet involves setting the least 2 significant bits in
the IP header DiffServ (ToS) field to 11.
The switch priority value(s) are mapped to specific egress queues
for the target switch ports.
The ecn.ecn_port_group.probability value indicates the probability
of a packet being marked if congestion is experienced.
The following configuration example shows ECN configured for ports swp1
through swp4 and swp6:
# Explicit Congestion Notification
# to configure ECN on a group of ports:
# -- add or replace port group names in the port group list
# -- assign cos value(s) to the cos list *ECN will only be applied to traffic matching this COS*
# -- for each port group in the list
# -- populate the port set, e.g.
# swp1-swp4,swp8,swp50s0-swp50s3
ecn.port_group_list = [ecn_port_group]
ecn.ecn_port_group.cos_list = [0]
ecn.ecn_port_group.port_set = swp1-swp4,swp6
ecn.ecn_port_group.min_threshold_bytes = 40000
ecn.ecn_port_group.max_threshold_bytes = 200000
ecn.ecn_port_group.probability = 100
Restart switchd to allow the ECN configuration changes to take effect:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Check Interface Buffer Status
On switches with Spectrum ASICs, you can collect a fine-grained history of queue lengths using histograms maintained by the ASIC; see the ASIC monitoring chapter for details.
On Broadcom switches, the buffer status is not visible currently.
To run DHCP for both IPv4 and IPv6, initiate the DHCP relay once for
IPv4 and once for IPv6. Following are the configurations on the server
hosts, DHCP relay, and DHCP server using the following topology:
The dhcpd and dhcrelay services are disabled by default. After you
finish configuring the DHCP relays and servers, you need to start those
services. If you intend to run these services within a
VRF,
follow these steps for
configuring them.
Configure IPv4 DHCP Relays
Configure isc-dhcp-relay using NCLU,
specifying the IP addresses to each DHCP server and the interfaces that are used as the uplinks.
In the examples below, the DHCP server IP address is 172.16.1.102, VLAN 1 (the SVI is vlan1) and the uplinks are swp51 and swp52.
You configure a DHCP relay on a per-VLAN basis, specifying the SVI, not
the parent bridge; in our example, you would specify vlan1 as the SVI
for VLAN 1; do not specify the bridge named bridge in this case.
As per RFC 3046, you can specify
as many server IP addresses that can fit in 255 octets, specifying each
address only once.
cumulus@leaf01:~$ net add dhcp relay interface swp51
cumulus@leaf01:~$ net add dhcp relay interface swp52
cumulus@leaf01:~$ net add dhcp relay interface vlan1
cumulus@leaf01:~$ net add dhcp relay server 172.16.1.102
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following configuration in the
/etc/default/isc-dhcp-relay file:
To see the DHCP relay status, use the systemctl status dhcrelay.service command:
cumulus@leaf01:~$ sudo systemctl status dhcrelay.service
● dhcrelay.service - DHCPv4 Relay Agent Daemon
Loaded: loaded (/lib/systemd/system/dhcrelay.service; enabled)
Active: active (running) since Fri 2016-12-02 17:09:10 UTC; 2min 16s ago
Docs: man:dhcrelay(8)
Main PID: 1997 (dhcrelay)
CGroup: /system.slice/dhcrelay.service
└─1997 /usr/sbin/dhcrelay --nl -d -q -i vlan1 -i swp51 -i swp52 172.16.1.102
DHCP Option 8
You can configure DHCP relays to inject the circuit-id field with the
-a option, which you add to the OPTIONS line in the
/etc/default/isc-dhcp-relay file. By default, the ingress SVI
interface against which the relayed DHCP discover packet is processed is
injected into this field. You can change this behavior by adding the
--use-pif-circuit-id option. With this option, the physical switch
port (swp) on which the discover packet arrives is placed in the
circuit-id field.
Control the Gateway IP Address with RFC 3527
When DHCP relay is required in an environment that relies on an anycast
gateway (such as EVPN), a unique IP address is necessary on each device
for return traffic. By default, in a BGP unnumbered environment with
DHCP relay, the source IP address is set to the loopback IP address and
the gateway IP address (giaddr) is set as the SVI IP address. However
with anycast traffic, the SVI IP address is not unique to each rack; it
is typically shared amongst all racks. Most EVPN ToR deployments only
possess a single unique IP address, which is the loopback IP address.
RFC 3527 enables the DHCP server
to react to these environments by introducing a new parameter to the
DHCP header called the link selection sub-option, which is built by
the DHCP relay agent. The link selection sub-option takes on the normal
role of the giaddr in relaying to the DHCP server which subnet is
correlated to the DHCP request. When using this sub-option, the giaddr
continues to be present but only relays the return IP address that is to
be used by the DHCP server; the giaddr becomes the unique loopback IP address.
When enabling RFC 3527 support, you can specify an interface, such as
the loopback interface or a switchport interface to be used as the
giaddr. The relay picks the first IP address on that interface. If the
interface has multiple IP addresses, you can specify a specific IP
address for the interface.
RFC 3527 is supported for IPv4 DHCP relays only.
The following illustration demonstrates how you can control the giaddr
with RFC 3527.
To enable RFC 3527 support and control the giaddr, run the net add dhcp relay giaddr-interface command with interface/IP address you want to
use.
The following example uses the first IP address on the loopback
interface as the giaddr:
cumulus@leaf01:~$ net add dhcp relay giaddr-interface lo
The above command creates the following configuration in the
/etc/default/isc-dhcp-relay file:
cumulus@leaf01:~$ cat /etc/default/isc-dhcp-relay
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U lo"
The first IP address on the loopback interface is typically the
127.0.0.1 address. Use more specific syntax, as shown in the next example.
The following example uses IP address 10.0.0.1 on the loopback interface
as the giaddr:
cumulus@leaf01:~$ net add dhcp relay giaddr-interface lo 10.0.0.1
The above command creates the following configuration in the
/etc/default/isc-dhcp-relay file:
cumulus@leaf01:~$ cat /etc/default/isc-dhcp-relay
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U 10.0.0.1%lo"
The following example uses the first IP address on swp2 as the giaddr:
cumulus@leaf01:~$ net add dhcp relay giaddr-interface swp2
The above command creates the following configuration in the
/etc/default/isc-dhcp-relay file:
cumulus@leaf01:~$ cat /etc/default/isc-dhcp-relay
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U swp2"
The following example uses IP address 10.0.0.3 on swp2 as the giaddr:
cumulus@leaf01:~$ net add dhcp relay giaddr-interface swp2 10.0.0.3
The above command creates the following configuration in the
/etc/default/isc-dhcp-relay file:
cumulus@leaf01:~$ cat /etc/default/isc-dhcp-relay
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U 10.0.0.3%swp2"
When enabling RFC 3527 support, you can specify an interface such as the loopback interface or swp interface for the gateway address. The interface you use must be reachable in the tenant VRF that it is servicing and must be unique to the switch. In EVPN symmetric routing, fabrics running an anycast gateway that uses the same SVI IP address on multiple leaf switches, need a unique IP address for the VRF interface and must include the layer 3 VNI for this VRF in the DHCP Relay configuration.
Configure IPv6 DHCP Relays
If you are configuring IPv6, the /etc/default/isc-dhcp-relay6
variables file has a different format than the
/etc/default/isc-dhcp-relay file for IPv4 DHCP relays. Make sure to
configure the variables appropriately by editing this file.
After you finish configuring the DHCP relay, save your changes, restart
the dhcrelay6 service, then enable the dhcrelay6 service so the
configuration persists between reboots:
To see the status of the IPv6 DHCP relay, use the systemctl status dhcrelay6.service command:
cumulus@leaf01:~$ sudo systemctl status dhcrelay6.service
● dhcrelay6.service - DHCPv6 Relay Agent Daemon
Loaded: loaded (/lib/systemd/system/dhcrelay6.service; disabled)
Active: active (running) since Fri 2016-12-02 21:00:26 UTC; 1s ago
Docs: man:dhcrelay(8)
Main PID: 6152 (dhcrelay)
CGroup: /system.slice/dhcrelay6.service
└─6152 /usr/sbin/dhcrelay -6 --nl -d -q -l vlan1 -u 2001:db8:100::2 swp51 -u 2001:db8:100::2 swp52
Configure Multiple DHCP Relays
Cumulus Linux supports multiple DHCP relay daemons on a switch to enable
relaying of packets from different bridges to different upstreams.
To configure multiple DHCP relay daemons on a switch:
Create a config file in /etc/default using the following format
for each dhcrelay: isc-dhcp-relay-<dhcp-name>. An example file is
shown below:
# Defaults for isc-dhcp-relay initscript# sourced by /etc/init.d/isc-dhcp-relay
# installed at /etc/default/isc-dhcp-relay by the maintainer scripts
#
# This is a POSIX shell fragment
#
# What servers should the DHCP relay forward requests to?
SERVERS="102.0.0.2"
# On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
# Always include the interface towards the DHCP server.
# This variable requires a -i for each interface configured above.
# This will be used in the actual dhcrelay command
# For example, "-i eth0 -i eth1"
INTF_CMD="-i swp2s2 -i swp2s3"
# Additional options that are passed to the DHCP relay daemon?
OPTIONS=""
Run the following command to start a dhcrelay instance. Replace
dhcp-name with the instance name or number:
The configuration procedure for DHCP relay with VRR is the same as
documented above. Note that DHCP relay
must run on the SVI and not on the -v0 interface.
Configure the DHCP Relay Service Manually (Advanced)
Configuring the DHCP service manually ...
By default, Cumulus Linux configures the DHCP relay service
automatically. However, in older versions of Cumulus Linux, you needed
to edit the dhcrelay.service file as described below. The IPv4
dhcrelay.serviceUnit script calls /etc/default/isc-dhcp-relay to
find launch variables.
cumulus@switch:~$ cat /lib/systemd/system/dhcrelay.service
[Unit]
Description=DHCPv4 Relay Agent Daemon
Documentation=man:dhcrelay(8)
After=network-oneline.target networking.service syslog.service
[Service]
Type=simple
EnvironmentFile=-/etc/default/isc-dhcp-relay
# Here, we are expecting the INTF_CMD to contain
# the -i for each interface specified,
# e.g. "-i eth0 -i swp1"
ExecStart=/usr/sbin/dhcrelay -d -q $INTF_CMD $SERVERS $OPTIONS
[Install]
WantedBy=multi-user.target
The /etc/default/isc-dhcp-relay variables file needs to reference both
interfaces participating in DHCP relay (facing the server and facing the
client) and the IP address of the server. If the client-facing interface
is a bridge port, specify the switch virtual interface (SVI) name if you
are using a VLAN-aware bridge
(for example, vlan100), or the bridge name if you are using traditional
bridging (for example, br100).
Use the Gateway IP Address as the Source IP for Relayed DHCP Packets (Advanced)
Using the gateway IP address as the source IP for relayed DHCP
packets
You can configure the dhcrelay service to forward IPv4 (only) DHCP
packets to a server and ensure that the source IP address of the relayed
packet is the same as the gateway IP address. You do this by enabling
the giaddr-src option; when set, dhcrelay attempts to set the source
IP address of the packet to be the gateway IP address.
This option impacts all relayed packets globally.
To enable this feature:
cumulus@leaf:~$ net add dhcp relay use-giaddr-as-src
cumulus@leaf:~$ net pending
cumulus@leaf:~$ net commit
These commands create the following configuration in the
/etc/default/isc-dhcp-relay file:
cumulus@leaf01:~$ cat /etc/default/isc-dhcp-relay
# Defaults for isc-dhcp-relay initscript
# sourced by /etc/init.d/isc-dhcp-relay
# installed at /etc/default/isc-dhcp-relay by the maintainer scripts
#
# This is a POSIX shell fragment
#
# What servers should the DHCP relay forward requests to?
SERVERS=""
# On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
# Always include the interface towards the DHCP server.
# This variable requires a -i for each interface configured above.
# This will be used in the actual dhcrelay command
# For example, "-i eth0 -i eth1"
INTF_CMD=""
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="--giaddr-src"
Troubleshooting
If you are experiencing issues with the DHCP relay, run the following
commands to determine if the issue is with systemd. The following
commands manually activate the DHCP relay process and they do not
persist when you reboot the switch:
Use the journalctl command to look at the behavior on the Cumulus
Linux switch that is providing the DHCP relay functionality:
cumulus@leaf01:~$ sudo journalctl -l -n 20 | grep dhcrelay
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
You can run the journalctl command with the --since flag to specify
a time period:
cumulus@leaf01:~$ sudo journalctl -l --since "2 minutes ago" | grep dhcrelay
Dec 05 21:08:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp51
Configuration Errors
If you configure DHCP relays by editing the
/etc/default/isc-dhcp-relay file
manually instead of running NCLU commands, you might introduce
configuration errors that can cause the switch to crash.
For example, if you see an error similar to the following, there might
be a space between the DHCP server address and the interface used as the
uplink.
Core was generated by `/usr/sbin/dhcrelay --nl -d -i vx-40 -i vlan100 10.0.0.4 -U 10.0.1.2 %vlan120'.
Program terminated with signal SIGSEGV, Segmentation fault.
To resolve the issue, manually edit the /etc/default/isc-dhcp-relay
file to remove the space, then run the
systemctl restart dhcrelay.service command to restart the dhcrelay
service and apply the configuration change.
Caveats and Errata
Interface Names Cannot Be Longer than 14 Characters
The dhcrelay command does not bind to an interface if the interface’s name is
longer than 14 characters. To work around this issue, change the interface name
to be 14 or fewer characters if dhcrelay is required to bind to it.
This is a known limitation in dhcrelay.
DHCP Servers
To run DHCP for both IPv4 and IPv6, you need to initiate the DHCP server
twice: once for IPv4 and once for IPv6. The following configuration uses
the following topology for the host, DHCP relay and DHCP server:
For the configurations used in this chapter, the DHCP server is a switch
running Cumulus Linux; however, the DHCP server can also be located on a
dedicated server in your environment.
The dhcpd and dhcrelay services are disabled by default. After you
finish configuring the DHCP relays and servers, you need to start those
services. If you intend to run these services within a
VRF,
including the management VRF,
follow these steps for
configuring them. See also the VRF chapter.
Configure the DHCP Server on Cumulus Linux Switches
You can use the following sample configurations for dhcp.conf and
dhcpd6.conf to start both an IPv4 and an IPv6 DHCP server. The
configuration files for the two DHCP server instances need to have two
pools:
Pool 1: Subnet overlaps interfaces
Pool 2: Subnet that includes the addresses
Configure the IPv4 DHCP Server
In a text editor, edit the dhcpd.conf file with a configuration
similar to the following:
Just as you did with the DHCP relay scripts, edit the DHCP server
configuration file so it can launch the DHCP server when the system
boots. Here is a sample configuration:
Just as you did with the DHCP relay scripts, edit the DHCP server
configuration file so it can launch the DHCP server when the system
boots. Here is a sample configuration:
You can assign an IP address and other DHCP options based on physical
location or port regardless of MAC address to clients that are attached
directly to the Cumulus Linux switch through a switch port. This is
helpful when swapping out switches and servers; you can avoid the
inconvenience of collecting the MAC address and sending it to the
network administrator to modify the DHCP server configuration.
Edit the /etc/dhcp/dhcpd.conf file and add the interface name ifname
to assign an IP address through DHCP. The following provides an example:
The DHCP server knows whether a DHCP request is a relay or a non-relay
DHCP request. On isc-dhcp-server, for example, it is possible to tail
the log and look at the behavior firsthand:
cumulus@server02:~$ sudo tail /var/log/syslog | grep dhcpd
2016-12-05T19:03:35.379633+00:00 server02 dhcpd: Relay-forward message from 2001:db8:101::1 port 547, link address 2001:db8:101::1, peer address fe80::4638:39ff:fe00:3
2016-12-05T19:03:35.380081+00:00 server02 dhcpd: Advertise NA: address 2001:db8:1::110 to client with duid 00:01:00:01:1f:d8:75:3a:44:38:39:00:00:03 iaid = 956301315 valid for 600 seconds
2016-12-05T19:03:35.380470+00:00 server02 dhcpd: Sending Relay-reply to 2001:db8:101::1 port 547
Facebook Voyager Optical Interfaces
Facebook Voyager is a Broadcom Tomahawk-based switch with added Dense
Wave Division Multiplexing (DWDM) ports that can connect to another
switch thousands of kilometers away by adding transponders. DWDM allows
many separate connections on one fiber pair by sending them over
different wavelengths. Although the wavelengths are sent on the same
physical fiber, they do not interact with each other, similar to VLANs
on a trunk. Each wavelength can transport very high speeds over very
long distances.
The Voyager Platform
The Voyager platform has 16 ports on the front of the switch:
Twelve QSFP28 ethernet ports labeled 1 thru 12. These are
standard 100G ports that you configure like ports on other platforms
with a Tomahawk ASIC. The ports.conf file defines the breakout
configuration and the /etc/network/interfaces file defines the
other port parameters. When not broken out they are named swp1 thru swp12.
Four duplex LC ports labeled L1 thru L4. L1 and L2 connect to
AC400 module 2. L3 and L4 connect to AC400 module 1. Each AC400
module connects to four Tomahawk ASIC ports.
The fc designations on the Tomahawk stand for Falcon Core. Each AC400
module has four 100G interfaces connected to the Tomahawk and two
interfaces connected to the front of the box.
Inside the AC400
The way in which the client ports are mapped to the network ports in an
AC400 depends on the modulation format and coupling mode. Cumulus Linux
supports five different modulation and coupling mode options on each
AC400 module.
Network 0 Modulation
Network 1 Modulation
Independent/Coupled
QPSK
QPSK
Independent
16-QAM
16-QAM
Independent
QPSK
16-QAM
Independent
16-QAM
QPSK
Independent
8-QAM
8-QAM
Coupled
QPSK-Quadrature phase shift keying.
When a network interface is using QPSK modulation, it carries 100Gbps
and is therefore connected to only one client interface.
16-QAM-Quadrature amplitude modulation
with 4 bits per symbol. When a network interface is using 16-QAM
modulation, it carries 200Gbps and is therefore connected to two client
interfaces. Each of the two client interfaces carried on a network
interface is called a tributary. The AC400 adds extra information so
that these tributaries can be sorted out at the far end and delivered to
the appropriate client interface.
8-QAM-Quadrature amplitude modulation
with 3 bits per symbol. When a network interface is using 8-QAM
modulation, it carries 150Gbps. In this case, the two network interfaces
in an AC400 module must be coupled, so that the total bandwidth carried
by the two interfaces is 300Gbps. Three client interfaces are used with
this modulation format. However, unlike other modulation formats that
use independent mode, the coupled mode means that data from each client
interface is carried on both of the network interfaces.
Client to Network Connection
For each of the five supported modulation configurations, the client
interface to network interface connections are as follows:
Configuration
Connections
In this configuration, two client interfaces, 0 and 2, are mapped to the two network interfaces. Client interfaces 1 and 3 are not used.
In this configuration, two client interfaces are mapped to each network interface. Each network interface, therefore, has two tributaries.
These configurations are combinations of the previous two. The network interface configured for QPSK connects to one client interface and the network interface configured for 16-QAM connects to two client interfaces.
This configuration uses three client interfaces, for a total of 300Gbps; 150Gbps on each network interface. Because the network interfaces are coupled, they cannot be connected to different far-end systems. Each network interface carries three tributaries.
Configure the Voyager Ports
To configure the five modulation and coupling configurations described
above, edit the /etc/cumulus/ports.conf file. The ports do not exist
until you configure them.
The file has lines for the 12 QSPF28 ports. The four DWDM Line ports are
labeled labeled L1 thru L4. To program the AC400 modulation and
coupling into the five configurations, configure these ports as follows:
ports.conf
L1 Modulation
L2 Modulation
Independent/Coupled
L1=1x
L2=1x
QPSK
QPSK
Independent
L1=1x
L2=2x
QPSK
16-QAM
Independent
L1=2x
L2=1x
16-QAM
QPSK
Independent
L1=2x
L2=2x
16-QAM
16-QAM
Independent
L1=3/2
L2=3/2
8-QAM
8-QAM
Coupled
The following example /etc/cumulus/ports.conf file shows configuration
for all of the modes.
1=1x # Creates swp1
2=2x # Creates swp2s0 and swp2s1
3=4x # Creates four 25G ports: swp3s0, swp3s1, swp3s2, and swp3s3
4=1x40G # Creates swp4
5=4x10G # Creates four 10G ports: swp5s0, swp5s1, swp5s2, and swp5s3
6=1x
7=1x
8=1x
9=1x
10=1x
11=1x
12=1x
L1=2x # Creates swpL1s0 and swpL1s1
L2=1x # Creates swpL2
L3=3/2 # Creates swpL3s0, swpL3s1, and swpL3s2
L4=3/2 # Creates no "swpL4" ports since L4 is ganged with L3
Configure the Transponder Modules
The Voyager platform contains two AC400 transponder modules, which you
configure with NCLU commands.
Many commands include the <trans-port> parameter. This is the network
interface of the transponder or the port, as printed on the front of the
system; L1, L2, L3, or L4.
Using NCLU commands is the preferred way to configure the transponder
modules. However, as an alternative, you can edit the
/etc/cumulus/transponders.ini file to make configuration changes. See
Edit the transponder.ini file
below.
Set the Transponder State
Each transponder module has a state, which is set to ready by default.
The available transponder states are listed below.
Setting
Description
reset
The module is in the reset state. The module cannot be accessed and remains non-operational until the state is changed to one of the other states.
low-power
The module is in the low-power configuration state. The network interfaces are not powered up. This state can be used to configure the module before bringing it online.
tx-off
The receivers and transmitters are turned up, but there is nothing being transmitted.
ready
This is the fully operational state of the module.
To change the state of the module, run the net add interface <trans-port> state (reset|low-power|tx-off|ready) command. For example,
to change the state of the transponder module to low power for L2, run
the following command:
cumulus@switch:~$ net add interface L2 state low-power
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
Use caution when changing the setting; although this command specifies a
port, it affects an entire module. State changes on modules with
multiple ports affect all ports on the module, not just the port
specified.
Disable the Transmitter
You can disable or enable the transmitter of an individual network
interface.
To disable the transmitter of a network interface, run the net add interface <trans-port> transmit-disable command. The following example
command disables the L1 transmitter:
cumulus@switch:~$ net add interface L1 transmit-disable
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
To enable the transmitter of an individual network interface, run the
net del interface <trans-port> transmit-disable command. The following
example command enables the L1 transmitter:
cumulus@switch:~$ net del interface L1 transmit-disable
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
You can set grid spacing between two adjacent channels (the distance
between channel frequencies) to 12.5GHz or 50GHz. The default spacing is
50 GHz.
To change the grid spacing, run the n``et add interface <trans-port> grid-spacing (12.5|50) command. The following command sets the grid
spacing on L2 to 12.5GHz:
cumulus@switch:~$ net add interface L2 grid-spacing 12.5
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the /etc/cumulus/transponders.ini
file:
To set the frequency used by the network interface, run the net add interface <trans-port> frequency <trans-frequency> command.
<trans-frequency> is a floating point number in THz. The transponders
support 100 channels, from 191.15 THz to 196.10 THz. Tab-completion is
supported on this command and shows the available frequencies, together
with the corresponding channel number and wavelength.
The following example command sets the frequency used by L2 to 195.30:
cumulus@switch:~$ net add interface L2 frequency 195.30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
To see a complete list of the frequencies, channels, and wavelengths,
run the net show transponder frequency-map command (described in
Display Available Frequencies).
Set the Transmit Power
To set the amount of transmit power for a network interface, run the
net add interface <trans-port> power <trans-dBm> command.
<trans-dBm> is the power as a floating point number in units of dBm.
This value can range from -35.0 to 10.0. The following example command
sets the transmit power for L1 to 10.0 dBm.
cumulus@switch:~$ net add interface L1 power 10.0
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the /etc/cumulus/transponders.ini
file:
To change the modulation technique used on a network interface, run the
net add interface <trans-port> modulation (16-qam|8-qam|pm-qpsk)
command. The available modulation options are 16-qam, 8-qam, and
pm-qpsk. The following example command changes the modulation on L1 to
8-qam:
cumulus@switch:~$ net add interface L1 modulation 8-qam
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Changing the modulation also changes the Linux interfaces available in
the system, removing existing interfaces and adding the new ones.
Therefore, you must remove network interfaces with the net del interface swpLx... command before you change the modulation. The
network interfaces created for each modulation are as follows (L1 is
used as an example):
Modulation
Linux Interfaces
16-qam
swpL1s0 and swpL1s1
8-qam
swpL1s0, swpL1s1, and swpL1s2
pm-qpsk
swpL1
Because 8-qam modulation requires both network interfaces on a module to
operate together, changing the modulation on one interface also changes
it on the other. Also, the network mode of the module changes
automatically to coupled when changing to 8-qam and reverts to
independent when leaving 8-qam modulation.
The only modulation format that allows the 15%_ac100 FEC mode is
pm-qpsk. Attempting to change the modulation from pm-qpsk while
15%_ac100 FEC is configured is not allowed. First change the FEC mode
to something other than 15%_ac100 and then the modulation.
Set the Differential Encoding
To select non-differential encoding on the network interface, run the
net add interface <trans-port> non-differential command. To revert to
differential encoding (the default), run the net del interface <trans-port> non-differential command. The following example command
selects non-differential encoding for L1:
cumulus@switch:~$ net add interface L1 non-differential
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
To select Forward Error Correction (FEC) mode, run the net add interface <trans-port> fec (15%|15%_ac100|25%) command. The available
modes are 15% (15% overhead SDFEC), 15%_ac100 (15% overhead SDFEC
compatible with AC100), and 25% ( 25% overhead SDFEC). The following
example command sets FEC mode on L1 to 15%:
cumulus@switch:~$ net add interface L1 fec 15%
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This command creates the following configuration snippet in the
/etc/cumulus/transponders.ini file:
Line side loopback mode enables you to send and receive data from the
same network interface port to verify that the port is operational.
To enable line side loopback mode, run the net add interface <interface> facility-loopback command. You can enable line side loopback mode on one or
multiple interfaces. The following example enables loopback mode
on the L1, L2, L3, and L4 network interfaces:
cumulus@switch:~$ net add interface L1-4 facility-loopback
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To disable loopback mode, run the net del interface <interface> facility-loopback command. The following example disables loopback mode
on the L1, L2, L3, and L4 network interfaces:
cumulus@switch:~$ net del interface L1-4 facility-loopback
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To enable loopback on the client interface (internal loopback for DWDM
testing), edit the /etc/cumulus/transponders.ini file. See
Edit the transponders.ini file below.
Display the Transponder Status
To display the current status of the transponder module, run the net show transponder command. The first two lines of command output
displays the status of the module and the next section displays the
status of the network interfaces. This is repeated for each module in
the system.
cumulus@switch:~$ net show transponder
Module: 1 ready Acacia Comm Inc. AC400-004-330 S/N:170212599 53.88C 11.89V
Laser: 191.15 THz - 196.10 THz, 6.00 GHz fine tune, independent lanes
Network Interfaces
L3 L4
--------------------------- ---------------------------
Modulation 16-qam 16-qam
Frequency 193.70 THz, Channel 52 193.70 THz, Channel 52
Current BER 1.428e-04 1.387e-05
Current OSNR 84.90dBm 84.80dBm
Current Chromatic Disp 13ps/nm 9ps/nm
TX/RX Power 0.99dBm/0.66dBm 1.00dBm/0.43dBm
Encoding differential differential
Alignment TX & RX TX & RX
Grid Spacing 50ghz 50ghz
FEC Mode 25% 25%
Uncorrectable FEC Errs 0 0
TX/RX Turn-up power_adjusted/locked power_adjusted/locked
Module: 2 ready Acacia Comm Inc. AC400-004-330 S/N:170212585 55.00C 11.90V
Laser: 191.15 THz - 196.10 THz, 6.00 GHz fine tune, independent lanes
Network Interfaces
L1 L2
--------------------------- ---------------------------
Modulation 16-qam 16-qam
Frequency 193.70 THz, Channel 52 193.70 THz, Channel 52
Current BER 7.039e-05 7.404e-05
Current OSNR 84.90dBm 84.80dBm
Current Chromatic Disp 13ps/nm 9ps/nm
TX/RX Power 0.98dBm/0.48dBm 0.99dBm/-0.78dBm
Encoding differential differential
Alignment TX & RX TX & RX
Grid Spacing 50ghz 50ghz
FEC Mode 25% 25%
Uncorrectable FEC Errs 0 0
TX/RX Turn-up power_adjusted/locked power_adjusted/locked
To display only the status of a particular module, use the module <trans-module> option, which specifies the transponder module number.
The following example command displays the status of transponder module
1:
cumulus@switch:~$ net show transponder module 1
Module: 1 ready Acacia Comm Inc. AC400-004-330 S/N:170212599 53.75C 11.89V
Laser: 191.15 THz - 196.10 THz, 6.00 GHz fine tune, independent lanes
Network Interfaces
L3 L4
--------------------------- ---------------------------
Modulation 16-qam 16-qam
Frequency 193.70 THz, Channel 52 193.70 THz, Channel 52
Current BER 1.626e-04 1.343e-05
Current OSNR 84.90dBm 84.80dBm
Current Chromatic Disp 13ps/nm 9ps/nm
TX/RX Power 1.00dBm/0.67dBm 0.99dBm/0.42dBm
Encoding differential differential
Alignment TX & RX TX & RX
Grid Spacing 50ghz 50ghz
FEC Mode 25% 25%
Uncorrectable FEC Errs 0 0
TX/RX Turn-up power_adjusted/locked power_adjusted/locked
To display more information, including the host interfaces, use the
verbose option. The following example command displays more
information about the transponder module:
cumulus@switch:~$ net show transponder module 1 verbose
To display all status information in JSON format, use the json option.
The following example command displays all status information in JSON
format:
As an alternative to using NCLU commands to configure the transponder
modules (described above), you can edit the
/etc/cumulus/transponders.ini file, then Initiate a hardware update.
Using NCLU commands to configure the transponder modules is the
preferred method. However, not all configuration options are available
with NCLU. If you want to change a transponder module configuration
setting that does not have an NCLU command, you can change the setting
manually in the transponders.ini file, then initiate the hardware
update. Use caution when editing the /etc/cumulus/transponders.ini
file.
The /etc/cumulus/transponders.ini file consists of groups of key-value
pairs, interspersed with comments. Configuration groups start with a
header line that contains the group name enclosed in square brackets ([
]) and end implicitly by the start of the next group or the end of the
file. Key-value pairs have the form key=value. Spaces before and after
the = character are ignored. Lines beginning with # and blank lines are
considered comments.
Here is an example /etc/cumulus/transponders.ini file:
The Modulesgroup identifies the names of the other groups in
the file. This is the root group from which all other groups are
referenced; it must always be the first group in the file and must be
named Modules.
There is only one key-value pair in this group. Each value in the list
represents a transponder in the system. There must be a group within the
file that has the same name as each value in the list.
The following example shows that there are two modules in the system
named AC400_1 and AC400_2. The transponders.ini file must contain
these two groups.
[Modules]
Names=AC400_1,AC400_2
Module Groups
The module groups are individual groups for each of the predefined modules
and define the attributes of the transponders in the system. The name of a module group
is defined in the values of the Names key in the Modules group (shown
above).
The following table describes the key-value pairs in the module groups.
Key
Value Type
Description
Location
Integer: 1 or 2
The location or identifier of the module within Voyager. Voyager has two modules which are identified by indexes 1 and 2.
Module 1 is connected to external network interfaces labeled L3 and L4.
Module 2 is connected to L1 and L2.
NetworkMode
String: independent or coupled
The overall mode of the two network interfaces on the module:
In coupled mode, traffic from a client interface travels on both network interfaces.
In independent mode, traffic from a client interface travels on only one network interface.
The default value is independent.
Note: When network interfaces are configured in 8-qam mode, you must set this key to coupled.
NetworkInterfaces
Comma-separated list of network interface group names
Each value in the list represents a network interface connected to this module. There must be a group within the file that has the same name as each value in the list. Network interfaces are the module interfaces that leave the Voyager platform and are labeled L1, L2, L3, and L4 on the front of the Voyager.
Note: Although you can use any string for the network interface group names, it is best to use the labels on the front of the Voyager to avoid confusion.
HostInterfaces
Comma-separated list of client interface group names
Each value in this list represents a client interface connected to this module. There must be a group within the file that has the same name as each value in the list. Client interfaces are the module interfaces that connect to the Tomahawk switching ASIC.
OperStatus
String: reset, low_power, tx_off, or ready
The operational status of the module:
reset holds the module in the reset state.
low_power configures the module before bringing the module to an operational state.
tx_off means the module is fully functional, except that the transmitters on the network interfaces are turned off.
ready means the module is fully functional.
The following example provides the configuration for module 1. The
network interfaces are configured to operate independently and are
defined in the L3 and L4 groups in the file. The client interfaces
are defined in the Client0, Client1, Client2, and Client3 groups in the
file. The operational status of the module is ready.
The network interface groups define the attributes of the network
interfaces on the module. The name of a network interface group is
defined in the values of the NetworkInterfaces key in the module
groups.
The following table describes the key-value pairs in the network
interface groups.
Key
Value Type
Description
Location
Integer: 0-1
The location or index of the network interface within a module. The Voyager AC400 modules each have two network interfaces that are connected to the external ports as follows:
Module Location
Network Interface Location
External Port
2
0
L1
2
1
L2
1
0
L3
1
1
L4
TxEnable
Boolean: true or false
Enable (true) or disable (false) the transmission of data.
TxGridSpacing
String: 100ghz, 50ghz, 33ghz, 25ghz, 12.5ghz, or 6.25ghz
Defines the channel spacing. The AC400 does not support variable-width channels; only different channel center frequencies.
The default is 50ghz. Only 50ghz and 12.5ghz are supported.
TxChannel
Integer: 1-100
The channel number upon which the network interface transmits and receives data.
Click here to see the frequency and wavelength per channel
Channel Number
Frequency (THz)
Wavelength (nm)
1
191.15
1,568.36
2
191.20
1,567.95
3
191.25
1,567.54
4
191.30
1,567.13
5
191.35
1,566.72
6
191.40
1,566.31
7
191.45
1,565.91
8
191.50
1,565.50
9
191.55
1,565.09
10
191.60
1,564.68
11
191.65
1,564.27
12
191.70
1,563.86
13
191.75
1,563.46
14
191.80
1,563.05
15
191.85
1,562.64
16
191.90
1,562.23
17
191.95
1,561.83
18
192.00
1,561.42
19
192.05
1,561.01
20
192.10
1,560.61
21
192.15
1,560.20
22
192.20
1,559.79
23
192.25
1,559.39
24
192.30
1,558.98
25
192.35
1,558.58
26
192.40
1,558.17
27
192.45
1,557.77
28
192.50
1,557.36
29
192.55
1,556.96
30
192.60
1,556.56
31
192.65
1,556.15
32
192.70
1,555.75
33
192.75
1,555.34
34
192.80
1,554.94
35
192.85
1,554.54
36
192.90
1,554.13
37
192.95
1,553.73
38
193.00
1,553.33
39
193.05
1,552.93
40
193.10
1,552.52
41
193.15
1,552.12
42
193.20
1,551.72
43
193.25
1,551.32
44
193.30
1,550.92
45
193.35
1,550.52
46
193.40
1,550.12
47
193.45
1,549.72
48
193.50
1,549.32
49
193.55
1,548.92
50
193.60
1,548.52
51
193.65
1,548.12
52
193.70
1,547.72
53
193.75
1,547.32
54
193.80
1,546.92
55
193.85
1,546.52
56
193.90
1,546.12
57
193.95
1,545.72
58
194.00
1,545.32
59
194.05
1,544.92
60
194.10
1,544.53
61
194.15
1,544.13
62
194.20
1,543.73
63
194.25
1,543.33
64
194.30
1,542.94
65
194.35
1,542.54
66
194.40
1,542.14
67
194.45
1,541.75
68
194.50
1,541.35
69
194.55
1,540.95
70
194.60
1,540.56
71
194.65
1,540.16
72
194.70
1,539.77
73
194.75
1,539.37
74
194.80
1,538.98
75
194.85
1,538.58
76
194.90
1,538.19
77
194.95
1,537.79
78
195.00
1,537.40
79
195.05
1,537.00
80
195.10
1,536.61
81
195.15
1,536.22
82
195.20
1,535.82
83
195.25
1,535.43
84
195.30
1,535.04
85
195.35
1,534.64
86
195.40
1,534.25
87
195.45
1,533.86
88
195.50
1,533.47
89
195.55
1,533.07
90
195.60
1,532.68
91
195.65
1,532.29
92
195.70
1,531.90
93
195.75
1,531.51
94
195.80
1,531.12
95
195.85
1,530.73
96
195.90
1,530.33
97
195.95
1,529.94
98
196.00
1,529.55
99
196.05
1,529.16
100
196.10
1,528.77
OutputPower
Floating point number: 0 to +6
The output power of the network interface in dBm.
TxFineTuneFrequency
Integer
The fine tune frequency of the laser in units of 1 Hz. The AC400 modules on Voyager are only capable of 1 MHz resolution; you must specify this value in multiples of 1,000,000. The default value is 0.
MasterEnable
Boolean: true or false
Enables (true) or disables (false) the ability of the network lane modem to turn-up when leaving the low power state.
ModulationFormat
String: 16-qam, 8-qam, or pm-qpsk
Defines the modulation format used on the network interface:
16-qam operates at 200G
8-qam operates at 150G
pm-qpsk operates at 100G
Note: When selecting 8-qam, you must configure both network interfaces on a module for 8-qam and set the NetworkMode key of the module to coupled.
DifferentialEncoding
Boolean: true or false
Enables (true) or disables (false) differential encoding on the network interface.
FecMode
String: 15%, 15%_non_std, or 25%
Selects the type of forward error correction used on the network interface.
15% selects the 15% SDFEC
25% selects the 25% SDFEC
15%_non_std selects the 15% overhead AC100 compatible SDFEC
TxTributaryIndependent
List of two comma-separated integers
Defines which client interfaces map to this network interface when NetworkMode for the network interface is set to independent. The integers in the list are the Location values of the client interfaces. When operating in pm-qpsk, only the first client interface in the list is used.
Note: Do not change this value. The Tomahawk switching ASIC should be configured to steer data to the appropriate network interface, not this attribute.
TxTributaryCoupled
List of four comma-separated integers
Defines which client interfaces map to this network interface when NetworkMode for the network interface is set to coupled. The integers in the list are the Location values of the client interfaces. When operating in 8-qam, only the first three client interfaces in the list are used and only the attribute on the network interface at location 0 is used.
Note: Do not change this value. The Tomahawk switching ASIC should be configured to steer data to the appropriate network interface, not this attribute.
Loopback
Boolean: true or false
Enables (true) or disables (false) line side loopback mode on a network interface. When enabled, you send and receive data from the same network interface port to verify that the port is operational.
The following example shows a network interface at location 0, which
has transmission enabled and 50ghz channel spacing. Communication occurs
on channel 52 with 1dBm of power. The network interface becomes
operational when leaving the low power state. 16-qam encoding is used
(200G) with differential encoding and 25% overhead SDFEC. The tributary
mappings of the client interfaces is left unchanged. Loopback mode is
disabled.
The client interface groups define the attributes of the client
interfaces on the module. The name of a client interface group is
defined in the values of the HostInterfaces key of the module group.
The following table describes the key-value pairs in the client
interface groups.
Because client interfaces are internal interfaces between the
transponder module and the Tomahawk switching ASIC, the default values
of these attributes do not typically need to be changed.
Key
Value Type
Description
Location
Integer: 0-3
The location or index of the client interface within a module. The Voyager AC400 modules each have four network interfaces that are connected to the Tomahawk ASIC as follows:
Module Location
Network Interface Location
Tomahawk Falcon Core
1
0
fc11
1
1
fc12
1
2
fc10
1
3
fc9
2
0
fc19
2
1
fc18
2
2
fc17
2
3
fc16
Rate
String: otu4 or `100ge``
The rate at which the client interface operates. Because the client interfaces on Voyager are always connected to a Tomahawk ASIC, always set this value to 100ge.
Enable
Boolean: true or false
Enables (true) or disables (false) the client interface.
FecDecoder
Boolean: true or false
Enables (true) or disables (false) FEC decoding for data received from the Tomahawk switching ASIC.
FecEncoder
Boolean: true or false
Enables (true) or disables (false) FEC encoding for data sent to the Tomahawk switching ASIC.
DeserialLfCtleGain
Integer: 0-8
These attributes configure the SERDES of the client interface. The values for these attributes have been carefully determined by hardware engineers; do not change them.
DeserialCtleGain
Integer: 0-20
DeserialDfeCoeff
Integer: 0-63
SerialTap0Gain
Integer: 0-7
SerialTap0Delay
Integer: 0-7
SerialTap1Gain
Integer: 0-7
SerialTap2Gain
Integer: 0-15
SerialTap2Delay
Integer: 0-7
RxTributaryIndependent
Integer: 0-1
Defines which network interface maps to this client interface when NetworkMode for the client interface is set to independent. The integer is the Location value of the network interface.
Note: Do not change this value. The Tomahawk switching ASIC should be configured to steer data from the appropriate network interface, not this attribute.
RxTributaryCoupled
Integer: 0-1
Defines which network interface maps to this client interface when NetworkMode for the client interface is set to coupled. The integer is the Location value of the network interface.
Note: Do not change this value. The Tomahawk switching ASIC should be configured to steer data from the appropriate network interface, not this attribute.
Loopback
Boolean: true or false
Enables (true) or disables (false) terminal loopback mode on a client interface. When enabled, you send and receive data from the same client interface port to verify that the port is operational. This is useful for DWDM testing.
The following example shows a sample configuration for a client
interface group.
After making a change to the transponders.ini file, you must program
the change into the hardware by issuing a systemd reload command:
sudo systemctl reload taihost.service
Depending on the configuration changes, programming the change into the
hardware can take a long time to complete (several minutes). The
systemd reload command initiates the configuration update and returns
immediately. To monitor the progress of the configuration changes,
review the syslog messages. The following is an example of the syslog
messages.
2018-04-24T18:18:49.847312+00:00 cumulus systemd[1]: Reloading TAI host daemon.
2018-04-24T18:18:49.859649+00:00 cumulus voyager_tai_adapter[5793]: SIGHUP received
2018-04-24T18:18:49.864101+00:00 cumulus voyager_tai_adapter[5793]: Setting TxChannel (5) to 52, was 48
2018-04-24T18:18:49.867615+00:00 cumulus voyager_tai_adapter[5793]: Setting OutputPower (6) to 1.000000, was 0.000000
2018-04-24T18:18:49.873785+00:00 cumulus voyager_tai_adapter[5793]: Setting FecMode (268435464) to 3, was 1
2018-04-24T18:18:49.890446+00:00 cumulus voyager_tai_adapter[5793]: Setting TxChannel (5) to 52, was 48
2018-04-24T18:18:49.893846+00:00 cumulus voyager_tai_adapter[5793]: Setting OutputPower (6) to 1.000000, was 0.000000
2018-04-24T18:18:49.900383+00:00 cumulus voyager_tai_adapter[5793]: Setting FecMode (268435464) to 3, was 1
2018-04-24T18:18:49.915172+00:00 cumulus voyager_tai_adapter[5793]: Setting Rate (268435456) to 1, was 0
2018-04-24T18:18:49.920618+00:00 cumulus voyager_tai_adapter[5793]: Setting FecDecoder (268435458) to false, was true
2018-04-24T18:18:49.924865+00:00 cumulus voyager_tai_adapter[5793]: Setting FecEncoder (268435459) to false, was true
2018-04-24T18:18:49.929181+00:00 cumulus voyager_tai_adapter[5793]: Setting DeserialLfCtleGain (268435462) to 1, was 5
2018-04-24T18:18:49.933236+00:00 cumulus voyager_tai_adapter[5793]: Setting DeserialCtleGain (268435463) to 18, was 19
2018-04-24T18:18:49.937091+00:00 cumulus systemd[1]: Reloaded TAI host daemon.
2018-04-24T18:18:49.941644+00:00 cumulus voyager_tai_adapter[5793]: Setting SerialTap0Delay (268435466) to 3, was 5
2018-04-24T18:18:49.946020+00:00 cumulus voyager_tai_adapter[5793]: Setting SerialTap1Gain (268435467) to 6, was 5
2018-04-24T18:18:49.948621+00:00 cumulus voyager_tai_adapter[5793]: Setting SerialTap2Gain (268435468) to 12, was 8
2018-04-24T18:18:49.952036+00:00 cumulus voyager_tai_adapter[5793]: Setting SerialTap2Delay (268435469) to 6, was 5
2018-04-24T18:18:49.957846+00:00 cumulus voyager_tai_adapter[5793]: Setting Rate (268435456) to 1, was 0
2018-04-24T18:18:49.962431+00:00 cumulus voyager_tai_adapter[5793]: Setting FecDecoder (268435458) to false, was true
2018-04-24T18:18:49.965701+00:00 cumulus voyager_tai_adapter[5793]: Setting FecEncoder (268435459) to false, was true
...
2018-04-24T18:21:24.164981+00:00 cumulus voyager_tai_adapter[5793]: Config has been reloaded
802.1X Interfaces
The IEEE 802.1X protocol
provides a method of authenticating a client (called a supplicant)
over wired media. It also provides access for individual MAC addresses
on a switch (called the authenticator) after those MAC addresses have
been authenticated by an authentication server - typically a
RADIUS
(Remote Authentication Dial In User Service, defined by
RFC 2865) server.
A Cumulus Linux switch acts as an intermediary between the clients
connected to the wired ports and the authentication server, which is
reachable over the existing network. EAPOL (Extensible Authentication
Protocol (EAP) over LAN - EtherType value of 0x888E, defined by
RFC 3748) operates on top of the
data link layer; the switch uses EAPOL to communicate with supplicants
connected to the switch ports.
Cumulus Linux implements 802.1X through the Debian hostapd package,
which has been modified to provide the PAE (port access entity).
Supported Features and Limitations
802.1X is supported on Broadcom-based switches (except the Hurricane2 switch). The Tomahawk, Tomahawk2, and Trident3 switch must be running in nonatomic mode.
The protocol is supported on physical interfaces only (bridged/access only and routed interfaces) - such as swp1 or swp2s0; these interfaces cannot be part of a bond. However, 802.1X is not supported on eth0.
Cumulus Linux 3.7.2 and later includes VRF support
Cumulus Linux does not support 802.1X with MLAG; the switch cannot synchronize 802.1X authenticated MAC addresses over the peerlink.
MAB, parking VLAN and dynamic VLAN all require a bridge access port.
In traditional bridge mode, parking VLANs and dynamic VLANs both require the destination bridge to have a parking VLAN ID or dynamic VLAN ID tagged subinterface, respectively.
Enabling or disabling the 802.1X capability on ports results in hostapd reloading. However, existing authorized sessions do not get reset.
Changing any of the following RADIUS parameters restarts hostapd, which forces existing, authorized users to re-authenticate:
The RADIUS server IP address, shared secret, authentication port or accounting port
Parking VLAN ID
MAB activation delay
EAP reauthentication period
Removing all 802.1X interfaces
Changing the interface dot1x, dot1x mab, or dot1x parking-vlan settings do not reset existing authorized user ports.
You can configure up to three RADIUS servers for failover purposes.
NVIDIA performed tests with only a few wpa_supplicant (Debian), Windows 10 and Windows 7 supplicants.
RADIUS authentication is supported with FreeRADIUS and Cisco ACS.
Supports simple login/password, PEAP/MSCHAPv2 (Win7) and EAP-TLS (Debian).
802.1X supports RFC 5281 for EAP-TTLS, which provides more secure transport layer security.
There is no support for Mako template-based configurations.
Cumulus Linux 3.7.4 and later includes support for Multi Domain Authentication (MDA), where 802.1X is extended to allow authorization of multiple devices (a data and a voice device) on a single port and assign different VLANs to the devices based on authorization:
A maximum of four authorized devices (MAB + EAPOL) per port are supported.
The 802.1X enabled port must be a trunk port to allow tagged voice traffic from a phone; you cannot enable 802.1X on an access port.
Only one untagged VLAN and one tagged VLAN is supported on the 802.1X enabled ports
Multiple MAB (non voice) devices on a port are supported for VLAN-aware bridges only. Authorization of multiple MAB devices for different VLANs is not supported.
If you upgraded Cumulus Linux from a version earlier than 3.3.0 instead
of performing a full disk install, you need to install the hostapd
package on your switch:
NCLU
handles all the configuration of 802.1X interfaces, updating hostapd
and other components so you do not have to manually modify configuration
files. All the interfaces share the same RADIUS server settings.
The 802.1X-specific settings are:
accounting-port: RADIUS accounting parameters, which defaults to 1813.
authentication-port: RADIUS authentication port, which defaults
to 1812.
server-ip: RADIUS Server IPv4 or IPv6 address, which has no
default, but is required. In Cumulus Linux 3.7.2 and later, you can
also specify a VRF.
shared-secret: RADIUS shared secret, which has no default, but
is required.
Configure 802.1X Interfaces for a VLAN-aware Bridge
Make sure you configure the RADIUS server before you configure the 802.1X interfaces. See Configure the RADIUS Server, above for details.
Create a simple interface bridge configuration on the switch and add the switch ports that are members of the bridge. You can use glob syntax to add a range of interfaces. The MAB and parking VLAN configurations require interfaces to be bridge access ports. The VLAN-aware bridge must be named bridge and there can be only one VLAN-aware bridge on a switch.
cumulus@switch:~$ net add bridge bridge ports swp1-4
Configure the settings for the 802.1X RADIUS server, including its IP address and shared secret:
cumulus@switch:~$ net add dot1x radius server-ip 127.0.0.1
cumulus@switch:~$ net add dot1x radius shared-secret testing123
In Cumulus Linux 3.7.2 and later, you can specify a VRF for outgoing RADIUS accounting and authorization packets. The following example specifies a VRF called blue:
cumulus@switch:~$ net add dot1x radius server-ip 127.0.0.1 vrf blue
cumulus@switch:~$ net add dot1x radius shared-secret mysecret
Enable 802.1X on interfaces.
cumulus@switch:~$ net add interface swp1-4 dot1x
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
In Cumulus Linux 3.7.4 and later, to assign a tagged VLAN for voice devices and assign different VLANs to the devices based on authorization, run these commands:
cumulus@switch:~$ net add interface swp1-4 dot1x voice-enable
cumulus@switch:~$ net add interface swp1-4 dot1x voice-enable vlan 200
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
/etc/network/interfaces file:
cumulus@switch:~$ cat /etc/network/interfaces
...
auto swp1
iface swp1
bridge-learning off
auto swp2
iface swp2
bridge-learning off
auto swp3
iface swp3
bridge-learning off
auto swp4
iface swp4
bridge-learning off
...
auto bridge
iface bridge
bridge-ports swp1 swp2 swp3 swp4
bridge-vlan-aware yes
Verify the 802.1X configuration, showing the configuration and its
status:
cumulus@switch:~$ net show configuration commands | grep dot1x
dot1x radius server-ip 127.0.0.1
dot1x radius authentication-port 1812
dot1x radius accounting-port 1813
dot1x radius shared-secret testing123
interface swp2,swp3,swp1,swp4 dot1x
cumulus@switch:~$ net show dot1x status
IEEE802.1X Enabled Status: enabled
IEEE802.1X Active Status: active
Configure 802.1X Interfaces for a Traditional Mode Bridge
NCLU and hostapd may change traditional mode configurations on the
bridge-ports line in /etc/network/interface by adding or deleting
special 802.1X traditional mode bridge-ports configuration stanzas in
/etc/network/interfaces.d/. It is important that the source
configuration command in /etc/network/interfaces include these special
configuration filenames. It should include at least source /etc/network/interfaces.d/*.intf in order to not prevent these files
from being sourced during an ifreload.
Create some uplink ports. The following example uses bonds:
cumulus@switch:~$ net add bond bond1 bond slaves swp5-6
cumulus@switch:~$ net add bond bond2 bond slaves swp7-8
Create a traditional mode bridge configuration on the switch and add the switch ports that are members of the bridge. Traditional bridge cannot be named bridge as that name is reserved for the single VLAN-aware bridge on the switch. You can use glob syntax to add a range of interfaces.
cumulus@switch:~$ net add bridge bridge1 ports swp1-4
Create bridge associations with the parking VLAN ID and the dynamic VLAN IDs. In this example, 600 is used for the parking VLAN ID and 700 is used for the dynamic VLAN ID:
cumulus@switch:~$ net add bridge br-vlan600 ports bond1.600
cumulus@switch:~$ net add bridge br-vlan700 ports bond2.700
Configure the settings for the 802.1X RADIUS server, including its IP address and shared secret:
net add dot1x radius server-ip 127.0.0.1
net add dot1x radius shared-secret testing123
In Cumulus Linux 3.7.2 and later, you can specify a VRF for outgoing RADIUS accounting and authorization packets.The following example specifies a VRF called blue:
cumulus@switch:~$ net add dot1x radius server-ip 127.0.0.1 vrf blue
cumulus@switch:~$ net add dot1x radius shared-secret mysecret
Enable 802.1X on interfaces, then review and commit the new configuration.
cumulus@switch:~$ net add interface swp1-2 dot1x
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Verify the 802.1X configuration, showing the configuration and its status:
cumulus@switch:~$ net show dot1x status
Hostapd IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator Daemon
Attribute Value
----------------------- ----------------
Current Status active (running)
Reload Status enabled
Interfaces swp1 swp2
MAB Interfaces
Parking VLAN Interfaces
Dynamic VLAN Status Disabled
cumulus@switch:~$ net show dot1x interface summary
Interface MAC Address Username State Authentication Type MAB VLAN
--------- ----------------- -------- ---------- ------------------- --- ----
swp1 00:02:00:00:00:01 host1 AUTHORIZED MD5 NO
swp2 00:02:00:00:00:02 host2 AUTHORIZED MD5 NO
Configure the Linux Supplicants
A sample FreeRADIUS server configuration needs to contain the entries
for users host1 and host2 on swp1 and swp2 for them to be placed in
a VLAN.
You can configure the accounting and authentication ports in Cumulus
Linux. The default values are 1813 for the accounting port and 1812
for the authentication port.
You can also change the reauthentication period for Extensible
Authentication Protocol (EAP). The period defaults to 0 (no
re-authentication is performed by the switch).
cumulus@switch:~$ net add dot1x radius authentication-port 2812
cumulus@switch:~$ net add dot1x radius accounting-port 2813
cumulus@switch:~$ net add dot1x eap-reauth-period 86400
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Configure MAC Authentication Bypass
MAC authentication bypass (MAB) enables bridge ports to allow devices to
bypass authentication based on their MAC address. This is useful for
devices that do not support PAE, such as printers or phones.
In Cumulus Linux 3.7.3 and earlier, MAB supports one authenticated
MAC address per port only. After a source MAC address is
authenticated, the port exits MAB mode. Cumulus Linux 3.7.4 and
later provides support for Multi Domain Authentication (MDA), where
802.1X is extended to allow authorization of multiple devices on a
single port and assign different VLANs to the devices based on
authorization.
You must configure MAB on both the RADIUS server and the RADIUS
client.
When using a VLAN-aware bridge, the switch port must be part of
bridge named bridge.
To configure MAB in Cumulus Linux 3.7.3 and earlier, enable a bridge
port for MAB and change the MAB activation delay. You can change the MAB
activation delay from the default of 30 seconds, but the delay must be
between 5 and 30 seconds. After the delay limit is reached, the port
enters MAB mode.
cumulus@switch:~$ net add dot1x mab-activation-delay 20
cumulus@switch:~$ net add interface swp1 dot1x mab
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To configure MAB In Cumulus Linux 3.7.4 and later, enable a bridge port
for MAB. The MAB activation delay is not used. For example:
cumulus@switch:~$ net add interface swp1 dot1x mab
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To verify the configuration, run the net show dot1x status command:
cumulus@switch:~$ net show dot1x status
Hostapd IEEE 802.11 AP and IEEE 802.1X/WPA/WPA2/EAP Authenticator Daemon
Attribute Value
----------------------- ----------------
Current Status active (running)
Reload Status enabled
Interfaces swp1 swp2
MAB Interfaces swp1
Parking VLAN Interfaces
Dynamic VLAN Status Disabled
cumulus@switch:~$ net show dot1x interface summary
Interface MAC Address Username State Authentication Type MAB VLAN
--------- ----------------- ------------ ------------ ------------------- --- ----
swp1 00:02:00:00:00:08 000200000008 AUTHORIZED unknown YES
Configure a Parking VLAN
If a non-authorized supplicant tries to communicate with the switch, you
can route traffic from that device to a different VLAN and associate
that VLAN with one of the switch ports to which the supplicant is
attached.
For VLAN-aware bridges, the parking VLAN is assigned by manipulating the
PVID of the switch port. For traditional mode bridges, Cumulus Linux
identifies the bridge associated with the parking VLAN ID and moves the
switch port into that bridge. If an appropriate bridge is not found for
the move, then the port remains in an unauthenticated state where no
packets can be received or transmitted.
When using a VLAN-aware bridge, the switch port must be part of bridge
named bridge.
cumulus@switch:~$ net add dot1x parking-vlan-id 777
cumulus@switch:~$ net add interface swp1 dot1x parking-vlan
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
If the authentication for swp1 fails, the port is moved to the parking
VLAN:
cumulus@switch:~$ net show dot1x interface swp1 details
Interface MAC Address Attribute Value
--------- ----------------- ---------------------------- -----------------
swp1 00:02:00:00:00:08 Status Flags [PARKED_VLAN]
Username vlan60
Authentication Type MD5
VLAN 777
Session Time (seconds) 24772
EAPOL Frames RX 9
EAPOL Frames TX 12
EAPOL Start Frames RX 1
EAPOL Logoff Frames RX 0
EAPOL Response ID Frames RX 4
EAPOL Response Frames RX 8
EAPOL Request ID Frames TX 4
EAPOL Request Frames TX 8
EAPOL Invalid Frames RX 0
EAPOL Length Error Frames Rx 0
EAPOL Frame Version 2
EAPOL Auth Last Frame Source 00:02:00:00:00:08
EAPOL Auth Backend Responses 8
RADIUS Auth Session ID C2FED91A39D8D605
To verify the configuration, run the net show dot1x interface summary command:
cumulus@switch:~$ net show dot1x interface summary
Interface MAC Address Username State Authentication Type MAB VLAN
--------- ----------------- ------------ ------------ ------------------- --- ----
swp1 00:02:00:00:00:08 vlan60 PARKING VLAN MD5 NO 777
The following output shows a parking VLAN association failure. VLAN
association failure only occurs with traditional mode bridges when there
is no traditional bridge available with a parking VLAN ID-tagged
subinterface in it (notice the [UNKNOWN_BR] status in the output):
cumulus@switch:~$ net show dot1x interface swp3 details
Interface MAC Address Attribute Value
--------- ----------------- ---------------------------- -------------------------
swp1 00:02:00:00:00:08 Status Flags [PARKED_VLAN][UNKNOWN_BR]
Username vlan60
Authentication Type MD5
VLAN 777
Session Time (seconds) 24599
EAPOL Frames RX 3
EAPOL Frames TX 3
EAPOL Start Frames RX 1
EAPOL Logoff Frames RX 0
EAPOL Response ID Frames RX 1
EAPOL Response Frames RX 2
EAPOL Request ID Frames TX 1
EAPOL Request Frames TX 2
EAPOL Invalid Frames RX 0
EAPOL Length Error Frames Rx 0
EAPOL Frame Version 2
EAPOL Auth Last Frame Source 00:02:00:00:00:08
EAPOL Auth Backend Responses 2
RADIUS Auth Session ID C2FED91A39D8D605
Configure Dynamic VLAN Assignments
A common requirement for campus networks is to assign dynamic VLANs to
specific users in combination with IEEE 802.1x. After authenticating a
supplicant, the user is assigned a VLAN based on the RADIUS
configuration.
For VLAN-aware bridges, the dynamic VLAN is assigned by manipulating the
PVID of the switch port. For traditional mode bridges, Cumulus Linux
identifies the bridge associated with the dynamic VLAN ID and moves the
switch port into that bridge. If an appropriate bridge is not found for
the move, then the port remains in an unauthenticated state where no
packets can be received or transmitted.
To enable dynamic VLAN assignment globally, where VLAN attributes sent
from the RADIUS server are applied to the bridge, do the following:
cumulus@switch:~$ net add dot1x dynamic-vlan
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
You can specify the require option in the command so that VLAN
attributes are required. If VLAN attributes do not exist in the access
response packet returned from the RADIUS server, the user is not
authorized and has no connectivity. If the RADIUS server returns VLAN
attributes but the user has an incorrect password, the user is placed in
the parking VLAN (if you have configured parking VLAN).
cumulus@switch:~$ net add dot1x dynamic-vlan require
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The following example shows a typical RADIUS configuration (shown for
FreeRADIUS, not typically configured or run on the Cumulus Linux device)
for a user with dynamic VLAN assignment:
# # VLAN 100 Client Configuration for Freeradius RADIUS Server.
# # This is not part of the CL configuration.
vlan100client Cleartext-Password := "client1password"
Service-Type = Framed-User,
Tunnel-Type = VLAN,
Tunnel-Medium-Type = "IEEE-802",
Tunnel-Private-Group-ID = 100
Verify the configuration (notice the [AUTHORIZED] status in the
output):
cumulus@switch:~$ net show dot1x interface swp1 details
Interface MAC Address Attribute Value
--------- ----------------- ---------------------------- --------------------------
swp1 00:02:00:00:00:08 Status Flags [DYNAMIC_VLAN][AUTHORIZED]
Username host1
Authentication Type MD5
VLAN 888
Session Time (seconds) 799
EAPOL Frames RX 3
EAPOL Frames TX 3
EAPOL Start Frames RX 1
EAPOL Logoff Frames RX 0
EAPOL Response ID Frames RX 1
EAPOL Response Frames RX 2
EAPOL Request ID Frames TX 1
EAPOL Request Frames TX 2
EAPOL Invalid Frames RX 0
EAPOL Length Error Frames Rx 0
EAPOL Frame Version 2
EAPOL Auth Last Frame Source 00:02:00:00:00:08
EAPOL Auth Backend Responses 2
RADIUS Auth Session ID 939B1A53B624FC56
cumulus@switch:~$ net show dot1x interface summary
Interface MAC Address Username State Authentication Type MAB VLAN
--------- ----------------- ------------ ------------ ------------------- --- ----
swp1 00:02:00:00:00:08 000200000008 AUTHORIZED unknown NO 888
The following output shows a dynamic VLAN association failure. VLAN
association failure only occurs with traditional mode bridges when there
is no traditional bridge available with a parking VLAN ID-tagged
subinterface in it (notice the [UNKNOWN_BR] status in the output):
cumulus@switch:~$ net show dot1x interface swp1 details
Interface MAC Address Attribute Value
--------- ----------------- ---------------------------- --------------------------------------
swp1 00:02:00:00:00:08 Status Flags [DYNAMIC_VLAN][AUTHORIZED][UNKNOWN_BR]
Username host2
Authentication Type MD5
VLAN 888
Session Time (seconds) 11
EAPOL Frames RX 3
EAPOL Frames TX 3
EAPOL Start Frames RX 1
EAPOL Logoff Frames RX 0
EAPOL Response ID Frames RX 1
EAPOL Response Frames RX 2
EAPOL Request ID Frames TX 1
EAPOL Request Frames TX 2
EAPOL Invalid Frames RX 0
EAPOL Length Error Frames Rx 0
EAPOL Frame Version 2
EAPOL Auth Last Frame Source 00:02:00:00:00:08
EAPOL Auth Backend Responses 2
RADIUS Auth Session ID BDF731EF2B765B78
To disable dynamic VLAN assignment, where VLAN attributes sent from the
RADIUS server are ignored and users are authenticated based on existing
credentials:
cumulus@switch:~$ net del dot1x dynamic-vlan
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Enabling or disabling dynamic VLAN assignment restarts hostapd, which
forces existing, authorized users to re-authenticate.
Configure MAC Addresses per Port
In Cumulus Linux 3.7.4 and later, you can specify the maximum number of
authenticated MAC addresses allowed on a port with the net add dot1x max-number-stations <value> command. You can specify any number between
0 and 255. The default value is 4.
cumulus@switch:~$ net add dot1x max-number-stations 10
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Configure EAP Requests from the Switch
Cumulus Linux 3.7.3 and later provides the send-eap-request-id option,
which you can use to trigger EAP packets to be sent from the host side
of a connection. For example, this option is required in a configuration
where a PC connected to a phone attempts to send EAP packets to the
switch via the phone but the PC does not receive a response from the
switch (the phone might not be ready to forward packets to the switch
after a reboot). Because the switch does not receive EAP packets, it
attempts to authorize the PC with MAB instead of waiting for the
packets. In this case, the PC might be placed into a parking VLAN to
isolate it. To remove the PC from the parking VLAN, the switch needs to
send an EAP request to the PC to trigger EAP.
To configure the switch send an EAP request, run these commands:
cumulus@switch:~$ net add dot1x send-eap-request-id
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Only run this command if MAB is configured on an interface.
The PC might attempt 802.1X authorization through the bridged connection in the back of the phone before the phone completes MAB authorization. In this case, 802.1X authentication fails.
The net del dot1x send-eap-request-id command disables this feature.
RADIUS Change of Authorization and Disconnect Requests
Extensions to the RADIUS protocol (RFC 5176) enable the Cumulus Linux
switch to act as a Dynamic Authorization Server (DAS) by listening for
Change of Authorization (CoA) requests from the RADIUS server (Dynamic
Authorization Client (DAC)) and taking action when needed, such as
bouncing a port or terminating a user session. The IEEE 802.1x server
(hostapd) running on Cumulus Linux has been adapted to handle these
additional, unsolicited RADIUS requests.
Configure DAS
To configure DAS, provide the UDP port (3799 is the default port), the
IP address, and the secret key for the DAS client.
The following example commands set the UDP port to the default port, the
IP address of the DAS client to 10.0.2.228, and the secret key to
myclientsecret:
cumulus@switch:~$ net add dot1x radius das-port default
cumulus@switch:~$ net add dot1x radius das-client-ip 10.0.2.228 das-client-secret myclientsecret
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
In Cumulus Linux 3.7.2 and later, you can specify a VRF so that incoming
RADIUS disconnect and CoA commands are received and acknowledged on the
correct interface when VRF is configured. The following example
specifies VRF blue:
cumulus@switch:~$ net add dot1x radius das-port default
cumulus@switch:~$ net add dot1x radius das-client-ip 10.0.2.228 vrf blue das-client-secret mysecret123
cumulus@switch:~$ net commit
In Cumulus Linux 3.7.4 and later, you can configure up to four DAS
clients to be authorized to send CoA commands. For example:
cumulus@switch:~$ net add dot1x radius das-port default
cumulus@switch:~$ net add dot1x radius das-client-ip 10.20.250.53 das-client-secret mysecret1
cumulus@switch:~$ net add dot1x radius das-client-ip 10.0.1.7 das-client-secret mysecret2
cumulus@switch:~$ net add dot1x radius das-client-ip 10.20.250.99 das-client-secret mysecret3
cumulus@switch:~$ net add dot1x radius das-client-ip 10.10.0.0.2 das-client-secret mysecret4
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
You can disable DAS in Cumulus Linux at any time by running the
following commands:
cumulus@switch:~$ net del dot1x radius das-port
cumulus@switch:~$ net del dot1x radius das-client-ip
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To see DAS configuration information, run the net show configuration dot1x command. For example:
From the DAC, users can create a disconnect message using the
radclient utility (included in the Debian freeradius-utils package)
on the RADIUS server or other authorized client. A disconnect message is
sent as an unsolicited RADIUS Disconnect-Request packet to the switch to
terminate a user session and discard all associated session context. The
Disconnect-Request packet is used when the RADIUS server wants to
disconnect the user after the session has been accepted by the RADIUS
Access-Accept packet.
This is an example of a disconnect message created using the radclient
utility:
$ echo "Acct-Session-Id=D91FE8E51802097" > disconnect-packet.txt
$ ## OPTIONAL ## echo "User-Name=somebody" >> disconnect-packet.txt
$ echo "Message-Authenticator=1" >> disconnect-packet.txt
$ echo "Event-Timestamp=1532974019" >> disconnect-packet.txt
# now send the packet with the radclient utility (from freeradius-utils deb package)
$ cat disconnect-packet.txt | radclient -x 10.0.0.1:3799 disconnect myclientsecret
To prevent unauthorized servers from disconnecting users, the
Disconnect-Request packet must include certain identification attributes
(described below). For a session to be disconnected, all parameters must
match their expected values at the switch. If the parameters do not
match, the switch discards the Disconnect-Request packet and sends a
Disconnect-NAK (negative acknowledgment message).
The Message-Authenticator attribute is required.
If the packet comes from a different source IP address than the one
defined by das-client-ip, the session is not disconnected and the
hostapd logs the debug message: DAS: Drop message from unknown client.
The Event-Timestamp attribute is required. If Event-Timestamp in
the packet is outside the time window, a debug message is shown in
the hostapd logs: DAS: Unacceptable Event-Timestamp (1532978602; local time 1532979367) in packet from 10.10.0.21:45263 - drop
If the Acct-Session-Id attribute is omitted, the User-Name
attribute is used to find the session. If the User-Name attribute
is omitted, the Acct-Session-Id attribute is used. If both the
User-Name and the Acct-Session-Id attributes are supplied, they
must match the username provided by the supplicant with the
Acct-Session-Id provided. If neither are given or there is no
match, a Disconnect-NAK message is returned to the RADIUS server
with Error-Cause "Session-Context-Not-Found" and the following
debug message is shown in the log:
RADIUS DAS: Acct-Session-Id match
RADIUS DAS: No matches remaining after User-Name check
hostapd_das_find_global_sta: checking ifname=swp2
RADIUS DAS: No matches remaining after Acct-Session-Id check
RADIUS DAS: No matching session found
DAS: Session not found for request from 10.10.0.1:58385
DAS: Reply to 10.10.0.1:58385
The following is an example of the Disconnect-Request packet received by
the switch:
You can create a CoA bounce-host-port message from the RADIUS server
using the radclient utility (included in the Debian freeradius-utils
package). The bounce port can cause a link flap on an authentication
port, which triggers DHCP renegotiation from one or more hosts connected
to the port.
The following is an example of a Cisco AVPair CoA bounce-host-port
message sent from the radclient utility:
To check connectivity between two supplicants, ping one host from the
other:
root@host1:/home/cumulus# ping 198.150.0.2
PING 11.0.0.2 (11.0.0.2) 56(84) bytes of data.
64 bytes from 11.0.0.2: icmp_seq=1 ttl=64 time=0.604 ms
64 bytes from 11.0.0.2: icmp_seq=2 ttl=64 time=0.552 ms
^C
--- 11.0.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.552/0.578/0
You can run net show dot1x with the following options for more data:
json: Prints the command output in JSON format.
macs: Displays MAC address information.
port-details: Shows counters from the IEEE8021-PAE-MIB for ports.
radius-details: Shows counters from the RADIUS-CLIENT MIB (RFC
2618) for ports.
status: Displays the status of the daemon.
To check which MAC addresses are authorized by RADIUS:
cumulus@switch:~$ net show dot1x macs
Interface Attribute Value
----------- ------------- -----------------
swp1 MAC Addresses 00:02:00:00:00:01
swp2 No Data
swp3 No Data
swp4 No Data
To check tc rules in /var/lib/hostapd/acl/tc_swpX.rules:
cumulus@switch:~$ sudo tc -s filter show dev swpXX parent 1:
cumulus@switch:~$ sudo tc -s filter show dev swpXX parent ffff:
Prescriptive Topology Manager - PTM
In data center topologies, right cabling is a time-consuming endeavor
and is error prone. Prescriptive Topology Manager (PTM) is a dynamic
cabling verification tool to help detect and eliminate such errors. It
takes a Graphviz-DOT specified network cabling plan (something many
operators already generate), stored in a topology.dot file, and
couples it with runtime information derived from LLDP to verify that the
cabling matches the specification. The check is performed on every link
transition on each node in the network.
You can customize the topology.dot file to control ptmd at both the
global/network level and the node/port level.
PTM runs as a daemon, named ptmd.
For more information, see man ptmd(8).
Supported Features
Topology verification using LLDP. ptmd creates a client connection
to the LLDP daemon, lldpd, and retrieves the neighbor relationship
between the nodes/ports in the network and compares them against the
prescribed topology specified in the topology.dot file.
Only physical interfaces, like swp1 or eth0, are currently
supported. Cumulus Linux does not support specifying virtual
interfaces like bonds or subinterfaces like eth0.200 in the topology
file.
Integration with FRRouting (PTM to FRRouting notification).
Client management: ptmd creates an abstract named socket
/var/run/ptmd.socket on startup. Other applications can connect to
this socket to receive notifications and send commands.
Event notifications: see Scripts below.
User configuration via a topology.dot file; see below.
Configure PTM
ptmd verifies the physical network topology against a DOT-specified
network graph file, /etc/ptm.d/topology.dot.
At startup, ptmd connects to lldpd, the LLDP daemon, over a Unix
socket and retrieves the neighbor name and port information. It then
compares the retrieved port information with the configuration
information that it read from the topology file. If there is a match,
then it is a PASS, else it is a FAIL.
PTM performs its LLDP neighbor check using the PortID ifname TLV
information. Previously, it used the PortID port description TLV
information.
Basic Topology Example
This is a basic example DOT file and its corresponding topology diagram.
You should use the same topology.dot file on all switches, and don’t
split the file per device; this allows for easy automation by
pushing/pulling the same exact file on each device!
ptmd executes scripts at /etc/ptm.d/if-topo-pass and
/etc/ptm.d/if-topo-fail for each interface that goes through a
change, running if-topo-pass when an LLDP or BFD check passes and
running if-topo-fails when the check fails. The scripts receive an
argument string that is the result of the ptmctl command, described in
the ptmd commands section below.
You should modify these default scripts as needed.
Configuration Parameters
You can configure ptmd parameters in the topology file. The parameters
are classified as host-only, global, per-port/node and templates.
Host-only Parameters
Host-only parameters apply to the entire host on which PTM is running.
You can include the hostnametype host-only parameter, which specifies
whether PTM should use only the host name (hostname) or the
fully-qualified domain name (fqdn) while looking for the self-node
in the graph file. For example, in the graph file below, PTM will ignore
the FQDN and only look for switch04, since that is the host name of
the switch it’s running on:
It’s a good idea to always wrap the hostname in double quotes, like
“www.example.com”. Otherwise, ptmd can fail if you specify a
fully-qualified domain name as the hostname and do not wrap it in double
quotes.
Further, to avoid errors when starting the ptmd process, make sure
that /etc/hosts and /etc/hostname both reflect the hostname you are
using in the topology.dot file.
However, in this next example, PTM will compare using the FQDN and look
for switch05.cumulusnetworks.com, which is the FQDN of the switch it’s
running on:
Global parameters apply to every port listed in the topology file.
There are two global parameters: LLDP and BFD. LLDP is enabled by
default; if no keyword is present, default values are used for all
ports. However, BFD is disabled if no keyword is present, unless there
is a per-port override configured. For example:
Templates provide flexibility in choosing different parameter
combinations and applying them to a given port. A template instructs
ptmd to reference a named parameter string instead of a default one.
There are two parameter strings ptmd supports:
bfdtmpl, which specifies a custom parameter tuple for BFD.
lldptmpl, which specifies a custom parameter tuple for LLDP.
match_type, which defaults to the interface name (ifname), but
can accept a port description (portdescr) instead if you want
lldpd to compare the topology against the port description instead
of the interface name. You can set this parameter globally or at the
per-port level.
match_hostname, which defaults to the host name (hostname), but
enables PTM to match the topology using the fully-qualified domain
name (fqdn) supplied by LLDP.
The following is an example of a topology with LLDP applied at the port
level:
When you specify match_hostname=fqdn, ptmd will match the entire
FQDN, like cumulus-2.domain.com in the example below. If you do not
specify anything for match_hostname, ptmd will match based on
hostname only, like cumulus-3 below, and ignore the rest of the URL:
BFD provides low overhead and rapid detection of failures in the paths
between two network devices. It provides a unified mechanism for link
detection over all media and protocol layers. Use BFD to detect failures
for IPv4 and IPv6 single or multihop paths between any two network
devices, including unidirectional path failure detection. For
information about configuring BFD using PTM, see the BFD topic
Check Link State with FRRouting
The FRRouting routing suite enables additional checks to ensure that
routing adjacencies are formed only on links that have connectivity
conformant to the specification, as determined by ptmd.
You only need to do this to check link state; you do not need to enable
PTM to determine BFD status.
When the global ptm-enable option is enabled, every interface has an
implied ptm-enable line in the configuration stanza in the interfaces
file.
To enable the global ptm-enable option, run the following FRRouting
command:
To disable the checks, delete the ptm-enable parameter from the
interface. For example:
cumulus@switch:~$ net del interface swp51 ptm-enable
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
If you need to reenable PTM for that interface, run:
cumulus@switch:~$ net add interface swp51 ptm-enable
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
With PTM enabled on an interface, the zebra daemon connects to ptmd
over a Unix socket. Any time there is a change of status for an
interface, ptmd sends notifications to zebra. Zebra maintains a
ptm-status flag per interface and evaluates routing adjacency based on
this flag. To check the per-interface ptm-status:
cumulus@switch:~$ net show interface swp1
Interface swp1 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
PTM status: disabled
vrf: Default-IP-Routing-Table
index 3 metric 0 mtu 1550
flags: <UP,BROADCAST,RUNNING,MULTICAST>
HWaddr: c4:54:44:bd:01:41
ptmd Service Commands
PTM sends client notifications in CSV format.
cumulus@switch:~$ sudo systemctl start|restart|force-reload ptmd.service: Starts or restarts the ptmd service. The topology.dot
file must be present in order for the service to start.
cumulus@switch:~$ sudo systemctl reload ptmd.service: Instructs ptmd
to read the topology.dot file again without restarting, applying the
new configuration to the running state.
cumulus@switch:~$ sudo systemctl stop ptmd.service: Stops the ptmd
service.
cumulus@switch:~$ sudo systemctl status ptmd.service: Retrieves the
current running state of ptmd.
ptmctl Commands
ptmctl is a client of ptmd; it retrieves the operational state of
the ports configured on the switch and information about BFD sessions
from ptmd. ptmctl parses the CSV notifications sent by ptmd.
See man ptmctl for more information.
ptmctl Examples
The examples below contain the following keywords in the output of the
cbl status column, which are described here:
cbl status Keyword
Definition
pass
The interface is defined in the topology file, LLDP information is received on the interface, and the LLDP information for the interface matches the information in the topology file.
fail
The interface is defined in the topology file, LLDP information is received on the interface, and the LLDP information for the interface does not match the information in the topology file.
N/A
The interface is defined in the topology file, but no LLDP information is received on the interface. The interface may be down or disconnected, or the neighbor is not sending LLDP packets.
The N/A and fail statuses may indicate a wiring problem to investigate.
The N/A status is not shown when using the -l option with ptmctl. If you specify the -l option, ptmctl displays only those interfaces that are receiving LLDP information.
For basic output, use ptmctl without any options:
cumulus@switch:~$ sudo ptmctl
-------------------------------------------------------------
port cbl BFD BFD BFD BFD
status status peer local type
-------------------------------------------------------------
swp1 pass pass 11.0.0.2 N/A singlehop
swp2 pass N/A N/A N/A N/A
swp3 pass N/A N/A N/A N/A
For more detailed output, use the -d option:
cumulus@switch:~$ sudo ptmctl -d
--------------------------------------------------------------------------------------
port cbl exp act sysname portID portDescr match last BFD BFD
status nbr nbr on upd Type state
--------------------------------------------------------------------------------------
swp45 pass h1:swp1 h1:swp1 h1 swp1 swp1 IfName 5m: 5s N/A N/A
swp46 fail h2:swp1 h2:swp1 h2 swp1 swp1 IfName 5m: 5s N/A N/A
#continuation of the output
-------------------------------------------------------------------------------------------------
BFD BFD det_mult tx_timeout rx_timeout echo_tx_timeout echo_rx_timeout max_hop_cnt
peer DownDiag
-------------------------------------------------------------------------------------------------
N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A
cumulus@switch:~$
To return information on active BFD sessions ptmd is tracking, use the
-b option:
cumulus@switch:~$ sudo ptmctl -b
----------------------------------------------------------
port peer state local type diag
----------------------------------------------------------
swp1 11.0.0.2 Up N/A singlehop N/A
N/A 12.12.12.1 Up 12.12.12.4 multihop N/A
To return LLDP information, use the -l option. It returns only the
active neighbors currently being tracked by ptmd.
cumulus@switch:~$ sudo ptmctl -l
---------------------------------------------
port sysname portID port match last
descr on upd
---------------------------------------------
swp45 h1 swp1 swp1 IfName 5m:59s
swp46 h2 swp1 swp1 IfName 5m:59s
To return detailed information on active BFD sessions ptmd is
tracking, use the -b and -d options (results are for an
IPv6-connected peer):
cumulus@switch:~$ sudo ptmctl -b -d
----------------------------------------------------------------------------------------
port peer state local type diag det tx_timeout rx_timeout
mult
----------------------------------------------------------------------------------------
swp1 fe80::202:ff:fe00:1 Up N/A singlehop N/A 3 300 900
swp1 3101:abc:bcad::2 Up N/A singlehop N/A 3 300 900
#continuation of output
---------------------------------------------------------------------
echo echo max rx_ctrl tx_ctrl rx_echo tx_echo
tx_timeout rx_timeout hop_cnt
---------------------------------------------------------------------
0 0 N/A 187172 185986 0 0
0 0 N/A 501 533 0 0
ptmctl Error Outputs
If there are errors in the topology file or there isn’t a session, PTM
will return appropriate outputs. Typical error strings are:
Topology file error [/etc/ptm.d/topology.dot] [cannot find node cumulus] -
please check /var/log/ptmd.log for more info
Topology file error [/etc/ptm.d/topology.dot] [cannot open file (errno 2)] -
please check /var/log/ptmd.log for more info
No Hostname/MgmtIP found [Check LLDPD daemon status] -
please check /var/log/ptmd.log for more info
No BFD sessions . Check connections
No LLDP ports detected. Check connections
Unsupported command
For example:
cumulus@switch:~$ sudo ptmctl
-------------------------------------------------------------------------
cmd error
-------------------------------------------------------------------------
get-status Topology file error [/etc/ptm.d/topology.dot]
[cannot open file (errno 2)] - please check /var/log/ptmd.log
for more info
If you encounter errors with the topology.dot file, you can use dot
(included in the Graphviz package) to validate the syntax of the
topology file.
By simply opening the topology file with Graphviz, you can ensure that
it is readable and that the file format is correct.
If you edit topology.dot file from a Windows system, be sure to double
check the file formatting; there may be extra characters that keep the
graph from working correctly.
Caveats and Errata
Prior to version 2.1, Cumulus Linux stored the ptmd configuration
files in /etc/cumulus/ptm.d. When you upgrade to version 2.1 or
later, all the existing ptmd files are copied from their original
location to /etc/ptm.d with a dpkg-old extension, except for
topology.dot, which gets copied to /etc/ptm.d.
If you customized the if-topo-pass and if-topo-fail scripts,
they are also copied to dpkg-old, and you must modify them so they
can parse the CSV output correctly.
Sample if-topo-pass and if-topo-fail scripts are available in
/etc/ptm.d. A sample topology.dot file is available in
/usr/share/doc/ptmd/examples.
When PTMD is incorrectly in a failure state and the Zebra interface
is enabled, PIF BGP sessions are not establishing the route, but the
subinterface on top of it does establish routes.
If the subinterface is configured on the physical interface and the
physical interface is incorrectly marked as being in a PTM FAIL
state, routes on the physical interface are not processed in FRRouting,
but the subinterface is working.
If an LLDP neighbor advertises a PortDescr that contains commas, ptmctl -d splits the string on the commas and misplaces its components in other columns. Do not use commas in your port descriptions.
Spanning tree protocol (STP) identifies links in the network and shuts down redundant links, preventing possible network loops and broadcast radiation on a bridged network. STP also provides redundant links for automatic failover when an active link fails. STP is enabled by default in Cumulus Linux for both VLAN-aware and traditional bridges.
Cumulus Linux supports RSTP, PVST, and PVRST modes:
Traditional bridges operate in both PVST and PVRST mode. The default is set to PVRST. Each traditional bridge has its own separate STP instance.
Per VLAN Spanning Tree (PVST) creates a spanning tree instance for a bridge. Rapid PVST (PVRST) supports RSTP enhancements for each spanning tree instance. To use PVRST with a traditional bridge, you must create a bridge corresponding to the untagged native VLAN and all the physical switch ports must be part of the same VLAN.
For maximum interoperability, when connected to a switch that has a native VLAN configuration, the native VLAN must be configured to be VLAN 1 only.
STP for a VLAN-aware Bridge
VLAN-aware bridges operate in RSTP mode only. RSTP on VLAN-aware bridges works with other modes in the following ways:
RSTP and STP
If a bridge running RSTP (802.1w) receives a common STP (802.1D) BPDU, it falls back to 802.1D automatically.
RSTP and PVST
The RSTP domain sends BPDUs on the native VLAN, whereas PVST sends BPDUs on a per VLAN basis. For both protocols to work together, you need to enable the native VLAN on the link between the RSTP to PVST domain; the spanning tree is built according to the native VLAN parameters.
The RSTP protocol does not send or parse BPDUs on other VLANs, but floods BPDUs across the network, enabling the PVST domain to maintain its spanning-tree topology and provide a loop-free network.
To enable proper BPDU exchange across the network, be sure to allow all VLANs participating in the PVST domain on the link between the RSTP and PVST domains.
When using RSTP together with an existing PVST network, you need to define the root bridge on the PVST domain. Either lower the priority on the PVST domain or change the priority of the RSTP switches to a higher number.
When connecting a VLAN-aware bridge to a proprietary PVST+ switch using STP, you must allow VLAN 1 on all 802.1Q trunks that interconnect them, regardless of the configured native VLAN. Only VLAN 1 enables the switches to address the BPDU frames to the IEEE multicast MAC address. The proprietary switch might be configured like this:
switchport trunk allowed vlan 1-100
RSTP and MST
RSTP works with MST seamlessly, creating a single instance of spanning tree that transmits BPDUs on the native VLAN.
RSTP treats the MST domain as one giant switch, whereas MST treats the RSTP domain as a different region. To enable proper communication between the regions, MST creates a Common Spanning Tree (CST) that connects all the boundary switches and forms the overall view of the MST domain. Because changes in the CST need to be reflected in all regions, the RSTP tree is included in the CST to ensure that changes on the RSTP domain are reflected in the CST domain. This does cause topology changes on the RSTP domain to impact the rest of the network but keeps the MST domain informed of every change occurring in the RSTP domain, ensuring a loop-free network.
Configure the root bridge within the MST domain by changing the priority on the relevant MST switch. When MST detects an RSTP link, it falls back into RSTP mode. The MST domain choses the switch with the lowest cost to the CST root bridge as the CIST root bridge.
RSTP with MLAG
More than one spanning tree instance enables switches to load balance and use different links for different VLANs. With RSTP, there is only one instance of spanning tree. To better utilize the links, you can configure MLAG on the switches connected to the MST or PVST domain and set up these interfaces as an MLAG port. The PVST or MST domain thinks it is connected to a single switch and utilizes all the links connected to it. Load balancing is based on the port channel hashing mechanism instead of different spanning tree instances and uses all the links between the RSTP to the PVST or MST domains. For information about configuring MLAG, see Multi-Chassis Link Aggregation - MLAG.
Optional Configuration
There are a number of ways to customize STP in Cumulus Linux. Exercise caution when changing the settings below to prevent malfunctions in STP loop avoidance.
Spanning Tree Priority
If you have a multiple spanning tree instance (MSTI 0, also known as a common spanning tree, or CST), you can set the tree priority for a bridge. The bridge with the lowest priority is elected the root bridge. The priority must be a number between 0 and 61440, and must be a multiple of 4096. The default is 32768.
To set the tree priority, run the following commands:
The following example command sets the tree priority to 8192:
cumulus@switch:~$ net add bridge stp treeprio 8192
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Configure the tree priority (mstpctl-treeprio) under the bridge stanza in the /etc/network/interfaces file, then run the ifreload -a command. The following example command sets the tree priority to 8192:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
# bridge-ports includes all ports related to VxLAN and CLAG.
# does not include the Peerlink.4094 subinterface
bridge-ports bond01 bond02 peerlink vni13 vni24 vxlan4001
bridge-pvid 1
bridge-vids 13 24
bridge-vlan-aware yes
mstpctl-treeprio 8192
...
cumulus@switch:~$ ifreload -a
Cumulus Linux supports MSTI 0 only. It does not support MSTI 1 through 15.
PortAdminEdge (PortFast Mode)
PortAdminEdge is equivalent to the PortFast feature offered by other vendors. It enables or disables the initial edge state of a port in a bridge.
All ports configured with PortAdminEdge bypass the listening and learning states to move immediately to forwarding.
PortAdminEdge mode might cause loops if it is not used with the BPDU guard feature.
It is common for edge ports to be configured as access ports for a simple end host; however, this is not mandatory. In the data center, edge ports typically connect to servers, which might pass both tagged and untagged traffic.
To configure PortAdminEdge mode:
The following example commands configure PortAdminEdge and BPDU guard for swp5.
cumulus@switch:~$ net add interface swp5 stp bpduguard
cumulus@switch:~$ net add interface swp5 stp portadminedge
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Configure PortAdminEdge and BPDU guard under the switch port interface stanza in the /etc/network/interfaces file, then run the ifreload -a command. The following example configures PortAdminEdge and BPD guard on swp5.
PortAutoEdge is an enhancement to the standard PortAdminEdge (PortFast) mode, which allows for the automatic detection of edge ports. PortAutoEdge enables and disables the auto transition to and from the edge state of a port in a bridge.
Edge ports and access ports are not the same. Edge ports transition directly to the forwarding state and skip the listening and learning stages. Upstream topology change notifications are not generated when an edge port link changes state. Access ports only forward untagged traffic; however, there is no such restriction on edge ports, which can forward both tagged and untagged traffic.
When a BPDU is received on a port configured with PortAutoEdge, the port ceases to be in the edge port state and transitions into a normal STP port. When BPDUs are no longer received on the interface, the port becomes an edge port, and transitions through the discarding and learning states before resuming forwarding.
PortAutoEdge is enabled by default in Cumulus Linux.
To disable PortAutoEdge for an interface:
The following example commands disable PortAutoEdge on swp1:
cumulus@switch:~$ net add interface swp1 stp portautoedge no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portautoedge no line, then run the ifreload -a command. The following example disables PortAutoEdge on swp1:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
alias to Server01
# Port to Server02
mstpctl-portautoedge no
...
cumulus@switch:~$ sudo ifreload -a
To reenable PortAutoEdge for an interface:
The following example commands reenable PortAutoEdge on swp1:
cumulus@switch:~$ net del interface swp1 stp portautoedge no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Edit the switch port interface stanza in the /etc/network/interfaces file to remove mstpctl-portautoedge no, then run the ifreload -a command.
BPDU Guard
You can configure BPDU guard to protect the spanning tree topology from unauthorized switches affecting the forwarding path. For example, if you add a new switch to an access port off a leaf switch and this new switch is configured with a low priority, it might become the new root switch and affect the forwarding path for the entire layer 2 topology.
To configure BPDU guard:
The following example commands set BPDU guard for swp5:
cumulus@switch:~$ net add interface swp5 stp bpduguard
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-bpduguard yes line, then run the ifreload -a command. The following example sets BPDU guard for interface swp5:
If a BPDU is received on the port, STP brings down the port and logs an error in /var/log/syslog. The following is a sample error:
mstpd: error, MSTP_IN_rx_bpdu: bridge:bond0 Recvd BPDU on BPDU Guard Port - Port Down
To determine whether BPDU guard is configured, or if a BPDU has been received:
cumulus@switch:~$ net show bridge spanning-tree | grep bpdu
bpdu guard port yes bpdu guard error yes
cumulus@switch:~$ mstpctl showportdetail bridge bond0
bridge:bond0 CIST info
enabled no role Disabled
port id 8.001 state discarding
external port cost 305 admin external cost 0
internal port cost 305 admin internal cost 0
designated root 8.000.6C:64:1A:00:4F:9C dsgn external cost 0
dsgn regional root 8.000.6C:64:1A:00:4F:9C dsgn internal cost 0
designated bridge 8.000.6C:64:1A:00:4F:9C designated port 8.001
admin edge port no auto edge port yes
oper edge port no topology change ack no
point-to-point yes admin point-to-point auto
restricted role no restricted TCN no
port hello time 10 disputed no
bpdu guard port yes bpdu guard error yes
network port no BA inconsistent no
Num TX BPDU 3 Num TX TCN 2
Num RX BPDU 488 Num RX TCN 2
Num Transition FWD 1 Num Transition BLK 2
bpdufilter port no
clag ISL no clag ISL Oper UP no
clag role unknown clag dual conn mac 0:0:0:0:0:0
clag remote portID F.FFF clag system mac 0:0:0:0:0:0
The only way to recover a port that has been placed in the disabled state is to manually bring up the port with the sudo ifup <interface> command. See Interface Configuration and Management for more information about ifupdown.
Bringing up the disabled port does not correct the problem if the configuration on the connected end-station has not been resolved.
Bridge Assurance
On a point-to-point link where RSTP is running, if you want to detect unidirectional links and put the port in a discarding state, you can enable bridge assurance on the port by enabling a port type network. The port is then in a bridge assurance inconsistent state until a BPDU is received from the peer. You need to configure the port type network on both ends of the link for bridge assurance to operate properly.
Bridge assurance is disabled by default.
To enable bridge assurance on an interface:
The following example commands enable bridge assurance on swp1:
cumulus@switch:~$ net add interface swp1 stp portnetwork
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portnetwork yes line, then run the ifreload -a command. The following example enables bridge assurance on swp5:
You can enable bpdufilter on a switch port, which filters BPDUs in both directions. This disables STP on the port as no BPDUs are transiting.
Using BDPU filter might cause layer 2 loops. Use this feature deliberately and with extreme caution.
To configure the BPDU filter on an interface:
The following example commands configure the BPDU filter on swp6:
cumulus@switch:~$ net add interface swp6 stp portbpdufilter
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portbpdufilter yes line, then run the ifreload -a command. The following example configures BPDU filter on swp6:
Spanning tree parameters are defined in the IEEE 802.1D and 802.1Q specifications.
The table below describes the STP configuration parameters available in Cumulus Linux.
Most of these parameters are blacklisted in the ifupdown_blacklist section of the /etc/netd.conf file. Before you configure these parameters, you must edit the file to remove them from the blacklist.
Parameter
NCLU Command
Description
mstpctl-maxage
net add bridge stp maxage <seconds>
Sets the maximum age of the bridge in seconds. The default is 20. The maximum age must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age.
mstpctl-ageing
net add bridge stp ageing <seconds>
Sets the Ethernet (MAC) address ageing time for the bridge in seconds when the running version is STP, but not RSTP/MSTP. The default is 1800.
mstpctl-fdelay
net add bridge stp fdelay <seconds>
Sets the bridge forward delay time in seconds. The default value is 15. The bridge forward delay must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age.
mstpctl-maxhops
net add bridge stp maxhops <max-hops>
Sets the maximum hops for the bridge. The default is 20.
mstpctl-txholdcount
net add bridge stp txholdcount <hold-count>
Sets the bridge transmit hold count. The default value is 6.
mstpctl-forcevers
net add bridge stp forcevers RSTP|STP
Sets the force STP version of the bridge to either RSTP/STP. The default is RSTP.
mstpctl-treeprio
net add bridge stp treeprio <priority>
Sets the tree priority of the bridge for an MSTI (multiple spanning tree instance). The priority value is a number between 0 and 61440 and must be a multiple of 4096. The bridge with the lowest priority is elected the root bridge. The default is 32768. See Spanning Tree Priority above. Note: Cumulus Linux supports MSTI 0 only. It does not support MSTI 1 through 15.
mstpctl-hello
net add bridge stp hello <seconds>
Sets the bridge hello time in seconds. The default is 2.
mstpctl-portpathcost
net add interface <interface> stp portpathcost <cost>
Sets the port cost of the interface. The default is 0. mstpd supports only long mode; 32 bits for the path cost.
mstpctl-treeportprio
net add interface <interface> stp treeportprio <priority>
Sets the priority of the interface for the MSTI. The priority value is a number between 0 and 240 and must be a multiple of 16. The default is 128. Note: Cumulus Linux supports MSTI 0 only. It does not support MSTI 1 through 15.
mstpctl-portadminedge
net add interface <interface> stp portadminedge
Enables or disables the initial edge state of the interface in the bridge. The default is no. In NCLU, to use a setting other than the default, you must specify this attribute without setting an option. See PortAdminEdge above.
mstpctl-portautoedge
net add interface <interface> stp portautoedge
Enables or disables the auto transition to and from the edge state of the interface in the bridge. PortAutoEdge is enabled by default. See PortAutoEdge above.
mstpctl-portp2p
net add interface <interface> stp portp2p yes|no
Enables or disables the point-to-point detection mode of the interface in the bridge.
mstpctl-portrestrrole
net add interface <interface> stp portrestrrole
Enables or disables the ability of the interface in the bridge to take the restricted role. The default is no. To enable this feature with the NCLU command, you specify this attribute without an option (portrestrrole). To enable this feature by editing the /etc/network/interfaces file, you specify mstpctl-portrestrrole yes.
mstpctl-portrestrtcn
net add interface <interface> stp portrestrtcn
Enables or disables the ability of the interface in the bridge to propagate received topology change notifications. The default is no.
mstpctl-portnetwork
net add interface <interface> stp portnetwork
Enables or disables the bridge assurance capability for a network interface. The default is no. See Bridge Assurance above.
mstpctl-bpduguard
net add interface <interface> stp bpduguard
Enables or disables the BPDU guard configuration of the interface in the bridge. The default is no. See BPDU Guard above.
mstpctl-portbpdufilter
net add interface <interface> stp portbpdufilter
Enables or disables the BPDU filter functionality for an interface in the bridge. The default is no. See BPDU Filter above.
mstpctl-treeportcost
net add interface <interface> stp treeportcost <port-cost>
Sets the spanning tree port cost to a value from 0 to 255. The default is 0.
Troubleshooting
To check STP status for a bridge:
Run the net show bridge spanning-tree command:
cumulus@switch:~$ net show bridge spanning-tree
Bridge info
enabled yes
bridge id 8.000.44:38:39:FF:40:94
Priority: 32768
Address: 44:38:39:FF:40:94
This bridge is root.
designated root 8.000.44:38:39:FF:40:94
Priority: 32768
Address: 44:38:39:FF:40:94
root port none
path cost 0 internal path cost 0
max age 20 bridge max age 20
forward delay 15 bridge forward delay 15
tx hold count 6 max hops 20
hello time 2 ageing time 300
force protocol version rstp
INTERFACE STATE ROLE EDGE
--------- ----- ---- ----
peerlink forw Desg Yes
vni13 forw Desg Yes
vni24 forw Desg Yes
vxlan4001 forw Desg Yes
The mstpctl utility provided by the mstpd service configures STP. The mstpd daemon is an open source project used by Cumulus Linux to implement IEEE802.1D 2004 and IEEE802.1Q 2011.
The mstpd daemon starts by default when the switch boots and logs errors to /var/log/syslog.
mstpd is the preferred utility for interacting with STP on Cumulus Linux. brctl also provides certain tools for configuring STP; however, they are not as complete and output from brctl might be misleading.
To show the bridge state, run the brctl show command:
cumulus@switch:~$ sudo brctl show
bridge name bridge id STP enabled interfaces
bridge 8000.001401010100 yes swp1
swp4
swp5
To show the mstpd bridge port state, run the mstpctl showport bridge command:
Storm control provides protection against excessive inbound BUM (broadcast, unknown unicast, multicast) traffic on layer 2 switch port interfaces, which can cause poor network performance.
Storm control is not supported on a switch with the Tomahawk2 ASIC.
On Broadcom switches, ARP requests over layer 2 VXLAN bypass broadcast storm control; they are forwarded to the CPU and subjected to embedded control plane QoS instead.
Configure Storm Control
To configure storm control for physical ports, edit the /etc/cumulus/switchd.conf file. For example, to enable broadcast storm control for swp1 at 400 packets per second (pps), multicast storm control at 3000 pps, and unknown unicast at 500 pps, edit the /etc/cumulus/switchd.conf file and uncomment the storm_control.broadcast, storm_control.multicast, and storm_control.unknown_unicast lines:
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
# Storm Control setting on a port, in pps, 0 means disable
interface.swp1.storm_control.broadcast = 400
interface.swp1.storm_control.multicast = 3000
interface.swp1.storm_control.unknown_unicast = 500
...
When you update the /etc/cumulus/switchd.conf file, you must restart switchd for the changes to take effect.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Alternatively, you can run the following commands. The configuration below takes effect immediately, but does not persist if you reboot the switch. For a persistent configuration, edit the /etc/cumulus/switchd.conf file, as described above.
cumulus@switch:~$ sudo sh -c 'echo 400 > /cumulus/switchd/config/interface/swp1/storm_control/broadcast'
cumulus@switch:~$ sudo sh -c 'echo 3000 > /cumulus/switchd/config/interface/swp1/storm_control/multicast'
cumulus@switch:~$ sudo sh -c 'echo 500 > /cumulus/switchd/config/interface/swp1/storm_control/unknown_unicast'
To use the same command above on range of interfaces you can use a for-loop from the switch CLI using the below example.
cumulus@switch:mgmt:~$ for i in {1..5}; do
> sudo sh -c "echo 400 > /cumulus/switchd/config/interface/swp$i/storm_control/broadcast"
> sudo sh -c "echo 3000 > /cumulus/switchd/config/interface/swp$i/storm_control/multicast"
> sudo sh -c "echo 500 > /cumulus/switchd/config/interface/swp$i/storm_control/unknown_unicast"
> done
cumulus@switch:mgmt:~$
Link Layer Discovery Protocol
The lldpd daemon implements the IEEE802.1AB (Link Layer Discovery
Protocol, or LLDP) standard. LLDP enables you to know which ports are
neighbors of a given port. By default, lldpd runs as a daemon and is
started at system boot. lldpd command line arguments are placed in
/etc/default/lldpd. lldpd configuration options are placed in
/etc/lldpd.conf or under /etc/lldpd.d/.
For more details on the command line arguments and config options, see
man lldpd(8).
lldpd supports CDP (Cisco Discovery Protocol, v1 and v2). lldpd logs
by default into /var/log/daemon.log with an lldpd prefix.
lldpcli is the CLI tool to query the lldpd daemon for neighbors,
statistics, and other running configuration information. See man lldpcli(8) for details.
Configure LLDP
You configure lldpd settings in /etc/lldpd.conf or /etc/lldpd.d/.
The last line in the example above shows that LLDP is disabled on eth0.
You can disable LLDP on a single port by editing the
/etc/default/lldpd file. This file specifies the default options to
present to the lldpd service when it starts. The following example
uses the -I option to disable LLDP on swp43:
cumulus@switch:~$ sudo nano /etc/default/lldpd
# Add "-x" to DAEMON_ARGS to start SNMP subagent
# Enable CDP by default
DAEMON_ARGS="-c -I *,!swp43"
lldpd has two timers defined by the tx-interval setting that affect each switch port:
The first timer catches any port-related changes.
The second is a system-based refresh timer on each port that looks for other changes like hostname. This timer uses the tx-interval value multiplied by 20.
lldpd logs to /var/log/daemon.log with the lldpd prefix:
cumulus@switch:~$ sudo tail -f /var/log/daemon.log | grep lldp
Aug 7 17:26:17 switch lldpd[1712]: unable to get system name
Aug 7 17:26:17 switch lldpd[1712]: unable to get system name
Aug 7 17:26:17 switch lldpcli[1711]: lldpd should resume operations
Aug 7 17:26:32 switch lldpd[1805]: NET-SNMP version 5.4.3 AgentX subagent connected
Example lldpcli Commands
To show all neighbors on all ports/interfaces:
cumulus@switch:~$ sudo lldpcli show neighbors
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: eth0, via: LLDP, RID: 1, Time: 0 day, 17:38:08
Chassis:
ChassisID: mac 08:9e:01:e9:66:5a
SysName: PIONEERMS22
SysDescr: Cumulus Linux version 2.5.4 running on quanta lb9
MgmtIP: 192.168.0.22
Capability: Bridge, on
Capability: Router, on
Port:
PortID: ifname swp47
PortDescr: swp47
-------------------------------------------------------------------------------
Interface: swp1, via: LLDP, RID: 10, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:09:00
SysName: MSP-1
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.9
MgmtIP: fe80::201:ff:fe00:900
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp1
PortDescr: swp1
-------------------------------------------------------------------------------
Interface: swp2, via: LLDP, RID: 10, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:09:00
SysName: MSP-1
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.9
MgmtIP: fe80::201:ff:fe00:900
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp2
PortDescr: swp2
-------------------------------------------------------------------------------
Interface: swp3, via: LLDP, RID: 11, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:0a:00
SysName: MSP-2
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.10
MgmtIP: fe80::201:ff:fe00:a00
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp1
PortDescr: swp1
-------------------------------------------------------------------------------
Interface: swp4, via: LLDP, RID: 11, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:0a:00
SysName: MSP-2
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.10
MgmtIP: fe80::201:ff:fe00:a00
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp2
PortDescr: swp2
-------------------------------------------------------------------------------
Interface: swp49s1, via: LLDP, RID: 9, Time: 0 day, 16:55:00
Chassis:
ChassisID: mac 00:01:00:00:0c:00
SysName: TORC-1-2
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.12
MgmtIP: fe80::201:ff:fe00:c00
Capability: Bridge, on
Capability: Router, on
Port:
PortID: ifname swp6
PortDescr: swp6
-------------------------------------------------------------------------------
Interface: swp49s0, via: LLDP, RID: 9, Time: 0 day, 16:55:00
Chassis:
ChassisID: mac 00:01:00:00:0c:00
SysName: TORC-1-2
SysDescr: Cumulus Linux version 3.0.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.12
MgmtIP: fe80::201:ff:fe00:c00
Capability: Bridge, on
Capability: Router, on
Port:
PortID: ifname swp5
PortDescr: swp5
-------------------------------------------------------------------------------
cumulus@switch:~$ sudo lldpcli show statistics summary
---------------------------------------------------------------------
LLDP Global statistics:
---------------------------------------------------------------------
Summary of stats:
Transmitted: 648186
Received: 437557
Discarded: 0
Unrecognized: 0
Ageout: 10
Inserted: 38
Deleted: 10
To show the lldpd running configuration:
cumulus@switch:~$ sudo lldpcli show running-configuration
--------------------------------------------------------------------
Global configuration:
--------------------------------------------------------------------
Configuration:
Transmit delay: 30
Transmit hold: 4
Receive mode: no
Pattern for management addresses: (none)
Interface pattern: (none)
Interface pattern blacklist: (none)
Interface pattern for chassis ID: (none)
Override description with: (none)
Override platform with: Linux
Override system name with: (none)
Advertise version: yes
Update interface descriptions: no
Promiscuous mode on managed interfaces: no
Disable LLDP-MED inventory: yes
LLDP-MED fast start mechanism: yes
LLDP-MED fast start interval: 1
Source MAC for LLDP frames on bond slaves: local
Portid TLV Subtype for lldp frames: ifname
--------------------------------------------------------------------
Runtime Configuration (Advanced)
A runtime configuration does not persist when you reboot the switch -
all changes are lost.
To configure active interfaces:
cumulus@switch:~$ sudo lldpcli configure system interface pattern "swp*"
To configure inactive interfaces:
cumulus@switch:~$ sudo lldpcli configure system interface pattern *,!eth0,swp*
The active interface list always overrides the inactive interface list.
To reset any interface list to none:
cumulus@switch:~$ sudo lldpcli configure system interface pattern ""
VLAN (dot1) TLV
LLDPD in Cumulus Linux is compiled to not share VLAN information with peers. Cumulus Linux 3.7.11 and later provides the VLAN (dot1) TLV runtime option to enable advertisement of VLAN TLVs to LLDP peers.
To enable the VLAN (dot1) TLV option, run the following command:
To disable the VLAN (dot1) TLV option, run the lldpcli unconfigure lldp dot1-tlv command. When disabled, the sudo lldpcli show running-configuration command output shows DOT1 TLV advertise: no.
Scale Considerations
The number of VLAN TLVs that an LLDP frame can contain depends on the interface MTU and the number or other organizational TLVs. Because Cumulus Linux does not fragment LLDP frames, if the LLDP frame size (inclusive of all VLAN TLVs) exceeds the MTU, frames are dropped, which leads to an LLDP peering failure.
Use the following as guidance:
With an interface MTU of 1500 bytes, LLDP frames can carry approximately 50 VLAN TLVs.
With an interface MTU of 9216 bytes, LLDP frames can carry approximately 350 VLAN TLVs.
If you enable the VLAN (dot1) TLV option with a high number of VLANs resulting in LLDP frames that are larger than the MTU, the frames are dropped and following message is recorded in /var/log/syslog:
2019-12-09T00:23:39.183653+00:00 act-5812-11 lldpd[8585]: Cannot send LLDP packet for swpX, Too big message
Enable the SNMP Subagent in LLDP
LLDP does not enable the SNMP subagent by default. You need to edit /etc/default/lldpd and enable the -x option.
cumulus@switch:~$ sudo nano /etc/default/lldpd
# Add "-x" to DAEMON_ARGS to start SNMP subagent
# Enable CDP by default
DAEMON_ARGS="-c -x"
Change CDP Settings
Cumulus Linux provides support for CDP so that the switch can advertise information about itself with Cisco routers that do not support LLDP. By default, the Cumulus Linux switch sends CDP packets only if the peer sends CDP packets. You can change this setting by replacing -c in the /etc/default/lldpd file with one of the following options:
Option
Description
-cc
The Cumulus Linux switch sends CDPv1 packets even when there is no detected CDP peer.
-ccc
The Cumulus Linux switch sends CDPv2 packets even when there is no detected CDP peer.
-cccc
The Cumulus Linux switch disables CDPv1 and enables CDPv2.
-ccccc
The Cumulus Linux switch disables CDPv1 and forces CDPv2.
The following example changes the CDP setting to -ccc so that the switch sends CDPv2 packets even when there is no detected CDP peer:
You must restart the lldpd service for the changes to take effect.
cumulus@switch:~$ sudo systemctl restart lldpd
Caveats and Errata
Annex E (and hence Annex D) of IEEE802.1AB (lldp) is not supported.
If you configure both an eth0 IP address and a loopback IP address on the switch, LLDP advertises the loopback IP address as the management IP address. In this case, the Cumulus Linux switch behaves more like a typical Linux host than a networking appliance.
Linux bonding provides a method for aggregating multiple network
interfaces (slaves) into a single logical bonded interface (bond).
Cumulus Linux supports two bonding modes:
IEEE 802.3ad link aggregation mode, which allows one or more links to be aggregated together to form a link aggregation group (LAG), so that a media access control (MAC) client can treat the link aggregation group as if it were a single link. IEEE 802.3ad link aggregation is the default mode.
Balance-xor mode, where the bonding of slave interfaces are static and all slave interfaces are active for load balancing and fault tolerance purposes. This is useful for MLAG deployments.
The benefits of link aggregation include:
Linear scaling of bandwidth as links are added to LAG
Load balancing
Failover protection
Cumulus Linux uses version 1 of the LAG control protocol (LACP).
To temporarily bring up a bond even when there is no LACP partner, use
LACP Bypass.
Hash Distribution
Egress traffic through a bond is distributed to a slave based on a
packet hash calculation, providing load balancing over the slaves; many
conversation flows are distributed over all available slaves to load
balance the total traffic. Traffic for a single conversation flow always
hashes to the same slave.
The hash calculation uses packet header data to choose to which slave to
transmit the packet:
For IP traffic, IP header source and destination fields are used in the calculation.
For IP + TCP/UDP traffic, source and destination ports are included in the hash calculation.
In a failover event, the hash calculation is adjusted to steer traffic
over available slaves.
LAG Custom Hashing
LAG custom hashing is supported on Mellanox switches.
In Cumulus Linux 3.7.11 and later, you can configure which fields are used in the LAG hash calculation. For example, if you do not want to use source or destination port numbers in the hash calculation, you can disable the source port and destination port fields.
You can configure the following fields:
Source MAC
Destination
Source IP
Destination IP
Ether type
VLAN ID
Source port
Destination port
Layer 3 protocol
To configure custom hash, edit the /usr/lib/python2.7/dist-packages/cumulus/__chip_config/mlx/datapath.conf file:
To enable custom hashing, uncomment the lag_hash_config.enable = true line.
To enable a field, set the field to true. To disable a field, set the field to false.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
The following shows an example datapath.conf file:
cumulus@switch:~$ sudo nano /usr/lib/python2.7/dist-packages/cumulus/__chip_config/mlx/datapath.conf
...
#LAG HASH config
#HASH config for LACP to enable custom fields
#Fields will be applicable for LAG hash
#calculation
#Uncomment to enable custom fields configured below
lag_hash_config.enable = true
lag_hash_config.smac = true
lag_hash_config.dmac = true
lag_hash_config.sip = true
lag_hash_config.dip = true
lag_hash_config.ether_type = true
lag_hash_config.vlan_id = true
lag_hash_config.sport = false
lag_hash_config.dport = false
lag_hash_config.ip_prot = true
...
Symmetric hashing is enabled by default on Mellanox switches running Cumulus Linux 3.7.11 and later. Make sure that the settings for the source IP (lag_hash_config.sip) and destination IP (lag_hash_config.dip) fields match, and that the settings for the source port (lag_hash_config.sport) and destination port (lag_hash_config.dport) fields match; otherwise symmetric hashing is disabled automatically. You can disable symmetric hashing manually in the /etc/cumulus/datapath/traffic.conf file by setting symmetric_hash_enable = FALSE.
You can create and configure a bond with the Network Command Line Utility
(NCLU).
Follow the steps below to create a new bond:
SSH into the switch.
Add a bond using the net add bond command, replacing [bond-name]
with the name of the bond, and [slaves] with the list of slaves:
cumulus@switch:~$ net add bond [bond-name] bond slaves [slaves]
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The bond is configured by default in IEEE 802.3ad link aggregation mode. To configure the bond in balance-xor mode, see bond mode below.
The name of the bond must be compliant with Linux interface naming conventions and unique within the switch.
Do not use a dash (-) in the bond name.
Configuration Options
The configuration options and their default values are listed in the table below.
Each bond configuration option, except for bond slaves, is set to the
recommended value by default in Cumulus Linux. Only configure an option
if a different setting is needed. For more information on configuration
values, refer to the Related Information section below.
NCLU Configuration Option
Description
Default Value
bond mode
The bonding mode. Cumulus Linux supports IEEE 802.3ad link aggregation mode and balance-xor mode. IEEE 802.3ad link aggregation is the default mode.
You can change the bond mode using NCLU. The following example changes bond1 to balance-xor mode.
Note: Use balance-xor mode only if you cannot use LACP. See below for more information.
cumulus@switch:~$ net add bond bond1 bond mode balance-xor
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The following example changes bond1 to IEEE 802.3ad link aggregation mode:
cumulus@switch:~$ net add bond bond1 bond mode 802.3ad
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
802.3ad
bond slaves
The list of slaves in the bond.
N/A
bond miimon
Defines how often the link state of each slave is inspected for failures.
100
bond use-carrier
Determines the link state.
1
bond xmit-hash-policy
The hash method used to select the slave for a given packet.
cumulus@switch:~$ net add bond bond01 bond lacp-rate slow
1
bond min-links
Defines the minimum number of links that must be active before the bond is put into service.
A value greater than 1 is useful if higher level services need to ensure a minimum aggregate bandwidth level before activating a bond. Keeping bond-min-links set to 1 indicates the bond must have at least one active member. If the number of active members drops below the bond-min-links setting, the bond will appear to upper-level protocols as link-down. When the number of active links returns to greater than or equal to bond-min-links, the bond becomes link-up.
1
Enable balance-xor Mode
When you enable balance-xor mode, the bonding of slave interfaces are
static and all slave interfaces are active for load balancing and fault
tolerance purposes. Packet transmission on the bond is based on the hash
policy specified by xmit-hash-policy.
When using balance-xor mode to dual-connect host-facing bonds in an
MLAG
environment, you must configure the clag-id parameter on the MLAG
bonds and it must be the same on both MLAG switches. Otherwise, the
bonds are treated by the MLAG switch pair as single-connected.
Use balance-xor mode only if you cannot use LACP; LACP can detect
mismatched link attributes between bond members and can even detect
misconnections.
To change the mode of an existing bond to balance-xor, run the net add bond <bond-name> bond mode balance-xor command. The following example
commands change bond1 to balance-xor mode:
cumulus@switch:~$ net add bond bond1 bond mode balance-xor
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To create a new bond and configure the bond to use balance-xor mode,
create the bond, then configure the bond mode. The following example
commands create a bond called bond1 and configure bond mode to be
balance-xor:
cumulus@switch:~$ net add bond bond1 bond slaves swp3,4
cumulus@switch:~$ net add bond bond1 bond mode balance-xor
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the
/etc/network/interfaces file:
auto bond1
iface bond1
bond-mode balance-xor
bond-slaves swp3 swp4
cumulus@switch:~$ net show interface bond1
Name MAC Speed MTU Mode
-- ------ ----------------- ------- ----- ------
UP bond1 00:02:00:00:00:12 20G 1500 Bond
Bond Details
--------------- -------------
Bond Mode: Balance-XOR
Load Balancing: Layer3+4
Minimum Links: 1
In CLAG: CLAG Inactive
Port Speed TX RX Err Link Failures
-- ------- ------- ---- ---- ----- ---------------
UP swp3(P) 10G 0 0 0 0
UP swp4(P) 10G 0 0 0 0
LLDP
------- ---- ------------
swp3(P) ==== swp1(p1c1h1)
swp4(P) ==== swp2(p1c1h1)Routing
-------
Interface bond1 is up, line protocol is up
Link ups: 3 last: 2017/04/26 21:00:38.26
Link downs: 2 last: 2017/04/26 20:59:56.78
PTM status: disabled
vrf: Default-IP-Routing-Table
index 31 metric 0 mtu 1500
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: 00:02:00:00:00:12
inet6 fe80::202:ff:fe00:12/64
Interface Type Other
Example Configuration: Bonding 4 Slaves
In the following example, the front panel port interfaces swp1 thru swp4
are slaves in bond0, while swp5 and swp6 are not part of bond0.
Example Bond Configuration
The following commands create a bond with four slaves:
cumulus@switch:~$ net add bond bond0 address 10.0.0.1/30
cumulus@switch:~$ net add bond bond0 bond slaves swp1-4
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create this code snippet in the /etc/network/interfaces
file:
If the bond is going to become part of a bridge, you do not need to
specify an IP address.
When networking is started on the switch, bond0 is created as MASTER and
interfaces swp1 thru swp4 come up in SLAVE mode, as seen in the ip link show command:
cumulus@switch:~$ ip link show
...
3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
4: swp2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
5: swp3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
6: swp4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
...
55: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
All slave interfaces within a bond have the same MAC address as the bond. Typically, the first slave you add to the bond donates its MAC address as the bond MAC address, whereas the MAC addresses of the other slaves are the bond MAC address. The bond MAC address is the source MAC address for all traffic leaving the bond and provides a single destination MAC address to address traffic to the bond.
Removing a bond slave interface from which a bond derives its MAC address affects traffic when the bond interface flaps to update the MAC address.
Caveats and Errata
An interface cannot belong to multiple bonds.
A bond can have subinterfaces, but subinterfaces cannot have a bond.
A bond cannot enslave VLAN subinterfaces.
Set all slave ports within a bond to the same speed/duplex and make sure they match the link partner’s slave ports.
The detailed output in /proc/net/bonding/<filename> includes the actor/partner LACP information. This information is not necessary and requires you to use sudo to view the file.
On a Cumulus RMP switch, if you create a bond with multiple 10G member ports, traffic gets dropped when the bond uses members of the same unit listed in the /var/lib/cumulus/porttab file. For example, traffic gets dropped if both swp49 and swp52 are in the bond because they both are in the xe0 unit (or if both swp50 and swp51 are in the same bond because they are both in xe1): swp49 xe0 0 0 -1 0 swp50 xe1 0 0 -1 0 swp51 xe1 1 0 -1 0 swp52 xe0 1 0 -1 0 Single port member bonds, bonds with different units (xe0 or xe1, as
above), or layer 3 bonds do not have this issue.
On Cumulus RMP switches, which are built with two Hurricane2 ASICs, you cannot form an LACP bond on links that terminate on different Hurricane2 ASICs.
Ethernet bridges provide a means for hosts to communicate through layer
2, by connecting all of the physical and logical interfaces in the
system into a single layer 2 domain. The bridge is a logical interface
with a MAC address and an
MTU
(maximum transmission unit). The bridge MTU is the minimum MTU among all
its members. By default, the bridge’s MAC address
is the MAC address of the first port in the bridge-ports list. The
bridge can also be assigned an IP address, as discussed
below.
Bridge members can be individual physical interfaces, bonds or logical
interfaces that traverse an 802.1Q VLAN trunk.
Use VLAN-aware mode bridges,
rather than traditional mode bridges. The bridge driver in Cumulus Linux is
capable of VLAN filtering, which allows for configurations that are similar to
incumbent network devices. While Cumulus Linux supports Ethernet bridges in
traditional mode, it’s best to use VLAN-aware mode.
Cumulus Linux does not put all ports into a bridge by default.
You can configure both VLAN-aware and traditional mode bridges on the
same network in Cumulus Linux; however you cannot have more than one
VLAN-aware bridge on a given switch.
Create a VLAN-aware Bridge
To learn about VLAN-aware bridges and how to configure them, read VLAN-aware Bridge Mode.
The MAC address for a frame is learned when the frame enters the bridge
via an interface. The MAC address is recorded in the bridge table, and
the bridge forwards the frame to its intended destination by looking up
the destination MAC address. The MAC entry is then maintained for a
period of time defined by the bridge-ageing configuration option. If
the frame is seen with the same source MAC address before the MAC entry
age is exceeded, the MAC entry age is refreshed; if the MAC entry age is
exceeded, the MAC address is deleted from the bridge table.
The following example output shows a MAC address table for the bridge:
cumulus@switch:~$ net show bridge macs
VLAN Master Interface MAC TunnelDest State Flags LastSeen
-------- -------- ----------- ----------------- ------------ --------- ------- -----------------
untagged bridge swp1 44:38:39:00:00:03 00:00:15
untagged bridge swp1 44:38:39:00:00:04 permanent 20 days, 01:14:03
MAC Address Ageing
By default, Cumulus Linux stores MAC addresses in the Ethernet switching
table for 1800 seconds (30 minutes). You can change this setting using NCLU.
You can change the setting using NCLU. For example, to change the setting to 600 seconds, run:
cumulus@switch:~$ net add bridge bridge ageing 600
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
Bridges can be included as part of a routing topology after being
assigned an IP address. This enables hosts within the bridge to
communicate with other hosts outside of the bridge, via a switch VLAN
interface (SVI), which provides layer 3 routing. The IP address of the
bridge is typically from the same subnet as the bridge’s member hosts.
When an interface is added to a bridge, it ceases to function as a
router interface, and the IP address on the interface, if any, becomes
unreachable.
cumulus@switch:~$ net add bridge bridge ports swp1-2
cumulus@switch:~$ net add vlan 10 ip address 10.100.100.1/24
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following SVI configuration in the /etc/network/interfaces file:
Notice the vlan-raw-device keyword, which NCLU includes automatically.
NCLU uses this keyword to associate the SVI with the VLAN-aware bridge.
Alternately, you can use the bridge.VLAN-ID naming convention for the
SVI. The following example configuration can be manually created in the
/etc/network/interfaces file, which functions identically to the above
configuration:
auto bridge
iface bridge
bridge-ports swp1 swp2
bridge-vids 10
bridge-vlan-aware yes
auto bridge.10
iface bridge.10
address 10.100.100.1/24
When a switch is initially configured, all southbound bridge ports may
be down, which means that, by default, the SVI is also down. However,
you may want to force the SVI to always be up, to perform connectivity
testing, for example. To do this, you essentially need to disable
interface state tracking, leaving the SVI in the UP state always, even
if all member ports are down. Other implementations describe this
feature as no autostate.
In Cumulus Linux, you can keep the SVI perpetually UP by creating a
dummy interface, and making the dummy interface a member of the bridge.
Consider the following configuration, without a dummy interface in the
bridge:
With this configuration, when swp3 is down, the SVI is also down:
cumulus@switch:~$ ip link show swp3
5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master bridge state DOWN mode DEFAULT group default qlen 1000
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show bridge
35: bridge: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
Now add the dummy interface to your network configuration:
Create a dummy interface, and add it to the bridge configuration.
You do this by editing the /etc/network/interfaces file and adding
the dummy interface stanza before the bridge stanza:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto dummy
iface dummy
link-type dummy
auto bridge
iface bridge
...
Continue editing the interfaces file. Add the dummy interface to
the bridge-ports line in the bridge configuration:
Save and exit the file, then reload the configuration:
cumulus@switch:~$ sudo ifreload -a
Now, even when swp3 is down, both the dummy interface and the bridge remain up:
cumulus@switch:~$ ip link show swp3
5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master bridge state DOWN mode DEFAULT group default qlen 1000
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show dummy
37: dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge state UNKNOWN mode DEFAULT group default
link/ether 66:dc:92:d4:f3:68 brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show bridge
35: bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
IPv6 Link-local Address Generation
By default, Cumulus Linux automatically generates IPv6
link-local addresses on VLAN
interfaces. If you want to use a different mechanism to assign
link-local addresses, you should disable this feature. You can disable
link-local automatic address generation for both regular IPv6 addresses
and address-virtual (macvlan) addresses.
To disable automatic address generation for a regular IPv6 address on VLAN 100, run:
cumulus@switch:~$ net add vlan 100 ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@switch:~$ cat /etc/network/interfaces
...
auto vlan100
iface vlan 100
ipv6-addrgen off
vlan-id 100
vlan-raw-device bridge
...
To disable automatic address generation for a virtual IPv6 address on VLAN 100, run:
cumulus@switch:~$ net add vlan 100 address-virtual-ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@switch:~$ cat /etc/network/interfaces
...
auto vlan100
iface vlan 100
address-virtual-ipv6-addrgen off
vlan-id 100
vlan-raw-device bridge
...
To reenable automatic link-local address generation, run:
cumulus@switch:~$ net del vlan 100 ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
or
cumulus@switch:~$ net del vlan 100 address-virtual-ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
This removes the relevant configuration from the interfaces file.
Understanding bridge fdb Output
The bridge fdb command in Linux interacts with the forwarding database
table, which the bridge uses to store MAC addresses it has learned and
on which ports it learned those MAC addresses. The bridge fdb show
command output contains some specific keywords that require further
explanation:
self: the Linux kernel FDB entry flag indicating the FDB entry belongs to the FDB on the device referenced by the device. For example, this FDB entry belongs to the VXLAN device vx-1000: "00:02:00:00:00:08 dev vx-1000 dst 27.0.0.10 self"
master: the Linux kernel FDB entry flag indicating the FDB entry belongs to the FDB on the device’s master, and the FDB entry is pointing to a master’s port. For example, this FDB entry is from the master device named bridge and is pointing to the VXLAN bridge port vx-1001: 02:02:00:00:00:08 dev vx-1001 vlan 1001 master bridge
offload: the Linux kernel FDB entry flag indicating the FDB entry is managed (or offloaded) by an external control plane - for example, the BGP control plane for EVPN.
Consider the following output of the bridge fdb show command:
cumulus@switch:~$ bridge fdb show | grep 02:02:00:00:00:08
02:02:00:00:00:08 dev vx-1001 vlan 1001 offload master bridge
02:02:00:00:00:08 dev vx-1001 dst 27.0.0.10 self offload
Some things you should note about the output:
02:02:00:00:00:08 is the MAC address learned via BGP EVPN.
The first FDB entry points to a Linux bridge entry pointing to the VXLAN device vx-1001.
The second FDB entry points to the same entry on the VXLAN device with additional remote destination information.
The VXLAN FDB augments the bridge FDB with additional remote destination information.
All FDB entries pointing to a VXLAN port appear as two such entries with the second entry augmenting the remote destination information.
Caveats and Errata
A bridge cannot contain multiple subinterfaces of the same port. Attempting this configuration results in an error.
In environments where both VLAN-aware and traditional bridges are in use, if a traditional bridge has a subinterface of a bond that is a normal interface in a VLAN-aware bridge, the bridge is flapped when the traditional bridge’s bond subinterface is brought down.
You cannot enslave a VLAN raw device to a different master interface (that is, you cannot edit the vlan-raw-device setting in the /etc/network/interfaces file). You need to delete the VLAN and create it again.
Cumulus Linux supports up to 2000 VLANs. This includes the internal interfaces, bridge interfaces, logical interfaces, and so on.
In Cumulus Linux, MAC learning is enabled by default on traditional or VLAN-aware bridge interfaces. Do not disable MAC learning unless you are using EVPN. See Ethernet Virtual Private Network - EVPN.
Multi-Chassis Link Aggregation (MLAG) enables a server or switch with a two-port bond, such as a link aggregation group/LAG, EtherChannel, port group or trunk, to connect those ports to different switches and operate as if they are connected to a single, logical switch. This provides greater redundancy and greater system throughput.
MLAG or CLAG? The Cumulus Linux implementation of MLAG is referred to by other vendors as CLAG, MC-LAG or VPC. You will even see references to CLAG in Cumulus Linux, including the management daemon, named clagd, and other options in the code, such as clag-id, which exist for historical purposes. The Cumulus Linux implementation is truly a multi-chassis link aggregation protocol, so we call it MLAG.
Dual-connected devices can create LACP bonds that contain links to each physical switch. Therefore, active-active links from the dual-connected devices are supported even though they are connected to two different physical switches.
A basic setup looks like this:
You can see an example of how to set up this configuration by running cumulus@switch:~$ net example clag basic-clag.
The two switches, S1 and S2, known as peer switches, cooperate so that they appear as a single device to host H1’s bond. H1 distributes traffic between the two links to S1 and S2 in any way that you configure on the host. Similarly, traffic inbound to H1 can traverse S1 or S2 and arrive at H1.
MLAG Requirements
MLAG has these requirements:
There must be a direct connection between the two peer switches implementing MLAG (S1 and S2). This is typically a bond for increased reliability and bandwidth.
There must be only two peer switches in one MLAG configuration, but you can have multiple configurations in a network for switch-to-switch MLAG (see below).
You must specify a unique clag-id for every dual-connected bond on each peer switch; the value must be between 1 and 65535 and must be the same on both peer switches in order for the bond to be considered dual-connected.
The dual-connected devices (servers or switches) can use LACP (IEEE 802.3ad or 802.1ax) to form the bond. In this case, the peer switches must also use LACP.
Both switches in the MLAG pair must be identical; they must both be the same model of switch and run the same Cumulus Linux release.
Cumulus Linux does not support MLAG with 802.1X; the switch cannot synchronize 802.1X authenticated MAC addresses over the peerlink.
If for some reason you cannot use LACP, you can also use balance-xor mode to dual-connect host-facing bonds in an MLAG environment. If you do, you must still configure the same clag-id parameter on the MLAG bonds, and it must be the same on both MLAG switches. Otherwise, the MLAG switch pair treats the bonds as if they are single-connected.
More elaborate configurations are also possible. The number of links between the host and the switches can be greater than two, and does not have to be symmetrical:
Additionally, because S1 and S2 appear as a single switch to other bonding devices, you can also connect pairs of MLAG switches to each other in a switch-to-switch MLAG setup:
In this case, L1 and L2 are also MLAG peer switches, and present a two-port bond from a single logical system to S1 and S2. S1 and S2 do the same as far as L1 and L2 are concerned. For a switch-to-switch MLAG configuration, each switch pair must have a unique system MAC address. In the above example, switches L1 and L2 each have the same system MAC address configured. Switch pair S1 and S2 each have the same system MAC address configured; however, it is a different system MAC address than the one used by the switch pair L1 and L2.
LACP and Dual-Connectedness
For MLAG to operate correctly, the peer switches must know which links are dual-connected or are connected to the same host or switch. To do this, specify a clag-id for every dual-connected bond on each peer switch; the clag-id must be the same for the corresponding bonds on both peer switches. Typically, Link Aggregation Control Protocol (LACP), the IEEE standard protocol for managing bonds, is used for verifying dual-connectedness. LACP runs on the dual-connected device and on each of the peer switches. On the dual-connected device, the only configuration requirement is to create a bond that is managed by LACP.
However, if for some reason you cannot use LACP in your environment, you can configure the bonds in balance-xor mode. When using balance-xor mode to dual-connect host-facing bonds in an MLAG environment, you must configure the clag-id parameter on the MLAG bonds, which must be the same on both MLAG switches. Otherwise, the bonds are treated by the MLAG switch pair as if they are single-connected. In short, dual-connectedness is solely determined by matching clag-id and any misconnection will not be detected.
On each of the peer switches, you must place the links that are connected to the dual-connected host or switch in the bond. This is true even if the links are a single port on each peer switch, where each port is placed into a bond, as shown below:
All of the dual-connected bonds on the peer switches have their system ID set to the MLAG system ID. Therefore, from the point of view of the hosts, each of the links in its bond is connected to the same system, and so the host uses both links.
Each peer switch periodically makes a list of the LACP partner MAC addresses for all of their bonds and sends that list to its peer (using the clagd service; see below). The LACP partner MAC address is the MAC address of the system at the other end of a bond (hosts H1, H2, and H3 in the figure above). When a switch receives this list from its peer, it compares the list to the LACP partner MAC addresses on its switch. If any matches are found and the clag-id for those bonds match, then that bond is a dual-connected bond. You can also find the LACP partner MAC address by the running net show bridge macs command or by examining the /sys/class/net/<bondname>/bonding/ad_partner_mac sysfs file for each bond.
Configure MLAG
To configure MLAG, you need to:
Create a bond that uses LACP, on the dual-connected devices.
Configure the interfaces, including bonds, VLANs, bridges and peer links, on each peer switch.
MLAG synchronizes the dynamic state between the two peer switches but it does not synchronize the switch configurations. After modifying the configuration of one peer switch, you must make the same changes to the configuration on the other peer switch. This applies to all configuration changes, including:
Port configuration; for example, VLAN membership, MTU, and bonding parameters.
Bridge configuration; for example, spanning tree parameters or bridge properties.
Static address entries; for example, static FDB entries and static IGMP entries.
QoS configuration; for example, ACL entries.
You can verify the configuration of VLAN membership with the net show clag verify-vlans verbose command.
To prevent MAC address conflicts with other interfaces in the same bridged network, Cumulus Linux has a reserved range of MAC addresses specifically to use with MLAG. This range of MAC addresses is 44:38:39:ff:00:00 to 44:38:39:ff:ff:ff. Use this range of MAC addresses when configuring MLAG.
You cannot use the same MAC address for different MLAG pairs. Make sure you specify a different clag sys-mac setting for each MLAG pair in the network.
You cannot use multicast MAC addresses as the clagd-sys-mac.
If you configure MLAG with NCLU commands, Cumulus Linux does not check against a possible collision with VLANs outside the default reserved range when creating the peer link interfaces, in case the reserved VLAN range has been modified.
Configure the Host or Switch
On your dual-connected device, create a bond that uses LACP. The method you use varies with the type of device you are configuring. The following image is a basic MLAG configuration, showing all the essential elements; a more detailed two-leaf/two-spine configuration is shown below.
Configure the Interfaces
Place every interface that connects to the MLAG pair from a dual-connected device into a bond, even if the bond contains only a single link on a single physical switch (even
though the MLAG pair contains two or more links). Layer 2 data travels over this bond. In the examples throughout this chapter, peerlink is the name of the bond.
Single-attached hosts, also known as orphan ports, can be just a member of the bridge.
Additionally, configure the fast mode of LACP on the bond to allow more timely updates of the LACP state. These bonds are then placed in a bridge, which must include the peer link between the switches.
To enable communication between the clagd services on the peer switches, do the following:
Choose an unused VLAN (also known as a switched virtual interface or SVI here).
Assign the SVI an unrouteable link-local address to give the peer switches layer 3 connectivity between each other.
Configure the VLAN as a VLAN subinterface on the peer link bond rather than the VLAN-aware bridge, called peerlink. If you’re configuring the subinterface with NCLU, the VLAN subinterface is named 4094 by default (the subinterface named peerlink.4094 below). If you are configuring the peer link without NCLU, use 4094 for the peer link VLAN if possible. This ensures that the VLAN is completely independent of the bridge and spanning tree forwarding decisions.
Include untagged traffic on the peer link, as this avoids issues with STP.
Specify a backup interface, which is any layer 3 backup interface for your peer links in case the peer link goes down. While a backup interface is optional, it’s best to configure one. More information about configuring the backup link and understanding various redundancy scenarios is available below.
For example, if peerlink is the inter-chassis bond, and VLAN 4094 is the peer link VLAN, configure peerlink.4094 as follows:
Cumulus Linux 3.7.6 and earlier
cumulus@leaf01:~$ net add bond peerlink bond slaves swp49-50
cumulus@leaf01:~$ net add interface peerlink.4094 ip address 169.254.1.1/30
cumulus@leaf01:~$ net add interface peerlink.4094 clag peer-ip 169.254.1.2
cumulus@leaf01:~$ net add interface peerlink.4094 clag backup-ip 192.0.2.50
cumulus@leaf01:~$ net add interface peerlink.4094 clag sys-mac 44:38:39:FF:40:94
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
The above commands save the configuration in the /etc/network/interfaces file.
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
address 169.254.1.1/30
clagd-peer-ip 169.254.1.2
clagd-backup-ip 192.0.2.50
clagd-sys-mac 44:38:39:FF:40:94
Cumulus Linux 3.7.7 and later
In Cumulus Linux 3.7.7 and later, you can use MLAG unnumbered:
cumulus@leaf01:~$ net add bond peerlink bond slaves swp49-50
cumulus@leaf01:~$ net add interface peerlink.4094 clag peer-ip linklocal
cumulus@leaf01:~$ net add interface peerlink.4094 clag backup-ip 192.0.2.50
cumulus@leaf01:~$ net add interface peerlink.4094 clag sys-mac 44:38:39:FF:40:94
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
The above commands save the configuration in the /etc/network/interfaces file.
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 192.0.2.50
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:FF:40:94
Do not add VLAN 4094 to the bridge VLAN list; VLAN 4094 for the peer link subinterface cannot also be configured as a bridged VLAN with bridge VIDs under the bridge.
To enable MLAG, peerlink must be added to a traditional or VLAN-aware bridge. The commands below add peerlink to a VLAN-aware bridge:
cumulus@leaf01:~$ net add bridge bridge ports peerlink
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
This creates the following configuration in the /etc/network/interfaces file:
auto bridge
iface bridge
bridge-ports peerlink
bridge-vlan-aware yes
If you change the MLAG configuration by editing the interfaces file, the changes take effect when you bring the peer link interface up with ifup. Do not use systemctl restart clagd.service to apply the new configuration.
Do not use 169.254.0.1 as the MLAG peer link IP address; Cumulus Linux uses this address exclusively for
BGP unnumbered interfaces.
Switch Roles and Priority Setting
Each MLAG-enabled switch in the pair has a role. When the peering relationship is established between the two switches, one switch is put into the primary role, and the other into the secondary role. When an MLAG-enabled switch is in the secondary role, it does not send STP BPDUs on dual-connected links; it only sends BPDUs on single-connected links. The switch in the primary role sends STP BPDUs on all single- and dual-connected links.
Sends BPDUs Via
Primary
Secondary
Single-connected links
Yes
Yes
Dual-connected links
Yes
No
By default, the role is determined by comparing the MAC addresses of the two sides of the peering link; the switch with the lower MAC address assumes the primary role. You can override this by setting the clagd-priority option for the peer link:
cumulus@leaf01:~$ net add interface peerlink.4094 clag priority 2048
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
The switch with the lower priority value is given the primary role; the default value is 32768 and the range is 0 to 65535. Read the clagd(8) and clagctl(8) man pages for more information.
When the clagd service is exited during switch reboot or the service is stopped in the primary switch, the peer switch that is in the secondary role becomes the primary.
However, if the primary switch goes down without stopping the clagd service for any reason, or if the peer link goes down, the secondary switch does not change its role. In case the peer switch is determined to be not alive, the switch in the secondary role rolls back the LACP system ID to be the bond interface MAC address instead of the clagd-sys-mac and the switch in primary role uses the clagd-sys-mac as the LACP system ID on the bonds.
clagctl Timers
The clagd service has a number of timers that you can tune for enhanced performance. The relevant timers are:
--reloadTimer <SECONDS>: The number of seconds to wait for the peer switch to become active. If the peer switch does not become active after the timer expires, the MLAG bonds will leave the initialization (protodown) state and become active. This provides clagd with sufficient time to determine whether the peer switch is coming up or if it is permanently unreachable The default is 300 seconds.
--peerTimeout <SECONDS>: The number of seconds clagd waits without receiving any data from the peer switch before it determines that the peer is no longer active. If this parameter is not specified, clagd uses ten times the local lacpPoll value.
--initDelay <SECONDS>: The number of seconds clagd delays the bring up of MLAG bonds and anycast IP addresses. The default is 10 seconds.
--sendTimeout <SECONDS>: The number of seconds clagd waits until the sending socket times out. If it takes longer than the sendTimeout value to send data to the peer, clagd generates an exception. The default is 30 seconds.
To set a timer, use NCLU. For example, to set the peerTimeout to 900 seconds:
cumulus@switch:~$ net add interface peerlink.4094 clag args --peerTimeout 900
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
You can run clagctl params to see the settings for all of the clagd parameters.
The example configuration below configures two bonds for MLAG, each with a single port, a peer link that is a bond with two member ports, and three VLANs on each port.
You can see a more traditional layer 2 example configuration in NCLU; run net example clag l2-with-server-vlan-trunks. For a very basic configuration with just one pair of switches and a single host, run net example clag l2-with-server-vlan-trunks.
You configure these interfaces using NCLU,
so the bridges are in VLAN-aware mode. The bridges use these Cumulus Linux-specific keywords:
bridge-vids, which defines the allowed list of tagged 802.1q VLAN IDs for all bridge member interfaces. You can specify non-contiguous ranges with a space-separated list, like bridge-vids 100-200 300 400-500.
bridge-pvid, which defines the untagged VLAN ID for each port. This is commonly referred to as the native VLAN.
The bridge configurations below indicate that each bond carries tagged frames on VLANs 10, 20, 30, 40, 50, and 100 to 200 (as specified by bridge-vids), but untagged frames on VLAN 1 (as specified by bridge-pvid). Also, take note on how you configure the VLAN subinterfaces used for clagd communication (peerlink.4094 in the sample configuration below). Finally, the host configurations for server01 through server04 are not shown here. The configurations for each corresponding node are almost identical, except for the IP addresses used for managing the clagd service.
At minimum, this VLAN subinterface should not be in your layer 2 domain. Give it a very high VLAN ID (up to 4094). Read more about the range of VLAN IDs you can use.
The commands to create the configurations for both spines look like the following. Note that the clag-id and clagd-sys-mac must be the same for the corresponding bonds on spine01 and spine02:
spine01 and spine02 configuration
spine01
cumulus@spine01:~$ net show configuration commands
net add interface swp1-4
net add loopback lo ip address 10.0.0.21/32
net add interface eth0 ip address dhcp
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@spine01:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.21/32
auto eth0
iface eth0 inet dhcp
#downlinks
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
spine02
cumulus@spine02:~$ net show configuration commands
net add interface swp1-4
net add loopback lo ip address 10.0.0.22/32
net add interface eth0 ip address dhcp
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@spine02:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.22/32
auto eth0
iface eth0 inet dhcp
#downlinks
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
Here is an example configuration for the switches leaf01 through leaf04. Note that the clag-id and clagd-sys-mac must be the same for the corresponding bonds on leaf01 and leaf02 as well as leaf03 and leaf04:
leaf01 thru leaf04 configuration
leaf01
cumulus@leaf01:~$ net show configuration commands
net add loopback lo ip address 10.0.0.11/32
net add bgp autonomous-system 65011
net add bgp router-id 10.0.0.11
net add bgp ipv4 unicast network 10.0.0.11/32
net add routing prefix-list ipv4 dc-leaf-in seq 10 permit 0.0.0.0/0
net add routing prefix-list ipv4 dc-leaf-in seq 20 permit 10.0.0.0/24 le 32
net add routing prefix-list ipv4 dc-leaf-in seq 30 permit 172.16.2.0/24
net add routing prefix-list ipv4 dc-leaf-out seq 10 permit 172.16.1.0/24
net add bgp neighbor fabric peer-group
net add bgp neighbor fabric remote-as external
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-in in
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-out out
net add bgp neighbor swp51-52 interface peer-group fabric
net add vlan 100 ip address 172.16.1.1/24
net add bgp ipv4 unicast network 172.16.1.1/24
net add clag peer sys-mac 44:38:39:FF:00:01 interface swp49-50 primary backup-ip 192.168.1.12
net add clag port bond server1 interface swp1 clag-id 1
net add clag port bond server2 interface swp2 clag-id 2
net add bond server1-2 bridge access 100
net add bond server1-2 stp portadminedge
net add bond server1-2 stp bpduguard
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@leaf01:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.11/32
auto eth0
iface eth0 inet dhcp
auto swp1
iface swp1
auto swp2
iface swp2
#peerlink
auto swp49
iface swp49
post-up ip link set $IFACE promisc on # Only required on VX
auto swp50
iface swp50
post-up ip link set $IFACE promisc on # Only required on VX
#uplinks
auto swp51
iface swp51
auto swp52
iface swp52
#bridge to hosts
auto bridge
iface bridge
bridge-ports peerlink server1 server2
bridge-vids 100
bridge-vlan-aware yes
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto vlan100
iface vlan100
address 172.16.1.1/24
vlan-id 100
vlan-raw-device bridge
leaf02
cumulus@leaf02:~$ net show conf commands
net add loopback lo ip address 10.0.0.12/32
net add bgp autonomous-system 65012
net add bgp router-id 10.0.0.12
net add bgp ipv4 unicast network 10.0.0.12/32
net add routing prefix-list ipv4 dc-leaf-in seq 10 permit 0.0.0.0/0
net add routing prefix-list ipv4 dc-leaf-in seq 20 permit 10.0.0.0/24 le 32
net add routing prefix-list ipv4 dc-leaf-in seq 30 permit 172.16.2.0/24
net add routing prefix-list ipv4 dc-leaf-out seq 10 permit 172.16.1.0/24
net add bgp neighbor fabric peer-group
net add bgp neighbor fabric remote-as external
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-in in
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-out out
net add bgp neighbor swp51-52 interface peer-group fabric
net add vlan 100 ip address 172.16.1.2/24
net add bgp ipv4 unicast network 172.16.1.2/24
net add clag peer sys-mac 44:38:39:FF:00:01 interface swp49-50 secondary backup-ip 192.168.1.11
net add clag port bond server1 interface swp1 clag-id 1
net add clag port bond server2 interface swp2 clag-id 2
net add bond server1-2 bridge access 100
net add bond server1-2 stp portadminedge
net add bond server1-2 stp bpduguard
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@leaf02:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.12/32
auto eth0
iface eth0 inet dhcp
auto swp1
iface swp1
auto swp2
iface swp2
#peerlink
auto swp49
iface swp49
post-up ip link set $IFACE promisc on # Only required on VX
auto swp50
iface swp50
post-up ip link set $IFACE promisc on # Only required on VX
#uplinks
auto swp51
iface swp51
auto swp52
iface swp52
#bridge to hosts
auto bridge
iface bridge
bridge-ports peerlink server1 server2
bridge-vids 100
bridge-vlan-aware yes
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 192.168.1.11
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:FF:00:01
auto vlan100
iface vlan100
address 172.16.1.2/24
vlan-id 100
vlan-raw-device bridge
leaf03
cumulus@leaf03:~$ net show conf commands
net add loopback lo ip address 10.0.0.13/32
net add bgp autonomous-system 65013
net add bgp router-id 10.0.0.13
net add bgp ipv4 unicast network 10.0.0.13/32
net add routing prefix-list ipv4 dc-leaf-in seq 10 permit 0.0.0.0/0
net add routing prefix-list ipv4 dc-leaf-in seq 20 permit 10.0.0.0/24 le 32
net add routing prefix-list ipv4 dc-leaf-in seq 30 permit 172.16.2.0/24
net add routing prefix-list ipv4 dc-leaf-out seq 10 permit 172.16.1.0/24
net add bgp neighbor fabric peer-group
net add bgp neighbor fabric remote-as external
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-in in
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-out out
net add bgp neighbor swp51-52 interface peer-group fabric
net add vlan 100 ip address 172.16.1.3/24
net add bgp ipv4 unicast network 172.16.1.3/24
net add clag peer sys-mac 44:38:39:FF:00:02 interface swp49-50 primary backup-ip 192.168.1.14
net add clag port bond server3 interface swp1 clag-id 3
net add clag port bond server4 interface swp2 clag-id 4
net add bond server3-4 bridge access 100
net add bond server3-4 stp portadminedge
net add bond server3-4 stp bpduguard
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@leaf03:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.13/32
auto eth0
iface eth0 inet dhcp
auto swp1
iface swp1
auto swp2
iface swp2
#peerlink
auto swp49
iface swp49
post-up ip link set $IFACE promisc on # Only required on VX
auto swp50
iface swp50
post-up ip link set $IFACE promisc on # Only required on VX
#uplinks
auto swp51
iface swp51
auto swp52
iface swp52
#bridge to hosts
auto bridge
iface bridge
bridge-ports peerlink server3 server4
bridge-vids 100
bridge-vlan-aware yes
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto vlan100
iface vlan100
address 172.16.1.3/24
vlan-id 100
vlan-raw-device bridge
leaf04
cumulus@leaf04:~$ net show configuration commands
net add loopback lo ip address 10.0.0.14/32
net add bgp autonomous-system 65014
net add bgp router-id 10.0.0.14
net add bgp ipv4 unicast network 10.0.0.14/32
net add routing prefix-list ipv4 dc-leaf-in seq 10 permit 0.0.0.0/0
net add routing prefix-list ipv4 dc-leaf-in seq 20 permit 10.0.0.0/24 le 32
net add routing prefix-list ipv4 dc-leaf-in seq 30 permit 172.16.2.0/24
net add routing prefix-list ipv4 dc-leaf-out seq 10 permit 172.16.1.0/24
net add bgp neighbor fabric peer-group
net add bgp neighbor fabric remote-as external
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-in in
net add bgp ipv4 unicast neighbor fabric prefix-list dc-leaf-out out
net add bgp neighbor swp51-52 interface peer-group fabric
net add vlan 100 ip address 172.16.1.4/24
net add bgp ipv4 unicast network 172.16.1.4/24
net add clag peer sys-mac 44:38:39:FF:00:02 interface swp49-50 secondary backup-ip 192.168.1.13
net add clag port bond server3 interface swp1 clag-id 3
net add clag port bond server4 interface swp2 clag-id 4
net add bond server3-4 bridge access 100
net add bond server3-4 stp portadminedge
net add bond server3-4 stp bpduguard
These commands create the following configuration in the /etc/network/interfaces file:
cumulus@leaf04:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.14/32
auto eth0
iface eth0 inet dhcp
auto swp1
iface swp1
auto swp2
iface swp2
#peerlink
auto swp49
iface swp49
post-up ip link set $IFACE promisc on # Only required on VX
auto swp50
iface swp50
post-up ip link set $IFACE promisc on # Only required on VX
#uplinks
auto swp51
iface swp51
auto swp52
iface swp52
#bridge to hosts
auto bridge
iface bridge
bridge-ports peerlink server3 server4
bridge-vids 100
bridge-vlan-aware yes
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 192.168.1.13
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:FF:00:02
auto vlan100
iface vlan100
address 172.16.1.4/24
vlan-id 100
vlan-raw-device bridge
Disable clagd on an Interface
In the configurations above, the clagd-peer-ip and clagd-sys-mac parameters are mandatory, while the rest are optional. When mandatory clagd commands are present under a peer link subinterface, by default clagd-enable is set to yes and does not need to be specified; to disable clagd on the subinterface, set clagd-enable to no:
cumulus@spine01:~$ net add interface peerlink.4094 clag enable no
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
Use clagd-priority to set the role of the MLAG peer switch to primary or secondary. Each peer switch in an MLAG pair must have the same clagd-sys-mac setting. Each clagd-sys-mac setting must be unique to each MLAG pair in the network. For more details, refer to man clagd.
Check the MLAG Configuration Status
You can check the status of your MLAG configuration using the net show clag command.
cumulus@leaf01:~$ net show clag
The peer is alive
Peer Priority, ID, and Role: 4096 44:38:39:FF:00:01 primary
Our Priority, ID, and Role: 8192 44:38:39:FF:00:02 secondary
Peer Interface and IP: peerlink.4094 linklocal
Backup IP: 192.168.1.12 (inactive)
System MAC: 44:38:39:FF:00:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
server1 server1 1 - -
server2 server2 2 - -
A command line utility called clagctl is available for interacting with a running clagd service to get status or alter operational behavior. For a detailed explanation of the utility, refer to the clagctl(8)man page.
Sample clagctl Output
The following is a sample output of the MLAG operational status displayed by clagctl:
The peer is alive
Peer Priority, ID, and Role: 4096 44:38:39:FF:00:01 primary
Our Priority, ID, and Role: 8192 44:38:39:FF:00:02 secondary
Peer Interface and IP: peerlink.4094 linklocal
Backup IP: 192.168.1.12 (inactive)
System MAC: 44:38:39:FF:00:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
server1 server1 1 - -
server2 server2 2 - -
To configure MLAG with a traditional mode bridge, the peer link and all dual-connected links must be configured as untagged/native ports on a bridge (note the absence of any VLANs in the bridge-ports line and the lack of the bridge-vlan-aware parameter below):
auto br0
iface br0
bridge-ports peerlink spine1-2 host1 host2
The following example shows you how to allow VLAN 100 across the peer link:
auto br0.100
iface br0.100
bridge-ports peerlink.100 bond1.100
In an MLAG and traditional bridge configuration, NVIDIA recommends that you set bridge learning to off on all VLANs over the peerlink except for the layer 3 peerlink subinterface; for example:
...
auto peerlink
iface peerlink
bridge-learning off
auto peerlink.1510
iface peerlink.1510
bridge-learning off
auto peerlink.4094
iface peerlink.4094
...
For a deeper comparison of traditional versus VLAN-aware bridge modes, read this knowledge base article.
Peer Link Interfaces and the protodown State
In addition to the standard UP and DOWN administrative states, an interface that is a member of an MLAG bond can also be in a protodown state. When MLAG detects a problem that might result in connectivity issues such as traffic black-holing or a network meltdown if the link carrier was left in an UP state, it can put that interface into protodown state. Such connectivity issues include:
When the peer link goes down but the peer switch is up (that is, the backup link is active).
When the bond is configured with an MLAG ID, but the clagd service is not running (whether it was deliberately stopped or simply died).
When an MLAG-enabled node is booted or rebooted, the MLAG bonds are placed in a protodown state until the node establishes a connection to its peer switch, or five minutes have elapsed.
When an interface goes into a protodown state, it results in a local OPER DOWN (carrier down) on the interface. As of Cumulus Linux 2.5.5, the protodown state can be manipulated with the ip link set command. Given its use in preventing network meltdowns, manually manipulating protodown is not recommended outside the scope of interaction with the Cumulus Linux support team.
The following ip link show command output shows an interface in protodown state. Notice that the link carrier is down (NO-CARRIER):
cumulus@switch:~$ net show bridge link swp1
3: swp1 state DOWN: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 9216 master pfifo_fast master host-bond1 state DOWN mode DEFAULT qlen 500 protodown on
link/ether 44:38:39:00:69:84 brd ff:ff:ff:ff:ff:ff
Specify a Backup Link
You should specify a backup link for your peer links in case the peer link goes down. When this happens, the clagd service uses the backup link to check the health of the peer switch. The backup link is specified in the clagd-backup-ip parameter.
In an anycast VTEP environment, if you do not specify the clagd-backup-ip parameter, large convergence times (around 5 minutes) can result when the primary MLAG switch is powered off. Then the secondary switch must wait until the reload delay timer expires (which defaults to 300 seconds, or 5 minutes) before bringing up a VNI with its unique loopback IP.
To configure a backup link, add clagd-backup-ip <ADDRESS> to the peer link configuration:
cumulus@spine01:~$ net add interface peerlink.4094 clag backup-ip 192.0.2.50
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
The backup IP address must be different than the peer link IP address (clagd-peer-ip). It must be reachable by a route that does not use the peer link and it must be in the same network namespace as the peer link IP address.
Use the switch’s loopback or management IP address for this purpose. Which one should you choose?
If your MLAG configuration has routed uplinks (a modern approach to the data center fabric network), then configure clagd to use the peer switch loopback address for the health check. When the peer link is down, the secondary switch must route towards the loopback address using uplinks (towards spine layer). If the primary switch is also suffering a more significant problem (for example, switchd is unresponsive /or stopped), then the secondary switch eventually promotes itself to primary and traffic now flows normally.
To ensure IP connectivity between the loopbacks, you must carefully consider what implications this has on the BGP ASN configured:
The two MLAG member switches must use unique BGP ASNs, or,
If the two MLAG member switches use the same BGP ASN, then you must bypass the BGP loop prevention check on AS_PATH attribute.
If your MLAG configuration has bridged uplinks (such as a campus network or a large, flat layer 2 network), then configure clagd to use the peer switch eth0 address for the health check. When the peer link is down, the secondary switch must route towards the eth0 address using the OOB network (provided you have implemented an OOB network).
You can also specify the backup UDP port. The port defaults to 5342, but you can configure it as an argument in clagd-args using --backupPort <PORT>.
cumulus@spine01:~$ net add interface peerlink.4094 clag args --backupPort 5400
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
To see the backup IP address, run the net show clag command:
cumulus@spine01:~$ net show clag
The peer is alive
Our Priority, ID, and Role: 32768 44:38:39:00:00:41 primary
Peer Priority, ID, and Role: 32768 44:38:39:00:00:42 secondary
Peer Interface and IP: peerlink.4094 linklocal
Backup IP: 192.168.0.22 (active)
System MAC: 44:38:39:FF:40:90
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
leaf03-04 leaf03-04 1034 - -
exit01-02 - 2930 - -
leaf01-02 leaf01-02 1012 - -
Specify a Backup Link to a VRF
You can configure the backup link to a VRF or management VRF. Include the name of the VRF or management VRF with the clagd-backup-ip command. Here is a sample configuration:
cumulus@spine01:~$ net add interface peerlink.4094 clag backup-ip 192.168.0.22 vrf mgmt
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
You cannot use the VRF on a peer link subinterface.
Verify the backup link by running the net show clag backup-ip command:
cumulus@leaf01:~$ net show clag backup-ip
Backup info:
IP: 192.168.0.12; State: active; Role: primary
Peer priority and id: 32768 44:38:39:00:00:12; Peer role: secondary
The MLAG healthCheck module listens on UDP port 5342. If you have not configured a backup VRF, the module listens on all VRFs, which is normal UDP socket behaviour. Make sure to configure a backup link and backup VRF so that the MLAG healtcheck module only listens on the backup VRF.
Comparing VRF and Management VRF Configurations
The configuration for both a VRF and management VRF is exactly the same. The following example shows a configuration where the backup interface is in a VRF:
cumulus@leaf01:~$ net show configuration
...
auto swp52s0
iface swp52s0
address 192.0.2.1/24
vrf green
auto green
iface green
vrf-table auto
auto peer5.4000
iface peer5.4000
address 192.0.2.15/24
clagd-peer-ip linklocal
clagd-backup-ip 192.0.2.2 vrf green
clagd-sys-mac 44:38:39:01:01:01
...
You can verify the configuration with the net show clag status verbose command:
cumulus@leaf01:~$ net show clag status verbose
The peer is alive
Peer Priority, ID, and Role: 32768 00:02:00:00:00:13 primary
Our Priority, ID, and Role: 32768 c4:54:44:f6:44:5a secondary
Peer Interface and IP: peer5.4000 linklocal
Backup IP: 192.0.2.2 vrf green (active)
System MAC: 44:38:39:01:01:01
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
bond4 bond4 4 - -
bond1 bond1 1 - -
bond2 bond2 2 - -
bond3 bond3 3 - -
...
Monitor Dual-Connected Peers
Upon receipt of a valid message from its peer, the switch knows that clagd is alive and executing on that peer. This causes clagd to change the system ID of each bond that is assigned a clag-id from the default value (the MAC address of the bond) to the system ID assigned to both peer switches. This makes the hosts connected to each switch act as if they are connected to the same system so that they use all ports within their bond. Additionally, clagd determines which bonds are dual-connected and modifies the forwarding and learning behavior to accommodate these dual-connected bonds.
If the peer does not receive any messages for three update intervals, then that peer switch is assumed to no longer be acting as an MLAG peer. In this case, the switch reverts all configuration changes so that it operates as a standard non-MLAG switch. This includes removing all statically assigned MAC addresses, clearing the egress forwarding mask, and allowing addresses to move from any port to the peer port. After a message is again received from the peer, MLAG operation starts again as described earlier. You can configure a custom timeout setting by adding --peerTimeout <VALUE> to clagd-args, like this:
cumulus@spine01:~$ net add interface peerlink.4094 clag args --peerTimeout 900
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
After bonds are identified as dual-connected, clagd sends more information to the peer switch for those bonds. The MAC addresses (and VLANs) that are dynamically learned on those ports are sent along with the LACP partner MAC address for each bond. When a switch receives MAC address information from its peer, it adds MAC address entries on the corresponding ports. As the switch learns and ages out MAC addresses, it informs the peer switch of these changes to its MAC address table so that the peer can keep its table synchronized. Periodically, at 45% of the bridge ageing time, a switch sends its entire MAC address table to the peer, so that peer switch can verify that its MAC address table is properly synchronized.
The switch sends an update frequency value in the messages to its peer, which tells clagd how often the peer will send these messages. You can configure a different frequency by adding --lacpPoll <SECONDS> to clagd-args:
cumulus@spine01:~$ net add interface peerlink.4094 clag args --lacpPoll 900
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
Configure Layer 3 Routed Uplinks
In this scenario, the spine switches connect at layer 3, as shown in the image below. Alternatively, the spine switches can be singly connected to each core switch at layer 3 (not shown below).
In this design, the spine switches route traffic between the server hosts in the layer 2 domains and the core. The servers (host1 thru host4) each have a layer 2 connection up to the spine layer where the default gateway for the host subnets resides. However, since the spine switches as gateway devices communicate at layer 3, you need to configure a protocol such as VRR (virtual router redundancy) between the spine switch pair to support active/active forwarding.
Then, to connect the spine switches to the core switches, you need to determine whether the routing is static or dynamic. If it is dynamic, you must choose which protocol - OSPF or BGP - to use.
When enabling a routing protocol in an MLAG environment, it is also necessary to manage the uplinks, because by default MLAG is not aware of layer 3 uplink interfaces. In the event of a peer link failure, MLAG does not remove static routes or bring down a BGP or OSPF adjacency unless a separate link state daemon such as ifplugd is used.
MLAG and Peer Link Peering
When using MLAG with VRR, set up a routed adjacency across the peerlink.4094 interface. If a routed connection is not built across the peer link, then during uplink failure on one of the switches in the MLAG pair, egress traffic can be blackholed if it hashes to the leaf whose uplinks are down.
To set up the adjacency, configure a BGP or OSPF unnumbered peering, as appropriate for your network.
For example, if you are using BGP, use a configuration like this:
cumulus@switch:~$ net add bgp neighbor peerlink.4094 interface remote-as internal
cumulus@switch:~$ net commit
If you are using OSPF, use a configuration like this:
cumulus@switch:~$ net add interface peerlink.4094 ospf area 0.0.0.1
cumulus@switch:~$ net commit
If you are using EVPN and MLAG, you need to enable the EVPN address family across the peerlink.4094 interface as well:
cumulus@switch:~$ net add bgp neighbor peerlink.4094 interface remote-as internal
cumulus@switch:~$ net add bgp l2vpn evpn neighbor peerlink.4094 activate
cumulus@switch:~$ net commit
Be aware of an existing issue when you use NCLU to create an iBGP peering, it creates an eBGP peering instead. For more information, see this release note.
MLAG Routing Support
In addition to the routing adjacency over the peer link, Cumulus Linux supports routing adjacencies from attached network devices to MLAG switches under the following conditions:
The router must physically attach to a single interface of a switch.
The attached router must peer directly to a local address on the physically connected switch.
The router cannot:
Attach to the switch over a MLAG bond interface.
Form routing adjacencies to a virtual address (VRR or VRRP).
IGMP Snooping with MLAG
IGMP snooping processes IGMP reports received on a bridge port in a bridge to identify hosts that are configured to receive multicast traffic destined to that group. An IGMP query message received on a port is used to identify the port that is connected to a router and configured to receive multicast traffic.
IGMP snooping is enabled by default on the bridge. IGMP snooping multicast database entries and router port entries are synced to the peer MLAG switch. If there is no multicast router in the VLAN, you can configure the IGMP querier on the switch to generate IGMP query messages. For more information, read the IGMP and MLD Snooping chapter.
In an MLAG configuration, the switch in the secondary role does not send IGMP queries, even though the configuration is identical to the switch in the primary role. This is expected behavior, as there can be only one querier on each VLAN. Once the querier on the primary switch stops transmitting, the secondary switch starts transmitting.
Monitor the Status of the clagd Service
Due to the critical nature of the clagd service, systemd continuously monitors the status of clagd. systemd monitors the clagd service through the use of notify messages every 30 seconds. If the clagd service dies or becomes unresponsive for any reason and systemd receives no messages after 60 seconds, systemd restarts clagd. systemd logs these failures in /var/log/syslog, and, on the first failure, generates a cl-support file as well.
This monitoring is automatically configured and enabled as long as clagd is enabled (that is, clagd-peer-ip and clagd-sys-mac are configured for an interface) and the clagd service is running. When clagd is explicitly stopped, for example with the systemctl stop clagd.service command, monitoring of clagd is also stopped.
Check clagd Status
You can check the status of clagd monitoring by using the cl-service-summary command:
cumulus@switch:~$ sudo cl-service-summary summary
The systemctl daemon 5.4 uptime: 15m
...
Service clagd enabled active
...
Or the systemctl status command:
cumulus@switch:~$ sudo systemctl status clagd.service
● clagd.service - Cumulus Linux Multi-Chassis LACP Bonding Daemon
Loaded: loaded (/lib/systemd/system/clagd.service; enabled)
Active: active (running) since Mon 2016-10-03 20:31:50 UTC; 4 days ago
Docs: man:clagd(8)
Main PID: 1235 (clagd)
CGroup: /system.slice/clagd.service
├─1235 /usr/bin/python /usr/sbin/clagd --daemon 169.254.255.2 peerlink.4094 44:38:39:FF:40:90 --prior...
└─1307 /sbin/bridge monitor fdb
Feb 01 23:19:30 leaf01 clagd[1717]: Cleanup is executing.
Feb 01 23:19:31 leaf01 clagd[1717]: Cleanup is finished
Feb 01 23:19:31 leaf01 clagd[1717]: Beginning execution of clagd version 1.3.0
Feb 01 23:19:31 leaf01 clagd[1717]: Invoked with: /usr/sbin/clagd --daemon 169.254.255.2 peerlink.4094 44:38:39:FF:40:94 --pri...168.0.12
Feb 01 23:19:31 leaf01 clagd[1717]: Role is now secondary
Feb 01 23:19:31 leaf01 clagd[1717]: Initial config loaded
Feb 01 23:19:31 leaf01 systemd[1]: Started Cumulus Linux Multi-Chassis LACP Bonding Daemon.
Feb 01 23:24:31 leaf01 clagd[1717]: HealthCheck: reload timeout.
Feb 01 23:24:31 leaf01 clagd[1717]: Role is now primary; Reload timeout
Hint: Some lines were ellipsized, use -l to show in full.
MLAG Best Practices
For MLAG to function properly, you must configure the dual-connected host interfaces identically on the pair of peering switches. See the note above in the Configure MLAG section.
Otherwise, traffic is determined by the bridge MTU. Bridge MTU in turn is determined by the lowest MTU setting of an interface that is a member of the bridge. If you want to set an MTU other than the default of 1500 bytes, you must configure the MTU on each physical interface and bond interface that are members of the MLAG bridges in the entire bridged domain.
For example, if an MTU of 9216 is desired through the MLAG domain in the example shown above, on all four leaf switches, configure mtu 9216 for each of the following bond interfaces, as they are members of the bridge named bridge: peerlink, uplink, server01.
cumulus@leaf01:~$ net add bond peerlink mtu 9216
cumulus@leaf01:~$ net add bond uplink mtu 9216
cumulus@leaf01:~$ net add bond server01 mtu 9216
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
The above commands produce the following configuration in the
/etc/network/interfaces file:
auto bridge
iface bridge
bridge-ports peerlink uplink server01
auto peerlink
iface peerlink
mtu 9216
auto server01
iface server01
mtu 9216
auto uplink
iface uplink
mtu 9216
Likewise, to ensure the MTU 9216 path is respected through the spine switches above, also change the MTU setting for bridge bridge by configuring mtu 9216 for each of the following members of bridge *bridge on both spine01 and spine02: leaf01-02, leaf03-04, exit01-02, peerlink.
cumulus@spine01:~$ net add bond leaf01-02 mtu 9216
cumulus@spine01:~$ net add bond leaf03-04 mtu 9216
cumulus@spine01:~$ net add bond exit01-02 mtu 9216
cumulus@spine01:~$ net add bond peerlink mtu 9216
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
The above commands produce the following configuration in the /etc/network/interfaces file:
auto bridge
iface bridge
bridge-ports leaf01-02 leaf03-04 exit01-02 peerlink
auto exit01-02
iface exit01-02
mtu 9216
auto leaf01-02
iface leaf01-02
mtu 9216
auto leaf03-04
iface leaf03-04
mtu 9216
auto peerlink
iface peerlink
mtu 9216
Peer Link Sizing
The peer link carries very little traffic when compared to the bandwidth consumed by dataplane traffic. In a typical MLAG configuration, most every connection between the two switches in the MLAG pair is dual-connected, so the only traffic going across the peer link is traffic from the clagd process and some LLDP or LACP traffic; the traffic received on the peer link is not forwarded out of the dual-connected bonds.
However, there are some instances where a host is connected to only one switch in the MLAG pair; for example:
You have a hardware limitation on the host where there is only one PCIE slot, and therefore, one NIC on the system, so the host is only single-connected across that interface.
The host does not support 802.3ad and you cannot create a bond on it.
You are accounting for a link failure, where the host may become single connected until the failure is rectified.
In general, you need to determine how much bandwidth is traveling across the single-connected interfaces, and allocate half of that bandwidth to the peer link. We recommend half of the single-connected bandwidth because, on average, one half of the traffic destined to the single-connected host arrives on the switch directly connected to the single-connected host and the other half arrives on the switch that is not directly connected to the single-connected host. When this happens, only the traffic that arrives on the switch that is not directly connected to the single-connected host needs to traverse the peer link, which is how you calculate 50% of the traffic.
In addition, you might want to add extra links to the peer link bond to handle link failures in the peer link bond itself.
In the illustration below, each host has two 10G links, with each 10G link going to each switch in the MLAG pair. Each host has 20G of dual-connected bandwidth, so all three hosts have a total of 60G of dual-connected bandwidth. We recommend you allocate at least 15G of bandwidth to each peer link bond, which represents half of the single-connected bandwidth.
Scaling this example out to a full rack, when planning for link failures, you need only allocate enough bandwidth to meet your site’s strategy for handling failure scenarios. Imagine a full rack with 40 servers and two switches. You might plan for four to six servers to lose connectivity to a single switch and become single connected before you respond to the event. So expanding upon our previous example, if you have 40 hosts each with 20G of bandwidth dual-connected to the MLAG pair, you might allocate 20G to 30G of bandwidth to the peer link - which accounts for half of the single-connected bandwidth for four to six hosts.
Failover Redundancy Scenarios
To get a better understanding of how STP and LACP behave in response to various failover redundancy scenarios, read this knowledge base article.
STP Interoperability with MLAG
Always enable STP in your layer 2 network.
With MLAG, enable BPDU guard on the host-facing bond interfaces. For more information about BPDU guard, see BPDU Guard and Bridge Assurance.
Run the net show <interface> spanning-tree command to display MLAG information useful for debugging:
cumulus@switch:~$ net show bridge spanning-tree
bridge:peerlink CIST info
enabled yes role Designated
port id 8.002 state forwarding
..............
bpdufilter port no
clag ISL yes clag ISL Oper UP yes
clag role primary clag dual conn mac 00:00:00:00:00:00
clag remote portID F.FFF clag system mac 44:38:39:FF:40:90
Best Practices for STP with MLAG
The STP global configuration must be the same on both peer switches.
The STP configuration for dual-connected ports should be the same on both peer switches.
To minimize convergence times when a link transitions to the forwarding state, configure the edge ports (for tagged and untagged frames) with PortAdminEdge and BPDU guard enabled.
The STP priority must be the same on both peer switches. You set the priority with this command:
cumulus@switch:~$ net add bridge stp treeprio PRIORITY_VALUE
cumulus@switch:~$ net commit
Use NCLU (net) commands for all spanning tree configurations, including bridge priority, path cost and so forth. Do not use brctl commands for spanning tree, except for brctl stp on/off, as changes are not reflected to mstpd and can create conflicts.
Troubleshooting
Here are some troubleshooting tips.
View the MLAG Log File
By default, when clagd is running, it logs its status to the /var/log/clagd.log file and syslog. Example log file output is below:
cumulus@spine01:~$ sudo tail /var/log/clagd.log
2016-10-03T20:31:50.471400+00:00 spine01 clagd[1235]: Initial config loaded
2016-10-03T20:31:52.479769+00:00 spine01 clagd[1235]: The peer switch is active.
2016-10-03T20:31:52.496490+00:00 spine01 clagd[1235]: Initial data sync to peer done.
2016-10-03T20:31:52.540186+00:00 spine01 clagd[1235]: Role is now primary; elected
2016-10-03T20:31:54.250572+00:00 spine01 clagd[1235]: HealthCheck: role via backup is primary
2016-10-03T20:31:54.252642+00:00 spine01 clagd[1235]: HealthCheck: backup active
2016-10-03T20:31:54.537967+00:00 spine01 clagd[1235]: Initial data sync from peer done.
2016-10-03T20:31:54.538435+00:00 spine01 clagd[1235]: Initial handshake done.
2016-10-03T20:31:58.527464+00:00 spine01 clagd[1235]: leaf03-04 is now dual connected.
2016-10-03T22:47:35.255317+00:00 spine01 clagd[1235]: leaf01-02 is now dual connected.
Large Packet Drops on the Peer Link Interface
A large volume of packet drops across one of the peer link interfaces can be expected. These drops serve to prevent looping of BUM (broadcast, unknown unicast, multicast) packets. When a packet is received across the peer link, if the destination lookup results in an egress interface that is a dual-connected bond, the switch does not forward the packet to prevent loops. This results in a drop being recorded on the peer link.
You can detect this issue by running the net show counters or the ethtool -S <interface> command.
Using NCLU, the number of dropped packets is displayed in the RX_DRP column when you run net show counters:
When you run clagctl, you may see output like this:
bond01 bond01 52 duplicate lacp - partner mac
This occurs when you have multiple LACP bonds between the same two LACP endpoints - for example, an MLAG switch pair is one endpoint and an ESXi host is another. These bonds have duplicate LACP identifiers, which are MAC addresses. This same warning could be triggered when you have a cabling or configuration error.
Caveats and Errata
If both the backup and peer connectivity are lost within a 30-second window, the switch in the secondary role misinterprets the event sequence, believing the peer switch is down, so it takes over as the primary.
MLAG is disabled on the chassis, including the Facebook Backpack and EdgeCore OMP-800.
LACP Bypass
On Cumulus Linux, LACP Bypass is a feature that allows a
bond configured in
802.3ad mode to become active and forward traffic even when there is no
LACP partner. A typical use case for this feature is to enable a host,
without the capability to run LACP, to PXE boot while connected to a
switch on a bond configured in 802.3ad mode. Once the pre-boot process
finishes and the host is capable of running LACP, the normal 802.3ad
link aggregation operation takes over.
LACP Bypass All-active Mode
When a bond has multiple slave interfaces, each bond slave interface
operates as an active link while the bond is in bypass mode. This is
known as all-active mode. This is useful during PXE boot of a server
with multiple NICs, when the user cannot determine beforehand which port
needs to be active.
Keep in the mind the following caveats with all-active mode:
All-active mode is not supported on bonds that are not specified as bridge ports on the switch. To work around this limitation, do one of the following:
Configure the layer 3 interface on the physical link instead of using a bond
Configure the LACP bond on the switch port so that the AS has neighbor LACP information
Configure the bond interface as balance-xor mode instead of LACP
Spanning tree protocol (STP) does not run on the individual bond slave interfaces when the LACP bond is in all-active mode. Therefore, only use all-active mode on host-facing LACP bonds. Consider configuring STP BPDU guard together with all-active mode.
The following features are not supported:
priority mode
bond-lacp-bypass-period
bond-lacp-bypass-priority
bond-lacp-bypass-all-active
In an MLAG deployment
where bond slaves of a host are connected to two switches and the bond
is in all-active mode, all the slaves of bond are active on both the
primary and secondary MLAG nodes.
Configure LACP Bypass
To enable LACP bypass on the host-facing bond, configure bond-lacp-bypass-allow
using NCLU. The following commands create a VLAN-aware bridge with LACP bypass
enabled:
cumulus@switch:~$ net add bond bond1 bond slaves swp51s2,swp51s3
cumulus@switch:~$ net add bond bond1 clag id 1
cumulus@switch:~$ net add bond bond1 bond lacp-bypass-allow
cumulus@switch:~$ net add bond bond1 stp bpduguard
cumulus@switch:~$ net add bridge bridge ports bond1,bond2,bond3,bond4,peer5
cumulus@switch:~$ net add bridge bridge vids 100-105
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
clag-id is not a required parameter in the configuration shown above. While LACP bypass is often configured on bonds involved in MLAG, MLAG is not required to use LACP bypass.
These commands create the following stanzas in
/etc/network/interfaces:
Cumulus Linux provides the option of using Virtual Router Redundancy
(VRR) or Virtual Router Redundancy Protocol (VRRP).
VRR enables hosts to communicate with any redundant router without reconfiguration, by running dynamic router protocols or router redundancy protocols. Redundant routers respond to Address Resolution Protocol (ARP) requests from hosts. Routers are configured to respond in an identical manner, but if one fails, the other redundant routers continue to respond, leaving the hosts with the impression that nothing has changed. VRR is typically used in an MLAG configuration.
Use VRR when you have multiple devices connected to a single logical connection, such as an MLAG bond. A device connected to an MLAG bond believes there is a single device on the other end of the bond and only forwards one copy of the transit frames. If this frame is destined to the virtual MAC address and you are running VRRP, it is possible that the frame is sent to the link connected to the VRRP standby device, which will not forward the frame appropriately. By having the virtual MAC active on both MLAG devices, it ensures either MLAG device handles the frame it receives correctly.
VRRP allows a single virtual default gateway to be shared between two or more network devices in an active/standby configuration. The physical VRRP router that forwards packets at any given time is called the master. If this VRRP router fails, another VRRP standby router automatically takes over as master. VRRP is used in a non-MLAG configuration.
Use VRRP when you have multiple distinct devices that connect to a layer 2 segment through multiple logical connections (not through a single bond). VRRP elects a single active forwarder that owns the virtual MAC address while it is active. This prevents the forwarding database of the layer 2 domain from continuously updating in response to MAC flaps as frames sourced from the virtual MAC address are received from discrete logical connections.
VRRP is supported in Cumulus Linux 3.7.4 and later.
You cannot configure both VRR and VRRP on the same switch.
VRR
The diagram below illustrates a basic VRR-enabled network configuration. The network includes several hosts and two routers running Cumulus Linux configured with Multi-chassis Link Aggregation (MLAG).
Cumulus Linux only supports VRR on switched virtual interfaces (SVIs). VRR is not supported on physical interfaces or virtual subinterfaces.
A production implementation has many more server hosts and network connections than are shown here. However, this basic configuration provides a complete description of the important aspects of the VRR setup.
As the bridges in each of the redundant routers are connected, they each receive and reply to ARP requests for the virtual router IP address.
Each ARP request made by a host receives replies from each router; these replies are identical, and the host receiving the replies either ignores replies after the first, or accepts them and overwrites the previous identical reply.
A range of MAC addresses is reserved for use with VRR to prevent MAC address conflicts with other interfaces in the same bridged network. The reserved range is 00:00:5E:00:01:00 to 00:00:5E:00:01:ff. Use MAC addresses from the reserved range when configuring VRR.
The reserved MAC address range for VRR is the same as for the Virtual Router Redundancy Protocol (VRRP), as they serve similar purposes.
Configure VRR
The following procedures describe how to configure routers and hosts to use VRR.
Configure the Routers
The routers implement the layer 2 network interconnecting the hosts and the redundant routers. To configure the routers, add a bridge with the following interfaces to each router:
One bond interface or switch port interface to each host. For networks using MLAG, use bond interfaces, otherwise, use switch port interfaces.
One or more interfaces to each peer router. Multiple inter-peer links are typically bonded interfaces that accommodate higher bandwidth between the routers and offer link redundancy.
The VLAN interface must have unique IP addresses for both the physical (the address option below) and virtual (the address-virtual option below) interfaces, as the unique address is used when the switch initiates an ARP request.
Example VRR Configuration
The example NCLU commands below create a VLAN-aware bridge interface for a VRR-enabled network:
cumulus@switch:~$ net add bridge
cumulus@switch:~$ net add vlan 500 ip address 192.0.2.252/24
cumulus@switch:~$ net add vlan 500 ip address-virtual 00:00:5e:00:01:00 192.0.2.254/24
cumulus@switch:~$ net add vlan 500 ipv6 address 2001:db8::1/32
cumulus@switch:~$ net add vlan 500 ipv6 address-virtual 00:00:5e:00:01:00 2001:db8::f/32
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The NCLU commands above produce the following /etc/network/interfaces snippet:
Each host must have two network interfaces. The routers configure the interfaces as bonds running LACP; the hosts must also configure the two interfaces using teaming, port aggregation, port group, or
EtherChannel running LACP. Configure the hosts, either statically or via DHCP, with a gateway address that is the IP address of the virtual router; this default gateway address never changes.
Configure the links between the hosts and the routers in active-active mode for First Hop Redundancy Protocol.
Example VRR Configuration with MLAG
To create an MLAG configuration that incorporates VRR, use a configuration similar to the following.
The following examples uses a single virtual MAC address for all VLANs. You can add a unique MAC address for each VLAN, but this is not necessary.
leaf01 Configuration
cumulus@leaf01:~$ net add interface eth0 ip address 192.168.0.21/24
cumulus@leaf01:~$ net add bond server01 bond slaves swp1
cumulus@leaf01:~$ net add bond server01 clag id 1
cumulus@leaf01:~$ net add bond server01 mtu 9216
cumulus@leaf01:~$ net add bond server01 alias LACP etherchannel to uplink on server01
cumulus@leaf01:~$ net add bond peerlink bond slaves swp49-50
cumulus@leaf01:~$ net add interface peerlink.4094 peerlink.4094
cumulus@leaf01:~$ net add interface peerlink.4094 ip address 169.254.255.1/30
cumulus@leaf01:~$ net add interface peerlink.4094 clag peer-ip 169.254.255.2
cumulus@leaf01:~$ net add interface peerlink.4094 clag backup-ip 192.168.0.22
cumulus@leaf01:~$ net add interface peerlink.4094 clag sys-mac 44:38:39:FF:40:90
cumulus@leaf01:~$ net add bridge bridge ports server01,peerlink
cumulus@leaf01:~$ net add bridge stp treeprio 4096
cumulus@leaf01:~$ net add vlan 100 ip address 10.0.1.2/24
cumulus@leaf01:~$ net add vlan 100 ip address-virtual 00:00:5E:00:01:01 10.0.1.1/24
cumulus@leaf01:~$ net add vlan 200 ip address 10.0.2.2/24
cumulus@leaf01:~$ net add vlan 200 ip address-virtual 00:00:5E:00:01:01 10.0.2.1/24
cumulus@leaf01:~$ net add vlan 300 ip address 10.0.3.2/24
cumulus@leaf01:~$ net add vlan 300 ip address-virtual 00:00:5E:00:01:01 10.0.3.1/24
cumulus@leaf01:~$ net add vlan 400 ip address 10.0.4.2/24
cumulus@leaf01:~$ net add vlan 400 ip address-virtual 00:00:5E:00:01:01 10.0.4.1/24
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto eth0
iface eth0
address 192.168.0.21/24
auto bridge
iface bridge
bridge-ports server01 peerlink
bridge-vids 100 200 300 400
bridge-vlan-aware yes
mstpctl-treeprio 4096
auto server01
iface server01
alias LACP etherchannel to uplink on server01
bond-slaves swp1
clag-id 1
mtu 9216
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
address 169.254.255.1/30
clagd-backup-ip 192.168.0.22
clagd-peer-ip 169.254.255.2
clagd-sys-mac 44:38:39:FF:40:90
auto vlan100
iface vlan100
address 10.0.1.2/24
address-virtual 00:00:5E:00:01:01 10.0.1.1/24
vlan-id 100
vlan-raw-device bridge
auto vlan200
iface vlan200
address 10.0.2.2/24
address-virtual 00:00:5E:00:01:01 10.0.2.1/24
vlan-id 200
vlan-raw-device bridge
auto vlan300
iface vlan300
address 10.0.3.2/24
address-virtual 00:00:5E:00:01:01 10.0.3.1/24
vlan-id 300
vlan-raw-device bridge
auto vlan400
iface vlan400
address 10.0.4.2/24
address-virtual 00:00:5E:00:01:01 10.0.4.1/24
vlan-id 400
vlan-raw-device bridge
leaf02 Configuration
cumulus@leaf02:~$ net add interface eth0 ip address 192.168.0.22/24
cumulus@leaf02:~$ net add bond server01 bond slaves swp1
cumulus@leaf02:~$ net add bond server01 clag id 1
cumulus@leaf02:~$ net add bond server01 mtu 9216
cumulus@leaf02:~$ net add bond server01 alias LACP etherchannel to uplink on server01
cumulus@leaf02:~$ net add bond peerlink bond slaves swp49-50
cumulus@leaf02:~$ net add interface peerlink.4094 peerlink.4094
cumulus@leaf02:~$ net add interface peerlink.4094 ip address 169.254.255.2/30
cumulus@leaf02:~$ net add interface peerlink.4094 clag peer-ip 169.254.255.1
cumulus@leaf02:~$ net add interface peerlink.4094 clag backup-ip 192.168.0.21
cumulus@leaf02:~$ net add interface peerlink.4094 clag sys-mac 44:38:39:FF:40:90
cumulus@leaf02:~$ net add bridge bridge ports server01,peerlink
cumulus@leaf02:~$ net add bridge stp treeprio 4096
cumulus@leaf02:~$ net add vlan 100 ip address 10.0.1.3/24
cumulus@leaf02:~$ net add vlan 100 ip address-virtual 00:00:5E:00:01:01 10.0.1.1/24
cumulus@leaf02:~$ net add vlan 200 ip address 10.0.2.3/24
cumulus@leaf02:~$ net add vlan 200 ip address-virtual 00:00:5E:00:01:01 10.0.2.1/24
cumulus@leaf02:~$ net add vlan 300 ip address 10.0.3.3/24
cumulus@leaf02:~$ net add vlan 300 ip address-virtual 00:00:5E:00:01:01 10.0.3.1/24
cumulus@leaf02:~$ net add vlan 400 ip address 10.0.4.3/24
cumulus@leaf02:~$ net add vlan 400 ip address-virtual 00:00:5E:00:01:01 10.0.4.1/24
cumulus@leaf02:~$ net pending
cumulus@leaf02:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto eth0
iface eth0
address 192.168.0.22/24
auto bridge
iface bridge
bridge-ports server01 peerlink
bridge-vids 100 200 300 400
bridge-vlan-aware yes
mstpctl-treeprio 4096
auto server01
iface server01
alias LACP etherchannel to uplink on server01
bond-slaves swp1
clag-id 1
mtu 9216
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
address 169.254.255.2/30
clagd-backup-ip 192.168.0.21
clagd-peer-ip 169.254.255.1
clagd-sys-mac 44:38:39:FF:40:90
auto vlan100
iface vlan100
address 10.0.1.3/24
address-virtual 00:00:5E:00:01:01 10.0.1.1/24
vlan-id 100
vlan-raw-device bridge
auto vlan200
iface vlan200
address 10.0.2.3/24
address-virtual 00:00:5E:00:01:01 10.0.2.1/24
vlan-id 200
vlan-raw-device bridge
auto vlan300
iface vlan300
address 10.0.3.3/24
address-virtual 00:00:5E:00:01:01 10.0.3.1/24
vlan-id 300
vlan-raw-device bridge
auto vlan400
iface vlan400
address 10.0.4.3/24
address-virtual 00:00:5E:00:01:01 10.0.4.1/24
vlan-id 400
vlan-raw-device bridge
server01 Configuration
Create a configuration similar to the following on an Ubuntu host:
auto eth0
iface eth0 inet dhcp
auto eth1
iface eth1 inet manual
bond-master uplink
auto eth2
iface eth2 inet manual
bond-master uplink
auto uplink
iface uplink inet static
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
address 172.16.1.101
netmask 255.255.255.0
post-up ip route add 172.16.0.0/16 via 172.16.1.1
post-up ip route add 10.0.0.0/8 via 172.16.1.1
auto uplink:200
iface uplink:200 inet static
address 10.0.2.101
auto uplink:300
iface uplink:300 inet static
address 10.0.3.101
auto uplink:400
iface uplink:400 inet static
address 10.0.4.101
# modprobe bonding
server02 Configuration
Create a configuration similar to the following on an Ubuntu host:
auto eth0
iface eth0 inet dhcp
auto eth1
iface eth1 inet manual
bond-master uplink
auto eth2
iface eth2 inet manual
bond-master uplink
auto uplink
iface uplink inet static
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
address 172.16.1.101
netmask 255.255.255.0
post-up ip route add 172.16.0.0/16 via 172.16.1.1
post-up ip route add 10.0.0.0/8 via 172.16.1.1
auto uplink:200
iface uplink:200 inet static
address 10.0.2.101
auto uplink:300
iface uplink:300 inet static
address 10.0.3.101
auto uplink:400
iface uplink:400 inet static
address 10.0.4.101
# modprobe bonding
VRRP
VRRP allows for a single virtual default gateway to be shared between two or more network devices in an active/standby configuration. The VRRP router that forwards packets at any given time is called the master. If this VRRP router fails, another VRRP standby router automatically takes over as master. The master sends VRRP advertisements to other VRRP routers in the same virtual router group, which include the priority and state of the master. VRRP router priority determines the role that each virtual router plays and who becomes the new master if the master fails.
All virtual routers use 00:00:5E:00:01:XX for IPv4 gateways or 00:00:5E:00:02:XX for IPv6 gateways as their MAC address. The last byte of the address is the Virtual Router IDentifier (VRID), which is different for each virtual router in the network. This MAC address is used by only one physical router at a time, which replies with this address when ARP requests or neighbor solicitation packets are sent for the IP addresses of the virtual router.
VRRP is supported in Cumulus Linux 3.7.4 and later.
Cumulus Linux supports both VRRPv2 and VRRPv3. The default protocol version is VRRPv3.
255 virtual routers are supported per switch.
VRRP is not supported in an MLAG environment.
To configure VRRP on an SVI or traditional mode bridge, you need to edit the etc/network/interfaces and /etc/frr/frr.conf files. The NCLU commands are not supported with SVIs or traditional mode bridges.
You cannot use VRRP in an EVPN configuration; use MLAG and VRR instead.
The following example illustrates a basic VRRP configuration.
Configure VRRP
To configure VRRP, specify the following information on each switch:
A virtual router ID (VRID) that identifies the group of VRRP routers. You must specify the same ID across all virtual routers in the group.
One or more virtual IP addresses that are assigned to the virtual router group. These are IP addresses that do not directly connect to a specific interface. Inbound packets sent to a virtual IP address are redirected to a physical network interface.
You can also set these optional parameters. If you do not set these parameters, the defaults are used:
Optional Parameter
Default Value
Description
priority
100
The priority level of the virtual router within the virtual router group, which determines the role that each virtual router plays and what happens if the master fails. Virtual routers have a priority between 1 and 254; the router with the highest priority becomes the master.
advertisement interval
1000 milliseconds
The advertisement interval is the interval between successive advertisements by the master in a virtual router group. You can specify a value between 10 and 40950.
preempt
enabled
Preempt mode lets the router take over as master for a virtual router group if it has a higher priority than the current master. Preempt mode is enabled by default. To disable preempt mode, you need to edit the /etc/frr/frr.conf file and add the line no vrrp <VRID> preempt to the interface stanza, then restart the FRR service.
The NCLU commands write VRRP configuration to the /etc/network/interfaces file and the /etc/frr/frr.conf file.
When you commit a change that configures a new routing service such as VRRP, the FRR daemon restarts and might interrupt network operations for other configured routing services.
The following example commands configure two switches (spine01 and spine02) that form one virtual router group (VRID 44) with IPv4 address 10.0.0.1/24 and IPv6 address 2001:0db8::1/64. spine01 is the master; it has a priority of 254. spine02 is the backup VRRP router.
spine01
cumulus@spine01:~$ net add interface swp1 vrrp 44 10.0.0.1/24
cumulus@spine01:~$ net add interface swp1 vrrp 44 2001:0db8::1/64
cumulus@spine01:~$ net add interface swp1 vrrp 44 priority 254
cumulus@spine01:~$ net add interface swp1 vrrp 44 advertisement-interval 5000
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
spine02
cumulus@spine02:~$ net add interface swp1 vrrp 44 10.0.0.1/24
cumulus@spine02:~$ net add interface swp1 vrrp 44 2001:0db8::1/64
cumulus@spine02:~$ net pending
cumulus@spine02:~$ net commit
The NCLU commands save the configuration in the /etc/frr/frr.conf
file. For example:
To show virtual router information on a switch, run the net show vrrp <VRID> command. For example:
cumulus@spine01:~$ net show vrrp 44
Virtual Router ID 44
Protocol Version 3
Autoconfigured No
Shutdown No
Interface swp1
VRRP interface (v4) vrrp4-3-1
VRRP interface (v6) vrrp6-3-1
Primary IP (v4)
Primary IP (v6) fe80::54df:e543:5c12:7762
Virtual MAC (v4) 00:00:5e:00:01:01
Virtual MAC (v6) 00:00:5e:00:02:01
Status (v4) Master
Status (v6) Master
Priority 254
Effective Priority (v4) 254
Effective Priority (v6) 254
Preempt Mode Yes
Accept Mode Yes
Advertisement Interval 5000 ms
Master Advertisement Interval (v4) 0 ms
Master Advertisement Interval (v6) 5000 ms
Advertisements Tx (v4) 17
Advertisements Tx (v6) 17
Advertisements Rx (v4) 0
Advertisements Rx (v6) 0
Gratuitous ARP Tx (v4) 1
Neigh. Adverts Tx (v6) 1
State transitions (v4) 2
State transitions (v6) 2
Skew Time (v4) 0 ms
Skew Time (v6) 0 ms
Master Down Interval (v4) 0 ms
Master Down Interval (v6) 0 ms
IPv4 Addresses 1
. . . . . . . . . . . . . . . . . . 10.0.0.1
IPv6 Addresses 1
. . . . . . . . . . . . . . . . . . 2001:0db8::1
IGMP and MLD Snooping
IGMP (Internet Group Management Protocol) and MLD (Multicast Listener
Discovery) snooping are implemented in the bridge driver of the Cumulus
Linux kernel and are enabled by default. IGMP snooping processes IGMP
v1, v2, and v3 reports received on a bridge port in a bridge to identify the
hosts that want to receive multicast traffic destined to that group.
In Cumulus Linux 3.7.4 and later, IGMP and MLD snooping is supported
over VXLAN bridges; however, this feature is not enabled by default.
To enable IGMP and MLD over VXLAN, see Configure IGMP/MLD Snooping over VXLAN.
When an IGMPv2 leave message is received, a group specific query is sent
to identify if there are any other hosts interested in that group,
before the group is deleted.
An IGMP query message received on a port is used to identify the port
that is connected to a router and is interested in receiving multicast
traffic.
MLD snooping processes MLD v1/v2 reports, queries and v1 done messages
for IPv6 groups. If IGMP or MLD snooping is disabled, multicast traffic
gets flooded to all the bridge ports in the bridge. Similarly, in the
absence of receivers in a VLAN, multicast traffic would be flooded to
all ports in the VLAN. The multicast group IP address is mapped to a
multicast MAC address and a forwarding entry is created with a list of
ports interested in receiving multicast traffic destined to that group.
Configure IGMP/MLD Snooping over VXLAN
On Broadcom switches, Cumulus Linux 3.7.4 and later supports IGMP/MLD snooping over VXLAN bridges, where VXLAN ports are set as router ports. On Mellanox Spectrum switches, IGMP/MLD snooping over VXLAN bridges is supported in Cumulus Linux 3.7.9 and later.
To enable IGMP/MLD snooping over VXLAN, run the net add bridge <bridge> mcsnoop yes command:
cumulus@switch:~$ net add bridge mybridge mcsnoop yes
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
To disable IGMP/MLD snooping over VXLAN, run the net add bridge <bridge> mcsnoop no command.
Additional Configuration for Spectrum Switches
In Cumulus Linux 3.7.13 and earlier, in addition to enabling IGMP/MLD snooping over VXLAN, you need to perform an additional configuration step, described below. This additional configuration step is not required for Cumulus Linux 3.7.14 and later.
For Spectrum switches, the IGMP reports received over VXLAN from remote hosts are not forwarded to the kernel, which, in certain cases, might result in local receivers not responding to the IGMP query. To workaround this issue, you need to apply certain ACL rules to avoid the IGMP report packets being sent across to the hosts:
Add the following lines to the /etc/cumulus/acl/policy.d/23_acl_test.rules file (where <swp> is the port connected to the access host), then run the cl-acltool -i command:
[ebtables]
-A FORWARD -p IPv4 -o #<swp> --ip-proto igmp -j ACCEPT --ip-destination 224.0.0.0/24
-A FORWARD -p IPv4 -o #<swp> --ip-proto igmp -j DROP
DIP-based Multicast Forwarding
DIP-based multicast forwarding is supported on Broadcom switches only.
Cumulus Linux 3.7.10 and earlier performs layer 2 multicast bridging using the destination MAC address (DMAC) of the packet, which is programmed in the layer 2 table of the ASIC. Cumulus Linux 3.7.11 and later provides the option of using IP-based layer 2 multicast forwarding (DIP), where layer 2 multicast packets are forwarded based on the layer 3 forwarding table, using the VLAN as the key.
DIP-based multicast forwarding is a good solution if you want to have a separate bridge domain and multicast flood domain for two groups that map to the same MAC address. In multicast, there can be multiple group addresses that map to the same MAC address as the address is derived from the three octets of the group; out of the allowed multicast range, you have 16 group addresses with the same MAC address.
DIP-based multicast forwarding is also a good solution if you use a group that falls in to the link local address range (for example, 228.0.0.1), which is not forwarded with DMAC-based multicast forwarding.
DIP-based multicast forwarding is not supported with IGMP Snooping over VXLAN or with IPv6 addresses (DMAC-based forwarding is used for IPv6 addresses).
To enable DIP-based multicast forwarding:
Edit the /etc/cumulus/switchd.conf file to set the bridge.dip_based_l2multicast field to TRUE, then uncomment the line.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
The following example shows that the bridge.dip_based_l2multicast field is set to TRUE and the line is uncommented in the /etc/cumulus/switchd.conf file:
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
# configure IP based forwarding for L2 Multicast
bridge.dip_based_l2multicast = TRUE
...
Configure IGMP/MLD Querier
If no multicast router is sending queries to configure IGMP/MLD querier
on the switch, you can add a configuration similar to the following in
/etc/network/interfaces. To enable IGMP and MLD snooping for a bridge,
set bridge-mcquerier to 1 in the bridge stanza. By default, the
source IP address of IGMP queries is 0.0.0.0. To set the source IP
address of the queries to be the bridge IP address, configure
bridge-mcqifaddr 1.
For an explanation of the relevant parameters, see the
ifupdown-addons-interfaces man page.
For a VLAN-aware bridge, like bridge in the above example, to enable
querier functionality for VLAN 100 in the bridge, set bridge-mcquerier
to 1 in the bridge stanza and set bridge-igmp-querier-src to
123.1.1.1 in the bridge.100 stanza.
You can specify a range of VLANs as well. For example:
auto bridge.[1-200]
vlan bridge.[1-200]
bridge-igmp-querier-src 123.1.1.1
For a bridge in traditional mode, use a
configuration like the following:
auto br0
iface br0
address 192.0.2.10/24
bridge-ports swp1 swp2 swp3
bridge-vlan-aware no
bridge-mcquerier 1
bridge-mcqifaddr 1
Disable IGMP and MLD Snooping
To disable IGMP and MLD snooping, set the bridge-mcsnoop value to 0.
The example NCLU commands below create a VLAN-aware bridge interface for
a VRR-enabled network:
cumulus@switch:~$ net add bridge bridge mcsnoop no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The commands above add the bridge-mcsnoop line to the following
example bridge in /etc/network/interfaces:
To show the IGMP/MLD snooping bridge state, run brctl showstp <bridge>:
cumulus@switch:~$ sudo brctl showstp bridge
bridge
bridge id 8000.7072cf8c272c
designated root 8000.7072cf8c272c
root port 0 path cost 0
max age 20.00 bridge max age 20.00
hello time 2.00 bridge hello time 2.00
forward delay 15.00 bridge forward delay 15.00
ageing time 300.00
hello timer 0.00 tcn timer 0.00
topology change timer 0.00 gc timer 263.70
hash elasticity 4096 hash max 4096
mc last member count 2 mc init query count 2
mc router 1 mc snooping 1
mc last member timer 1.00 mc membership timer 260.00
mc querier timer 255.00 mc query interval 125.00
mc response interval 10.00 mc init query interval 31.25
mc querier 0 mc query ifaddr 0
flags
swp1 (1)
port id 8001 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8001 forward delay timer 0.00
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
swp2 (2)
port id 8002 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8002 forward delay timer 0.00
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
swp3 (3)
port id 8003 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8003 forward delay timer 8.98
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
To show the groups and bridge port state, run the NCLU net show bridge mdb command or the Linux bridge mdb show
command. To show detailed router ports and group information, run the bridge -d -s mdb show command:
cumulus@switch:~$ sudo bridge -d -s mdb show
dev bridge port swp2 grp 234.10.10.10 temp 241.67
dev bridge port swp1 grp 238.39.20.86 permanent 0.00
dev bridge port swp1 grp 234.1.1.1 temp 235.43
dev bridge port swp2 grp ff1a::9 permanent 0.00
router ports on bridge: swp3
VXLAN is the de facto technology for implementing network virtualization
in the data center, enabling layer 2 segments to be extended over an IP
core (the underlay). The initial definition of VXLAN
(RFC 7348) did not include any
control plane and relied on a flood-and-learn approach for MAC address
learning. An alternate deployment model was to use a controller or a
technology such as Lightweight Network Virtualization (LNV) in Cumulus Linux.
You cannot use EVPN and LNV at the same time.
When using EVPN, you must disable data plane MAC learning on all VXLAN interfaces. This is described in Basic EVPN Configuration, below.
Ethernet Virtual Private Network (EVPN) is a standards-based control
plane for VXLAN defined in
RFC 7432 and
RFC 8365
that allows for building and deploying VXLANs at scale. It relies on
multi-protocol BGP (MP-BGP) for exchanging information and is based on
BGP-MPLS IP VPNs (RFC 4364). It
has provisions to enable not only bridging between end systems in the
same layer 2 segment but also routing between different segments
(subnets). There is also inherent support for multi-tenancy. EVPN is
often referred to as the means of implementing controller-less VXLAN.
Cumulus Linux fully supports EVPN as the control plane for VXLAN,
including for both intra-subnet bridging and inter-subnet routing. Key
features include:
VNI membership exchange between VTEPs using EVPN type-3 (Inclusive multicast Ethernet tag) routes.
Exchange of host MAC and IP addresses using EVPN type-2 (MAC/IP advertisement) routes.
Support for host/VM mobility (MAC and IP moves) through exchange of the MAC Mobility Extended community.
Support for dual-attached hosts via VXLAN active-active mode. MAC synchronization between the peer switches is done using MLAG.
Support for ARP/ND suppression, which provides VTEPs with the ability to suppress ARP flooding over VXLAN tunnels.
Support for exchange of static (sticky) MAC addresses through EVPN.
Support for distributed symmetric routing between different subnets.
Support for distributed asymmetric routing between different subnets.
Support for centralized routing.
Support for prefix-based routing using EVPN type-5 routes (EVPN IP prefix route)
Support for layer 3 multi-tenancy.
Support for IPv6 tenant routing.
Symmetric routing, asymmetric routing and prefix-based routing are supported for both IPv4 and IPv6 hosts and prefixes.
ECMP (equal cost multipath) support for overlay networks on RIOT-capable Broadcom switches (Trident 3, Maverick, Trident 2+) in addition to Tomahawk and Mellanox Spectrum-A1 switches. No configuration is needed, ECMP occurs in the overlay when there are multiple next hops.
EVPN address-family is supported with both eBGP and iBGP peering. If the
underlay routing is provisioned using eBGP, the same eBGP session can
also be used to carry EVPN routes. For example, in a typical 2-tier Clos
network topology where the leaf switches are the VTEPs, if eBGP sessions
are in use between the leaf and spine switches for the underlay routing,
the same sessions can be used to exchange EVPN routes; the spine
switches merely act as “route forwarders” and do not install any
forwarding state as they are not VTEPs. When EVPN routes are exchanged
over iBGP peering, OSPF can be used as the IGP or the next hops can also
be resolved using iBGP.
For Cumulus Linux 3.4 and later releases, the routing control plane
(including EVPN) is installed as part of the
FRRouting (FRR) package. For more information
about FRR, refer to the FRRouting Overview.
For information about VXLAN routing, including platform and hardware
limitations, see VXLAN Routing.
Basic EVPN Configuration
The following steps represent the fundamental configuration to use EVPN
as the control plane for VXLAN. These steps are in addition to
configuring VXLAN interfaces, attaching them to a bridge, and mapping
VLANs to VNIs.
Enable EVPN route exchange (that is, address-family layer 2
VPN/EVPN) between BGP peers.
Enable EVPN on the system to advertise VNIs and host reachability
information (MAC addresses learned on associated VLANs) to BGP peers.
Disable MAC learning on VXLAN interfaces as EVPN is responsible for
installing remote MACs.
Additional configuration is necessary to enable ARP/ND suppression,
provision inter-subnet routing, and so on. The configuration depends on
the deployment scenario. You can also configure various other BGP
parameters.
Enable EVPN between BGP Neighbors
You enable EVPN between
BGP neighbors by
adding the address family evpn to the existing neighbor address-family
activation command.
For a non-VTEP device that is merely participating in EVPN route
exchange, such as a spine switch (the network deployment uses hop-by-hop
eBGP or the switch is acting as an iBGP route reflector), activating the
interface for the EVPN address family is the fundamental configuration
needed in FRRouting.
Additional configuration options for specific scenarios are described
later on in this chapter.
The other BGP neighbor address-family-specific configurations supported
for EVPN are allowas-in and route-reflector-client.
To configure an EVPN route exchange with a BGP peer, you must activate
the peer or peer-group within the EVPN address-family:
cumulus@switch:~$ net add bgp autonomous-system 65000
cumulus@switch:~$ net add bgp neighbor swp1 interface remote-as external
cumulus@switch:~$ net add bgp l2vpn evpn neighbor swp1 activate
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Adjust the remote-as above to be appropriate for your environment.
The command syntax bgp evpn is also permitted for backwards
compatibility with prior versions of Cumulus Linux, but the syntax bgp l2vpn evpn is recommended to standardize the BGP address-family
configuration to the AFI/SAFI format.
The above commands create the following configuration snippet in the
/etc/frr/frr.conf file.
The above configuration does not result in BGP knowing about the local
VNIs defined on the system and advertising them to peers. This requires
additional configuration, as described below.
Advertise All VNIs
A single configuration variable enables the BGP control plane for all
VNIs configured on the switch. Set the variable advertise-all-vni to
provision all locally configured VNIs to be advertised by the BGP
control plane. FRR is not aware of any local VNIs and MACs and hosts
(neighbors) associated with those VNIs until advertise-all-vni is
configured.
To build upon the previous example, run the following commands to
advertise all VNIs:
cumulus@switch:~$ net add bgp autonomous-system 65000
cumulus@switch:~$ net add bgp neighbor swp1 interface remote-as external
cumulus@switch:~$ net add bgp l2vpn evpn neighbor swp1 activate
cumulus@switch:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Adjust the remote-as above to be appropriate for your environment.
The above commands create the following configuration snippet in the
/etc/frr/frr.conf file.
This configuration is only needed on leaf switches that are VTEPs. EVPN
routes received from a BGP peer are accepted, even without this explicit
EVPN configuration. These routes are maintained in the global EVPN
routing table. However, they only become effective (that is, imported
into the per-VNI routing table and appropriate entries installed in the
kernel) when the VNI corresponding to the received route is locally
known.
Auto-derivation of RDs and RTs
When FRR learns about a local VNI and there is no explicit
configuration for that VNI in FRR, the route distinguisher (RD) and
import and export route targets (RTs) for this VNI are automatically
derived; the RD uses RouterId:VNI-Index and the import and export RTs
use AS:VNI. For routes that come from a layer 2 VNI (type-2 and type-3), the RD uses the vxlan-local-tunnelip from the layer 2 VNI interface instead of the RouterId (vxlan-local-tunnelip:VNI). The RD and RTs are used in the EVPN route exchange.
The RD disambiguates EVPN routes in different VNIs (as they may have the same
MAC and/or IP address) while the RTs describe the VPN membership for the
route. The “VNI-Index” used for the RD is a unique, internally generated
number for a VNI. It solely has local significance; on remote switches,
its only role is for route disambiguation. This number is used instead
of the VNI value itself because this number has to be less than or equal
to 65535. In the RT, the AS part is always encoded as a 2-byte value to
allow room for a large VNI. If the router has a 4-byte AS, only the
lower 2 bytes are used. This ensures a unique RT for different VNIs
while having the same RT for the same VNI across routers in the same AS.
For eBGP EVPN peering, the peers are in a different AS so using an
automatic RT of “AS:VNI” does not work for route import. Therefore, the
import RT is treated as “*:VNI” to determine which received routes are
applicable to a particular VNI. This only applies when the import RT is
auto-derived and not configured.
User-defined RDs and RTs
EVPN also supports manual configuration of RDs and RTs, if you don’t
want them derived automatically. To manually define RDs and RTs, use the
vni option within NCLU to configure the switch:
cumulus@switch:~$ net add bgp l2vpn evpn vni 10200 rd 172.16.100.1:20
cumulus@switch:~$ net add bgp l2vpn evpn vni 10200 route-target import 65100:20
cumulus@switch:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
/etc/frr/frr.conf file.
These commands are per VNI and must be specified under address-family l2vpn evpn in BGP.
If you delete the RD or RT later, it reverts back to its corresponding default value.
Route target auto derivation does not support 4-byte AS numbers; If the router has a 4-byte AS, you must define the RTs manually.
You can configure multiple RT values for import or export for a VNI. In
addition, you can configure both the import and export route targets
with a single command by using route-target both:
cumulus@switch:~$ net add bgp evpn vni 10400 route-target import 100:400
cumulus@switch:~$ net add bgp evpn vni 10400 route-target import 100:500
cumulus@switch:~$ net add bgp evpn vni 10500 route-target both 65000:500
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
The above commands create the following configuration snippet in the
/etc/frr/frr.conf file:
Enable EVPN in an iBGP Environment with an OSPF Underlay
EVPN can be deployed with an OSPF
or static route underlay if needed. This is a more complex configuration than
using eBGP. In this case, iBGP advertises EVPN routes directly between
VTEPs, and the spines are unaware of EVPN or BGP.
The leaf switches peer with each other in a full mesh within the EVPN
address family without using route reflectors. The leafs generally peer
to their loopback addresses, which are advertised in OSPF. The receiving
VTEP imports routes into a specific VNI with a matching route target
community.
cumulus@switch:~$ net add bgp autonomous-system 65020
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.2 remote-as internal
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.3 remote-as internal
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.4 remote-as internal
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.2 activate
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.3 activate
cumulus@switch:~$ net add bgp evpn neighbor 10.1.1.4 activate
cumulus@switch:~$ net add bgp evpn advertise-all-vni
cumulus@switch:~$ net add ospf router-id 10.1.1.1
cumulus@switch:~$ net add loopback lo ospf area 0.0.0.0
cumulus@switch:~$ net add ospf passive-interface lo
cumulus@switch:~$ net add interface swp50 ospf area 0.0.0.0
cumulus@switch:~$ net add interface swp51 ospf area 0.0.0.0
cumulus@switch:~$ net add interface swp50 ospf network point-to-point
cumulus@switch:~$ net add interface swp51 ospf network point-to-point
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
/etc/frr/frr.conf file.
interface lo
ip ospf area 0.0.0.0
!
interface swp50
ip ospf area 0.0.0.0
ip ospf network point-to-point
interface swp51
ip ospf area 0.0.0.0
ip ospf network point-to-point
!
router bgp 65020
neighbor 10.1.1.2 remote-as internal
neighbor 10.1.1.3 remote-as internal
neighbor 10.1.1.4 remote-as internal
!
address-family l2vpn evpn
neighbor 10.1.1.2 activate
neighbor 10.1.1.3 activate
neighbor 10.1.1.4 activate
advertise-all-vni
exit-address-family
!
Router ospf
Ospf router-id 10.1.1.1
Passive-interface lo
Disable Data Plane MAC Learning over VXLAN Tunnels
When EVPN is provisioned, you must disable data plane MAC learning for
VXLAN interfaces because the purpose of EVPN is to exchange MACs between
VTEPs in the control plane. In the /etc/network/interfaces file,
configure the bridge-learning value to off:
cumulus@leaf01:~$ net add loopback lo vxlan local-tunnelip 10.0.0.1
cumulus@switch:~$ net add vxlan vni200 vxlan id 10200
cumulus@switch:~$ net add vxlan vni200 bridge access 200
cumulus@switch:~$ net add vxlan vni200 bridge learning off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following code snippet in the
/etc/network/interfaces file:
# The loopback network interface
auto lo
iface lo inet loopback
vxlan-local-tunnelip 10.0.0.1
auto vni200
iface vni200
bridge-access 200
bridge-learning off
vxlan-id 10200
For a bridge in traditional mode,
you must edit the bridge configuration in the /etc/network/interfaces
file using a text editor:
auto bridge1
iface bridge1
bridge-ports swp3.100 swp4.100 vni100
bridge-learning vni100=off
For a traditional-mode bridge on Broadcom switches, the bridge learning setting is per physical port; you cannot control MAC
learning behavior based on subinterface. For example, you cannot set
bridge learning off on some subinterfaces and on for other
subinterfaces of the same physical interface.
Cumulus Linux does not support different bridge-learning settings for
different VNIs of VXLAN tunnels between 2 VTEPs.
BUM Traffic and Head End Replication
With EVPN, the only method of generating BUM traffic in hardware is head end replication. Head end replication
is enabled by default in Cumulus Linux.
Broadcom switches with Tomahawk, Maverick, Trident3, Trident II+, and Trident II
ASICs and Mellanox switches with Spectrum ASICs are capable of head end
replication. The most scalable solution available with EVPN is to have
each VTEP (top of rack switch) generate all of its own BUM traffic
instead of relying on an external service node.
Cumulus Linux supports up to 128 VTEPs with head end replication.
ARP and ND Suppression
ARP suppression in an EVPN context refers to the ability of a VTEP to
suppress ARP flooding over VXLAN tunnels as much as possible. Instead, a
local proxy handles ARP requests received from locally attached hosts
for remote hosts. ARP suppression is the implementation for IPv4; ND
suppression is the implementation for IPv6.
ARP/ND suppression is not enabled by default. Enable ARP and ND suppression in all EVPN bridging and symmetric routing deployments to reduce flooding of ARP/ND packets over VXLAN tunnels.
You configure ARP/ND suppression on a VXLAN interface. You also need to create an SVI for the
neighbor entry.
On switches with the Mellanox Spectrum chipset, ND suppression only functions with the Spectrum A1 chip.
ARP/ND suppression must be enabled on all VXLAN interfaces on the switch. You cannot have ARP/ND suppression enabled on some VXLAN interfaces but not on others.
When ARP/ND suppression is enabled, you need to configure layer 3 interfaces even if the switch is configured only for layer 2 (that is, you are not using VXLAN routing). To avoid unnecessary layer 3 information from being installed, configure the ip forward off or ip6 forward off options as appropriate on the VLANs. See the example configuration below.
To configure ARP/ND suppression, use NCLU.
Here is an example configuration using two VXLANs (10100 and 10200) and two VLANs (100 and 200).
cumulus@leaf01:~$ net add loopback lo vxlan local-tunnelip 10.0.0.1
cumulus@switch:~$ net add bridge bridge ports vni100,vni200
cumulus@switch:~$ net add bridge bridge vids 100,200
cumulus@switch:~$ net add vxlan vni100 vxlan id 10100
cumulus@switch:~$ net add vxlan vni200 vxlan id 10200
cumulus@switch:~$ net add vxlan vni100 bridge learning off
cumulus@switch:~$ net add vxlan vni200 bridge learning off
cumulus@switch:~$ net add vxlan vni100 bridge access 100
cumulus@switch:~$ net add vxlan vni100 bridge arp-nd-suppress on
cumulus@switch:~$ net add vxlan vni200 bridge arp-nd-suppress on
cumulus@switch:~$ net add vxlan vni200 bridge access 200
cumulus@switch:~$ net add vlan 100 ip forward off
cumulus@switch:~$ net add vlan 100 ipv6 forward off
cumulus@switch:~$ net add vlan 200 ip forward off
cumulus@switch:~$ net add vlan 200 ipv6 forward off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the
/etc/network/interfaces file:
# The loopback network interface
auto lo
iface lo inet loopback
vxlan-local-tunnelip 10.0.0.1
auto bridge
iface bridge
bridge-ports vni100 vni200
bridge-stp on
bridge-vids 100 200
bridge-vlan-aware yes
auto vlan100
iface vlan100
ip6-forward off
ip-forward off
vlan-id 100
vlan-raw-device bridge
auto vlan200
iface vlan200
ip6-forward off
ip-forward off
vlan-id 200
vlan-raw-device bridge
auto vni100
iface vni100
bridge-access 100
bridge-arp-nd-suppress on
bridge-learning off
vxlan-id 10100
auto vni200
iface vni200
bridge-learning off
bridge-access 200
bridge-arp-nd-suppress on
vxlan-id 10200
For a bridge in traditional mode,
you must edit the bridge configuration in the /etc/network/interfaces
file using a text editor:
auto bridge1
iface bridge1
bridge-ports swp3.100 swp4.100 vni100
bridge-learning vni100=off
bridge-arp-nd-suppress vni100=on
ip6-forward off
ip-forward off
UFT Profiles Other than the Default
When deploying EVPN and VXLAN using a hardware profile other than the
default UFT profile, ensure that the Linux kernel ARP sysctl settings
gc_thresh2 and gc_thresh3 are both set to a value larger than the
number of neighbor (ARP/ND) entries anticipated in the deployment.
To configure these settings, edit the /etc/sysctl.d/neigh.conf file.
If your network has more hosts than the values used in the example
below, change the sysctl entries accordingly.
After you save your settings, reboot the switch to apply the new
configuration.
Support for EVPN Neighbor Discovery (ND) Extended Community
In an EVPN VXLAN deployment with ARP and ND suppression where the VTEPs
are only configured for layer 2, EVPN needs to carry additional
information for the attached devices so proxy ND can provide the correct
information to attached hosts. Without this information, hosts might not
be able to configure their default routers or might lose their existing
default router information. Cumulus Linux supports the EVPN Neighbor Discovery (ND) Extended
Community with a type field value of 0x06, a sub-type field value of
0x08 (ND Extended Community), and a router flag; this enables the switch
to determine if a particular IPv6-MAC pair belongs to a host or a
router.
Router Flag
The router flag (R-bit) is used in following scenarios:
In a centralized VXLAN routing configuration with a gateway router.
In a layer 2 switch deployment with ARP/ND suppression.
When the MAC/IP (type-2) route contains the IPv6-MAC pair and the R-bit
is set, the route belongs to a router. If the R-bit is set to zero, the
route belongs to a host. If the router is in a local LAN segment, the
switch implementing the proxy ND function learns of this information by
snooping on neighbor advertisement messages for the associated IPv6
address. This information is then exchanged with other EVPN peers by
using the ND extended community in BGP updates.
To show the EVPN arp-cache that gets populated by the neighbor table and
see if the IPv6-MAC entry belongs to a router, run this command:
cumulus@switch:mgmt-vrf:~$ net show evpn arp-cache vni 101 ip fe80::202:ff:fe00:11
IP: fe80::202:ff:fe00:11
Type: remote
State: active
MAC: 00:02:00:00:00:11
Remote VTEP: 10.0.0.134
Flags: Router
Local Seq: 0 Remote Seq: 0
To show the BGP routing table entry for the IPv6-MAC EVPN route with the
ND extended community, run this command:
cumulus@switch:mgmt-vrf:~$ net show bgp l2vpn evpn route vni 101 mac 00:02:00:00:00:11 ip fe80::202:ff:fe00:11
BGP routing table entry for [2]:[0]:[0]:[48]:[00:02:00:00:00:11]:[128]:[fe80::202:ff:fe00:11]
Paths: (1 available, best #1)
Not advertised to any peer
Route [2]:[0]:[0]:[48]:[00:02:00:00:00:11]:[128]:[fe80::202:ff:fe00:11] VNI 101
Imported from 1.1.1.2:2:[2]:[0]:[0]:[48]:[00:02:00:00:00:11]:[128]:[fe80::202:ff:fe00:11]
65002
10.0.0.134 from leaf2(swp53s0) (10.0.0.134)
Origin IGP, valid, external, bestpath-from-AS 65002, best
Extended Community: RT:65002:101 ET:8 ND:Router Flag
AddPath ID: RX 0, TX 18
Last update: Thu Aug 30 14:12:09 2018
EVPN and VXLAN Active-active Mode
No additional EVPN-specific configuration is needed for
VXLAN active-active mode.
Both switches in the MLAG
pair establish EVPN peering with other EVPN speakers (for example, with
spine switches, if using hop-by-hop eBGP) and inform about their locally
known VNIs and MACs. When MLAG is active, both switches announce this
information with the shared anycast IP address.
The active-active configuration, make sure that:
The clagd-vxlan-anycast-ip and vxlan-local-tunnelip parameters are under the loopback stanza on both peers.
The anycast address is advertised to the routed fabric from both peers.
The VNIs are configured identically on both peers.
MLAG synchronizes information between the two switches in the MLAG pair;
EVPN does not synchronize.
For information about active-active VTEPs and anycast IP behavior, and
for failure scenarios, read the VXLAN Active-Active Mode chapter.
Inter-subnet Routing
There are multiple models in EVPN for routing between different subnets
(VLANs), also known as inter-VLAN routing. These models arise due to the
following considerations:
Does every VTEP act as a layer 3 gateway and do routing, or only specific VTEPs do routing?
Is routing done only at the ingress of the VXLAN tunnel or is it done at both the ingress and the egress of the VXLAN tunnel?
These models are:
Centralized routing: Specific VTEPs act as designated layer 3 gateways and perform routing between subnets; other VTEPs just perform bridging.
Distributed asymmetric routing: Every VTEP participates in routing, but all routing is done at the ingress VTEP; the egress VTEP only performs bridging.
Distributed symmetric routing: Every VTEP participates in routing and routing is done at both the ingress VTEP and the egress VTEP.
Distributed routing - asymmetric or symmetric - is commonly deployed
with the VTEPs configured with an anycast IP/MAC address for each
subnet. That is, each VTEP that has a particular subnet is configured
with the same IP/MAC for that subnet. Such a model facilitates easy
host/VM mobility as there is no need to change the host/VM configuration
when it moves from one VTEP to another.
EVPN in Cumulus Linux supports all of the routing models listed above.
The models are described further in the following sections.
All routing happens in the context of a tenant VRF (virtual routing and forwarding).
A VRF instance is provisioned for each tenant, and the subnets of the
tenant are associated with that VRF (the corresponding SVI is attached
to the VRF). Inter-subnet routing for each tenant occurs within the
context of that tenant’s VRF and is separate from the routing for other
tenants.
When configuring VXLAN routing, enable ARP suppression on all VXLAN interfaces. Otherwise, when a locally attached host ARPs for the gateway, it will receive multiple responses, one from each anycast gateway.
Centralized Routing
In centralized routing, a specific VTEP is configured to act as the
default gateway for all the hosts in a particular subnet throughout the
EVPN fabric. It is common to provision a pair of VTEPs in active-active
mode as the default gateway, using an anycast IP/MAC address for each
subnet. All subnets need to be configured on such gateway VTEP(s). When
a host in one subnet wants to communicate with a host in another subnet,
it addresses the packets to the gateway VTEP. The ingress VTEP (to which
the source host is attached) bridges the packets to the gateway VTEP
over the corresponding VXLAN tunnel. The gateway VTEP performs the
routing to the destination host and post-routing, the packet gets
bridged to the egress VTEP (to which the destination host is attached).
The egress VTEP then bridges the packet on to the destination host.
Advertising the Default Gateway
To enable centralized routing, you must configure the gateway VTEPs to
advertise their IP/MAC address. Use the advertise-default-gw command,
as shown below.
cumulus@leaf01:~$ net add bgp autonomous-system 65000
cumulus@leaf01:~$ net add bgp l2vpn evpn advertise-default-gw
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following configuration snippet in the
/etc/frr/frr.conf file.
You can deploy centralized routing at the VNI level. Therefore, you
can configure the advertise-default-gw command per VNI so that
centralized routing is used for some VNIs while distributed routing
(described below) is used for other VNIs. This type of configuration
is not recommended unless the deployment requires it.
When centralized routing is in use, even if the source host and
destination host are attached to the same VTEP, the packets travel
to the gateway VTEP to get routed and then come back.
Asymmetric Routing
In distributed asymmetric routing, each VTEP acts as a layer 3 gateway,
performing routing for its attached hosts. The routing is called
asymmetric because only the ingress VTEP performs routing, the egress
VTEP only performs the bridging. Asymmetric routing is easy to deploy as
it can be achieved with only host routing and does not involve any
interconnecting VNIs. However, each VTEP must be provisioned with all
VLANs/VNIs - the subnets between which communication can take place;
this is required even if there are no locally-attached hosts for a
particular VLAN.
The only additional configuration required to implement asymmetric
routing beyond the standard configuration for a layer 2 VTEP described
earlier is to ensure that each VTEP has all VLANs (and corresponding
VNIs) provisioned on it and the SVI for each such VLAN is configured
with an anycast IP/MAC address.
Symmetric Routing
In distributed symmetric routing, each VTEP acts as a layer 3 gateway,
performing routing for its attached hosts. This is the same as in
asymmetric routing. The difference is that with symmetric routing, both
the ingress VTEP and egress VTEP route the packets. Therefore, it can be
compared to the traditional routing behavior of routing to a next hop
router. In the VXLAN encapsulated packet, the inner destination MAC
address is set to the router MAC address of the egress VTEP as an
indication that the egress VTEP is the next hop and also needs to
perform routing. All routing happens in the context of a tenant (VRF).
For a packet received by the ingress VTEP from a locally attached host,
the SVI interface corresponding to the VLAN determines the VRF. For a
packet received by the egress VTEP over the VXLAN tunnel, the VNI in the
packet has to specify the VRF. For symmetric routing, this is a VNI
corresponding to the tenant and is different from either the source VNI
or the destination VNI. This VNI is referred to as the layer 3 VNI or
interconnecting VNI; it has to be provisioned by the operator and is
exchanged through the EVPN control plane. In order to make the
distinction clear, the regular VNI, which is used to map a VLAN, is
referred to as the layer 2 VNI.
L3-VNI
There is a one-to-one mapping between a layer 3 VNI and a tenant (VRF).
The VRF to layer 3 VNI mapping has to be consistent across all VTEPs. The layer 3 VNI has to be provisioned by the operator.
Layer 3 VNI and layer 2 VNI cannot share the same number space; that is, you cannot have vlan10 and vxlan10 for example. Otherwise, the layer 2 VNI does not get created.
In an MLAG configuration, the SVI used for the layer 3 VNI cannot be
part of the bridge. This ensures that traffic tagged with that VLAN
ID is not forwarded on the peer link or other trunks.
In an EVPN symmetric routing configuration, when a type-2 (MAC/IP) route
is announced, in addition to containing two VNIs (the layer 2 VNI and
the layer 3 VNI), the route also contains separate RTs for layer 2 and
layer 3. The layer 3 RT associates the route with the tenant VRF. By
default, this is auto-derived in a similar way to the layer 2 RT, using
the layer 3 VNI instead of the layer 2 VNI; however you can also
explicitly configure it.
For EVPN symmetric routing, additional configuration is required:
Configure a per-tenant VXLAN interface that specifies the layer 3
VNI for the tenant. This VXLAN interface is part of the bridge and
router MAC addresses of remote VTEPs is installed over this
interface.
Configure an SVI (layer 3 interface) corresponding to the per-tenant
VXLAN interface. This is attached to the tenant’s VRF. Remote host
routes for symmetric routing are installed over this SVI.
Specify the mapping of VRF to layer 3 VNI. This configuration is for
the BGP control plane.
VXLAN Interface Corresponding to the Layer 3 VNI
cumulus@leaf01:~$ net add loopback lo vxlan local-tunnelip 10.0.0.1
cumulus@leaf01:~$ net add vxlan vni104001 vxlan id 104001
cumulus@leaf01:~$ net add vxlan vni104001 bridge access 4001
cumulus@leaf01:~$ net add vxlan vni104001 bridge learning off
cumulus@leaf01:~$ net add vxlan vni104001 bridge arp-nd-suppress on
cumulus@leaf01:~$ net add bridge bridge ports vni104001
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
The above commands create the following snippet in the
/etc/network/interfaces file:
# The loopback network interface
auto lo
iface lo inet loopback
vxlan-local-tunnelip 10.0.0.1
auto vni104001
iface vni104001
bridge-access 4001
bridge-arp-nd-suppress on
bridge-learning off
vxlan-id 104001
auto bridge
iface bridge
bridge-ports vni104001
bridge-vlan-aware yes
SVI for the Layer 3 VNI
cumulus@leaf01:~$ net add vlan 4001 vrf turtle
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following snippet in the
/etc/network/interfaces file:
auto vlan4001
iface vlan4001
vlan-id 4001
vlan-raw-device bridge
vrf turtle
When two VTEPs are operating in VXLAN active-active mode and performing
symmetric routing, you need to configure the router MAC corresponding to
each layer 3 VNI to ensure both VTEPs use the same MAC address. Specify
the hwaddress (MAC address) for the SVI corresponding to the layer 3
VNI. Use the same address on both switches in the MLAG pair. Cumulus
Networks recommends you use the MLAG system MAC address.
cumulus@leaf01:~$ net add vlan 4001 hwaddress 44:39:39:FF:40:94
This command creates the following snippet in the
/etc/network/interfaces file:
When configuring third party networking devices using MLAG and EVPN for interoperability, you must configure and announce a single shared router MAC value per advertised next hop IP address.
VRF to Layer 3 VNI Mapping
cumulus@leaf01:~$ net add vrf turtle vni 104001
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following configuration snippet in the
/etc/frr/frr.conf file.
vrf turtle
vni 104001
!
Configure RD and RTs for the Tenant VRF
If you do not want the RD and RTs (layer 3 RTs) for the tenant VRF to be
derived automatically, you can configure them manually by specifying
them under the l2vpn evpn address family for that specific VRF. For
example:
cumulus@switch:~$ net add bgp vrf tenant1 l2vpn evpn rd 172.16.100.1:20
cumulus@switch:~$ net add bgp vrf tenant1 l2vpn evpn route-target import 65100:20
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration snippet in the
/etc/frr/frr.conf file:
Symmetric routing presents a problem in the presence of silent hosts. If
the ingress VTEP does not have the destination subnet and the host route
is not advertised for the destination host, the ingress VTEP cannot
route the packet to its destination. This problem can be overcome by
having VTEPs announce the subnet prefixes corresponding to their
connected subnets in addition to announcing host routes. These routes
will be announced as EVPN prefix (type-5) routes.
Ensure that the routes corresponding to the connected subnets are
known in the BGP VRF routing table by injecting them using the
network command or redistributing them using the redistribute connected command.
This configuration is recommended only if the deployment is known to
have silent hosts. It is also recommended that you enable on only one
VTEP per subnet, or two for redundancy.
An earlier version of this chapter referred to the advertise-subnet
command. That command is deprecated and should not be used.
Prefix-based Routing - EVPN Type-5 Routes
EVPN in Cumulus Linux supports prefix-based routing using EVPN type-5
(prefix) routes. Type-5 routes (or prefix routes) are primarily used to
route to destinations outside of the data center fabric.
EVPN prefix routes carry the layer 3 VNI and router MAC address and
follow the symmetric routing model for routing to the destination
prefix.
When connecting to a WAN edge router to reach destinations outside the
data center, it is highly recommended that specific border/exit leaf
switches be deployed to originate the type-5 routes.
On switches with the Mellanox Spectrum chipset, centralized routing,
symmetric routing and prefix-based routing only function with the
Spectrum A1 chip.
If you are using a Broadcom Trident II+ switch as a border/exit leaf, see
Caveats below for a necessary workaround; the workaround only applies
to Trident II+ switches, not Tomahawk or Spectrum.
Configure the Switch to Install EVPN Type-5 Routes
For a switch to be able to install EVPN type-5 routes into the routing
table, it must be configured with the layer 3 VNI related information.
This configuration is the same as for symmetric routing. You need to:
Configure a per-tenant VXLAN interface that specifies the layer 3
VNI for the tenant. This VXLAN interface is part of the bridge;
router MAC addresses of remote VTEPs are installed over this
interface.
Configure an SVI (layer 3 interface) corresponding to the per-tenant
VXLAN interface. This is attached to the tenant’s VRF. The remote
prefix routes are installed over this SVI.
Specify the mapping of the VRF to layer 3 VNI. This configuration is
for the BGP control plane.
Announce EVPN Type-5 Routes
The following configuration is needed in the tenant VRF to announce IP
prefixes in BGP’s RIB as EVPN type-5 routes.
cumulus@bl1:~$ net add bgp vrf vrf1 l2vpn evpn advertise ipv4 unicast
cumulus@bl1:~$ net pending
cumulus@bl1:~$ net commit
These commands create the following snippet in the /etc/frr/frr.conf
file:
Asymmetric routing is an ideal choice when all VLANs (subnets) are
configured on all leaf switches. It simplifies the routing configuration
and eliminates the potential need for advertising subnet routes to
handle silent hosts. However, most deployments need access to external
networks to reach the Internet or global destinations, or to do
subnet-based routing between pods or data centers; this requires EVPN
type-5 routes.
Cumulus Linux supports EVPN type-5 routes for prefix-based routing in
asymmetric configurations within the pod or data center by providing an
option to use the layer 3 VNI only for type-5 routes; type-2 routes
(host routes) only use the layer 2 VNI.
The following example commands show how to use the layer 3 VNI for
type-5 routes only:
cumulus@leaf01:~$ net add vrf turtle vni 104001 prefix-routes-only
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
These commands create the following snippet in the /etc/frr/frr.conf
file:
vrf turtle
vni 104001 prefix-routes-only
There is no command to delete the prefix-routes-only option. The net del vrf <vrf> vni <vni> prefix-routes-only command deletes the VNI.
Control Which RIB Routes Are Injected into EVPN
By default, when announcing IP prefixes in the BGP RIB as EVPN type-5
routes, all routes in the BGP RIB are picked for advertisement as EVPN
type-5 routes. You can use a route map to allow selective advertisement
of routes from the BGP RIB as EVPN type-5 routes.
The following command adds a route map filter to IPv4 EVPN type-5 route
advertisement:
cumulus@switch:~$ net add bgp vrf turtle l2vpn evpn advertise ipv4 unicast route-map map1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Originate Default EVPN Type-5 Routes
Cumulus Linux supports originating EVPN default type-5 routes. The
default type-5 route is originated from a border (exit) leaf and
advertised to all the other leafs within the pod. Any leaf within the
pod follows the default route towards the border leaf for all external
traffic (towards the Internet or a different pod).
To originate a default type-5 route in EVPN, you need to execute FRRouting
commands. The following shows an example:
MAC addresses that are intended to be pinned to a particular VTEP can be
provisioned on the VTEP as a static bridge FDB entry. EVPN picks up
these MAC addresses and advertises them to peers as remote static MACs.
You configure static bridge FDB entries for sticky MACs under the bridge
configuration using NCLU:
cumulus@switch:~$ net add bridge post-up bridge fdb add 00:11:22:33:44:55 dev swp1 vlan 101 master static
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands create the following configuration in the
/etc/network/interfaces file:
For a bridge in traditional mode,
you must edit the bridge configuration in the /etc/network/interfaces
file using a text editor:
auto br101
iface br101
bridge-ports swp1.101 vni10101
bridge-learning vni10101=off
post-up bridge fdb add 00:11:22:33:44:55 dev swp1.101 master static
Filter EVPN Routes Based on Type
In many situations, it is desirable to only exchange EVPN routes of a
particular type. For example, a common deployment scenario for large
data centers is to sub-divide the data center into multiple pods with
full host mobility within a pod but only do prefix-based routing across
pods. This can be achieved by only exchanging EVPN type-5 routes across
pods.
To filter EVPN routes based on the route-type and allow only certain
types of EVPN routes to be advertised in the fabric, use these commands:
net add routing route-map <route_map_name> (deny|permit) <1-65535> match evpn default-route
net add routing route-map <route_map_name> (deny|permit) <1-65535> match evpn route-type (macip|prefix|multicast)
The following example command configures EVPN to advertise type-5 routes
only:
cumulus@switch:~$ net add routing route-map map1 permit 1 match evpn route-type prefix
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
Filtering EVPN Routes Based on VNI
In many situations, it is desirable to only exchange EVPN routes carrying
a particular VXLAN ID. For example, if data centers or pods within a data center share only certain tenants, you can use a route-map to control the EVPN routes to exchange based on the VNI.
To filter EVPN routes based on the VXLAN ID and allow Cumulus Linux to only advertise in the fabric EVPN routes
with a particular VNI, use these commands:
net add routing route-map <route_map_name> (deny|permit) <1-65535> match evpn vni <1-16777215>
You can only match type-2 and type-5 routes based on VNI.
Advertise SVI IP Addresses
In a typical EVPN deployment, you reuse SVI IP addresses on VTEPs
across multiple racks. However, if you use unique SVI IP addresses
across multiple racks and you want the local SVI IP address to be
reachable via remote VTEPs, you can enable the advertise-svi-ip
option. This option advertises the SVI IP/MAC address as a type-2 route
and eliminates the need for any flooding over VXLAN to reach the SVI IP
from a remote VTEP/rack.
Notes
The advertise-svi-ip option is available in Cumulus Linux 3.7.4 and later.
When you enable the advertise-svi-ip option, the anycast IP/MAC
address pair is not advertised. Be sure not to enable both the
advertise-svi-ip option and the advertise-default-gw option at
the same time. (The advertise-default-gw option configures the
gateway VTEPs to advertise their IP/MAC address. See
Advertising the Default Gateway).
To advertise all SVI IP/MAC addresses on the switch, run these
commands:
cumulus@switch:~$ net add bgp evpn advertise-svi-ip
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
These commands save the configuration in the /etc/frr/frr.conf file.
For example:
Cumulus Linux support for host and virtual machine mobility in an EVPN
deployment has been enhanced to handle scenarios where the IP to MAC
binding for a host or virtual machine changes across the move. This is
referred to as extended mobility. The simple mobility scenario where a
host or virtual machine with a binding of IP1, MAC1 moves from one
rack to another has been supported in previous releases of Cumulus
Linux. The EVPN enhancements support additional scenarios where a host
or virtual machine with a binding of IP1, MAC1 moves and takes on a
new binding of IP2, MAC1 or IP1, MAC2. The EVPN protocol
mechanism to handle extended mobility continues to use the MAC mobility
extended community and is the same as the standard mobility procedures.
Extended mobility defines how the sequence number in this attribute is
computed when binding changes occur.
Extended mobility not only supports virtual machine moves, but also a
scenario where one virtual machine shuts down and another is provisioned
on a different rack that uses the IP address or the MAC address of the
previous virtual machine. For example, in an EVPN deployment with
OpenStack, where virtual machines for a tenant are provisioned and shut
down very dynamically, a new virtual machine can use the same IP address
as an earlier virtual machine but with a different MAC address.
During mobility events, EVPN neighbor management relies on ARP and GARP to learn the new location for hosts and VMs. MAC learning is independent of this and happens in the hardware.
The support for extended mobility is enabled by default and does not
require any additional configuration.
You can examine the sequence numbers associated with a host or virtual
machine MAC address and IP address with NCLU commands. For example:
cumulus@switch:~$ net show evpn mac vni 10100 mac 00:02:00:00:00:42
MAC: 00:02:00:00:00:42
Remote VTEP: 10.0.0.2
Local Seq: 0 Remote Seq: 3
Neighbors:
10.1.1.74 Active
cumulus@switch:~$ net show evpn arp vni 10100 ip 10.1.1.74
IP: 10.1.1.74
Type: local
State: active
MAC: 44:39:39:ff:00:24
Local Seq: 2 Remote Seq: 3
Duplicate Address Detection
Cumulus Linux 3.7.2 and later is able to detect duplicate MAC and
IPv4/IPv6 addresses on hosts or virtual machines in a VXLAN-EVPN
configuration. The Cumulus Linux switch (VTEP) considers a host MAC or
IP address to be duplicate if the address moves across the network more
than a certain number of times within a certain number of seconds (five
moves within 180 seconds by default). In addition to legitimate host or
VM mobility scenarios, address movement can occur when IP addresses are
misconfigured on hosts or when packet looping occurs in the network due
to faulty configuration or behavior.
Duplicate address detection is enabled by default and triggers when:
Two hosts have the same MAC address (the host IP addresses might be the same or different)
Two hosts have the same IP address but different MAC addresses
By default, when a duplicate address is detected, Cumulus Linux flags
the address as a duplicate and generates an error in syslog so that you
can troubleshoot the reason and address the fault, then clear the
duplicate address flag. No functional action is taken on the address.
If a MAC address is flagged as a duplicate, all IP addresses associated
with that MAC are flagged as duplicates.
In an MLAG configuration, MAC mobility detection runs independently on each switch in the MLAG pair. Based on the sequence in which local learning and/or route withdrawal from the remote VTEP occurs, a type-2 route might have its MAC mobility counter incremented only on one of the switches in the MLAG pair. In rare cases, it is possible for neither VTEP to increment the MAC mobility counter for the type-2 prefix.
When Does Duplicate Address Detection Trigger?
The VTEP that sees an address move from remote to local begins the detection
process by starting a timer. Each VTEP runs duplicate address detection independently.
Detection always starts with the first mobility event from remote to
local. If the address is initially remote, the detection count can
start with the very first move for the address. If the address is
initially local, the detection count starts only with the second or
higher move for the address.
If an address is undergoing a mobility
event between remote VTEPs, duplicate detection is not started.
The following illustration shows VTEP-A, VTEP-B, and VTEP-C in an EVPN
configuration. Duplicate address detection triggers on VTEP-A when there
is a duplicate MAC address for two hosts attached to VTEP-A and VTEP-B.
However, duplicate detection does not trigger on VTEP-A when mobility
events occur between two remote VTEPs (VTEP-B and VTEP-C).
Configure Duplicate Address Detection
To change the threshold for MAC and IP address moves, run the net add bgp l2vpn evpn dup-addr-detection max-moves <number-of-events> time <duration> command. You can specify max-moves to be between 2 and
1000 and time to be between 2 and 1800 seconds.
The following example command sets the maximum number of address moves
allowed to 10 and the duplicate address detection time interval to 1200
seconds.
cumulus@switch:~$ net add bgp l2vpn evpn dup-addr-detection max-moves 10 time 1200
The following example shows the syslog message that is generated when
Cumulus Linux detects a MAC address as a duplicate during a local update:
2018/11/06 18:55:29.463327 ZEBRA: [EC 4043309149] VNI 1001: MAC 00:01:02:03:04:11 detected as duplicate during local update, last VTEP 172.16.0.16
The following example shows the syslog message that is generated when Cumulus
Linux detects an IP address as a duplicate during a remote update:
2018/11/09 22:47:15.071381 ZEBRA: [EC 4043309151] VNI 1002: MAC aa:22:aa:aa:aa:aa IP 10.0.0.9 detected as duplicate during remote update, from VTEP 172.16.0.16
Freeze a Detected Duplicate Address
Cumulus Linux 3.7.3 and later provides a freeze option that takes
action on a detected duplicate address. You can freeze the address
permanently (until you intervene) or for a defined amount of time,
after which it is cleared automatically.
When you enable the freeze option and a duplicate address is detected:
If the MAC or IP address is learned from a remote VTEP at the time
it is frozen, the forwarding information in the kernel and hardware
is not updated, leaving it in the prior state. Any future remote
updates are processed but they are not reflected in the kernel
entry. If the remote VTEP sends a MAC-IP route withdrawal, the local
VTEP removes the frozen remote entry. Then, if the local VTEP has a
locally-learned entry already present in its kernel, FRR will
originate a corresponding MAC-IP route and advertise it to all
remote VTEPs.
If the MAC or IP address is locally learned on this VTEP at the time
it is frozen, the address is not advertised to remote VTEPs. Future
local updates are processed but are not advertised to remote VTEPs.
If FRR receives a local entry delete event, the frozen entry is
removed from the FRR database. Any remote updates (from other VTEPs)
change the state of the entry to remote but the entry is not
installed in the kernel (until cleared).
To recover from a freeze, shut down the faulty host or VM or fix any
other misconfiguration in the network. If the address is frozen
permanently, issue the clear command
on the VTEP where the address is marked as duplicate. If the address is
frozen for a defined period of time, it is cleared automatically after
the timer expires (you can clear the duplicate address before the timer
expires with the clear command).
If you issue the clear command or the timer expires
before you address the fault, duplicate address detection might occur repeatedly.
After you clear a frozen address, if it is present behind a remote VTEP,
the kernel and hardware forwarding tables are updated. If the address is
locally learned on this VTEP, the address is advertised to remote VTEPs.
All VTEPs get the correct address as soon as the host communicates.
Silent hosts are learned only after the faulty entries age out, or you
intervene and clear the faulty MAC and ARP table entries.
Configure the Freeze Option
To enable Cumulus Linux to freeze detected duplicate addresses, run
the net add bgp l2vpn evpn dup-addr-detection freeze <duration>|permanent command.
The duration can be any number of seconds between 30 and 3600.
The following example command freezes duplicate addresses for a period
of 1000 seconds, after which it is cleared automatically :
cumulus@switch:~$ net add bgp l2vpn evpn dup-addr-detection freeze 1000
Set the freeze timer to be three times the duplicate address detection window. For example, if the duplicate address detection window is set to the default of 180 seconds, set the freeze timer to 540 seconds.
The following example command freezes duplicate addresses permanently
(until you issue the clear command):
cumulus@switch:~$ net add bgp l2vpn evpn dup-addr-detection freeze permanent
Clear Duplicate Addresses
To clear a duplicate MAC or IP address (and unfreeze a frozen address),
run the net clear evpn dup-addr vni <vni_id> ip <mac/ip address>
command. The following example command clears IP address 10.0.0.9 for
VNI 101.
cumulus@switch:~$ net clear evpn dup-addr vni 101 ip 10.0.0.9
To clear duplicate addresses for all VNIs, run the following command:
cumulus@switch:~$ net clear evpn dup-addr vni all
In an MLAG configuration, you need to run the clear command on both the MLAG
primary and secondary switch.
When you clear a duplicate MAC address, all its associated IP addresses
are also cleared. However, you cannot clear an associated IP address if
its MAC address is still in a duplicate state.
Disable Duplicate Address Detection
By default, duplicate address detection is enabled and a syslog error is
generated when a duplicate address is detected. To disable duplicate
address detection, run the following command.
cumulus@switch:~$ net del bgp l2vpn evpn dup-addr-detection
When you disable duplicate address detection, Cumulus Linux clears the
configuration and all existing duplicate addresses.
Show Detected Duplicate Address Information
During the duplicate address detection process, you can see the start
time and current detection count with the net show evpn mac vni <vni_id> mac <mac_addr> command. The following command example shows
that detection started for MAC address 00:01:02:03:04:11 for VNI 1001 on
Tuesday, Nov 6 at 18:55:05 and the number of moves detected is 1.
cumulus@switch:~$ net show evpn mac vni 1001 mac 00:01:02:03:04:11
MAC: 00:01:02:03:04:11
Intf: hostbond3(15) VLAN: 1001
Local Seq: 1 Remote Seq: 0
Duplicate detection started at Tue Nov 6 18:55:05 2018, detection count 1
Neighbors:
10.0.1.26 Active
After the duplicate MAC address is cleared, the net show evpn mac vni <vni_id> mac <mac_addr> command shows:
MAC: 00:01:02:03:04:11
Remote VTEP: 172.16.0.16
Local Seq: 13 Remote Seq: 14
Duplicate, detected at Tue Nov 6 18:55:29 2018
Neighbors:
10.0.1.26 Active
To display information for a duplicate IP address, run the net show evpn arp-cache vni <vni_id> ip <ip_addr> command. The following
command example shows information for IP address 10.0.0.9 for VNI 1001.
cumulus@switch:~$ net show evpn arp-cache vni 1001 ip 10.0.0.9
IP: 10.0.0.9
Type: remote
State: inactive
MAC: 00:01:02:03:04:11
Remote VTEP: 10.0.0.34
Local Seq: 0 Remote Seq: 14
Duplicate, detected at Tue Nov 6 18:55:29 2018
To show a list of MAC addresses detected as duplicate for a specific VNI
or for all VNIs, run the net show evpn mac vni <vni-id|all> duplicate
command. The following example command shows a list of duplicate MAC
addresses for VNI 1001:
cumulus@switch:~$ net show evpn mac vni 1001 duplicate
Number of MACs (local and remote) known for this VNI: 16
MAC Type Intf/Remote VTEP VLAN
aa:bb:cc:dd:ee:ff local hostbond3 1001
To show a list of IP addresses detected as duplicate for a specific VNI
or for all VNIs, run the net show evpn arp-cache vni <vni-id|all> duplicate command. The following example command shows a list of
duplicate IP addresses for VNI 1001:
cumulus@switch:~$ net show evpn arp-cache vni 1001 duplicate
Number of ARPs (local and remote) known for this VNI: 20
IP Type State MAC Remote VTEP
10.0.0.8 local active aa:11:aa:aa:aa:aa
10.0.0.9 local active aa:11:aa:aa:aa:aa
10.10.0.12 remote active aa:22:aa:aa:aa:aa 172.16.0.16
To show configured duplicate address detection parameters, run the net show evpn command:
cumulus@switch:~$ net show evpn
L2 VNIs: 4
L3 VNIs: 2
Advertise gateway mac-ip: No
Duplicate address detection: Enable
Detection max-moves 7, time 300
Detection freeze permanent
EVPN Operational Commands
General Linux Commands Related to EVPN
You can use various iproute2 commands to examine links, VLAN mappings
and the bridge MAC forwarding database known to the Linux kernel. You
can also use these commands to examine the neighbor cache and the
routing table (for the underlay or for a specific tenant VRF). Some of
the key commands are:
ip [-d] link show
bridge link show
bridge vlan show
bridge [-s] fdb show
ip neighbor show
ip route show [table <vrf-name>]
A sample output of ip -d link show type vxlan is shown below for one
VXLAN interface. Some relevant parameters are the VNI value, the state,
the local IP address for the VXLAN tunnel, the UDP port number (4789)
and the bridge that the interface is part of (bridge in the example
below). The output also shows that MAC learning is disabled (off) on
the VXLAN interface.
cumulus@leaf01:~$ ip -d link show type vxlan
9: vni100: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge state UNKNOWN mode DEFAULT group default
link/ether 72:bc:b4:a3:eb:1e brd ff:ff:ff:ff:ff:ff promiscuity 1
vxlan id 10100 local 10.0.0.1 srcport 0 0 dstport 4789 nolearning ageing 300
bridge_slave state forwarding priority 8 cost 100 hairpin off guard off root_block off fastleave off learning off flood on port_id 0x8001 port_no 0x1 designated_port 32769 designated_cost 0 designated_bridge 8000.0:1:0:0:11:0 designated_root 8000.0:1:0:0:11:0 hold_timer 0.00 message_age_timer 0.00 forward_delay_timer 0.00 topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp_wifi off mcast_router 1 mcast_fast_leave off mcast_flood on neigh_suppress on group_fwd_mask 0x0 group_fwd_mask_str 0x0 group_fwd_maskhi 0x0 group_fwd_maskhi_str 0x0 addrgenmode eui64
...
cumulus@leaf01:~$
A sample output of bridge fdb show is depicted below. Some interesting
information from this output includes:
swp3 and swp4 are access ports with VLAN ID 100. This is mapped to
VXLAN interface vni100.
00:02:00:00:00:01 is a local host MAC learned on swp3.
The remote VTEPs which participate in VLAN ID 100 are 10.0.0.3,
10.0.0.4 and 10.0.0.2. This is evident from the FDB entries with a
MAC address of 00:00:00:00:00:00. These entries are used for BUM
traffic replication.
00:02:00:00:00:06 is a remote host MAC reachable over the VXLAN
tunnel to 10.0.0.2.
cumulus@leaf01:~$ bridge fdb show
00:02:00:00:00:13 dev swp3 master bridge permanent
00:02:00:00:00:01 dev swp3 vlan 100 master bridge
00:02:00:00:00:02 dev swp4 vlan 100 master bridge
72:bc:b4:a3:eb:1e dev vni100 master bridge permanent
00:02:00:00:00:06 dev vni100 vlan 100 offload master bridge
00:00:00:00:00:00 dev vni100 dst 10.0.0.3 self permanent
00:00:00:00:00:00 dev vni100 dst 10.0.0.4 self permanent
00:00:00:00:00:00 dev vni100 dst 10.0.0.2 self permanent
00:02:00:00:00:06 dev vni100 dst 10.0.0.2 self offload
...
A sample output of ip neigh show is shown below. Some interesting
information from this output includes:
172.16.120.11 is a locally-attached host on VLAN 100. It is shown
twice because of the configuration of the anycast IP/MAC on the switch.
172.16.120.42 is a remote host on VLAN 100 and 172.16.130.23 is a
remote host on VLAN 200. The MAC address of these hosts can be
examined using the bridge fdb show command described earlier to
determine the VTEPs behind which these hosts are located.
cumulus@leaf01:~$ ip neigh show
172.16.120.11 dev vlan100-v0 lladdr 00:02:00:00:00:01 STALE
172.16.120.42 dev vlan100 lladdr 00:02:00:00:00:0e offload REACHABLE
172.16.130.23 dev vlan200 lladdr 00:02:00:00:00:07 offload REACHABLE
172.16.120.11 dev vlan100 lladdr 00:02:00:00:00:01 REACHABLE
...
In Cumulus Linux 3.7.11 and later, you can use the NCLU net show neighbor command.
General BGP Operational Commands Relevant to EVPN
The following commands are not unique to EVPN but help troubleshoot
connectivity and route propagation. If BGP is used for the underlay
routing, you can view a summary of the layer 3 fabric connectivity by
running the net show bgp summary command:
cumulus@leaf01:~$ net show bgp summary
show bgp ipv4 unicast summary
=============================
BGP router identifier 10.0.0.1, local AS number 65001 vrf-id 0
BGP table version 9
RIB entries 11, using 1496 bytes of memory
Peers 2, using 42 KiB of memory
Peer groups 1, using 72 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
s1(swp49s0) 4 65100 43 49 0 0 0 02:04:00 4
s2(swp49s1) 4 65100 43 49 0 0 0 02:03:59 4
Total number of neighbors 2
show bgp ipv6 unicast summary
=============================
No IPv6 neighbor is configured
show bgp evpn summary
=====================
BGP router identifier 10.0.0.1, local AS number 65001 vrf-id 0
BGP table version 0
RIB entries 15, using 2040 bytes of memory
Peers 2, using 42 KiB of memory
Peer groups 1, using 72 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
s1(swp49s0) 4 65100 43 49 0 0 0 02:04:00 30
s2(swp49s1) 4 65100 43 49 0 0 0 02:03:59 30
Total number of neighbors 2
You can examine the underlay routing, which determines how remote VTEPs
are reached. Run the net show route command. Here is some sample
output from a leaf switch:
cumulus@leaf01:~$ net show route
show ip route
=============
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR,
> - selected route, * - FIB route
C>* 10.0.0.11/32 is directly connected, lo, 19:48:21
B>* 10.0.0.12/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
B>* 10.0.0.13/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
B>* 10.0.0.14/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
B>* 10.0.0.21/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:04
B>* 10.0.0.22/32 [20/0] via fe80::4638:39ff:fe00:25, swp52, 19:48:03
B>* 10.0.0.41/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
B>* 10.0.0.42/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
C>* 10.0.0.112/32 is directly connected, lo, 19:48:21
B>* 10.0.0.134/32 [20/0] via fe80::4638:39ff:fe00:54, swp51, 19:48:03
* via fe80::4638:39ff:fe00:25, swp52, 19:48:03
C>* 169.254.1.0/30 is directly connected, peerlink.4094, 19:48:21
show ipv6 route
===============
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
> - selected route, * - FIB route
C * fe80::/64 is directly connected, bridge, 19:48:21
C * fe80::/64 is directly connected, peerlink.4094, 19:48:21
C * fe80::/64 is directly connected, swp52, 19:48:21
C>* fe80::/64 is directly connected, swp51, 19:48:21
cumulus@leaf01:~$
You can view the MAC forwarding database on the switch by running the
net show bridge macs command:
You can see the BGP peers participating in the layer 2 VPN/EVPN
address-family and their states using the net show bgp l2vpn evpn summary command. The following sample output from a leaf switch shows
eBGP peering with two spine switches for exchanging EVPN routes; both
peering sessions are in the established state.
cumulus@leaf01:~$ net show bgp l2vpn evpn summary
BGP router identifier 10.0.0.1, local AS number 65001 vrf-id 0
BGP table version 0
RIB entries 15, using 2280 bytes of memory
Peers 2, using 39 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
s1(swp1) 4 65100 103 107 0 0 0 1d02h08m 30
s2(swp2) 4 65100 103 107 0 0 0 1d02h08m 30
Total number of neighbors 2
cumulus@leaf01:~$
Display VNIs in EVPN
Run the show bgp l2vpn evpn vni command to display the configured VNIs
on a network device participating in BGP EVPN. This command is only
relevant on a VTEP. If symmetric routing is configured, this command
displays the special layer 3 VNIs that are configured per tenant VRF.
The following example from a leaf switch shows two layer 2 VNIs - 10100
and 10200 - as well as a layer 3 VNI - 104001. For layer 2 VNIs, the
number of associated MAC and neighbor entries are shown. The VXLAN
interface and VRF corresponding to each VNI are also shown.
cumulus@leaf01:~$ net show evpn vni
VNI Type VxLAN IF # MACs # ARPs # Remote VTEPs Tenant VRF
10200 L2 vni200 8 12 3 vrf1
10100 L2 vni100 8 12 3 vrf1
104001 L3 vni4001 3 3 n/a vrf1
cumulus@leaf01:~$
You can examine the EVPN information for a specific VNI in detail. The
following output shows details for the layer 2 VNI 10100 as well as for
the layer 3 VNI 104001. For the layer 2 VNI, the remote VTEPs which have
that VNI are shown. For the layer 3 VNI, the router MAC and associated
layer 2 VNIs are shown. The state of the layer 3 VNI depends on the
state of its associated VRF as well as the states of its underlying
VXLAN interface and SVI.
cumulus@leaf01:~$ net show evpn vni 10100
VNI: 10100
Type: L2
Tenant VRF: vrf1
VxLAN interface: vni100
VxLAN ifIndex: 9
Local VTEP IP: 10.0.0.1
Remote VTEPs for this VNI:
10.0.0.2
10.0.0.4
10.0.0.3
Number of MACs (local and remote) known for this VNI: 8
Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 12
Advertise-gw-macip: No
cumulus@leaf01:~$
cumulus@leaf01:~$ net show evpn vni 104001
VNI: 104001
Type: L3
Tenant VRF: vrf1
Local Vtep Ip: 10.0.0.1
Vxlan-Intf: vni4001
SVI-If: vlan4001
State: Up
Router MAC: 00:01:00:00:11:00
L2 VNIs: 10100 10200
cumulus@leaf01:~$
Examine Local and Remote MAC Addresses for a VNI in EVPN
Run net show evpn mac vni <vni> to examine all local and remote MAC
addresses for a VNI. This command is only relevant for a layer 2 VNI:
cumulus@leaf01:~$ net show evpn mac vni 10100
Number of MACs (local and remote) known for this VNI: 8
MAC Type Intf/Remote VTEP VLAN
00:02:00:00:00:0e remote 10.0.0.4
00:02:00:00:00:06 remote 10.0.0.2
00:02:00:00:00:05 remote 10.0.0.2
00:02:00:00:00:02 local swp4 100
00:00:5e:00:01:01 local vlan100-v0 100
00:02:00:00:00:09 remote 10.0.0.3
00:01:00:00:11:00 local vlan100 100
00:02:00:00:00:01 local swp3 100
00:02:00:00:00:0a remote 10.0.0.3
00:02:00:00:00:0d remote 10.0.0.4
cumulus@leaf01:~$
Run the net show evpn mac vni all command to examine MAC addresses for
all VNIs.
You can examine the details for a specific MAC addresse or query all
remote MAC addresses behind a specific VTEP:
cumulus@leaf01:~$ net show evpn mac vni 10100 mac 00:02:00:00:00:02
MAC: 00:02:00:00:00:02
Intf: swp4(6) VLAN: 100
Local Seq: 0 Remote Seq: 0
Neighbors:
172.16.120.12 Active
cumulus@leaf01:~$ net show evpn mac vni 10100 mac 00:02:00:00:00:05
MAC: 00:02:00:00:00:05
Remote VTEP: 10.0.0.2
Neighbors:
172.16.120.21
cumulus@leaf01:~$ net show evpn mac vni 10100 vtep 10.0.0.3
VNI 10100
MAC Type Intf/Remote VTEP VLAN
00:02:00:00:00:09 remote 10.0.0.3
00:02:00:00:00:0a remote 10.0.0.3
cumulus@leaf01:~$
Examine Local and Remote Neighbors for a VNI in EVPN
Run the net show evpn arp-cache vni <vni> command to examine all local
and remote neighbors (ARP entries) for a VNI. This command is only
relevant for a layer 2 VNI and the output shows both IPv4 and IPv6
neighbor entries:
cumulus@leaf01:~$ net show evpn arp-cache vni 10100
Number of ARPs (local and remote) known for this VNI: 12
IP Type MAC Remote VTEP
172.16.120.11 local 00:02:00:00:00:01
172.16.120.12 local 00:02:00:00:00:02
172.16.120.22 remote 00:02:00:00:00:06 10.0.0.2
fe80::201:ff:fe00:1100 local 00:01:00:00:11:00
172.16.120.1 local 00:01:00:00:11:00
172.16.120.31 remote 00:02:00:00:00:09 10.0.0.3
fe80::200:5eff:fe00:101 local 00:00:5e:00:01:01
...
Run the net show evpn arp-cache vni all command to examine neighbor
entries for all VNIs.
Examine Remote Router MACs in EVPN
When symmetric routing is deployed, run the net show evpn rmac vni <vni> command to examine the router MACs corresponding to all remote
VTEPs. This command is only relevant for a layer 3 VNI:
cumulus@leaf01:~$ net show evpn rmac vni 104001
Number of Remote RMACs known for this VNI: 3
MAC Remote VTEP
00:01:00:00:14:00 10.0.0.4
00:01:00:00:12:00 10.0.0.2
00:01:00:00:13:00 10.0.0.3
cumulus@leaf01:~$
Run the net show evpn rmac vni all command to examine router MACs
for all layer 3 VNIs.
Examine Gateway Next Hops in EVPN
When symmetric routing is deployed, you can run the net show evpn next-hops vni <vni> command to examine the gateway next hops. This
command is only relevant for a layer 3 VNI. In general, the gateway next
hop IP addresses correspond to the remote VTEP IP addresses. Remote host
and prefix routes are installed using these next hops:
cumulus@leaf01:~$ net show evpn next-hops vni 104001
Number of NH Neighbors known for this VNI: 3
IP RMAC
10.0.0.3 00:01:00:00:13:00
10.0.0.4 00:01:00:00:14:00
10.0.0.2 00:01:00:00:12:00
cumulus@leaf01:~$
Run the net show evpn next-hops vni all command to examine gateway
next hops for all layer 3 VNIs.
You can query a specific next hop; the output displays the remote host
and prefix routes through this next hop:
cumulus@leaf01:~$ net show evpn next-hops vni 104001 ip 10.0.0.4
Ip: 10.0.0.4
RMAC: 00:01:00:00:14:00
Refcount: 4
Prefixes:
172.16.120.41/32
172.16.120.42/32
172.16.130.43/32
172.16.130.44/32
cumulus@leaf01:~$
Display the VRF Routing Table in FRR
Run the net show route vrf <vrf-name> comand to examine the VRF
routing table. This command is not specific to EVPN. In the context of
EVPN, this command is relevant when symmetric routing is deployed and
can be used to verify that remote host and prefix routes are installed
in the VRF routing table and point to the appropriate gateway next hop.
cumulus@leaf01:~$ net show route vrf vrf1
show ip route vrf vrf1
=======================
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel,
> - selected route, * - FIB route
VRF vrf1:
K * 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 1d02h42m
C * 172.16.120.0/24 is directly connected, vlan100-v0, 1d02h42m
C>* 172.16.120.0/24 is directly connected, vlan100, 1d02h42m
B>* 172.16.120.21/32 [20/0] via 10.0.0.2, vlan4001 onlink, 1d02h41m
B>* 172.16.120.22/32 [20/0] via 10.0.0.2, vlan4001 onlink, 1d02h41m
B>* 172.16.120.31/32 [20/0] via 10.0.0.3, vlan4001 onlink, 1d02h41m
B>* 172.16.120.32/32 [20/0] via 10.0.0.3, vlan4001 onlink, 1d02h41m
B>* 172.16.120.41/32 [20/0] via 10.0.0.4, vlan4001 onlink, 1d02h41m
...
In the output above, the next hops for these routes are specified by
EVPN to be onlink, or reachable over the specified SVI. This is
necessary because this interface is not required to have an IP address.
Even if the interface is configured with an IP address, the next hop is
not on the same subnet as it is usually the IP address of the remote
VTEP (part of the underlay IP network).
Display the Global BGP EVPN Routing Table
Run the net show bgp l2vpn evpn route command to display all EVPN
routes, both local and remote. The routes displayed here are based on RD
as they are across VNIs and VRFs:
cumulus@leaf01:~$ net show bgp l2vpn evpn route
BGP table version is 0, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.0.0.1:1
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:01]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:01]:[32]:[172.16.120.11]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:01]:[128]:[2001:172:16:120::11]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:02]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:02]:[32]:[172.16.120.12]
10.0.0.1 32768 i
*> [3]:[0]:[32]:[10.0.0.1]
10.0.0.1 32768 i
Route Distinguisher: 10.0.0.1:2
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:01]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:01]:[32]:[172.16.130.11]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:02]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:02]:[32]:[172.16.130.12]
10.0.0.1 32768 i
*> [3]:[0]:[32]:[10.0.0.1]
10.0.0.1 32768 i
...
You can filter the routing table based on EVPN route type. The available
options are shown below:
cumulus@leaf01:~$ net show bgp l2vpn evpn route type
macip : MAC-IP (Type-2) route
multicast : Multicast
prefix : An IPv4 or IPv6 prefix
cumulus@leaf01:~$
Display a Specific EVPN Route
To drill down on a specific route for more information, run the net show bgp l2vpn evpn route rd <rd-value> command. This command displays
all EVPN routes with that RD and with the path attribute details for
each path. Additional filtering is possible based on route type or by
specifying the MAC and/or IP address. The following example shows a
specific MAC/IP route. The output shows that this remote host is behind
VTEP 10.0.0.4 and is reachable through two paths; one through either
spine switch. This example is from a symmetric routing deployment, so
the route shows both the layer 2 VNI (10200) and the layer 3 VNI
(104001) as well as the EVPN route target attributes corresponding to
each and the associated router MAC address.
cumulus@leaf01:~$ net show bgp l2vpn evpn route rd 10.0.0.4:3 mac 00:02:00:00:00:10 ip 172.16.130.44
BGP routing table entry for 10.0.0.4:3:[2]:[0]:[0]:[48]:[00:02:00:00:00:10]:[32]:[172.16.130.44]
Paths: (2 available, best #2)
Advertised to non peer-group peers:
s1(swp1) s2(swp2)
Route [2]:[0]:[0]:[48]:[00:02:00:00:00:10]:[32]:[172.16.130.44] VNI 10200/104001
65100 65004
10.0.0.4 from s2(swp2) (172.16.110.2)
Origin IGP, localpref 100, valid, external
Extended Community: RT:65004:10200 RT:65004:104001 ET:8 Rmac:00:01:00:00:14:00
AddPath ID: RX 0, TX 97
Last update: Sun Dec 17 20:57:24 2017
Route [2]:[0]:[0]:[48]:[00:02:00:00:00:10]:[32]:[172.16.130.44] VNI 10200/104001
65100 65004
10.0.0.4 from s1(swp1) (172.16.110.1)
Origin IGP, localpref 100, valid, external, bestpath-from-AS 65100, best
Extended Community: RT:65004:10200 RT:65004:104001 ET:8 Rmac:00:01:00:00:14:00
AddPath ID: RX 0, TX 71
Last update: Sun Dec 17 20:57:23 2017
Displayed 2 paths for requested prefix
cumulus@leaf01:~$
Only global VNIs are supported. Even though VNI values are exchanged
in the type-2 and type-5 routes, the received values are not used
when installing the routes into the forwarding plane; the local
configuration is used. You must ensure that the VLAN to VNI mappings
and the layer 3 VNI assignment for a tenant VRF are uniform
throughout the network.
If the remote host is dual attached, the next hop for the EVPN route is the anycast IP address of the remote MLAG pair, when MLAG is active.
The following example shows a prefix (type-5) route. Such a route has
only the layer 3 VNI and the route target corresponding to this VNI.
This route is learned through two paths, one through each spine switch.
cumulus@leaf01:~$ net show bgp l2vpn evpn route rd 172.16.100.2:3 type prefix
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
BGP routing table entry for 172.16.100.2:3:[5]:[0]:[30]:[172.16.100.0]
Paths: (2 available, best #2)
Advertised to non peer-group peers:
s1(swp1) s2(swp2)
Route [5]:[0]:[30]:[172.16.100.0] VNI 104001
65100 65050
10.0.0.5 from s2(swp2) (172.16.110.2)
Origin incomplete, localpref 100, valid, external
Extended Community: RT:65050:104001 ET:8 Rmac:00:01:00:00:01:00
AddPath ID: RX 0, TX 112
Last update: Tue Dec 19 00:12:18 2017
Route [5]:[0]:[30]:[172.16.100.0] VNI 104001
65100 65050
10.0.0.5 from s1(swp1) (172.16.110.1)
Origin incomplete, localpref 100, valid, external, bestpath-from-AS 65100, best
Extended Community: RT:65050:104001 ET:8 Rmac:00:01:00:00:01:00
AddPath ID: RX 0, TX 71
Last update: Tue Dec 19 00:12:17 2017
Displayed 1 prefixes (2 paths) with this RD (of requested type)
cumulus@leaf01:~$
Display the per-VNI EVPN Routing Table
Received EVPN routes are maintained in the global EVPN routing table
(described above), even if there are no appropriate local VNIs to
import them into. For example, a spine switch maintains the global
EVPN routing table even though there are no VNIs present on it. When
local VNIs are present, received EVPN routes are imported into the
per-VNI routing tables based on the route target attributes. You can
examine the per-VNI routing table with the net show bgp l2vpn evpn route vni <vni> command:
cumulus@leaf01:~$ net show bgp l2vpn evpn route vni 10110
BGP table version is 8, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Network Next Hop Metric LocPrf Weight Path
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:07]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:07]:[32]:[172.16.120.11]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:07]:[128]:[fe80::202:ff:fe00:7]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:08]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:08]:[32]:[172.16.120.12]
10.0.0.1 32768 i
*> [2]:[0]:[0]:[48]:[00:02:00:00:00:08]:[128]:[fe80::202:ff:fe00:8]
10.0.0.1 32768 i
*> [3]:[0]:[32]:[10.0.0.1]
10.0.0.1 32768 i
Displayed 7 prefixes (7 paths)
cumulus@leaf01:~$
To display the VNI routing table for all VNIs, run the net show bgp l2vpn evpn route vni all command.
Display the per-VRF BGP Routing Table
When symmetric routing is deployed, received type-2 and type-5 routes
are imported into the VRF routing table (against the corresponding
address-family: IPv4 unicast or IPv6 unicast) based on a match on the
route target attributes. You can examine BGP’s VRF routing table using
the net show bgp vrf <vrf-name> ipv4 unicast command or the net show bgp vrf <vrf-name> ipv6 unicast command.
cumulus@leaf01:~$ net show bgp vrf vrf1 ipv4 unicast
BGP table version is 8, local router ID is 172.16.120.250
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* 172.16.120.21/32 10.0.0.2 0 65100 65002 i
*> 10.0.0.2 0 65100 65002 i
* 172.16.120.22/32 10.0.0.2 0 65100 65002 i
*> 10.0.0.2 0 65100 65002 i
* 172.16.120.31/32 10.0.0.3 0 65100 65003 i
*> 10.0.0.3 0 65100 65003 i
* 172.16.120.32/32 10.0.0.3 0 65100 65003 i
*> 10.0.0.3 0 65100 65003 i
* 172.16.120.41/32 10.0.0.4 0 65100 65004 i
*> 10.0.0.4 0 65100 65004 i
* 172.16.120.42/32 10.0.0.4 0 65100 65004 i
*> 10.0.0.4 0 65100 65004 i
* 172.16.100.0/24 10.0.0.5 0 65100 65050 ?
*> 10.0.0.5 0 65100 65050 ?
* 172.16.100.0/24 10.0.0.6 0 65100 65050 ?
*> 10.0.0.6 0 65100 65050 ?
Displayed 8 routes and 16 total paths
cumulus@leaf01:~$
Examine MAC Moves
The first time a MAC moves from behind one VTEP to behind another, BGP
associates a MAC Mobility (MM) extended community attribute of sequence
number 1, with the type-2 route for that MAC. From there, each time this
MAC moves to a new VTEP, the MM sequence number increments by 1. You can
examine the MM sequence number associated with a MAC’s type-2 route with
the net show bgp l2vpn evpn route vni <vni> mac <mac> command. The
sample output below shows the type-2 route for a MAC that has moved
three times:
cumulus@switch:~$ net show bgp l2vpn evpn route vni 10109 mac 00:02:22:22:22:02
BGP routing table entry for [2]:[0]:[0]:[48]:[00:02:22:22:22:02]
Paths: (1 available, best #1)
Not advertised to any peer
Route [2]:[0]:[0]:[48]:[00:02:22:22:22:02] VNI 10109
Local
6.0.0.184 from 0.0.0.0 (6.0.0.184)
Origin IGP, localpref 100, weight 32768, valid, sourced, local, bestpath-from-AS Local, best
Extended Community: RT:650184:10109 ET:8 MM:3
AddPath ID: RX 0, TX 10350121
Last update: Tue Feb 14 18:40:37 2017
Displayed 1 paths for requested prefix
Examine Sticky MAC Addresses
You can identify static or sticky MACs in EVPN by the presence of
MM:0, sticky MAC in the Extended Community line of the output from
net show bgp l2vpn evpn route vni <vni> mac <mac>:
cumulus@switch:~$ net show bgp l2vpn evpn route vni 10101 mac 00:02:00:00:00:01
BGP routing table entry for [2]:[0]:[0]:[48]:[00:02:00:00:00:01]
Paths: (1 available, best #1)
Not advertised to any peer
Route [2]:[0]:[0]:[48]:[00:02:00:00:00:01] VNI 10101
Local
172.16.130.18 from 0.0.0.0 (172.16.130.18)
Origin IGP, localpref 100, weight 32768, valid, sourced, local, bestpath-from-AS Local, best
Extended Community: ET:8 RT:60176:10101 MM:0, sticky MAC
AddPath ID: RX 0, TX 46
Last update: Tue Apr 11 21:44:02 2017
Displayed 1 paths for requested prefix
Troubleshooting
To troubleshoot EVPN, enable FRR debug logs. The relevant debug options
are as follows:
debug zebra vxlan traces VNI addition and deletion (local and
remote) as well as MAC and neighbor addition and deletion (local and
remote).
debug zebra kernel traces actual netlink messages exchanged with
the kernel, which includes everything, not just EVPN.
debug bgp updates traces BGP update exchanges, including all
updates. Output is extended to show EVPN specific information.
debug bgp zebra traces interactions between BGP and zebra for EVPN
(and other) routes.
Caveats
The following caveats apply to EVPN in this version of Cumulus Linux:
When EVPN is enabled on a switch (VTEP), all locally defined VNIs on
that switch and other information (such as MAC addresses) pertaining
to them are advertised to EVPN peers. There is no provision to only
announce certain VNIs.
For VXLAN type-5 routes, ECMP does not work when the VTEP is directly connected to remote VTEPs. To work around this issue, add an additional device in the VXLAN fabric between the local and remote VTEPs, so that local and remote VTEPs are not directly connected.
In a VXLAN active-active
configuration, ARPs are sometimes not suppressed even if ARP
suppression is enabled. This is because the neighbor entries are not
synchronized between the two switches operating in active-active
mode by a control plane. This has no impact on forwarding.
You must configure the overlay (tenants) in a specific VRF(s) and
separate from the underlay, which resides in the default VRF. A
layer 3 VNI mapping for the default VRF is not supported.
In an EVPN deployment, Cumulus Linux supports a single BGP ASN which represents the ASN of the core as well as the ASN for any tenant VRFs if they have BGP peerings. If you need to change the ASN, you must first remove the layer 3 VNI in the /etc/frr/frr.conf file, modify the BGP ASN, then add back the layer 3 VNI in the /etc/frr/frr.conf file.
When you run the ping -I command and specify an interface, you don’t get an ICMP echo reply. However, when you run the ping command without the -I option, everything works as expected.
ping -I command example:
cumulus@switch:default:~:# ping -I swp2 10.0.10.1
PING 10.0.10.1 (10.0.10.1) from 10.0.0.2 swp1.5: 56(84) bytes of data.
ping command example:
cumulus@switch:default:~:# ping 10.0.10.1
PING 10.0.10.1 (10.0.10.1) 56(84) bytes of data.
64 bytes from 10.0.10.1: icmp_req=1 ttl=63 time=4.00 ms
64 bytes from 10.0.10.1: icmp_req=2 ttl=63 time=0.000 ms
64 bytes from 10.0.10.1: icmp_req=3 ttl=63 time=0.000 ms
64 bytes from 10.0.10.1: icmp_req=4 ttl=63 time=0.000 ms
^C
--- 10.0.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.000/1.000/4.001/1.732 ms
This is expected behavior with Cumulus Linux; when you send an ICMP echo request to an IP address that is not in the same subnet using the ping -I command, Cumulus Linux creates a failed ARP entry for the destination IP address.
On Broadcom Trident II+ and Maverick-based switches,
when a lookup is done after VXLAN decapsulation on the
external-facing switch (exit/border leaf), the switch does not
rewrite the MAC addresses or TTL; for through traffic, packets are
dropped by the next hop instead of correctly routing from a VXLAN
overlay network into a non-VXLAN external network (such as the
Internet). This applies to all forms of VXLAN routing (centralized,
asymmetric and symmetric) and affects all traffic from VXLAN overlay
hosts that need to be routed after VXLAN decapsulation on an
exit/border leaf, including traffic destined to external networks
(through traffic) and traffic destined to the exit leaf SVI address.
To work around this issue, modify the external-facing interface for
each VLAN sub-interface on the exit leaf by creating a temporary VNI
and associating it with the existing VLAN ID.
For example, if the expected interface configuration is:
auto swp3.2001
iface swp3.2001
vrf vrf1
address 10.0.0.2/24
# where swp3 is the external facing port and swp3.2001 is the VLAN sub-interface
auto bridge
iface bridge
bridge-vlan-aware yes
bridge ports vx-4001
bridge-vids 4001
auto vx-4001
iface vx-4001
vxlan-id 4001
<... usual vxlan config ...>
bridge-access 4001
# where vnid 4001 represents the L3 VNI
auto vlan4001
iface vlan4001
vlan-id 4001
vlan-raw-device bridge
vrf vrf1
Modify the configuration as follows:
auto swp3
iface swp3
bridge-access 2001
# associate the port (swp3) with bridge 2001
auto bridge
iface bridge
bridge-vlan-aware yes
bridge ports swp3 vx-4001 vx-16000000
bridge-vids 2001
# where vx-4001 is the existing VNI and vx-16000000 is a new temporary VNI
# this is now bridging the port (swp3), the VNI (vx-4001),
# and the new temporary VNI (vx-16000000)
# the bridge VLAN ID is now 2001
auto vlan2001
iface vlan2001
vlan-id 2001
vrf vrf1
address 10.0.0.2/24
vlan-raw-device bridge
# create a VLAN 2001 with the associated VRF and IP address
auto vx-16000000
iface vx-16000000
vxlan-id 16000000
bridge-access 2001
<... usual vxlan config ...>
# associate the temporary VNI (vx-16000000) with bridge 2001
auto vx-4001
iface vx-4001
vxlan-id 4001
<... usual vxlan config ...>
bridge-access 4001
# where vnid 4001 represents the L3 VNI
auto vlan4001
iface vlan4001
vlan-id 4001
vlan-raw-device bridge
vrf vrf1
If an MLAG pair is used instead of a single exit/border leaf, add
the same temporary VNIs on both switches of the MLAG pair.
EVPN is not supported when Redistribute Neighbor is also configured. Enabling both features simultaneously causes instability in IPv4 and IPv6 neighbor entries.
Example Configurations
Basic Clos (4x2) for bridging
Clos with MLAG and centralized routing
Clos with MLAG and asymmetric routing
Basic Clos with symmetric routing and exit leafs
Basic Clos (4x2) for Bridging
The following example configuration shows a basic Clos topology for bridging.
leaf01 and leaf02 Configurations
Leaf01 /etc/network/interfaces
cumulus@Leaf01:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.7/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.7
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.9/30
mtu 9216
clagd-priority 4096
clagd-sys-mac 44:38:39:ff:ff:01
clagd-peer-ip 169.254.0.10
# post-up sysctl -w net.ipv4.conf.peerlink-3/4094.accept_local=1
clagd-backup-ip 10.0.0.8
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.7
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.7
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto VxLanA-1
iface VxLanA-1
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1001
bridge-pvid 1
auto vlan1
iface vlan1
vlan-id 1
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device VxLanA-1
ip-forward off
Leaf02 /etc/network/interfaces
cumulus@Leaf02:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.8/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.7
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.10/30
mtu 9216
clagd-priority 8192
clagd-sys-mac 44:38:39:ff:ff:01
clagd-peer-ip 169.254.0.9
# post-up sysctl -w net.ipv4.conf.peerlink-3/4094.accept_local=1
clagd-backup-ip 10.0.0.7
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.8
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.8
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto VxLanA-1
iface VxLanA-1
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1001
bridge-pvid 1
auto vlan1
iface vlan1
vlan-id 1
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device VxLanA-1
ip-forward off
cumulus@Leaf03:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.9/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.9
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.9/30
mtu 9216
alias clag and vxlan communication primary path
clagd-priority 4096
clagd-sys-mac 44:38:39:ff:ff:02
clagd-peer-ip 169.254.0.10
# post-up sysctl -w net.ipv4.conf.peerlink-3/4094.accept_local=1
clagd-backup-ip 10.0.0.10
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto VxLanA-1
iface VxLanA-1
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1001
bridge-pvid 1
auto vlan1
iface vlan1
vlan-id 1
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device VxLanA-1
ip-forward off
Leaf04 /etc/network/interfaces
cumulus@Leaf04:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.10/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.9
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.10/30
mtu 9216
alias clag and vxlan communication primary path
clagd-priority 8192
clagd-sys-mac 44:38:39:ff:ff:02
clagd-peer-ip 169.254.0.9
# post-up sysctl -w net.ipv4.conf.peerlink-3/4094.accept_local=1
clagd-backup-ip 10.0.0.9
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto VxLanA-1
iface VxLanA-1
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1001
bridge-pvid 1
auto vlan1
iface vlan1
vlan-id 1
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device VxLanA-1
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device VxLanA-1
ip-forward off
cumulus@Spine01:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.5/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216
Spine02 /etc/network/interfaces
cumulus@Spine02:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.6/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216
cumulus@Leaf03:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.9/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.9
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.9/30
mtu 9216
alias clag and vxlan communication primary path
clagd-priority 4096
clagd-sys-mac 44:38:39:ff:ff:02
clagd-peer-ip 169.254.0.10
clagd-backup-ip 10.0.0.10
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101002
iface vx-101002
vxlan-id 101002
bridge-access 1002
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101003
iface vx-101003
vxlan-id 101003
bridge-access 1003
vxlan-local-tunnelip 10.0.0.9
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto bridge
iface bridge
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 vx-101002 vx-101003 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1003
bridge-pvid 1
auto vrf1
iface vrf1
vrf-table auto
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device bridge
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device bridge
ip-forward off
auto vrf2
iface vrf2
vrf-table auto
auto vlan1002
iface vlan1002
vlan-id 1002
vlan-raw-device bridge
ip-forward off
auto vlan1003
iface vlan1003
vlan-id 1003
vlan-raw-device bridge
ip-forward off
Leaf04 /etc/network/interfaces
cumulus@Leaf04:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.10/32
alias BGP un-numbered Use for Vxlan Src Tunnel
clagd-vxlan-anycast-ip 172.16.100.9
auto uplink-1
iface uplink-1
bond-slaves swp1 swp2
mtu 9216
auto uplink-2
iface uplink-2
bond-slaves swp3 swp4
mtu 9216
auto peerlink-3
iface peerlink-3
bond-slaves swp5 swp6
mtu 9216
auto peerlink-3.4094
iface peerlink-3.4094
address 169.254.0.10/30
mtu 9216
alias clag and vxlan communication primary path
clagd-priority 8192
clagd-sys-mac 44:38:39:ff:ff:02
clagd-peer-ip 169.254.0.9
clagd-backup-ip 10.0.0.9
auto hostbond4
iface hostbond4
bond-slaves swp7
mtu 9152
clag-id 1
bridge-pvid 1000
auto hostbond5
iface hostbond5
bond-slaves swp8
mtu 9152
clag-id 2
bridge-pvid 1001
auto vx-101000
iface vx-101000
vxlan-id 101000
bridge-access 1000
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101001
iface vx-101001
vxlan-id 101001
bridge-access 1001
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101002
iface vx-101002
vxlan-id 101002
bridge-access 1002
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto vx-101003
iface vx-101003
vxlan-id 101003
bridge-access 1003
vxlan-local-tunnelip 10.0.0.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
mtu 9152
auto bridge
iface bridge
bridge-vlan-aware yes
bridge-ports vx-101000 vx-101001 vx-101002 vx-101003 peerlink-3 hostbond4 hostbond5
bridge-stp on
bridge-vids 1000-1003
bridge-pvid 1
auto vrf1
iface vrf1
vrf-table auto
auto vlan1000
iface vlan1000
vlan-id 1000
vlan-raw-device bridge
ip-forward off
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device bridge
ip-forward off
auto vrf2
iface vrf2
vrf-table auto
auto vlan1002
iface vlan1002
vlan-id 1002
vlan-raw-device bridge
ip-forward off
auto vlan1003
iface vlan1003
vlan-id 1003
vlan-raw-device bridge
ip-forward off
cumulus@Spine01:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.5/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216<
Spine02 /etc/network/interfaces
cumulus@Spine02:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.6/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216
cumulus@Spine01:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.5/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216
Spine02 /etc/network/interfaces
cumulus@Spine02:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5)
# The primary network interface
auto eth0
iface eth0 inet dhcp
# Include any platform-specific interface configuration
#source /etc/network/interfaces.d/*.if
auto lo
iface lo
address 10.0.0.6/32
alias BGP un-numbered Use for Vxlan Src Tunnel
auto downlink-1
iface downlink-1
bond-slaves swp1 swp2
mtu 9216
auto downlink-2
iface downlink-2
bond-slaves swp3 swp4
mtu 9216
auto downlink-3
iface downlink-3
bond-slaves swp5 swp6
mtu 9216
auto downlink-4
iface downlink-4
bond-slaves swp7 swp8
mtu 9216
Basic Clos Configuration with EVPN Symmetric Routing
The following example configuration is a basic Clos topology with EVPN
symmetric routing with external prefix (type-5) routing via dual,
non-MLAG exit leafs connected to an edge router. Here is the topology
diagram:
Leaf01 and Leaf02 Configurations
Leaf01 /etc/network/interfaces
cumulus@Leaf01:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.1/32
clagd-vxlan-anycast-ip 10.0.1.1
vxlan-local-tunnelip 10.10.10.1
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# VRFs
###############
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
###############
# Clag Bonds
###############
auto bond1
iface bond1
bridge-access 10
bond-slaves swp1
clag-id 1
bond-lacp-bypass-allow yes
auto swp1
iface swp1
alias bond member of bond1
auto bond2
iface bond2
bridge-access 20
bond-slaves swp2
clag-id 2
bond-lacp-bypass-allow yes
auto swp2
iface swp2
alias bond member of bond2
auto bond3
iface bond3
bridge-access 30
bond-slaves swp3
clag-id 3
bond-lacp-bypass-allow yes
auto swp3
iface swp3
alias bond member of bond3
###############
# L2VNIs
###############
auto vni30010
iface vni30010
bridge-access 10
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30010
auto vni30020
iface vni30020
bridge-access 20
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30020
auto vni30030
iface vni30030
bridge-access 30
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30030
###############
# L3VNIs
###############
auto L3VNI_RED
iface L3VNI_RED
bridge-access 4001
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004001
auto vlan4001
iface vlan4001
hwaddress 44:38:39:BE:EF:01
vlan-id 4001
vlan-raw-device bridge
vrf RED
auto L3VNI_BLUE
iface L3VNI_BLUE
bridge-access 4002
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004002
auto vlan4002
iface vlan4002
hwaddress 44:38:39:BE:EF:01
vlan-id 4002
vlan-raw-device bridge
vrf BLUE
###############
# Fabric Links
###############
auto swp51
iface swp51
alias fabric link
auto swp52
iface swp52
alias fabric link
auto swp53
iface swp53
alias fabric link
auto swp54
iface swp54
alias fabric link
###############
# Mlag and peerlink
###############
auto swp49
iface swp49
alias peerlink
auto swp50
iface swp50
alias peerlink
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.2
clagd-peer-ip linklocal
clagd-priority 1000
clagd-sys-mac 44:38:39:FF:01:01
###############
# Bridge
###############
auto bridge
iface bridge
bridge-ports peerlink \
bond1 bond2 bond3 \
vni30010 vni30020 vni30030 \
L3VNI_RED L3VNI_BLUE
bridge-vids 10 20 30 \
4001 4002
bridge-vlan-aware yes
###############
# SVI
###############
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:1a 10.1.10.1/24
vrf RED
vlan-raw-device bridge
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
address-virtual 00:00:00:00:00:1b 10.1.20.1/24
vrf RED
vlan-raw-device bridge
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
address-virtual 00:00:00:00:00:1c 10.1.30.1/24
vrf BLUE
vlan-raw-device bridge
vlan-id 30
Leaf02 /etc/network/interfaces
cumulus@Leaf02:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.1
vxlan-local-tunnelip 10.10.10.2
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# VRFs
###############
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
###############
# Clag Bonds
###############
auto bond1
iface bond1
bridge-access 10
bond-slaves swp1
clag-id 1
bond-lacp-bypass-allow yes
auto swp1
iface swp1
alias bond member of bond1
auto bond2
iface bond2
bridge-access 20
bond-slaves swp2
clag-id 2
bond-lacp-bypass-allow yes
auto swp2
iface swp2
alias bond member of bond2
auto bond3
iface bond3
bridge-access 30
bond-slaves swp3
clag-id 3
bond-lacp-bypass-allow yes
auto swp3
iface swp3
alias bond member of bond3
###############
# L2VNIs
###############
auto vni30010
iface vni30010
bridge-access 10
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30010
auto vni30020
iface vni30020
bridge-access 20
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30020
auto vni30030
iface vni30030
bridge-access 30
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30030
###############
# L3VNIs
###############
auto L3VNI_RED
iface L3VNI_RED
bridge-access 4001
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004001
auto vlan4001
iface vlan4001
hwaddress 44:38:39:BE:EF:01
vlan-id 4001
vlan-raw-device bridge
vrf RED
auto L3VNI_BLUE
iface L3VNI_BLUE
bridge-access 4002
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004002
auto vlan4002
iface vlan4002
hwaddress 44:38:39:BE:EF:01
vlan-id 4002
vlan-raw-device bridge
vrf BLUE
###############
# Fabric Links
###############
auto swp51
iface swp51
alias fabric link
auto swp52
iface swp52
alias fabric link
auto swp53
iface swp53
alias fabric link
auto swp54
iface swp54
alias fabric link
###############
# Mlag and peerlink
###############
auto swp49
iface swp49
alias peerlink
auto swp50
iface swp50
alias peerlink
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.1
clagd-peer-ip linklocal
clagd-priority 32768
clagd-sys-mac 44:38:39:FF:01:01
###############
# Bridge
###############
auto bridge
iface bridge
bridge-ports peerlink \
bond1 bond2 bond3 \
vni30010 vni30020 vni30030 \
L3VNI_RED L3VNI_BLUE
bridge-vids 10 20 30 \
4001 4002
bridge-vlan-aware yes
###############
# SVI
###############
auto vlan10
iface vlan10
address 10.1.10.3/24
address-virtual 00:00:00:00:00:1a 10.1.10.1/24
vrf RED
vlan-raw-device bridge
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.3/24
address-virtual 00:00:00:00:00:1b 10.1.20.1/24
vrf RED
vlan-raw-device bridge
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.3/24
address-virtual 00:00:00:00:00:1c 10.1.30.1/24
vrf BLUE
vlan-raw-device bridge
vlan-id 30
cumulus@Leaf03:~$ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.3/32
clagd-vxlan-anycast-ip 10.0.1.2
vxlan-local-tunnelip 10.10.10.3
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# VRFs
###############
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
###############
# Clag Bonds
###############
auto bond1
iface bond1
bridge-access 10
bond-slaves swp1
clag-id 1
bond-lacp-bypass-allow yes
auto swp1
iface swp1
alias bond member of bond1
auto bond2
iface bond2
bridge-access 20
bond-slaves swp2
clag-id 2
bond-lacp-bypass-allow yes
auto swp2
iface swp2
alias bond member of bond2
auto bond3
iface bond3
bridge-access 30
bond-slaves swp3
clag-id 3
bond-lacp-bypass-allow yes
auto swp3
iface swp3
alias bond member of bond3
###############
# L2VNIs
###############
auto vni30010
iface vni30010
bridge-access 10
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30010
auto vni30020
iface vni30020
bridge-access 20
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30020
auto vni30030
iface vni30030
bridge-access 30
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30030
###############
# L3VNIs
###############
auto L3VNI_RED
iface L3VNI_RED
bridge-access 4001
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004001
auto vlan4001
iface vlan4001
hwaddress 44:38:39:BE:EF:02
vlan-id 4001
vlan-raw-device bridge
vrf RED
auto L3VNI_BLUE
iface L3VNI_BLUE
bridge-access 4002
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004002
auto vlan4002
iface vlan4002
hwaddress 44:38:39:BE:EF:02
vlan-id 4002
vlan-raw-device bridge
vrf BLUE
###############
# Fabric Links
###############
auto swp51
iface swp51
alias fabric link
auto swp52
iface swp52
alias fabric link
auto swp53
iface swp53
alias fabric link
auto swp54
iface swp54
alias fabric link
###############
# Mlag and peerlink
###############
auto swp49
iface swp49
alias peerlink
auto swp50
iface swp50
alias peerlink
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.4
clagd-peer-ip linklocal
clagd-priority 1000
clagd-sys-mac 44:38:39:FF:01:02
###############
# Bridge
###############
auto bridge
iface bridge
bridge-ports peerlink \
bond1 bond2 bond3 \
vni30010 vni30020 vni30030 \
L3VNI_RED L3VNI_BLUE
bridge-vids 10 20 30 \
4001 4002
bridge-vlan-aware yes
###############
# SVI
###############
auto vlan10
iface vlan10
address 10.1.10.4/24
address-virtual 00:00:00:00:00:1a 10.1.10.1/24
vrf RED
vlan-raw-device bridge
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.4/24
address-virtual 00:00:00:00:00:1b 10.1.20.1/24
vrf RED
vlan-raw-device bridge
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.4/24
address-virtual 00:00:00:00:00:1c 10.1.30.1/24
vrf BLUE
vlan-raw-device bridge
vlan-id 30
Leaf04 /etc/network/interfaces
cumulus@Leaf04:~$ cat /etc/network/interfaces
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.4/32
clagd-vxlan-anycast-ip 10.0.1.2
vxlan-local-tunnelip 10.10.10.4
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# VRFs
###############
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
###############
# Clag Bonds
###############
auto bond1
iface bond1
bridge-access 10
bond-slaves swp1
clag-id 1
bond-lacp-bypass-allow yes
auto swp1
iface swp1
alias bond member of bond1
auto bond2
iface bond2
bridge-access 20
bond-slaves swp2
clag-id 2
bond-lacp-bypass-allow yes
auto swp2
iface swp2
alias bond member of bond2
auto bond3
iface bond3
bridge-access 30
bond-slaves swp3
clag-id 3
bond-lacp-bypass-allow yes
auto swp3
iface swp3
alias bond member of bond3
###############
# L2VNIs
###############
auto vni30010
iface vni30010
bridge-access 10
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30010
auto vni30020
iface vni30020
bridge-access 20
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30020
auto vni30030
iface vni30030
bridge-access 30
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 30030
###############
# L3VNIs
###############
auto L3VNI_RED
iface L3VNI_RED
bridge-access 4001
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004001
auto vlan4001
iface vlan4001
hwaddress 44:38:39:BE:EF:02
vlan-id 4001
vlan-raw-device bridge
vrf RED
auto L3VNI_BLUE
iface L3VNI_BLUE
bridge-access 4002
bridge-arp-nd-suppress on
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 3004002
auto vlan4002
iface vlan4002
hwaddress 44:38:39:BE:EF:02
vlan-id 4002
vlan-raw-device bridge
vrf BLUE
###############
# Fabric Links
###############
auto swp51
iface swp51
alias fabric link
auto swp52
iface swp52
alias fabric link
auto swp53
iface swp53
alias fabric link
auto swp54
iface swp54
alias fabric link
###############
# Mlag and peerlink
###############
auto swp49
iface swp49
alias peerlink
auto swp50
iface swp50
alias peerlink
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.3
clagd-peer-ip linklocal
clagd-priority 32768
clagd-sys-mac 44:38:39:FF:01:02
###############
# Bridge
###############
auto bridge
iface bridge
bridge-ports peerlink \
bond1 bond2 bond3 \
vni30010 vni30020 vni30030 \
L3VNI_RED L3VNI_BLUE
bridge-vids 10 20 30 \
4001 4002
bridge-vlan-aware yes
###############
# SVI
###############
auto vlan10
iface vlan10
address 10.1.10.5/24
address-virtual 00:00:00:00:00:1a 10.1.10.1/24
vrf RED
vlan-raw-device bridge
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.5/24
address-virtual 00:00:00:00:00:1b 10.1.20.1/24
vrf RED
vlan-raw-device bridge
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.5/24
address-virtual 00:00:00:00:00:1c 10.1.30.1/24
vrf BLUE
vlan-raw-device bridge
vlan-id 30
cumulus@Spine01:~$ cat /etc/network/interfaces
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.101/32
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# Fabric Links
###############
auto swp1
iface swp1
alias fabric link
auto swp2
iface swp2
alias fabric link
auto swp3
iface swp3
alias fabric link
auto swp4
iface swp4
alias fabric link
auto swp5
iface swp5
alias fabric link
auto swp6
iface swp6
alias fabric link
Spine02 /etc/network/interfaces
cumulus@Spine02:~$ cat /etc/network/interfaces
###############
# Loopback
###############
auto lo
iface lo inet loopback
address 10.10.10.102/32
###############
# Mgmt interface
###############
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto eth0
iface eth0 inet dhcp
vrf mgmt
###############
# Fabric Links
###############
auto swp1
iface swp1
alias fabric link
auto swp2
iface swp2
alias fabric link
auto swp3
iface swp3
alias fabric link
auto swp4
iface swp4
alias fabric link
auto swp5
iface swp5
alias fabric link
auto swp6
iface swp6
alias fabric link
As of Cumulus Linux 3.7, the lightweight network virtualization feature (LNV) has been deprecated. The feature will be removed in Cumulus Linux 4.0. Use EVPN for network virtualization.
Lightweight Network Virtualization (LNV) is a technique for deploying
VXLANs without a central
controller on bare metal switches. This solution requires no external
controller or software suite; it runs the VXLAN service and registration
daemons on Cumulus Linux itself. The data path between bridge entities
is established on top of a layer 3 fabric by means of a simple service
node coupled with traditional MAC address learning.
To see an example of a full solution before reading the following
background information, read this chapter.
The two switches running Cumulus Linux, called leaf1 and leaf2, each
have a bridge configured. These two bridges contain the physical switch
port interfaces connecting to the servers as well as the logical VXLAN
interface associated with the bridge. By creating a logical VXLAN
interface on both leaf switches, the switches become VTEPs (virtual
tunnel end points). The IP address associated with this VTEP is most
commonly configured as its loopback address; in the image above, the
loopback address is 10.2.1.1 for leaf1 and 10.2.1.2 for leaf2.
Acquire the Forwarding Database at the Service Node
To connect these two VXLANs together and forward BUM (Broadcast,
Unknown-unicast, Multicast) packets to members of a VXLAN, the service
node needs to acquire the addresses of all the VTEPs for every VXLAN it
serves. The service node daemon does this through a registration daemon
running on each leaf switch that contains a VTEP participating in LNV.
The registration process informs the service node of all the VXLANs to
which the switch belongs.
MAC Learning and Flooding
With LNV, as with traditional bridging of physical LANs or VLANs, a
bridge automatically learns the location of hosts as a side effect of
receiving packets on a port.
For example, when server1 sends a layer 2 packet to server3, leaf2
learns that the MAC address for server1 is located on that particular
VXLAN and the VXLAN interface learns that the IP address of the VTEP for
server1 is 10.2.1.1. So when server3 sends a packet to server1, the
bridge on leaf2 forwards the packet out of the port to the VXLAN
interface and the VXLAN interface sends it, encapsulated in a UDP
packet, to the address 10.2.1.1.
But what if server3 sends a packet to some address that has yet to send
it a packet (server2, for example)? In this case, the VXLAN interface
sends the packet to the service node, which sends a copy to every other
VTEP that belongs to the same VXLAN. This is called service node
replication and is one of two techniques for handling BUM (Broadcast
Unknown-unicast and Multicast) traffic.
BUM Traffic
Cumulus Linux has two ways of handling BUM (Broadcast Unknown-unicast
and Multicast) traffic:
Head end replication
Service node replication
Head end replication is enabled by default in Cumulus Linux.
You cannot have both service node and head end replication configured
simultaneously, as this causes the BUM traffic to be duplicated; both
the source VTEP and the service node send their own copy of each packet
to every remote VTEP.
Head End Replication
Broadcom switches with Tomahawk, Trident II+, and Trident II ASICs
and switches with Spectrum ASICs are capable of head end replication (HER), which is the ability to generate all the BUM traffic in hardware. The most scalable solution available with LNV is to have each VTEP (top of rack switch) generate all of its own BUM traffic instead of relying on an external service node. HER is enabled by default in Cumulus Linux.
Cumulus Linux verified support for up to 128 VTEPs with head end
replication.
To disable head end replication, edit the /etc/vxrd.conf file and set
head_rep to False.
Service Node Replication
Cumulus Linux also supports service node replication for VXLAN BUM
packets. This is useful with LNV if you have more than 128 VTEPs.
However, it is not recommended because it forces the spine switches
running the vxsnd (service node daemon) to replicate the packets in
software instead of in hardware, unlike head end replication.
To enable service node replication:
Disable head end replication; set head_rep to False in the
/etc/vxrd.conf file.
Configure a service node IP address for every VXLAN interface using
the vxlan-svcnodeip parameter:
cumulus@switch:~$ net add vxlan VXLAN vxlan svcnodeip IP_ADDRESS
You only specify this parameter when head end replication is
disabled. For the loopback, the parameter is still named
vxrd-svcnode-ip.
Edit the /etc/vxsnd.conf file and configure the following:
Set the same service node IP address that you configured in the
previous step:
svcnode_ip = <>
To forward VXLAN data traffic, set the following variable to
True:
enable_vxlan_listen = true
Requirements
Hardware Requirements
Switches with the Broadcom Tomahawk, Trident II+, or Trident II ASIC or switches with the Mellanox Spectrum ASIC running Cumulus Linux 2.5.4 or later. Refer to the hardware compatibility list for a list of supported switch models.
Configuration Requirements
The VXLAN has an associated VXLAN Network Identifier (VNI), also interchangeably called a VXLAN ID.
The VNI cannot be 0 or 16777215, as these two numbers are reserved values under Cumulus Linux.
The VXLAN link and physical interfaces are added to the bridge to create the association between the port, VLAN, and VXLAN instance.
Each bridge on the switch has only one VXLAN interface. Cumulus Linux does not support more than one VXLAN link in a bridge; however, a switch can have multiple bridges.
An SVI (Switch VLAN Interface) or layer 3 address on the bridge is not supported. For example, you cannot ping from the leaf1 SVI to the leaf2 SVI through the VXLAN tunnel; you need to use server1 and server2 to verify.
Install the LNV Packages
vxfld is installed by default on all new installations of Cumulus
Linux 3.x. If you are upgrading from an earlier version, run sudo -E apt-get install python-vxfld to install the LNV package.
Sample LNV Configuration
The following images illustrate the configuration that is referenced
throughout this chapter.
Physical Cabling Diagram
Network Virtualization Diagram
Want to try out configuring LNV and do not have a Cumulus Linux switch?
Check out Cumulus VX.
Network Connectivity
There must be full network connectivity before you can configure LNV.
The layer 3 IP addressing information as well as the OSPF configuration
(/etc/frr/frr.conf) below is provided to make the LNV example easier
to understand.
OSPF is not a requirement for LNV, LNV just requires layer 3
connectivity. With Cumulus Linux this can be achieved with static
routes, OSPF or BGP.
Layer 3 IP Addressing
Here is the configuration for the IP addressing information used in this
example.
spine1:
cumulus@spine1:~$ net add interface swp49 ip address 10.1.1.2/30
cumulus@spine1:~$ net add interface swp50 ip address 10.1.1.6/30
cumulus@spine1:~$ net add interface swp51 ip address 10.1.1.50/30
cumulus@spine1:~$ net add interface swp52 ip address 10.1.1.54/30
cumulus@spine1:~$ net add loopback lo ip address 10.2.1.3/32
cumulus@spine1:~$ net pending
cumulus@spine1:~$ net commit
These commands create the following configuration:
cumulus@spine1:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.2.1.3/32
auto eth0
iface eth0 inet dhcp
auto swp49
iface swp49
address 10.1.1.2/30
auto swp50
iface swp50
address 10.1.1.6/30
auto swp51
iface swp51
address 10.1.1.50/30
auto swp52
iface swp52
address 10.1.1.54/30
spine2:
cumulus@spine2:~$ net add interface swp49 ip address 10.1.1.18/30
cumulus@spine2:~$ net add interface swp50 ip address 10.1.1.22/30
cumulus@spine2:~$ net add interface swp51 ip address 10.1.1.34/30
cumulus@spine2:~$ net add interface swp52 ip address 10.1.1.38/30
cumulus@spine2:~$ net add loopback lo ip address 10.2.1.4/32
cumulus@spine2:~$ net pending
cumulus@spine2:~$ net commit
These commands create the following configuration:
cumulus@spine2:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.2.1.4/32
auto eth0
iface eth0 inet dhcp
auto swp49
iface swp49
address 10.1.1.18/30
auto swp50
iface swp50
address 10.1.1.22/30
auto swp51
iface swp51
address 10.1.1.34/30
auto swp52
iface swp52
address 10.1.1.38/30
leaf1:
cumulus@leaf1:~$ net add interface swp1 breakout 4x
cumulus@leaf1:~$ net add interface swp1s0 ip address 10.1.1.1/30
cumulus@leaf1:~$ net add interface swp1s1 ip address 10.1.1.5/30
cumulus@leaf1:~$ net add interface swp1s2 ip address 10.1.1.33/30
cumulus@leaf1:~$ net add interface swp1s3 ip address 10.1.1.37/30
cumulus@leaf1:~$ net add loopback lo ip address 10.2.1.1/32
cumulus@leaf1:~$ net pending
cumulus@leaf1:~$ net commit
These commands create the following configuration:
cumulus@leaf1:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.2.1.1/32
auto eth0
iface eth0 inet dhcp
auto swp1s0
iface swp1s0
address 10.1.1.1/30
auto swp1s1
iface swp1s1
address 10.1.1.5/30
auto swp1s2
iface swp1s2
address 10.1.1.33/30
auto swp1s3
iface swp1s3
address 10.1.1.37/30
leaf2:
cumulus@leaf2:~$ net add interface swp1 breakout 4x
cumulus@leaf2:~$ net add interface swp1s0 ip address 10.1.1.17/30
cumulus@leaf2:~$ net add interface swp1s1 ip address 10.1.1.21/30
cumulus@leaf2:~$ net add interface swp1s2 ip address 10.1.1.49/30
cumulus@leaf2:~$ net add interface swp1s3 ip address 10.1.1.53/30
cumulus@leaf2:~$ net add loopback lo ip address 10.2.1.2/32
cumulus@leaf2:~$ net pending
cumulus@leaf2:~$ net commit
These commands create the following configuration:
cumulus@leaf2:~$ cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.2.1.2/32
auto eth0
iface eth0 inet dhcp
auto swp1s0
iface swp1s0
address 10.1.1.17/30
auto swp1s1
iface swp1s1
address 10.1.1.21/30
auto swp1s2
iface swp1s2
address 10.1.1.49/30
auto swp1s3
iface swp1s3
address 10.1.1.53/30
Layer 3 Fabric
The service nodes and registration nodes must all be routable between
each other. The layer 3 fabric on Cumulus Linux can either be
BGP or
OSPF. In this
example, OSPF is used to demonstrate full reachability. Click to expand
the FRRouting configurations below.
Click to expand the OSPF configuration ...
FRRouting configuration using OSPF:
spine1:
cumulus@spine1:~$ net add ospf network 10.2.1.3/32 area 0.0.0.0
cumulus@spine1:~$ net add interface swp49 ospf network point-to-point
cumulus@spine1:~$ net add interface swp50 ospf network point-to-point
cumulus@spine1:~$ net add interface swp51 ospf network point-to-point
cumulus@spine1:~$ net add interface swp52 ospf network point-to-point
cumulus@spine1:~$ net add interface swp49 ospf area 0.0.0.0
cumulus@spine1:~$ net add interface swp50 ospf area 0.0.0.0
cumulus@spine1:~$ net add interface swp51 ospf area 0.0.0.0
cumulus@spine1:~$ net add interface swp52 ospf area 0.0.0.0
cumulus@spine1:~$ net add ospf router-id 10.2.1.3
cumulus@spine1:~$ net pending
cumulus@spine1:~$ net commit
These commands create the following configuration:
interface swp49
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp50
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp51
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp52
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
router ospf
ospf router-id 10.2.1.3
network 10.2.1.3/32 area 0.0.0.0
spine2:
cumulus@spine2:~$ net add ospf network 10.2.1.4/32 area 0.0.0.0
cumulus@spine2:~$ net add interface swp49 ospf network point-to-point
cumulus@spine2:~$ net add interface swp50 ospf network point-to-point
cumulus@spine2:~$ net add interface swp51 ospf network point-to-point
cumulus@spine2:~$ net add interface swp52 ospf network point-to-point
cumulus@spine2:~$ net add interface swp49 ospf area 0.0.0.0
cumulus@spine2:~$ net add interface swp50 ospf area 0.0.0.0
cumulus@spine2:~$ net add interface swp51 ospf area 0.0.0.0
cumulus@spine2:~$ net add interface swp52 ospf area 0.0.0.0
cumulus@spine2:~$ net add ospf router-id 10.2.1.4
cumulus@spine2:~$ net pending
cumulus@spine2:~$ net commit
These commands create the following configuration:
interface swp49
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp50
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp51
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp52
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
router ospf
ospf router-id 10.2.1.4
network 10.2.1.4/32 area 0.0.0.0
leaf1:
cumulus@leaf1:~$ net add ospf network 10.2.1.1/32 area 0.0.0.0
cumulus@leaf1:~$ net add interface swp1s0 ospf network point-to-point
cumulus@leaf1:~$ net add interface swp1s1 ospf network point-to-point
cumulus@leaf1:~$ net add interface swp1s2 ospf network point-to-point
cumulus@leaf1:~$ net add interface swp1s3 ospf network point-to-point
cumulus@leaf1:~$ net add interface swp1s0 ospf area 0.0.0.0
cumulus@leaf1:~$ net add interface swp1s1 ospf area 0.0.0.0
cumulus@leaf1:~$ net add interface swp1s2 ospf area 0.0.0.0
cumulus@leaf1:~$ net add interface swp1s3 ospf area 0.0.0.0
cumulus@leaf1:~$ net add ospf router-id 10.2.1.1
cumulus@leaf1:~$ net pending
cumulus@leaf1:~$ net commit
These commands create the following configuration:
interface swp1s0
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s1
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s2
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s3
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
router ospf
ospf router-id 10.2.1.1
network 10.2.1.1/32 area 0.0.0.0
leaf2:
cumulus@leaf2:~$ net add ospf network 10.2.1.2/32 area 0.0.0.0
cumulus@leaf2:~$ net add interface swp1s0 ospf network point-to-point
cumulus@leaf2:~$ net add interface swp1s1 ospf network point-to-point
cumulus@leaf2:~$ net add interface swp1s2 ospf network point-to-point
cumulus@leaf2:~$ net add interface swp1s3 ospf network point-to-point
cumulus@leaf2:~$ net add interface swp1s0 ospf area 0.0.0.0
cumulus@leaf2:~$ net add interface swp1s1 ospf area 0.0.0.0
cumulus@leaf2:~$ net add interface swp1s2 ospf area 0.0.0.0
cumulus@leaf2:~$ net add interface swp1s3 ospf area 0.0.0.0
cumulus@leaf2:~$ net add ospf router-id 10.2.1.2
cumulus@leaf2:~$ net pending
cumulus@leaf2:~$ net commit
These commands create the following configuration:
interface swp1s0
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s1
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s2
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
interface swp1s3
ip ospf network point-to-point
ip ospf area 0.0.0.0
!
router ospf
ospf router-id 10.2.1.2
network 10.2.1.2/32 area 0.0.0.0
In this example, the servers are running Ubuntu 14.04. There needs to be
a trunk mapped from server1 and server2 to the respective switch. In
Ubuntu this is done with subinterfaces. You can expand the
configurations below.
Click to expand the host configurations ...
server1:
auto eth3.10
iface eth3.10 inet static
address 10.10.10.1/24
auto eth3.20
iface eth3.20 inet static
address 10.10.20.1/24
auto eth3.30
iface eth3.30 inet static
address 10.10.30.1/24
server2:
auto eth3.10
iface eth3.10 inet static
address 10.10.10.2/24
auto eth3.20
iface eth3.20 inet static
address 10.10.20.2/24
auto eth3.30
iface eth3.30 inet static
address 10.10.30.2/24
On Ubuntu, it is more reliable to use `ifup` and `if down` to bring the
interfaces up and down individually, rather than restarting networking
entirely (there is no equivalent to `if reload` like there is in Cumulus
Linux):
cumulus@server1:~$ sudo ifup eth3.10
Set name-type for VLAN subsystem. Should be visible in /proc/net/vlan/config
Added VLAN with VID == 10 to IF -:eth3:-
cumulus@server1:~$ sudo ifup eth3.20
Set name-type for VLAN subsystem. Should be visible in /proc/net/vlan/config
Added VLAN with VID == 20 to IF -:eth3:-
cumulus@server1:~$ sudo ifup eth3.30
Set name-type for VLAN subsystem. Should be visible in /proc/net/vlan/config
Added VLAN with VID == 30 to IF -:eth3:-
Configure the VLAN to VXLAN Mapping
Configure the VLANs and associated VXLANs. In this example, there are 3
VLANs and 3 VXLAN IDs (VNIs). VLANs 10, 20 and 30 are used and
associated with VNIs 10, 2000 and 30 respectively. The loopback address,
used as the vxlan-local-tunnelip, is the only difference between leaf1
and leaf2 for this demonstration.
leaf1:
cumulus@leaf1:~$ net add loopback lo ip address 10.2.1.1/32
cumulus@leaf1:~$ net add loopback lo vxrd-src-ip 10.2.1.1
cumulus@leaf1:~$ net add loopback lo vxrd-svcnode-ip 10.2.1.3
cumulus@leaf1:~$ net add vxlan vni-10 vxlan id 10
cumulus@leaf1:~$ net add vxlan vni-10 vxlan local-tunnelip 10.2.1.1
cumulus@leaf1:~$ net add vxlan vni-10 bridge access 10
cumulus@leaf1:~$ net add vxlan vni-2000 vxlan id 2000
cumulus@leaf1:~$ net add vxlan vni-2000 vxlan local-tunnelip 10.2.1.1
cumulus@leaf1:~$ net add vxlan vni-2000 bridge access 20
cumulus@leaf1:~$ net add vxlan vni-30 vxlan id 30
cumulus@leaf1:~$ net add vxlan vni-30 vxlan local-tunnelip 10.2.1.1
cumulus@leaf1:~$ net add vxlan vni-30 bridge access 30
cumulus@leaf1:~$ net add bridge bridge ports swp32s0.10
cumulus@leaf1:~$ net pending
cumulus@leaf1:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto lo
iface lo
address 10.2.1.1/32
vxrd-src-ip 10.2.1.1
cumulus@leaf2:~$ net add loopback lo ip address 10.2.1.2/32
cumulus@leaf2:~$ net add loopback lo vxrd-src-ip 10.2.1.2
cumulus@leaf2:~$ net add loopback lo vxrd-svcnode-ip 10.2.1.3
cumulus@leaf2:~$ net add vxlan vni-10 vxlan id 10
cumulus@leaf2:~$ net add vxlan vni-10 vxlan local-tunnelip 10.2.1.2
cumulus@leaf2:~$ net add vxlan vni-10 bridge access 10
cumulus@leaf2:~$ net add vxlan vni-2000 vxlan id 2000
cumulus@leaf2:~$ net add vxlan vni-2000 vxlan local-tunnelip 10.2.1.2
cumulus@leaf2:~$ net add vxlan vni-2000 bridge access 20
cumulus@leaf2:~$ net add vxlan vni-30 vxlan id 30
cumulus@leaf2:~$ net add vxlan vni-30 vxlan local-tunnelip 10.2.1.2
cumulus@leaf2:~$ net add vxlan vni-30 bridge access 30
cumulus@leaf1:~$ net add bridge bridge ports swp32s0.10
cumulus@leaf2:~$ net pending
cumulus@leaf2:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto lo
iface lo
address 10.2.1.2/32
vxrd-src-ip 10.2.1.2
Why is vni-2000 not vni-20? For example, why not tie VLAN 20 to VNI 20,
or why was 2000 used? VXLANs and VLANs do not need to be the same
number. However if you are using fewer than 4096 VLANs, there is no
reason not to make it easy and correlate VLANs to VXLANs. It is
completely up to you.
Verify the VLAN to VXLAN Mapping
Use the brctl show command to see the physical and logical interfaces
associated with that bridge:
cumulus@leaf1:~$ brctl show
bridge name bridge id STP enabled interfaces
bridge 8000.443839008404 yes swp32s0.10
vni-10
vni-2000
vni-30
As with any logical interfaces on Linux, the name does not matter (other
than a 15-character limit). To verify the associated VNI for the logical
name, use the ip -d link show command:
cumulus@leaf1:~$ ip -d link show vni-10
43: vni-10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-10 state UNKNOWN mode DEFAULT
link/ether 02:ec:ec:bd:7f:c6 brd ff:ff:ff:ff:ff:ff
vxlan id 10 srcport 32768 61000 dstport 4789 ageing 1800
bridge_slave
The vxlan id 10 indicates the VXLAN ID/VNI is indeed 10 as the logical
name suggests.
Enable and Manage Service Node and Registration Daemons
Every VTEP must run the registration daemon (vxrd). Typically, every
leaf switch acts as a VTEP. A minimum of 1 switch (a switch not already
acting as a VTEP) must run the service node daemon (vxsnd). The
instructions for enabling these daemons follows.
Enable the Service Node Daemon
The service node daemon (vxsnd) is included in the Cumulus Linux
repository as vxfld-vxsnd. The service node daemon can run on any
switch running Cumulus Linux as long as that switch is not also a VXLAN
VTEP. In this example, enable the service node only on the spine1
switch, then restart the service.
Do not run vxsnd on a switch that is already acting as a VTEP.
Enable the Registration Daemon
The registration daemon (vxrd) is included in the Cumulus Linux
package as vxfld-vxrd. The registration daemon must run on each VTEP
participating in LNV, so you must enable it on every TOR (leaf) switch
acting as a VTEP, then restart the vxrd daemon. For example, on leaf1:
To determine if the daemon is running, use the systemctl status <daemon name>.service command.
For the service node daemon:
cumulus@spine1:~$ sudo systemctl status vxsnd.service
● vxsnd.service - Lightweight Network Virt Discovery Svc and Replicator
Loaded: loaded (/lib/systemd/system/vxsnd.service; enabled)
Active: active (running) since Wed 2016-05-11 11:42:55 UTC; 10min ago
Main PID: 774 (vxsnd)
CGroup: /system.slice/vxsnd.service
└─774 /usr/bin/python /usr/bin/vxsnd
May 11 11:42:55 cumulus vxsnd[774]: INFO: Starting (pid 774) ...
For the registration daemon:
cumulus@leaf1:~$ sudo systemctl status vxrd.service
● vxrd.service - Lightweight Network Virtualization Peer Discovery Daemon
Loaded: loaded (/lib/systemd/system/vxrd.service; enabled)
Active: active (running) since Wed 2016-05-11 11:42:55 UTC; 10min ago
Main PID: 929 (vxrd)
CGroup: /system.slice/vxrd.service
└─929 /usr/bin/python /usr/bin/vxrd
May 11 11:42:55 cumulus vxrd[929]: INFO: Starting (pid 929) ...
Configure the Registration Node
The registration node was configured earlier in /etc/network/interfaces in the
VXLAN mapping section above; no additional
configuration is typically needed. However, if you need to modify the registration
node configuration, edit /etc/vxrd.conf.
Configuring the registration node in /etc/vxrd.conf ...
cumulus@leaf1:~$ sudo nano /etc/vxrd.conf
Then edit the svcnode_ip variable:
svcnode_ip = 10.2.1.3
Then perform the same on leaf2:
cumulus@leaf2:~$ sudo nano /etc/vxrd.conf
And again edit the svcnode_ip variable:
svcnode_ip = 10.2.1.3
Enable, then restart the registration node daemon for the change to take
effect:
The complete list of options you can configure is listed below:
Registration node options ...
Name
Description
Default
loglevel
The log level: DEBUG, INFO, WARNING, ERROR, CRITICAL.
INFO
logdest
The destination for log messages. The destination can be a file name, stdout, or syslog.
syslog
logfilesize
The log file size in bytes. Used when logdest is a file name.
512000
logbackupcount
The maximum number of log files stored on the disk. Used when logdest is a file name.
14
pidfile
The PIF file location for the vxrd daemon.
/var/run/vxrd.pid
udsfile
The file name for the Unix domain socket used for management.
/var/run/vxrd.sock
vxfld_port
The UDP port used for VXLAN control messages.
10001
svcnode_ip
The address to which registration daemons send control messages for registration and or BUM packets for replication. You can also configure this option in the /etc/network/interfaces file with the vxrd-svcnode-ip keyword.
holdtime
The hold time (in seconds) for soft state, which is how long the service node waits before ageing out an IP address for a VNI. The vxrd includes this in the register messages it sends to a vxsnd.
90 seconds
src_ip
The local IP address to bind to for receiving control traffic from the service node daemon.
refresh_rate
The number of times to refresh within the hold time. The higher this number, the more lost UDP refresh messages can be tolerated.
3 seconds
config_check_rate
The number of seconds to poll the system for current VXLAN membership.
5 seconds
head_rep
Enables self replication. Instead of using the service node to replicate BUM packets, it is done in hardware on the VTEP switch.
true
Use 1, yes, true, or on for True for each relevant option. Use
0, no, false, or off for False.
Configure the Service Node
To configure the service node daemon, edit the /etc/vxsnd.conf
configuration file.
For the example configuration, default values are used, except for the
svcnode_ip field.
cumulus@spine1:~$ sudo nano /etc/vxsnd.conf
The address field is set to the loopback address of the switch running
the vxsnd daemon.
svcnode_ip = 10.2.1.3
Enable, then restart the service node daemon for the change to take
effect:
The complete list of options you can configure is listed below:
Name
Description
Default
loglevel
The log level: DEBUG, INFO, WARNING, ERROR, CRITICAL.
INFO
logdest
The destination for log messages. The destination can be a file name, stdout, or syslog.
syslog
logfilesize
The log file size in bytes. Used when logdest is a file name.
512000
logbackupcount
The maximum number of log files stored on disk. Used when logdest is a file name.
14
pidfile
The PID file location for the vxrd daemon.
/var/run/vxrd.pid
udsfile
The file name for the Unix domain socket used for management.
/var/run/vxrd.sock
vxfld_port
The UDP port used for VXLAN control messages.
10001
svcnode_ip
The address to which registration daemons send control messages for registration and or BUM packets for replication.
0.0.0.0
holdtime
The holdtime (in seconds) for soft state. This option is used when sending a register message to peers in response to learning a <vni, addr> from a VXLAN data packet.
90
src_ip
The local IP address to bind to for receiving inter-vxsnd control traffic.
0.0.0.0
svcnode_peers
A space-separated list of IP addresses with which the vxsnd shares its state.
enable_vxlan_listen
When set to true, the service node listens for VXLAN data traffic.
true
install_svcnode_ip
When set to true, the snd_peer_address gets installed on the loopback interface. It gets withdrawn when the vxsnd is not in service. If set to true, you must define the snd_peer_address configuration variable.
false
age_check
Number of seconds to wait before checking the database to age out stale entries.
90 seconds
Use 1, yes, true, or on for True for each relevant option. Use
0, no, false, or off for False.
Advanced LNV Usage
Scale LNV by Load Balancing with Anycast
The above configuration assumes a single service node, which can quickly
be overwhelmed by BUM traffic. To load balance BUM traffic across
multiple service nodes, use
Anycast. Anycast enables BUM
traffic to reach the topologically nearest service node instead of
overwhelming a single service node.
Enable the Service Node Daemon on Additional Spine Switches
In this example, spine1 already has the service node daemon enabled.
Enable it on the spine2 switch, then restart the vxsnd daemon:
Configure the Anycast Address on All Participating Service Nodes
spine1:
Add the 10.10.10.10/32 address to the loopback address:
cumulus@spine1:~$ net add loopback lo ip address 10.10.10.10/32
cumulus@spine1:~$ net pending
cumulus@spine1:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto lo
iface lo inet loopback
address 10.2.1.3/32
address 10.10.10.10/32
Verify the IP address is configured:
cumulus@spine1:~$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet 10.2.1.3/32 scope global lo
inet 10.10.10.10/32 scope global lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
spine2:
Add the 10.10.10.10/32 address to the loopback address:
cumulus@spine2:~$ net add loopback lo ip address 10.10.10.10/32
cumulus@spine2:~$ net pending
cumulus@spine2:~$ net commit
These commands create the following configuration in the /etc/network/interfaces file:
auto lo
iface lo inet loopback
address 10.2.1.4/32
address 10.10.10.10/32
Verify the IP address is configured:
cumulus@spine2:~$ ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet 10.2.1.4/32 scope global lo
inet 10.10.10.10/32 scope global lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
Configure the Service Node vxsnd.conf File
spine1:
Use a text editor to edit the network configuration:
cumulus@spine1:~$ sudo nano /etc/vxsnd.conf
Change the following values:
svcnode_ip = 10.10.10.10
svcnode_peers = 10.2.1.4
src_ip = 10.2.1.3
This sets the address on which the service node listens to VXLAN messages to the configured Anycast address and sets it to sync with spine2.
Repeat the ping tests from the previous section. Here is the table again
for reference:
VNI
server1
server2
10
10.10.10.1
10.10.10.2
2000
10.10.20.1
10.10.20.2
30
10.10.30.1
10.10.30.2
cumulus@server1:~$ ping 10.10.10.2
PING 10.10.10.2 (10.10.10.2) 56(84) bytes of data.
64 bytes from 10.10.10.2: icmp_seq=1 ttl=64 time=5.32 ms
64 bytes from 10.10.10.2: icmp_seq=2 ttl=64 time=0.206 ms
^C
--- 10.10.10.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.206/2.767/5.329/2.562 ms
PING 10.10.20.2 (10.10.20.2) 56(84) bytes of data.
64 bytes from 10.10.20.2: icmp_seq=1 ttl=64 time=1.64 ms
64 bytes from 10.10.20.2: icmp_seq=2 ttl=64 time=0.187 ms
^C
--- 10.10.20.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.187/0.914/1.642/0.728 ms
cumulus@server1:~$ ping 10.10.30.2
PING 10.10.30.2 (10.10.30.2) 56(84) bytes of data.
64 bytes from 10.10.30.2: icmp_seq=1 ttl=64 time=1.63 ms
64 bytes from 10.10.30.2: icmp_seq=2 ttl=64 time=0.191 ms
^C
--- 10.10.30.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.191/0.913/1.635/0.722 ms
Restart Network Removes vxsnd Anycast IP Address from Loopback Interface
If you have not configured a loopback anycast IP address in the
/etc/network/interfaces file, but you have enabled the vxsnd
(service node daemon) log to automatically add anycast IP addresses,
when you restart networking (with systemctl restart networking), the
anycast IP address gets removed from the loopback interface.
To prevent this issue from occurring, specify an anycast IP address for
the loopback interface in both the /etc/network/interfaces file and
the vxsnd.conf file. This way, in case vxsnd fails, you can withdraw
the IP address.