NVIDIA Cumulus Linux

Cumulus Linux 4.4 User Guide

NVIDIA® Cumulus Linux is the first full-featured Debian Buster-based, Linux operating system for the networking industry.

This user guide provides in-depth documentation on the Cumulus Linux installation process, system configuration and management, network solutions, and monitoring and troubleshooting recommendations. In addition, the quick start guide provides an end-to-end setup process to get you started.

Cumulus Linux 4.4 includes the NVIDIA Cumulus NetQ agent and CLI. You can use Cumulus NetQ to monitor and manage your data center network infrastructure and operational health. Refer to the NVIDIA Cumulus NetQ documentation for details.

For a list of the new features in this release, see What's New. For bug fixes and known issues present in this release, refer to the Cumulus Linux 4.4 Release Notes.

Open Source Contributions

To implement various Cumulus Linux features, NVIDIA has forked various software projects, like CFEngine Netdev and some Puppet Labs packages. Some of the forked code resides in the NVIDIA Networking GitHub repository and some is available as part of the Cumulus Linux repository as Debian source packages.

NVIDIA has also developed and released new applications as open source. The list of open source projects is on the open source software page.

Hardware Compatibility List

You can find the most up-to-date hardware compatibility list (HCL) here. Use the HCL to confirm that Cumulus Linux supports your switch model. The HCL lists products by port configuration, manufacturer and SKU part number.

Download the User Guide

You can view the complete Cumulus Linux 4.4 user guide as a single page to print to PDF here.

What's New

This document supports the Cumulus Linux 4.4 release, and lists new platforms and features.

What’s New in Cumulus Linux 4.4.1

Cumulus Linux 4.4.1 provides bug fixes and contains several enhancements.

Enhancements

What’s New in Cumulus Linux 4.4.0

Cumulus Linux 4.4.0 supports new platforms, provides bug fixes, and contains several new features and improvements.

New Platforms

New Features and Enhancements

Unsupported Platforms

Cumulus Linux 4.4 supports NVIDIA Spectrum-based switches only. Cumulus Linux 3.7 and 4.3 continue to support Broadcom-based networking ASICs.

Deprecated Features

Cumulus Linux 4.4. no longer supports GRE and OVSDB High Availability.

What’s New in the Documentation

The Cumulus Linux 4.4 user guide (this guide) provides pre-built Try It demos for certain Cumulus Linux features. The Try It demos run a simulation in NVIDIA Air; a cloud hosted platform that works exactly like a real world production deployment. Use the Try It demos to examine switch configuration for a feature.

The following Try It demos are currently available:

For more information, see Try It Pre-built Demos.

Quick Start Guide

This quick start guide provides an end-to-end setup process for installing and running Cumulus Linux.

Prerequisites

This guide assumes you have intermediate-level Linux knowledge. You need to be familiar with basic text editing, Unix file permissions, and process monitoring. A variety of text editors are pre-installed, including vi and nano.

You must have access to a Linux or UNIX shell. If you are running Windows, use a Linux environment like Cygwin as your command line tool for interacting with Cumulus Linux.

Install Cumulus Linux

To install Cumulus Linux, you use ONIE (Open Network Install Environment), an extension to the traditional U-Boot software that allows for automatic discovery of a network installer image. This facilitates the ecosystem model of procuring switches with an operating system choice, such as Cumulus Linux. The easiest way to install Cumulus Linux with ONIE is with local HTTP discovery:

  1. If your host (laptop or server) is IPv6-enabled, make sure it is running a web server. If your host is IPv4-enabled, make sure it is running DHCP in addition to a web server.

  2. Download the Cumulus Linux installation file to the root directory of the web server. Rename this file onie-installer.

  3. Connect your host using an Ethernet cable to the management Ethernet port of the switch.

  4. Power on the switch. The switch downloads the ONIE image installer and boots. You can watch the installation progress in your terminal. After the installation completes, the Cumulus Linux login prompt appears in the terminal window.

These steps describe a flexible unattended installation method; you do not need a console cable. A fresh install with ONIE using a local web server typically completes in less than ten minutes. However, you have more options for installing Cumulus Linux with ONIE, such as using a local file, FTP or USB. See Installing a New Cumulus Linux Image for more options.

After installing Cumulus Linux, you are ready to:

Get Started

When starting Cumulus Linux for the first time, the management port makes a DHCPv4 request. To determine the IP address of the switch, you can cross reference the MAC address of the switch with your DHCP server. The MAC address is typically located on the side of the switch or on the box in which the unit ships.

Login Credentials

The default installation includes two accounts:

ONIE includes options that allow you to change the default password for the cumulus account automatically when you install a new Cumulus Linux image. Refer to ONIE Installation Options. You can also change the default password using a ZTP script.

In this quick start guide, you use the cumulus account to configure Cumulus Linux.

All accounts except root can use remote SSH login; you can use sudo to grant a non-root account root-level access. Commands that change the system configuration require this elevated level of access.

For more information about sudo, see Using sudo to Delegate Privileges.

Serial Console Management

NVIDIA recommends you perform management and configuration over the network, either in band or out of band. A serial console is fully supported.

Typically, switches ship from the manufacturer with a mating DB9 serial cable. Switches with ONIE are always set to a 115200 baud rate.

Wired Ethernet Management

A Cumulus Linux switch always provides at least one dedicated Ethernet management port called eth0. This interface is specifically for out-of-band management use. The management interface uses DHCPv4 for addressing by default.

To set a static IP address:

cumulus@switch:~$ net add interface eth0 ip address 192.0.2.42/24
cumulus@switch:~$ net add interface eth0 ip gateway 192.0.2.1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface eth0 ip address 192.0.2.42/24
cumulus@switch:~$ nv set interface eth0 ip gateway 192.0.2.1
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file:

cumulus@switch:~$ sudo nano /etc/network/interfaces
# Management interface
auto eth0
iface eth0
    address 192.0.2.42/24
    gateway 192.0.2.1

Configure the Hostname

The hostname identifies the switch; make sure you configure the hostname to be unique and descriptive.

Do not use an underscore (_), apostrophe ('), or non-ASCII characters in the hostname.

To change the hostname:

cumulus@switch:~$ net add hostname leaf01
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

The above command modifies both the /etc/hostname and /etc/hosts files.

cumulus@switch:~$ nv set platform hostname value leaf01
cumulus@switch:~$ nv config apply
  1. Edit the /etc/hostname file with the desired hostname:

    cumulus@switch:~$ sudo nano /etc/hostname
    
  2. In the /etc/hosts file, replace the 127.0.1.1 IP address with the new hostname:

    cumulus@switch:~$ sudo nano /etc/hosts
    

The command prompt in the terminal does not reflect the new hostname until you either log out of the switch or start a new shell.

Configure the Time Zone

The default time zone on the switch is UTC (Coordinated Universal Time). Change the time zone on your switch to be the time zone for your location.

To update the time zone, use NTP interactive mode:

  1. In a terminal, run the following command:

    cumulus@switch:~$ sudo dpkg-reconfigure tzdata
    
  2. Follow the on screen menu options to select the geographic area and region.

Programs that are already running (including log files) and logged in users, do not see time zone changes made with interactive mode. To set the time zone for all services and daemons, reboot the switch.

Verify the System Time

Verify that the date and time on the switch are correct, and correct the date and time if necessary. If the date and time is incorrect, the switch does not synchronize with Puppet and returns errors after you restart switchd:

Warning: Unit file of switchd.service changed on disk, 'systemctl daemon-reload' recommended.

Configure Breakout Ports with Splitter Cables

If you are using 4x10G DAC or AOC cables, or you want to break out 100G or 40G switch ports, configure the breakout ports. For more details, see Switch Port Attributes.

Test Cable Connectivity

By default, Cumulus Linux disables all data plane ports (every Ethernet port except the management interface, eth0). To test cable connectivity, administratively enable physical ports.

To administratively enable a port:

cumulus@switch:~$ net add interface swp1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To administratively enable all physical ports on a switch that has ports numbered from swp1 to swp52:

cumulus@switch:~$ net add interface swp1-52
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To view link status, run the net show interface all command.

To administratively enable a port:

cumulus@switch:~$ nv set interface swp1
cumulus@switch:~$ nv config apply

To administratively enable all physical ports on a switch that has ports numbered from swp1 to swp52:

cumulus@switch:~$ nv set interface swp1-52
cumulus@switch:~$ nv config apply

To view link status, run the nv show interface command.

To administratively enable a port:

cumulus@switch:~$ sudo ip link set swp1 up

To administratively enable all physical ports, run the following bash script:

cumulus@switch:~$ sudo su -
cumulus@switch:~$ for i in /sys/class/net/*; do iface=`basename $i`; if [[ $iface == swp* ]]; then ip link set $iface up fi done

To view link status, run the ip link show command.

Configure Layer 2 Ports

Cumulus Linux does not put all ports into a bridge by default. To create a bridge and configure one or more front panel ports as members of the bridge:

In the following configuration example, the front panel port swp1 is in a bridge called bridge.

cumulus@switch:~$ net add bridge bridge ports swp1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

You can add a range of ports in one command. For example, to add swp1 through swp10, swp12, and swp14 through swp20 to bridge:

cumulus@switch:~$ net add bridge bridge ports swp1-10,12,14-20
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

The following configuration example places the front panel port swp1 into the default bridge called br_default.

cumulus@switch:~$ nv set interface swp1 bridge domain br_default
cumulus@switch:~$ nv config apply

You can add a range of ports in one command. For example, to add swp1 through swp3, swp10, and swp14 through swp20 to the bridge:

cumulus@switch:~$ nv set interface swp1-3,swp6,swp14-20 bridge domain br_default
cumulus@switch:~$ nv config apply

The following configuration example places the front panel port swp1 into the default bridge called br_default:

...
auto br_default
iface br_default
    bridge-ports swp1
...

To put a range of ports into a bridge, use the glob keyword. For example, to add swp1 through swp10, swp12, and swp14 through swp20 to the bridge called br_default:

...
auto br_default
iface br_default
    bridge-ports glob swp1-10 swp12 glob swp14-20
...

To apply the configuration, check for typos:

cumulus@switch:~$ sudo ifquery -a

If there are no errors, run the following command:

cumulus@switch:~$ sudo ifup -a

For more information about Ethernet bridges, see Ethernet Bridging - VLANs.

Configure Layer 3 Ports

You can configure a front panel port or bridge interface as a layer 3 port.

In the following configuration example, the front panel port swp1 is a layer 3 access port:

cumulus@switch:~$ net add interface swp1 ip address 10.1.1.1/30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To add an IP address to a bridge interface, you must put it into a VLAN interface. If you want to use a VLAN other than the native one, set the bridge PVID:

cumulus@switch:~$ net add vlan 100 ip address 10.2.2.1/24
cumulus@switch:~$ net add bridge bridge pvid 100
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

The following configuration example configures the front panel port swp1 as a layer 3 access port:

cumulus@switch:~$ nv set interface swp1 ip address 10.1.1.1/30
cumulus@switch:~$ nv config apply

To add an IP address to a bridge interface, you must put it into a VLAN interface. If you want to use a VLAN other than the native one, set the bridge PVID:

cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 100
cumulus@switch:~$ nv set interface vlan100 ip address 10.2.2.1/24
cumulus@switch:~$ nv set bridge domain br_default untagged 100
cumulus@switch:~$ nv config apply

The following configuration example configures the front panel port swp1 as a layer 3 access port:

auto swp1
iface swp1
  address 10.1.1.1/30

To add an IP address to a bridge interface, include the address under the iface stanza in the /etc/network/interfaces file. If you want to use a VLAN other than the native one, set the bridge PVID:

auto br_default
iface br_default
    address 10.2.2.1/24
    bridge-ports glob swp1-10 swp12 glob swp14-20
    bridge-pvid 100

To apply the configuration, check for typos:

cumulus@switch:~$ sudo ifquery -a

If there are no errors, run the following command:

cumulus@switch:~$ sudo ifup -a

Configure a Loopback Interface

Cumulus Linux has a preconfigured loopback interface. When the switch boots up, the loopback interface, called lo, is up and assigned an IP address of 127.0.0.1.

The loopback interface lo must always exist on the switch and must always be up.

To check the status of the loopback interface:

cumulus@switch:~$ net show interface lo
    Name    MAC                Speed    MTU    Mode
--  ------  -----------------  -------  -----  --------
UP  lo      00:00:00:00:00:00  N/A      65536  Loopback

Alias
-----
loopback interface
IP Details
-------------------------  --------------------
IP:                        127.0.0.1/8, ::1/128
IP Neighbor(ARP) Entries:  0

The loopback is up with the IP address 127.0.0.1.

cumulus@switch:~$ nv show interface lo
cumulus@switch:~$ ip addr show lo

To add an IP address to a loopback interface, configure the lo interface:

cumulus@switch:~$ net add loopback lo ip address 10.1.1.1/32
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@switch:~$ nv config apply

Add the IP address directly under the iface lo inet loopback definition in the /etc network/interfaces file:

auto lo
iface lo inet loopback
    address 10.10.10.1

If you configure an IP address without a subnet mask, it becomes a /32 IP address. For example, 10.10.10.1 is 10.10.10.1/32.

You can add multiple loopback addresses. For more information, see Interface Configuration and Management.

If you run NVUE Commands to configure the switch, run the nv config save command before you reboot to save the applied configuration to the startup configuration so that the changes persist after the reboot.

cumulus@switch:~$ nv config save

Installation Management

This section describes how to manage, install, and upgrade Cumulus Linux on your switch.

Managing Cumulus Linux Disk Images

The Cumulus Linux operating system resides on a switch as a disk image. This section discusses how to manage the image.

To install a new Cumulus Linux image, refer to Installing a New Cumulus Linux Image. To upgrade Cumulus Linux, refer to Upgrading Cumulus Linux.

Reprovision the System (Restart the Installer)

Reprovisioning the system deletes all system data from the switch.

To stage an ONIE installer from the network (where ONIE automatically locates the installer), run the onie-select -i command. You must reboot the switch to start the install process.

cumulus@switch:~$ sudo onie-select -i
WARNING:
WARNING: Operating System install requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling install at next reboot...done.
Reboot required to take effect.

To cancel a pending reinstall operation, run the onie-select -c command:

cumulus@switch:~$ sudo onie-select -c
Cancelling pending install at next reboot...done.

To stage an installer located in a specific location, run the onie-install -i <location> command. You can specify a local, absolute or relative path, an HTTP or HTTPS server, SCP or FTP server. You can also stage a Zero Touch Provisioning (ZTP) script along with the installer. You typically use the onie-install command with the -a option to activate installation. If you do not specify the -a option, you must reboot the switch to start the installation process.

The following example stages the installer located at http://203.0.113.10/image-installer together with the ZTP script located at http://203.0.113.10/ztp-script and activates installation and ZTP:

cumulus@switch:~$ sudo onie-install -i http://203.0.113.10/image-installer
cumulus@switch:~$ sudo onie-install -z http://203.0.113.10/ztp-script
cumulus@switch:~$ sudo onie-install -a

You can also specify these options together in the same command. For example:

cumulus@switch:~$ sudo onie-install -i http://203.0.113.10/image-installer -z http://203.0.113.10/ztp-script -a

To see more onie-install options, run man onie-install.

Uninstall All Images and Remove the Configuration

To remove all installed images and configurations, and return the switch to its factory defaults, run the onie-select -k command.

The onie-select -k command takes a long time to run as it overwrites the entire NOS section of the flash. Only use this command if you want to erase all NOS data and take the switch out of service.

cumulus@switch:~$ sudo onie-select -k
WARNING:
WARNING: Operating System uninstall requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling uninstall at next reboot...done.
Reboot required to take effect.

You must reboot the switch to start the uninstallation process.

To cancel a pending uninstall operation, run the onie-select -c command:

cumulus@switch:~$ sudo onie-select -c
Cancelling pending uninstall at next reboot...done.

Boot Into Rescue Mode

If your system becomes unresponsive, you can correct certain issues by booting into ONIE rescue mode, which uses unmounted file systems. You can use various Cumulus Linux utilities to try and resolve a problem.

To reboot the system into ONIE rescue mode, run the onie-select -r command:

cumulus@switch:~$ sudo onie-select -r
WARNING:
WARNING: Rescue boot requested.
WARNING:
Are you sure (y/N)? y
Enabling rescue at next reboot...done.
Reboot required to take effect.

You must reboot the system to boot into rescue mode.

To cancel a pending rescue boot operation, run the onie-select -c command:

cumulus@switch:~$ sudo onie-select -c
Cancelling pending rescue at next reboot...done.

Inspect the Image File

The Cumulus Linux image file is executable. From a running switch, you can display, extract, and verify the contents of the image file.

To display the contents of the Cumulus Linux image file, pass the info option to the image file. For example, to display the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:

cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer info
Verifying image checksum ...OK.
Preparing image archive ... OK.
Control File Contents
=====================
Description: Cumulus Linux 4.1.0
Release: 4.1.0
Architecture: amd64
Switch-Architecture: bcm-amd64
Build-Id: dirtyz224615f
Build-Date: 2019-05-17T16:34:22+00:00
Build-User: clbuilder
Homepage: http://www.cumulusnetworks.com/
Min-Disk-Size: 1073741824
Min-Ram-Size: 536870912
mkimage-version: 0.11.111_gbcf0

To extract the contents of the image file, use with the extract <path> option. For example, to extract an image file called onie-installer located in the /var/lib/cumulus/installer directory to the mypath directory:

cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer extract mypath
total 181860
-rw-r--r-- 1 4000 4000       308 May 16 19:04 control
drwxr-xr-x 5 4000 4000      4096 Apr 26 21:28 embedded-installer
-rw-r--r-- 1 4000 4000  13273936 May 16 19:04 initrd
-rw-r--r-- 1 4000 4000   4239088 May 16 19:04 kernel
-rw-r--r-- 1 4000 4000 168701528 May 16 19:04 sysroot.tar

To verify the contents of the image file, use with the verify option. For example, to verify the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:

cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer verify
Verifying image checksum ...OK.
Preparing image archive ... OK.
./cumulus-linux-bcm-amd64.bin.1: 161: ./cumulus-linux-bcm-amd64.bin.1: onie-sysinfo: not found
Verifying image compatibility ...OK.
Verifying system ram ...OK.
Open Network Install Environment (ONIE) Home Page

Installing a New Cumulus Linux Image

The default password for the cumulus user account is cumulus. The first time you log into Cumulus Linux, you must change this default password. Be sure to update any automation scripts before installing a new image. Cumulus Linux provides command line options to change the default password automatically during the installation process. Refer to ONIE Installation Options.

You can install a new Cumulus Linux image using ONIE, an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on bare metal switches.

Before you install Cumulus Linux, the switch can be in two different states:

The sections below describe some of the different ways you can install the Cumulus Linux image. Steps show how to install directly from ONIE (if no image is on the switch) and from Cumulus Linux (if the image is already on the switch). For additional methods to find and install the Cumulus Linux image, see the ONIE Design Specification.

You can download a Cumulus Linux image from the MyMellanox Downloads page.

Installing the Cumulus Linux image is destructive; configuration files on the switch are not saved; copy them to a different server before installing.

In the following procedures:

Install Using a DHCP/Web Server With DHCP Options

To install Cumulus Linux using a DHCP/web server with DHCP options, set up a DHCP/web server on your laptop and connect the eth0 management port of the switch to your laptop. After you connect the cable, the installation proceeds as follows:

  1. The switch boots up and requests an IP address (DHCP request).

  2. The DHCP server acknowledges and responds with DHCP option 114 and the location of the installation image.

  3. ONIE downloads the Cumulus Linux image, installs, and reboots.

    You are now running Cumulus Linux.

The most common way is to send DHCP option 114 with the entire URL to the web server (this can be the same system). However, there are other ways you can use DHCP even if you do not have full control over DHCP. See the ONIE user guide for information on partial installer URLs and advanced DHCP options; both articles list more supported DHCP options.

Here is an example DHCP configuration with an ISC DHCP server:

subnet 172.0.24.0 netmask 255.255.255.0 {
  range 172.0.24.20 172.0.24.200;
  option default-url = "http://172.0.24.14/onie-installer-x86_64";
}

Here is an example DHCP configuration with dnsmasq (static address assignment):

dhcp-host=sw4,192.168.100.14,6c:64:1a:00:03:ba,set:sw4
dhcp-option=tag:sw4,114,"http://roz.rtplab.test/onie-installer-x86_64"

If you do not have a web server, you can use this free Apache example.

Install Using a DHCP/Web Server without DHCP Options

Follow the steps below if you can log into the switch on a serial console (ONIE), or log in on the console or with ssh (Install from Cumulus Linux).

  1. Place the Cumulus Linux image in a directory on the web server.

  2. Run the onie-nos-install command:

    ONIE:/ #onie-nos-install http://10.0.1.251/path/to/cumulus-install-x86_64.bin
    
  1. Place the Cumulus Linux image in a directory on the web server.

  2. From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.

    cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/path/to/cumulus-install-x86_64.bin
    

Install Using a Web Server With no DHCP

Follow the steps below if you can log into the switch on a serial console (ONIE), or you can log in on the console or with ssh (Install from Cumulus Linux) but no DHCP server is available.

You need a console connection to access the switch; you cannot perform this procedure remotely.

  1. ONIE is in discovery mode. You must disable discovery mode with the following command:

    onie# onie-discovery-stop
    

    On older ONIE versions, if the onie-discovery-stop command is not supported, run:

    onie# /etc/init.d/discover.sh stop
    
  2. Assign a static address to eth0 with the ip addr add command:

    ONIE:/ #ip addr add 10.0.1.252/24 dev eth0
    
  3. Place the Cumulus Linux image in a directory on your web server.

  4. Run the installer manually (because there are no DHCP options):

    ONIE:/ #onie-nos-install http://10.0.1.251/path/to/cumulus-install-x86_64.bin
    
  1. Place the Cumulus Linux image in a directory on your web server.

  2. From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.

    cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/path/to/cumulus-install-x86_64.bin
    

Install Using FTP Without a Web Server

Follow the steps below if your laptop is on the same network as the switch eth0 interface but no DHCP server is available.

  1. Set up DHCP or static addressing for eth0. The following example assigns a static address to eth0:

    ONIE:/ #ip addr add 10.0.1.252/24 dev eth0
    
  2. If you are using static addressing, disable ONIE discovery mode:

    onie# onie-discovery-stop
    

    On older ONIE versions, if the onie-discovery-stop command is not supported, run:

    onie# /etc/init.d/discover.sh stop
    
  3. Place the Cumulus Linux image into a TFTP or FTP directory.

  4. If you are not using DHCP options, run one of the following commands (tftp for TFTP or ftp for FTP):

    ONIE# onie-nos-install ftp://local-ftp-server/cumulus-install-x86_64.bin
    
    ONIE# onie-nos-install tftp://local-tftp-server/cumulus-install-[PLATFORM].bin
    
  1. Place the Cumulus Linux image into an FTP directory (TFTP is not supported in Cumulus Linux).

  2. From the Cumulus Linux command prompt, run the following command, then reboot the switch.

    cumulus@switch:~$ sudo onie-install -a -i ftp://local-ftp-server/cumulus-install-x86_64.bin
    

Install Using a Local File

Follow the steps below to install the Cumulus Linux image referencing a local file.

  1. Set up DHCP or static addressing for eth0. The following example assigns a static address to eth0:

    ONIE:/ #ip addr add 10.0.1.252/24 dev eth0
    
  2. If you are using static addressing, disable ONIE discovery mode.

    onie# onie-discovery-stop
    

    On older ONIE versions, if the onie-discovery-stop command is not supported, run:

    onie# /etc/init.d/discover.sh stop
    
  3. Use scp to copy the Cumulus Linux image to the switch.

  4. Run the installer manually from ONIE:

    ONIE:/ #onie-nos-install /path/to/local/file/cumulus-install-x86_64.bin
    
  1. Copy the Cumulus Linux image to the switch.

  2. From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.

    cumulus@switch:~$ sudo onie-install -a -i /path/to/local/file/cumulus-install-x86_64.bin
    

Install Using a USB Drive

Follow the steps below to install the Cumulus Linux image using a USB drive.

Installing Cumulus Linux using a USB drive is fine for a single switch here and there but is not scalable. DHCP can scale to hundreds of switch installs with zero manual input unlike USB installs.

Prepare for USB Installation

  1. From the MyMellanox Downloads page, download the appropriate Cumulus Linux image for your platform.

  2. From a computer, prepare your USB drive by formatting it using one of the supported formats: FAT32, vFAT or EXT2.

    Optional: Prepare a USB Drive inside Cumulus Linux

    a. Insert your USB drive into the USB port on the switch running Cumulus Linux and log in to the switch. Examine output from cat /proc/partitions and sudo fdisk -l [device] to determine the location of your USB drive. For example, sudo fdisk -l /dev/sdb.

    These instructions assume your USB drive is the /dev/sdb device, which is typical if you insert the USB drive after the machine is already booted. However, if you insert the USB drive during the boot process, it is possible that your USB drive is the /dev/sda device. Make sure to modify the commands below to use the proper device for your USB drive.

    b. Create a new partition table on the USB drive. If the parted utility is not on the system, install it with sudo -E apt-get install parted.

    sudo parted /dev/sdb mklabel msdos
    

    c. Create a new partition on the USB drive:

    sudo parted /dev/sdb -a optimal mkpart primary 0% 100%
    

    d. Format the partition to your filesystem of choice using one of the examples below:

    sudo mkfs.ext2 /dev/sdb1
    sudo mkfs.msdos -F 32 /dev/sdb1
    sudo mkfs.vfat /dev/sdb1
    

    To use mkfs.msdos or mkfs.vfat, you need to install the dosfstools package from the Debian software repositories, as they are not included by default.

    e. To continue installing Cumulus Linux, mount the USB drive to move files:

    sudo mkdir /mnt/usb
    sudo mount /dev/sdb1 /mnt/usb
    
  3. Copy the Cumulus Linux image to the USB drive, then rename the image file to onie-installer-x86_64.

    You can also use any of the ONIE naming schemes mentioned here.

    When using a MAC or Windows computer to rename the installation file, the file extension can still be present. Make sure you remove the file extension so that ONIE can detect the file.

  4. Insert the USB drive into the switch, then prepare the switch for installation:

    • If the switch is offline, connect to the console and power on the switch.
    • If the switch is already online in ONIE, use the reboot command.

    SSH sessions to the switch get dropped after this step. To complete the remaining instructions, connect to the console of the switch. Cumulus Linux switches display their boot process to the console; you need to monitor the console specifically to complete the next step.

  5. Monitor the console and select the ONIE option from the first GRUB screen shown below.

  6. Cumulus Linux on x86 uses GRUB chainloading to present a second GRUB menu specific to the ONIE partition. No action is necessary in this menu to select the default option ONIE: Install OS.

  7. The switch recognizes the USB drive and mounts it automatically. Cumulus Linux installation begins.

  8. After installation completes, the switch automatically reboots into the newly installed instance of Cumulus Linux.

ONIE Installation Options

You can run several installer command line options from ONIE to perform basic switch configuration automatically after installation completes and Cumulus Linux boots for the first time. These options enable you to:

The onie-nos-install command does not allow you specify command line parameters. You must access the switch from the console and transfer a disk image to the switch. You must then make the disk image executable and install the image directly from the ONIE command line with the options you want to use.

The following example commands transfer a disk image to the switch, make the image executable, and install the image with the --password option to change the default cumulus user password:

ONIE:/ # wget http://myserver.datacenter.com/cumulus-linux-4.4.0-mlx-amd64.bin
ONIE:/ # chmod 755 cumulus-linux-4.4.0-mlx-amd64.bin
ONIE:/ # ./cumulus-linux-4.4.0-mlx-amd64.bin --password 'MyP4$$word'

You can run more than one option in the same command.

Set the cumulus User Password

The default cumulus user account password is cumulus. When you log into Cumulus Linux for the first time, you must provide a new password for the cumulus account, then log back into the system.

To automate this process, you can specify a new password from the command line of the installer with the --password '<clear text-password>' option. For example, to change the default cumulus user password to MyP4$$word:

ONIE:/ # ./cumulus-linux-4.4.0-mlx-amd64.bin --password 'MyP4$$word'

To provide a hashed password instead of a clear text password, use the --hashed-password '<hash>' option. An encrypted hash maintains a secure management network.

  1. Generate a sha-512 password hash with the following python command. The example command generates a sha-512 password hash for the password MyP4$$word.

    user@host:~$ python3 -c "import crypt; print(crypt.crypt('MyP4$$word',salt=crypt.mksalt()))"
    $6$hs7OPmnrfvLNKfoZ$iB3hy5N6Vv6koqDmxixpTO6lej6VaoKGvs5E8p5zNo4tPec0KKqyQnrFMII3jGxVEYWntG9e7Z7DORdylG5aR/
    
  2. Specify the new password from the command line of the installer with the --hashed-password '<hash>' command:

    ONIE:/ # ./cumulus-linux-4.4.0-mlx-amd64.bin  --hashed-password '$6$hs7OPmnrfvLNKfoZ$iB3hy5N6Vv6koqDmxixpTO6lej6VaoKGvs5E8p5zNo4tPec0KKqyQnrFMII3jGxVEYWntG9e7Z7DORdylG5aR/'
    

If you specify both the --password and --hashed-password options, the --hashed-password option takes precedence and the switch ignores he --password option.

Provide Initial Network Configuration

To provide initial network configuration automatically when Cumulus Linux boots for the first time after installation, use the --interfaces-file <filename> option. For example, to copy the contents of a file called network.intf into the /etc/network/interfaces file and run the ifreload -a command:

ONIE:/ # ./cumulus-linux-4.4.0-mlx-amd64.bin  --interfaces-file network.intf

Execute a ZTP Script

To run a ZTP script that contains commands to execute after Cumulus Linux boots for the first time after installation, use the --ztp <filename> option. For example, to run a ZTP script called initial-conf.ztp:

ONIE:/ # ./cumulus-linux-4.4.0-mlx-amd64.bin --ztp initial-conf.ztp

The ZTP script must contain the CUMULUS-AUTOPROVISIONING string near the beginning of the file and must reside on the ONIE filesystem. Refer to Zero Touch Provisioning - ZTP.

If you use the --ztp option together with any of the other command line options, the ZTP script takes precedence and the switch ignores other command line options.

Edit the Cumulus Linux Image (Advanced)

The Cumulus Linux disk image file contains a BASH script that includes a set of variables. You can set these variables to be able to install a fully configured system with a single image file.

To edit the image

Example Image File

The Cumulus Linux disk image file is a self-extracting executable. The executable part of the file is a BASH script at the beginning of the file. Towards the beginning of this BASH script are a set of variables with empty strings:

...
CL_INSTALLER_PASSWORD=''
CL_INSTALLER_HASHED_PASSWORD=''
CL_INSTALLER_LICENSE=''
CL_INSTALLER_INTERFACES_FILENAME=''
CL_INSTALLER_INTERFACES_CONTENT=''
CL_INSTALLER_ZTP_FILENAME=''
CL_INSTALLER_QUIET=""
CL_INSTALLER_FORCEINST=""
CL_INSTALLER_INTERACTIVE=""
CL_INSTALLER_EXTRACTDIR=""
CL_INSTALLER_PAYLOAD_SHA256="72a8c3da28cda7a610e272b67fa1b3a54c50248bf6abf720f73ff3d10e79ae76"

You can set these variables:

VariableDescription
CL_INSTALLER_PASSWORDDefines the clear text password.
This variable is equivalent to the ONIE installer command line option --password.
CL_INSTALLER_HASHED_PASSWORDDefines the hashed password.
This variable is equivalent to the ONIE installer command line option --hashed-password.
If you set both the CL_INSTALLER_PASSWORD and CL_INSTALLER_HASHED_PASSWORD variable, the CL_INSTALLER_HASHED_PASSWORD takes precedence.
CL_INSTALLER_INTERFACES_FILENAMEDefines the name of the file on the ONIE filesystem you want to use as the /etc/network/interfaces file.
This variable is equivalent to the ONIE installer command line option --interfaces-file.
CL_INSTALLER_INTERFACES_CONTENTDescribes the network interfaces available on your system and how to activate them. Setting this variable defines the contents of the /etc/network/interfaces file.
There is no equivalent ONIE installer command line option.
If you set both the CL_INSTALLER_INTERFACES_FILENAME and CL_INSTALLER_INTERFACES_CONTENT variables, the CL_INSTALLER_INTERFACES_FILENAME takes precedence.
CL_INSTALLER_ZTP_FILENAMEDefines the name of the ZTP file on the ONIE filesystem you want to execute at first boot after installation.
This variable is equivalent to the ONIE installer command line option --ztp

Edit the Image File

Because the Cumulus Linux image file is a binary file, you cannot use standard text editors to edit the file directly. Instead, you must split the file into two parts, edit the first part, then put the two parts back together.

  1. Copy the first 20 lines to an empty file:
head -20 cumulus-linux-4.4.0-mlx-amd64.bin > cumulus-linux-4.4.0-mlx-amd64.bin.1
  1. Remove the first 20 lines of the image, then copy the remaining lines into another empty file:
sed -e '1,20d' cumulus-linux-4.4.0-mlx-amd64.bin > cumulus-linux-4.4.0-mlx-amd64.bin.2

The original file is now split, with the first 20 lines in cumulus-linux-4.4.0-mlx-amd64.bin.1 and the remaining lines in cumulus-linux-4.4.0-mlx-amd64.bin.2.

  1. Use a text editor to change the variables in cumulus-linux-4.4.0-mlx-amd64.bin.1.

  2. Put the two pieces back together using cat:

cat cumulus-linux-4.4.0-mlx-amd64.bin.1 cumulus-linux-4.4.0-mlx-amd64.bin.2 > cumulus-linux-4.4.0-mlx-amd64.bin.final
  1. Calculate the new checksum and update the CL_INSTALLER_PAYLOAD_SHA256 variable.
    sed -e '1,/^exit_marker$/d' "cumulus-linux-4.4.0-mlx-amd64.bin.final" | sha256sum | awk '{ print $1 }'

This following example shows a modified image file:

...
CL_INSTALLER_PAYLOAD_SHA256='d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac332e42f'
CL_INSTALLER_PASSWORD='MyP4$$word'
CL_INSTALLER_HASHED_PASSWORD=''
CL_INSTALLER_LICENSE='customer@datacenter.com|4C3YMCACDiK0D/EnrxlXpj71FBBNAg4Yrq+brza4ZtJFCInvalid'
CL_INSTALLER_INTERFACES_FILENAME=''
CL_INSTALLER_INTERFACES_CONTENT='# This file describes the network interfaces available on your system and how to activate them.

source /etc/network/interfaces.d/*.intf

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet dhcp
	vrf mgmt

auto bridge
iface bridge
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 10 11
    bridge-vlan-aware yes

auto mgmt
iface mgmt
	address 127.0.0.1/8
	address ::1/128
	vrf-table auto
'
CL_INSTALLER_ZTP_FILENAME=''
...

You can install this edited image file in the usual way, by using the ONIE install waterfall or the onie-nos-install command.

If you install the modified installation image and specify installer command line parameters, the command line parameters take precedence over the variables modified in the image.

Secure Boot

Secure Boot validates each binary image loaded during system boot with key signatures that correspond to a stored trusted key in firmware.

Secure Boot is only on the NVIDIA SN3700C-S switch.

Secure Boot settings are in the BIOS Security menu. To access BIOS, press Ctrl+B through the serial console during system boot while the BIOS version prints:


To access the BIOS menu, use admin which is the default BIOS password:


NVIDIA recommends changing the default BIOS password. To change the BIOS password, select Administrator Password from the Security menu:

To validate or change the Secure Boot mode, navigate to Security and select Secure Boot:


In the Secure Boot menu, you can enable and disable Secure Boot mode. To install an unsigned version of Cumulus Linux or access ONIE without a prompt for a username and password, set Secure Boot to disabled:


To access ONIE when Secure Boot is enabled, authentication is necessary. The default username and password are both root:

​ONIE: Rescue Mode ...
Platform  : x86_64-mlnx_x86-r0
Version   : 2021.02-5.3.0006-rc3-115200
Build Date: 2021-05-20T14:27+03:00
Info: Mounting kernel filesystems... done.

Info: Mounting ONIE-BOOT on /mnt/onie-boot ...
[   17.011057] ext4 filesystem being mounted at /mnt/onie-boot supports timestamps until 2038 (0x7fffffff)
Info: Mounting EFI System on /boot/efi ...
Info: BIOS mode: UEFI
Info: Using eth0 MAC address: b8:ce:f6:3c:62:06
Info: eth0:  Checking link... up.
Info: Trying DHCPv4 on interface: eth0
ONIE: Using DHCPv4 addr: eth0: 10.20.84.226 / 255.255.255.0
Starting: klogd... done.
Starting: dropbear ssh daemon... done.
Starting: telnetd... done.
discover: Rescue mode detected.  Installer disabled.

Please press Enter to activate this console. To check the install status inspect /var/log/onie.log.
Try this:  tail -f /var/log/onie.log

** Rescue Mode Enabled **
login: root
Password: root
ONIE:~ #

To validate the Secure Boot status of a system from Cumulus Linux, run the mokutil --sb-state command.

cumulus@leaf01:mgmt:~$ mokutil --sb-state
SecureBoot enabled

Upgrading Cumulus Linux

The default password for the cumulus user account is cumulus. The first time you log into Cumulus Linux, you must change this default password. Be sure to update any automation scripts before you upgrade. You can use ONIE command line options to change the default password automatically during the Cumulus Linux image installation process. Refer to ONIE Installation Options.

This topic describes how to upgrade Cumulus Linux on your switch.

Consider deploying, provisioning, configuring, and upgrading switches using automation, even with small networks or test labs. During the upgrade process, you can upgrade dozens of devices in a repeatable manner. Using tools like Ansible, Chef, or Puppet for configuration management greatly increases the speed and accuracy of the next major upgrade; these tools also enable the quick swap of failed switch hardware.

Before You Upgrade

Be sure to read the knowledge base article Upgrades: Network Device and Linux Host Worldview Comparison, which provides a detailed comparison between the network device and Linux host worldview of upgrade and installation.

Back up Configuration Files

Understanding the location of configuration data is important for successful upgrades, migrations, and backup. As with other Linux distributions, the /etc directory is the primary location for all configuration data in Cumulus Linux. The following list is the set of files that you need to back up and migrate to a new release. Make sure you examine any changed files. Make the following files and directories part of a backup strategy.

File Name and LocationDescriptionCumulus Linux DocumentationDebian Documentation
/etc/network/Network configuration files, most notably /etc/network/interfaces and /etc/network/interfaces.d/Switch Port AttributesN/A
/etc/resolv.confDNS resolutionNot unique to Cumulus Linux: wiki.debian.org/NetworkConfigurationhttps://www.debian.org/doc/manuals/debian-reference/ch05.en.html
/etc/frr/Routing application (responsible for BGP and OSPF)FRRoutingN/A
/etc/hostnameConfiguration file for the hostname of the switchQuick Start Guidehttps://wiki.debian.org/HowTo/ChangeHostname
/etc/hostsConfiguration file for the hostname of the switchQuick Start Guidehttps://wiki.debian.org/HowTo/ChangeHostname
/etc/cumulus/acl/*Netfilter configurationNetfilter - ACLsN/A
/etc/cumulus/ports.confBreakout cable configuration fileSwitch Port AttributesN/A; read the guide on breakout cables
/etc/cumulus/switchd.confswitchd configurationConfiguring switchdN/A; read the guide on switchd configuration
File Name and LocationDescriptionCumulus Linux DocumentationDebian Documentation
/etc/motdMessage of the dayNot unique to Cumulus Linuxwiki.debian.org/motd
/etc/passwdUser account informationNot unique to Cumulus Linuxhttps://www.debian.org/doc/manuals/debian-reference/ch04.en.html
/etc/shadowSecure user account informationNot unique to Cumulus Linuxhttps://www.debian.org/doc/manuals/debian-reference/ch04.en.html
/etc/groupDefines user groups on the switchNot unique to Cumulus Linuxhttps://www.debian.org/doc/manuals/debian-reference/ch04.en.html
/etc/lldpd.confLink Layer Discover Protocol (LLDP) daemon configurationLink Layer Discovery Protocolhttps://packages.debian.org/buster/lldpd
/etc/lldpd.d/Configuration directory for lldpdLink Layer Discovery Protocolhttps://packages.debian.org/buster/lldpd
/etc/nsswitch.confName Service Switch (NSS) configuration fileTACACSN/A
/etc/ssh/SSH configuration filesSSH for Remote Accesshttps://wiki.debian.org/SSH
/etc/sudoers, /etc/sudoers.dBest practice is to place changes in /etc/sudoers.d/ instead of /etc/sudoers; changes in the /etc/sudoers.d/ directory are not lost during upgradeUsing sudo to Delegate Privileges

  • If you are using the root user account, consider including /root/.
  • If you have custom user accounts, consider including /home/<username>/.
  • Run the net show configuration files | grep -B 1 "===" command and back up the files listed in the command output.

File Name and LocationDescription
/etc/mlx/Per-platform hardware configuration directory, created on first boot. Do not copy.
/etc/default/clagdCreated and managed by ifupdown2. Do not copy.
/etc/default/grubGrub init table. Do not modify manually.
/etc/default/hwclockPlatform hardware-specific file. Created during first boot. Do not copy.
/etc/initPlatform initialization files. Do not copy.
/etc/init.d/Platform initialization files. Do not copy.
/etc/fstabStatic information on filesystem. Do not copy.
/etc/image-releaseSystem version data. Do not copy.
/etc/os-releaseSystem version data. Do not copy.
/etc/lsb-releaseSystem version data. Do not copy.
/etc/lvm/archiveFilesystem files. Do not copy.
/etc/lvm/backupFilesystem files. Do not copy.
/etc/modulesCreated during first boot. Do not copy.
/etc/modules-load.d/Created during first boot. Do not copy.
/etc/sensors.dPlatform-specific sensor data. Created during first boot. Do not copy.
/root/.ansibleAnsible tmp files. Do not copy.
/home/cumulus/.ansibleAnsible tmp files. Do not copy.

If you use certain forms of network virtualization, such as VMware NSX-V, you update the /usr/share/openvswitch/scripts/ovs-ctl-vtep file. This file is not marked as a configuration file; therefore, if the file contents change in a newer release of Cumulus Linux, they overwrite any changes you make to the file. Be sure to back up this file and the database file conf.db before upgrading.

The following commands show you which files changed after the previous Cumulus Linux installation. Be sure to back up any changed files.

  • Run the sudo dpkg --verify command to show a list of changed files.
  • Run the egrep -v '^$|^#|=""$' /etc/default/isc-dhcp-* command to see if any of the generated /etc/default/isc-* files have changes.

Upgrade Cumulus Linux

You can upgrade Cumulus Linux in one of two ways:

Cumulus Linux also provides the Smart System Manager that enables you to upgrade an active switch with minimal disruption to the network. See Smart System Manager.

Upgrading an MLAG pair requires additional steps. If you are using MLAG to dual connect two Cumulus Linux switches in your environment, follow the steps in Upgrade Switches in an MLAG Pair below to ensure a smooth upgrade.

Install a Cumulus Linux Image or Upgrade Packages?

The decision to upgrade Cumulus Linux by either installing a Cumulus Linux image or upgrading packages depends on your environment and your preferences. Here are some recommendations for each upgrade method.

Installing a Cumulus Linux image is best if you are performing a rolling upgrade in a production environment and if are using up-to-date and comprehensive automation scripts. This upgrade method enables you to choose the exact release to which you want to upgrade and is the only method available to upgrade your switch to a new release train (for example, from 3.7.14 to 4.4.0).

To upgrade to Cumulus Linux 4.4.0 from a previous release, you must install a Cumulus Linux image; package upgrade is not supported.

Be aware of the following when installing the Cumulus Linux image:

Package upgrade is best if you are upgrading from Cumulus Linux 4.4.0 to a later 4.4.x release, or if you use third-party applications. Package upgrade does not replace or remove third-party applications, unlike the Cumulus Linux image install.

Be aware of the following when upgrading packages:

Cumulus Linux Image Install (ONIE)

ONIE is an open source project, equivalent to PXE on servers, that you use to install a NOS on a bare metal switch.

To upgrade the switch:

  1. Back up the configurations off the switch.

  2. Download the Cumulus Linux image.

  3. Install the Cumulus Linux image with the onie-install -a -i <image-location> command, which boots the switch into ONIE. The following example command installs the image from a web server, then reboots the switch. There are additional ways to install the Cumulus Linux image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.

    cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/cumulus-linux-4.1.0-mlx-amd64.bin && sudo reboot
    
  4. Restore the configuration files to the new release (NVIDIA recommends that you do not restore files with automation).

  5. Verify correct operation with the old configurations on the new release.

  6. Reinstall third party applications and associated configurations.

Package Upgrade

Cumulus Linux completely embraces the Linux and Debian upgrade workflow, where you use an installer to install a base image, then perform any upgrades within that release train with sudo -E apt-get update and sudo -E apt-get upgrade commands. Any packages that have changed after the base install get upgraded in place from the repository. All switch configuration files remain untouched, or in rare cases merged (using the Debian merge function) during the package upgrade.

When you use package upgrade to upgrade your switch, configuration data stays in place while the packages upgrade. If the new release updates a configuration file that you changed previously, the upgrade prompts you for the version you want to use or to evaluate the differences.

To upgrade the switch using package upgrade:

  1. Back up the configurations from the switch.

  2. Fetch the latest update metadata from the repository.

    cumulus@switch:~$ sudo -E apt-get update
    
  3. Review potential upgrade issues (in some cases, upgrading new packages might also upgrade additional existing packages due to dependencies). Run the following command to see the additional packages for installation or upgrade.

    cumulus@switch:~$ sudo -E apt-get upgrade --dry-run
    
  4. Upgrade all the packages to the latest distribution.

    cumulus@switch:~$ sudo -E apt-get upgrade
    

    If you do not need to reboot the switch after the upgrade completes, the upgrade ends, restarts all upgraded services, and log messages in the /var/log/syslog file similar to the ones shown below. In the examples below, only the frr package upgrades.

    Policy: Service frr.service action stop postponed
    Policy: Service frr.service action start postponed
    Policy: Restarting services: frr.service
    Policy: Finished restarting services
    Policy: Removed /usr/sbin/policy-rc.d
    Policy: Upgrade is finished
    

    If the upgrade process encounters changed configuration files that have new versions in the release to which you are upgrading, you see a message similar to this:

    Configuration file '/etc/frr/daemons'
    ==> Modified (by you or by a script) since installation.
    ==> Package distributor has shipped an updated version.
    What would you like to do about it ? Your options are:
    Y or I : install the package maintainer's version
    N or O : keep your currently-installed version
    D : show the differences between the versions
    Z : start a shell to examine the situation
    The default action is to keep your current version.
    *** daemons (Y/I/N/O/D/Z) [default=N] ?
    
    • To see the differences between the currently installed version and the new version, type D.
    • To keep the currently installed version, type N. The new package version installs with the suffix .dpkg-dist (for example, /etc/frr/daemons.dpkg-dist). When upgrade is complete and before you reboot, merge your changes with the changes from the newly installed file.
    • To install the new version, type I. The upgrade saves your currently installed version with the suffix .dpkg-old. When the upgrade is complete, you can search for the files with the sudo find / -mount -type f -name '*.dpkg-*' command.
  5. Reboot the switch if the upgrade messages indicate that you need a system restart.

    cumulus@switch:~$ sudo -E apt-get upgrade
    ... upgrade messages here ...
    
    *** Caution: Service restart prior to reboot could cause unpredictable behavior
    *** System reboot required ***
    cumulus@switch:~$ sudo reboot
    
  6. Verify correct operation with the old configurations on the new version.

Upgrade Notes

Package upgrade always updates to the latest available release in the Cumulus Linux repository. For example, if you are currently running Cumulus Linux 4.4.0 and run the sudo -E apt-get upgrade command on that switch, the packages upgrade to the latest releases contained in the latest 4.4.x release.

Because Cumulus Linux is a collection of different Debian Linux packages, be aware of the following:

Upgrade Switches in an MLAG Pair

If you are using MLAG to dual connect two switches in your environment, follow the steps below to upgrade the switches.

You must upgrade both switches in the MLAG pair to the same release of Cumulus Linux.

For networks with MLAG deployments, you can only upgrade to Cumulus Linux 4.4 from version 3.7.10 or later. If you are using a version of Cumulus Linux earlier than 3.7.10, you must upgrade to version 3.7.10 first, then upgrade to version 4.4. Version 3.7.10 is available on the MyMellanox downloads page.

During upgrade, MLAG bonds stay single-connected while the switches are running different major releases; for example, while leaf01 is running 3.7.12 and leaf02 is running 4.4.0.

The bonding driver changes how it derives the actor port key, which causes the port key to have a different value for links with the same speed or duplex settings across different major releases. The port key from the LACP partner must remain consistent between all bond members for all bonds to synchronize. When each MLAG switch sends LACPDUs with different port keys, only links to one MLAG switch synchronize.

  1. Verify the switch is in the secondary role:

    cumulus@switch:~$ clagctl status
    
  2. Shut down the core uplink layer 3 interfaces:

    cumulus@switch:~$ sudo ip link set <switch-port> down
    
  3. Shut down the peer link:

    cumulus@switch:~$ sudo ip link set peerlink down
    
  4. Run the onie-install -a -i <image-location> command to boot the switch into ONIE. The following example command installs the image from a web server. There are additional ways to install the Cumulus Linux image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.

    cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/downloads/cumulus-linux-4.1.0-mlx-amd64.bin
    
  5. Reboot the switch:

    cumulus@switch:~$ sudo reboot
    
  6. Verify STP convergence across both switches:

    cumulus@switch:~$ mstpctl showall
    
  7. Verify core uplinks and peer links are UP:

    cumulus@switch:~$ nv show interface
    
  8. Verify MLAG convergence:

    cumulus@switch:~$ clagctl status
    
  9. Make this secondary switch the primary:

    cumulus@switch:~$ clagctl priority 2048
    
  10. Verify the other switch is now in the secondary role.

  11. Repeat steps 2-8 on the new secondary switch.

  12. Remove the priority 2048 and restore the priority back to 32768 on the current primary switch:

    cumulus@switch:~$ clagctl priority 32768
    

Roll Back a Cumulus Linux Installation

Even the most well planned and tested upgrades can result in unforeseen problems and sometimes the best solution is to roll back to the previous state. These main strategies require detailed planning and execution:

The method you employ is specific to your deployment strategy. Providing detailed steps for each scenario is outside the scope of this document.

Third Party Packages

If you install any third party applications on a Cumulus Linux switch, configuration data is typically installed in the /etc directory, but it is not guaranteed. It is your responsibility to understand the behavior and configuration file information of any third party packages installed on the switch.

After you upgrade using a full Cumulus Linux image install, you need to reinstall any third party packages or any Cumulus Linux add-on packages.

Adding and Updating Packages

To manage additional applications in the form of packages and to install the latest updates, use the Advanced Packaging Tool (apt).

Updating, upgrading, and installing packages with apt causes disruptions to network services:

  • Upgrading a package can cause services to restart or stop.
  • Installing a package sometimes disrupts core services by changing core service dependency packages. In some cases, installing new packages also upgrades additional existing packages due to dependencies.
  • If services stop, you need to reboot the switch to restart the services.

Update the Package Cache

To work correctly, apt relies on a local cache listing of the available packages. You must populate the cache initially, then periodically update it with sudo -E apt-get update:

cumulus@switch:~$ sudo -E apt-get update
Ign:1 copy:/var/lib/cumulus/cumulus-local-apt-archive cumulus-local-apt-archive InRelease
Get:2 copy:/var/lib/cumulus/cumulus-local-apt-archive cumulus-local-apt-archive Release [1,115 B]
Ign:3 copy:/var/lib/cumulus/cumulus-local-apt-archive cumulus-local-apt-archive Release.gpg
Hit:4 http://deb.debian.org/debian buster InRelease
Get:5 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:6 http://deb.debian.org/debian buster-backports InRelease [46.7 kB]
Get:7 http://security.debian.org buster/updates InRelease [65.4 kB]
Get:8 http://apps3.cumulusnetworks.com/repos/deb CumulusLinux-4 InRelease [35.7 kB]
Hit:9 http://apt.cumulusnetworks.com/repo CumulusLinux-4-latest InRelease
Get:10 http://deb.debian.org/debian buster-backports/main Sources.diff/Index [27.8 kB]
Get:11 http://deb.debian.org/debian buster-backports/main amd64 Packages.diff/Index [27.8 kB]
Get:12 http://deb.debian.org/debian buster-backports/main Translation-en.diff/Index [27.8 kB]
Get:13 http://deb.debian.org/debian buster-backports/main Sources 2021-06-04-1401.15.pdiff [1,731 B]
Get:13 http://deb.debian.org/debian buster-backports/main Sources 2021-06-04-1401.15.pdiff [1,731 B]
Get:14 http://deb.debian.org/debian buster-backports/main amd64 Packages 2021-06-04-1401.15.pdiff [653 B]
Get:15 http://deb.debian.org/debian buster-backports/main Translation-en 2021-06-04-1401.15.pdiff [358 B]
Get:14 http://deb.debian.org/debian buster-backports/main amd64 Packages 2021-06-04-1401.15.pdiff [653 B]
Get:15 http://deb.debian.org/debian buster-backports/main Translation-en 2021-06-04-1401.15.pdiff [358 B]
Get:16 http://security.debian.org buster/updates/main Sources [187 kB]
Get:17 http://security.debian.org buster/updates/main amd64 Packages [291 kB]
Get:18 http://security.debian.org buster/updates/main Translation-en [151 kB]
Get:19 http://apps3.cumulusnetworks.com/repos/deb CumulusLinux-4/netq-latest amd64 Packages [1,418 B]
Fetched 917 kB in 3s (265 kB/s)
Reading package lists... Done

Use the -E option with sudo whenever you run any apt-get command. This option preserves your environment variables (such as HTTP proxies) before you install new packages or upgrade your distribution.

List Available Packages

After the cache populates, use the apt-cache command to search the cache and find the packages of interest or to get information about an available package.

Here are examples of the search and show sub-commands:

cumulus@switch:~$ apt-cache search tcp
collectd-core - statistics collection and monitoring daemon (core system)
fakeroot - tool for simulating superuser privileges
iperf - Internet Protocol bandwidth measuring tool
iptraf-ng - Next Generation Interactive Colorful IP LAN Monitor
libfakeroot - tool for simulating superuser privileges - shared libraries
libfstrm0 - Frame Streams (fstrm) library
libibverbs1 - Library for direct userspace use of RDMA (InfiniBand/iWARP)
libnginx-mod-stream - Stream module for Nginx
libqt4-network - Qt 4 network module
librtr-dev - Small extensible RPKI-RTR-Client C library - development files
librtr0 - Small extensible RPKI-RTR-Client C library
libwiretap8 - network packet capture library -- shared library
libwrap0 - Wietse Venema's TCP wrappers library
libwrap0-dev - Wietse Venema's TCP wrappers library, development files
netbase - Basic TCP/IP networking system
nmap-common - Architecture independent files for nmap
nuttcp - network performance measurement tool
openssh-client - secure shell (SSH) client, for secure access to remote machines
openssh-server - secure shell (SSH) server, for secure access from remote machines
openssh-sftp-server - secure shell (SSH) sftp server module, for SFTP access from remote machines
python-dpkt - Python 2 packet creation / parsing module for basic TCP/IP protocols
rsyslog - reliable system and kernel logging daemon
socat - multipurpose relay for bidirectional data transfer
tcpdump - command-line network traffic analyzer
cumulus@switch:~$ apt-cache show tcpdump
Package: tcpdump
Version: 4.9.3-1~deb10u1
Installed-Size: 1109
Maintainer: Romain Francoise <rfrancoise@debian.org>
Architecture: amd64
Replaces: apparmor-profiles-extra (<< 1.12~)
Depends: libc6 (>= 2.14), libpcap0.8 (>= 1.5.1), libssl1.1 (>= 1.1.0)
Suggests: apparmor (>= 2.3)
Breaks: apparmor-profiles-extra (<< 1.12~)
Size: 400060
SHA256: 3a63be16f96004bdf8848056f2621fbd863fadc0baf44bdcbc5d75dd98331fd3
SHA1: 2ab9f0d2673f49da466f5164ecec8836350aed42
MD5sum: 603baaf914de63f62a9f8055709257f3
Description: command-line network traffic analyzer
 This program allows you to dump the traffic on a network. tcpdump
 is able to examine IPv4, ICMPv4, IPv6, ICMPv6, UDP, TCP, SNMP, AFS
 BGP, RIP, PIM, DVMRP, IGMP, SMB, OSPF, NFS and many other packet
 types.
 .
 It can be used to print out the headers of packets on a network
 interface, filter packets that match a certain expression. You can
 use this tool to track down network problems, to detect attacks
 or to monitor network activities.
Description-md5: f01841bfda357d116d7ff7b7a47e8782
Homepage: http://www.tcpdump.org/
Multi-Arch: foreign
Section: net
Priority: optional
Filename: pool/upstream/t/tcpdump/tcpdump_4.9.3-1~deb10u1_amd64.deb

The search commands look for the search terms not only in the package name but in other parts of the package information; the search matches on more packages than you expect.

List Packages Installed on the System

The apt-cache command shows information about all the packages available in the repository. To see which packages are actually installed on your system with the version, run the following command.

cumulus@switch:~$ net show package version
Package                            Installed Version(s)
---------------------------------  -----------------------------------------------------------------------
acpi                               1.7-1.1
acpi-support-base                  0.142-8
acpid                              1:2.0.31-1
adduser                            3.118
apt                                1.8.2
arping                             2.19-6
arptables                          0.0.4+snapshot20181021-4
...
cumulus@switch:~$ nv show platform software installed
                                       description                                                                                                                   package                                version
-------------------------------------  ----------------------------------------------------------------------------------------------------------------------------  -------------------------------------  ----------------------------------------------
acpi                                   displays information on ACPI devices                                                                                          acpi                                   1.7-1.1
acpi-support-base                      scripts for handling base ACPI events such as the power button                                                                acpi-support-base                      0.142-8
acpid                                  Advanced Configuration and Power Interface event daemon                                                                       acpid                                  1:2.0.31-1
...
cumulus@switch:~$ dpkg -l
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                Version                   Architecture Description
+++-===================-=========================-============-=================================
ii  acpi                1.7-1.1                   amd64        displays information on ACPI devices
ii  acpi-support-base   0.142-8                   all          scripts for handling base ACPI events such as th
ii  acpid               1:2.0.31-1                amd64        Advanced Configuration and Power Interface event
ii  adduser             3.118                     all          add and remove users and groups
ii  apt                 1.8.2                     amd64        commandline package manager
ii  arping              2.19-6                    amd64        sends IP and/or ARP pings (to the MAC address)
ii  arptables           0.0.4+snapshot20181021-4  amd64        ARP table administration
...

Show the Version of a Package

To show the version of a specific package installed on the system:

The following example command shows which version of the vrf package is on the system:

cumulus@switch:~$ net show package version vrf
1.0-cl4.4.0u0 

The following example command shows which version of the vrf package is on the system:

cumulus@switch:~$ nv show platform software installed vrf
             running              applied  pending  description
-----------  -------------------  -------  -------  -----------
description  Linux tools for VRF                    Description
package      vrf                                    Package
version      1.0-cl4.4.0u0                         Version

The following example command shows which version of the vrf package is on the system:

cumulus@switch:~$ dpkg -l vrf
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name       Version      Architecture Description
+++-==========-============-============-=================================
ii  vrf        1.0-cl4.4.0u0    amd64        Linux tools for VRF

Upgrade Packages

To upgrade all the packages installed on the system to their latest versions, run the following commands:

cumulus@switch:~$ sudo -E apt-get update
cumulus@switch:~$ sudo -E apt-get upgrade

The system lists the packages for upgrade and prompts you to continue.

The above commands upgrade all installed versions with their latest versions but do not install any new packages.

Add New Packages

To add a new package, first ensure the package is not already on the system:

cumulus@switch:~$ dpkg -l | grep <name of package>
cumulus@switch:~$ sudo -E apt-get update
cumulus@switch:~$ sudo -E apt-get install tcpreplay
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
tcpreplay
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 436 kB of archives.
After this operation, 1008 kB of additional disk space will be used
...

You can install several packages at the same time:

cumulus@switch:~$ sudo -E apt-get install <package1> <package2> <package3>

In some cases, installing a new package also upgrades additional existing packages due to dependencies. To view these additional packages before you install, run the apt-get install --dry-run command.

Add Packages From Another Repository

As shipped, Cumulus Linux searches the Cumulus Linux repository for available packages. You can add additional repositories to search by adding them to the list of sources that apt-get consults. See man sources.list for more information.

NVIDIA adds features or makes bug fixes to certain packages; do not replace these packages with versions from other repositories.

If you want to install packages that are not in the Cumulus Linux repository, the procedure is the same as above, but with one additional step.

NVIDIA does not test and Cumulus Linux Technical Support does not support packages that are not part of the Cumulus Linux repository.

Installing packages outside of the Cumulus Linux repository requires the use of sudo -E apt-get; however, depending on the package, you can use easy-install and other commands.

To install a new package, complete the following steps:

  1. Run the dpkg command to ensure that the package is not already installed on the system:

    cumulus@switch:~$ dpkg -l | grep <name of package>
    
  2. If the package is already on the system, ensure it is the version you need. If it is an older version, update the package from the Cumulus Linux repository:

    cumulus@switch:~$ sudo -E apt-get update
    cumulus@switch:~$ sudo -E apt-get install <name of package>
    cumulus@switch:~$ sudo -E apt-get upgrade
    
  3. If the package is not on the system, the package source location is not in the /etc/apt/sources.list file. Edit and add the appropriate source to the file. For example, add the following if you want a package from the Debian repository that is not in the Cumulus Linux repository:

    deb http://http.us.debian.org/debian buster main
    deb http://security.debian.org/ buster/updates main
    

    Otherwise, /etc/apt/sources.list lists the repository but comments it out. To uncomment the repository, remove the # at the start of the line, then save the file.

  4. Run sudo -E apt-get update, then install the package and upgrade:

    cumulus@switch:~$ sudo -E apt-get update
    cumulus@switch:~$ sudo -E apt-get install <name of package>
    cumulus@switch:~$ sudo -E apt-get upgrade
    

Add Packages from the Cumulus Linux Local Archive

Cumulus Linux contains a local archive embedded in the Cumulus Linux image. This archive, cumulus-local-apt-archive, contains the packages you need to install ifplugd, LDAP, RADIUS or TACACS+ without a network connection.

The archive contains the following packages:

Add these packages with apt-get update && apt-get install, as described above.

Zero Touch Provisioning - ZTP

Use Zero touch provisioning (ZTP) to deploy network devices in large-scale environments. On first boot, Cumulus Linux invokes ZTP, which executes the provisioning automation that deploys the device for its intended role in the network.

The provisioning framework allows you to execute a one-time, user-provided script. You can develop this script using a variety of automation tools and scripting languages. You can also use it to add the switch to a configuration management (CM) platform such as Puppet, Chef, CFEngine or a custom, proprietary tool.

While developing and testing the provisioning logic, you can use the ztp command in Cumulus Linux to manually invoke your provisioning script on a device.

ZTP in Cumulus Linux can occur automatically in one of the following ways, in this order:

  1. Through a local file
  2. Using a USB drive inserted into the switch (ZTP-USB)
  3. Through DHCP

Use a Local File

ZTP only looks one time for a ZTP script on the local file system when the switch boots. ZTP searches for an install script that matches an ONIE-style waterfall in /var/lib/cumulus/ztp, looking for the most specific name first, and ending at the most generic:

You can also trigger the ZTP process manually by running the ztp --run <URL> command, where the URL is the path to the ZTP script.

Use a USB Drive

NVIDIA tests this feature only with thumb drives, not an external large USB hard drive.

If the ztp process does not discover a local script, it tries one time to locate an inserted but unmounted USB drive. If it discovers one, it begins the ZTP process. Cumulus Linux supports the use of a FAT32, FAT16, or VFAT-formatted USB drive as an installation source for ZTP scripts. You must plug in the USB drive before you power up the switch.

At minimum, the script must:

Follow these steps to perform ZTP using a USB drive:

  1. Copy the installation image to the USB drive.
  2. The ztp process searches the root filesystem of the newly mounted drive for filenames matching an ONIE-style waterfall (see the patterns and examples above), looking for the most specific name first, and ending at the most generic.
  3. ZTP parses the contents of the script to ensure it contains the CUMULUS-AUTOPROVISIONING flag (see example scripts).

The USB drive mounts to a temporary directory under /tmp (for example, /tmp/tmpigGgjf/). To reference files on the USB drive, use the environment variable ZTP_USB_MOUNTPOINT to refer to the USB root partition.

ZTP Over DHCP

If the ztp process does not discover a local ONIE script or applicable USB drive, it checks DHCP every ten seconds for up to five minutes for the presence of a ZTP URL specified in /var/run/ztp.dhcp. The URL can be any of HTTP, HTTPS, FTP, or TFTP.

For ZTP using DHCP, provisioning initially takes place over the management network and initiates through a DHCP hook. A DHCP option specifies a configuration script. The ZTP process requests this script from the Web server and the script executes locally.

The ZTP process over DHCP follows these steps:

  1. The first time you boot Cumulus Linux, eth0 makes a DHCP request.
  2. The DHCP server offers a lease to the switch.
  3. If option 239 is in the response, the ZTP process starts.
  4. The ZTP process requests the contents of the script from the URL, sending additional HTTP headers containing details about the switch.
  5. ZTP parses the contents of the script to ensure it contains the CUMULUS-AUTOPROVISIONING flag (see example scripts).
  6. If provisioning is necessary, the script executes locally on the switch with root privileges.
  7. ZTP examines the return code of the script. If the return code is 0, ZTP marks the provisioning state as complete in the autoprovisioning configuration file.

Trigger ZTP Over DHCP

If you have not yet provisioned the switch, you can trigger the ZTP process over DHCP when eth0 uses DHCP and one of the following events occur:

You can also run the ztp --run <URL> command, where the URL is the path to the ZTP script.

Configure the DHCP Server

During the DHCP process over eth0, Cumulus Linux requests DHCP option 239. This option specifies the custom provisioning script.

For example, the /etc/dhcp/dhcpd.conf file for an ISC DHCP server looks like:

option cumulus-provision-url code 239 = text;

  subnet 192.0.2.0 netmask 255.255.255.0 {
  range 192.0.2.100 192.168.0.200;
  option cumulus-provision-url "http://192.0.2.1/demo.sh";
}

In addition, you can specify the hostname of the switch with the host-name option:

subnet 192.168.0.0 netmask 255.255.255.0 {
  range 192.168.0.100 192.168.0.200;
  option cumulus-provision-url "http://192.0.2.1/demo.sh";
  host dc1-tor-sw1 { hardware ethernet 44:38:39:00:1a:6b; fixed-address 192.168.0.101; option host-name "dc1-tor-sw1"; }
}

Do not use an underscore (_) in the hostname; underscores are not permitted in hostnames.

DHCP on Front Panel Ports

In case eth0 is not operational or you prefer to use a front panel port, you can configure ZTP to bring all the front panel ports that are operational and run DHCP on any active interface. ZTP reassesses the list of active ports on every retry cycle. When it receives the DHCP lease and option 239 is present in the response, ZTP begins executing the script.

To configure ZTP to bring up the front panel ports and run DHCP on any active interface, add the CUMULUS-AUTOPROVISIONING-FRONT-PANEL directive to the local ZTP script.

Inspect HTTP Headers

The following HTTP headers in the request to the web server retrieve the provisioning script:

Header                        Value                 Example
------                        -----                 -------
User-Agent                                          CumulusLinux-AutoProvision/0.4
CUMULUS-ARCH                  CPU architecture      x86_64
CUMULUS-BUILD                                       5.0.0
CUMULUS-MANUFACTURER                                odm
CUMULUS-PRODUCTNAME                                 switch_model
CUMULUS-SERIAL                                      XYZ123004
CUMULUS-BASE-MAC                                    44:38:39:FF:40:94
CUMULUS-MGMT-MAC                                    44:38:39:FF:00:00
CUMULUS-VERSION                                     5.0.0
CUMULUS-PROV-COUNT                                  0
CUMULUS-PROV-MAX                                    32

Write ZTP Scripts

You must include the following line in any of the supported scripts that you expect to run using the autoprovisioning framework.

# CUMULUS-AUTOPROVISIONING

The script must contain the CUMULUS-AUTOPROVISIONING flag. You can include this flag in a comment or remark; you do not need to echo or write the flag to stdout.

You can write the script in any language that Cumulus Linux supports, such as:

The script must return an exit code of 0 upon success, as this marks the process as complete in the autoprovisioning configuration file.

The following script installs Cumulus Linux from a USB drive and applies a configuration:

#!/bin/bash
function error() {
  echo -e "\e[0;33mERROR: The ZTP script failed while running the command $BASH_COMMAND at line $BASH_LINENO.\e[0m" >&2
  exit 1
}

# Log all output from this script
exec >> /var/log/autoprovision 2>&1
date "+%FT%T ztp starting script $0"

trap error ERR

#Add Debian Repositories
echo "deb http://http.us.debian.org/debian buster main" >> /etc/apt/sources.list
echo "deb http://security.debian.org/ buster/updates main" >> /etc/apt/sources.list

#Update Package Cache
apt-get update -y

#Load interface config from usb
cp ${ZTP_USB_MOUNTPOINT}/interfaces /etc/network/interfaces

#Load port config from usb
#   (if breakout cables are used for certain interfaces)
cp ${ZTP_USB_MOUNTPOINT}/ports.conf /etc/cumulus/ports.conf

#Reload interfaces to apply loaded config
ifreload -a

#Output state of interfaces
net show interface

# CUMULUS-AUTOPROVISIONING
exit 0

Several ZTP example scripts are available in the Cumulus GitHub repository.

Continue Provisioning

Typically ZTP exits after executing the script locally and does not continue. To continue with provisioning so that you do not have to intervene manually or embed an Ansible callback into the script, you can add the CUMULUS-AUTOPROVISION-CASCADE directive.

Best Practices

ZTP scripts come in different forms and frequently perform the same tasks. As BASH is the most common language for ZTP scripts, use the following BASH snippets to perform common tasks with robust error checking.

Set the Default Cumulus User Password

The default cumulus user account password is cumulus. When you log into Cumulus Linux for the first time, you must provide a new password for the cumulus account, then log back into the system.

Add the following function to your ZTP script to change the default cumulus user account password to a clear-text password. The example changes the password cumulus to MyP4$$word.

function set_password(){
     # Unexpire the cumulus account
     passwd -x 99999 cumulus
     # Set the password
     echo 'cumulus:MyP4$$word' | chpasswd
}
set_password

If you have an insecure management network, set the password with an encrypted hash instead of a clear-text password.

Test DNS Name Resolution

DNS names are frequently used in ZTP scripts. The ping_until_reachable function tests that each DNS name resolves into a reachable IP address. Call this function with each DNS target used in your script before you use the DNS name elsewhere in your script.

The following example shows how to call the ping_until_reachable function in the context of a larger task.

function ping_until_reachable(){
    last_code=1
    max_tries=30
    tries=0
    while [ "0" != "$last_code" ] && [ "$tries" -lt "$max_tries" ]; do
        tries=$((tries+1))
        echo "$(date) INFO: ( Attempt $tries of $max_tries ) Pinging $1 Target Until Reachable."
        ping $1 -c2 &> /dev/null
        last_code=$?
            sleep 1
    done
    if [ "$tries" -eq "$max_tries" ] && [ "$last_code" -ne "0" ]; then
        echo "$(date) ERROR: Reached maximum number of attempts to ping the target $1 ."
        exit 1
    fi
}

Check the Cumulus Linux Release

The following script segment demonstrates how to check which Cumulus Linux release is running and upgrades the node if the release is not the target release. If the release is the target release, normal ZTP tasks execute. This script calls the ping_until_reachable script (described above) to make sure the server holding the image server and the ZTP script is reachable.

function init_ztp(){
    #do normal ZTP tasks
}

CUMULUS_TARGET_RELEASE=5.0.0
CUMULUS_CURRENT_RELEASE=$(cat /etc/lsb-release  | grep RELEASE | cut -d "=" -f2)
IMAGE_SERVER_HOSTNAME=webserver.example.com
IMAGE_SERVER= "http:// "$IMAGE_SERVER_HOSTNAME "/ "$CUMULUS_TARGET_RELEASE ".bin "
ZTP_URL= "http:// "$IMAGE_SERVER_HOSTNAME "/ztp.sh "

if [ "$CUMULUS_TARGET_RELEASE" != "$CUMULUS_CURRENT_RELEASE" ]; then
    ping_until_reachable $IMAGE_SERVER_HOSTNAME
    /usr/cumulus/bin/onie-install -fa -i $IMAGE_SERVER -z $ZTP_URL && reboot
else
    init_ztp && reboot
fi
exit 0

Apply Management VRF Configuration

If you apply a management VRF in your script, either apply it last or reboot instead. If you do not apply a management VRF last, you need to prepend any commands that require eth0 to communicate out with /usr/bin/ip vrf exec mgmt; for example, /usr/bin/ip vrf exec mgmt apt-get update -y.

Perform Ansible Provisioning Callbacks

After initially configuring a node with ZTP, use Provisioning Callbacks to inform Ansible Tower or AWX that the node is ready for more detailed provisioning. The following example demonstrates how to use a provisioning callback:

/usr/bin/curl -H "Content-Type:application/json" -k -X POST --data '{"host_config_key":"'somekey'"}' -u username:password http://ansible.example.com/api/v2/job_templates/1111/callback/

Disable the DHCP Hostname Override Setting

Make sure to disable the DHCP hostname override setting in your script (NCLU does this automatically).

function set_hostname(){
    # Remove DHCP Setting of Hostname
    sed s/'SETHOSTNAME="yes"'/'SETHOSTNAME="no"'/g -i /etc/dhcp/dhclient-exit-hooks.d/dhcp-sethostname
    hostnamectl set-hostname $1
}

NCLU in ZTP Scripts

You cannot use all NCLU commands with ZTP.

When you use NCLU in ZTP scripts, add the following loop to make sure NCLU has time to start up.

# Waiting for NCLU to finish starting up
last_code=1
while [ "1" == "$last_code" ]; do
    net show interface &> /dev/null
    last_code=$?
done

net add vrf mgmt
net add time zone Etc/UTC
net add time ntp server 192.168.0.254 iburst
net commit

Test ZTP Scripts

Use these commands to test and debug your ZTP scripts.

You can use verbose mode to debug your script and see where your script failed. Include the -v option when you run ZTP:

cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh

Broadcast message from root@dell-s6010-01 (ttyS0) (Tue May 10 22:44:17 2016):  

ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure

To see results of the most recent ZTP execution, you can run the ztp -s command.

cumulus@switch:~$ ztp -s
ZTP INFO:

State              enabled
Version            1.0
Result             Script Failure
Date               Mon 20 May 2019 09:31:27 PM UTC
Method             ZTP DHCP
URL                http://192.0.2.1/demo.sh

If ZTP runs when the switch boots and not manually, you can run the systemctl -l status ztp.service then journalctl -l -u ztp.service to see if any failures occur:

cumulus@switch:~$ sudo systemctl -l status ztp.service
● ztp.service - Cumulus Linux ZTP
    Loaded: loaded (/lib/systemd/system/ztp.service; enabled)
    Active: failed (Result: exit-code) since Wed 2016-05-11 16:38:45 UTC; 1min 47s ago
        Docs: man:ztp(8)
    Process: 400 ExecStart=/usr/sbin/ztp -b (code=exited, status=1/FAILURE)
    Main PID: 400 (code=exited, status=1/FAILURE)

May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6010-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6010-01 systemd[1]: Unit ztp.service entered failed state.
cumulus@switch:~$
cumulus@switch:~$ sudo journalctl -l -u ztp.service --no-pager
-- Logs begin at Wed 2016-05-11 16:37:42 UTC, end at Wed 2016-05-11 16:40:39 UTC. --
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp: Sate Directory does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Looking for ZTP local Script
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220-rUNKNOWN
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Looking for unmounted USB devices
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Parsing partitions
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6010-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6010-01 systemd[1]: Unit ztp.service entered failed state.

Instead of running journalctl, you can see the log history by running:

cumulus@switch:~$ cat /var/log/syslog | grep ztp
2016-05-11T16:37:45.132583+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp: State Directory does not exist. Creating it...
2016-05-11T16:37:45.134081+00:00 cumulus ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
2016-05-11T16:37:45.135360+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
2016-05-11T16:37:45.185598+00:00 cumulus ztp [400]: ZTP LOCAL: Looking for ZTP local Script
2016-05-11T16:37:45.485084+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220-rUNKNOWN
2016-05-11T16:37:45.486394+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220
2016-05-11T16:37:45.488385+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
2016-05-11T16:37:45.489665+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
2016-05-11T16:37:45.490854+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
2016-05-11T16:37:45.492296+00:00 cumulus ztp [400]: ZTP USB: Looking for unmounted USB devices
2016-05-11T16:37:45.493525+00:00 cumulus ztp [400]: ZTP USB: Parsing partitions
2016-05-11T16:37:45.636422+00:00 cumulus ztp [400]: ZTP USB: Device not found
2016-05-11T16:38:43.372857+00:00 cumulus ztp [1805]: Found ZTP DHCP Request
2016-05-11T16:38:45.696562+00:00 cumulus ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
2016-05-11T16:38:45.698598+00:00 cumulus ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
2016-05-11T16:38:45.816275+00:00 cumulus ztp [400]: ZTP DHCP: URL response code 200
2016-05-11T16:38:45.817446+00:00 cumulus ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
2016-05-11T16:38:45.818402+00:00 cumulus ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
2016-05-11T16:38:45.834240+00:00 cumulus ztp [400]: ZTP DHCP: Payload returned code 1
2016-05-11T16:38:45.835488+00:00 cumulus ztp [400]: Script returned failure
2016-05-11T16:38:45.876334+00:00 cumulus systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
2016-05-11T16:38:45.879410+00:00 cumulus systemd[1]: Unit ztp.service entered failed state.

If you see that the issue is a script failure, you can modify the script and then run ZTP manually using ztp -v -r <URL/path to that script>, as above.

cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh

Broadcast message from root@dell-s6010-01 (ttyS0) (Tue May 10 22:44:17 2019):  

ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure
cumulus@switch:~$ sudo ztp -s
State         enabled
Version       1.0
Result        Script Failure
Date          Mon 20 May 2019 09:31:27 PM UTC
Method        ZTP Manual
URL           http://192.0.2.1/demo.sh

Use the following command to check syslog for information about ZTP:

cumulus@switch:~$ sudo grep -i ztp /var/log/syslog

Common ZTP Script Errors

Could not find referenced script/interpreter in downloaded payload

cumulus@leaf01:~$ sudo cat /var/log/syslog | grep ztp
2018-04-24T15:06:08.887041+00:00 leaf01 ztp [13404]: Attempting to provision via ZTP Manual from http://192.168.0.254/ztp_oob_windows.sh
2018-04-24T15:06:09.106633+00:00 leaf01 ztp [13404]: ZTP Manual: URL response code 200
2018-04-24T15:06:09.107327+00:00 leaf01 ztp [13404]: ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
2018-04-24T15:06:09.107635+00:00 leaf01 ztp [13404]: ZTP Manual: Executing http://192.168.0.254/ztp_oob_windows.sh
2018-04-24T15:06:09.132651+00:00 leaf01 ztp [13404]: ZTP Manual: Could not find referenced script/interpreter in downloaded payload.
2018-04-24T15:06:14.135521+00:00 leaf01 ztp [13404]: ZTP Manual: Retrying
2018-04-24T15:06:14.138915+00:00 leaf01 ztp [13404]: ZTP Manual: URL response code 200
2018-04-24T15:06:14.139162+00:00 leaf01 ztp [13404]: ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
2018-04-24T15:06:14.139448+00:00 leaf01 ztp [13404]: ZTP Manual: Executing http://192.168.0.254/ztp_oob_windows.sh
2018-04-24T15:06:14.143261+00:00 leaf01 ztp [13404]: ZTP Manual: Could not find referenced script/interpreter in downloaded payload.
2018-04-24T15:06:24.147580+00:00 leaf01 ztp [13404]: ZTP Manual: Retrying
2018-04-24T15:06:24.150945+00:00 leaf01 ztp [13404]: ZTP Manual: URL response code 200
2018-04-24T15:06:24.151177+00:00 leaf01 ztp [13404]: ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
2018-04-24T15:06:24.151374+00:00 leaf01 ztp [13404]: ZTP Manual: Executing http://192.168.0.254/ztp_oob_windows.sh
2018-04-24T15:06:24.155026+00:00 leaf01 ztp [13404]: ZTP Manual: Could not find referenced script/interpreter in downloaded payload.
2018-04-24T15:06:39.164957+00:00 leaf01 ztp [13404]: ZTP Manual: Retrying
2018-04-24T15:06:39.165425+00:00 leaf01 ztp [13404]: Script returned failure
2018-04-24T15:06:39.175959+00:00 leaf01 ztp [13404]: ZTP script failed. Exiting...

Errors in syslog for ZTP like those shown above often occur if you create or edit the script on a Windows machine. Check to make sure that the \r\n characters are not present in the end-of-line encodings.

Use the cat -v ztp.sh command to view the contents of the script and search for any hidden characters.

root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_windows.sh 
#!/bin/bash^M
^M
###################^M
#   ZTP Script^M
###################^M
^M
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt^M
^M
# Clean method of performing a Reboot^M
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &^M
^M
exit 0^M
^M
# The line below is required to be a valid ZTP script^M
#CUMULUS-AUTOPROVISIONING^M
root@oob-mgmt-server:/var/www/html#

The ^M characters in the output of your ZTP script, as shown above, indicate the presence of Windows end-of-line encodings that you need to remove.

Use the translate (tr) command on any Linux system to remove the '\r' characters from the file.

root@oob-mgmt-server:/var/www/html# tr -d '\r' < ztp_oob_windows.sh > ztp_oob_unix.sh
root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_unix.sh 
#!/bin/bash
###################
#   ZTP Script
###################
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt
# Clean method of performing a Reboot
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &
exit 0
# The line below is required to be a valid ZTP script
#CUMULUS-AUTOPROVISIONING
root@oob-mgmt-server:/var/www/html#

Manually Use the ztp Command

To enable ZTP, use the -e option:

cumulus@switch:~$ sudo ztp -e

When you enable ZTP, it tries to run the next time the switch boots. However, if ZTP already ran on a previous boot up or if there is a manual configuration, ZTP exits without trying to look for a script.

ZTP checks for these manual configurations when the switch boots:

  • Password changes
  • Users and groups changes
  • Packages changes
  • Interfaces changes

When the switch boots for the first time, ZTP records the state of important files that can update after you configure the switch. After a reboot, ZTP compares the recorded state to the current state of these files. If they do not match, ZTP considers the switch as already provisioned and exits. ZTP only deletes these files after a reset.

To reset ZTP to its original state, use the -R option. This removes the ztp directory and ZTP runs the next time the switch reboots.

cumulus@switch:~$ sudo ztp -R

To disable ZTP, use the -d option:

cumulus@switch:~$ sudo ztp -d

To force provisioning to occur and ignore the status listed in the configuration file, use the -r option:

cumulus@switch:~$ sudo ztp -r cumulus-ztp.sh

To see the current ZTP state, use the -s option:

cumulus@switch:~$ sudo ztp -s
ZTP INFO:
State          disabled
Version        1.0
Result         success
Date           Mon May 20 21:51:04 2019 UTC
Method         Switch manually configured  
URL            None

Considerations

System Configuration

This section describes how to configure your Cumulus Linux switch. You can set the date and time, configure authentication, authorization, and accounting and control the traffic entering your network with access control lists (ACLs).

This section also describes the services and daemons that Cumulus Linux uses, and describes how to configure switchd, the daemon at the heart of Cumulus Linux.

Network Command Line Utility - NCLU

The Network Command Line Utility (NCLU) is a command line interface that simplifies the networking configuration process for all users.

NCLU resides in the Linux user space and provides consistent access to networking commands directly through bash, making configuration and troubleshooting simple and easy; no need to edit files or enter modes and sub-modes. NCLU provides these benefits:

The NCLU wrapper utility called net is capable of configuring layer 2 and layer 3 features of the networking stack, installing ACLs and VXLANs, restoring configuration files, as well as providing monitoring and troubleshooting functionality for these features. You can configure both the /etc/network/interfaces and /etc/frr/frr.conf files with net, in addition to running show and clear commands related to ifupdown2 and FRRouting.

NCLU Basics

Use the following workflow to stage and commit changes to Cumulus Linux with NCLU:

  1. Use the net add and net del commands to stage and remove configuration changes.
  2. Use the net pending command to review staged changes.
  3. Use net commit and net abort to commit and delete staged changes.

net commit applies the changes to the relevant configuration files, such as /etc/network/interfaces, then runs necessary follow on commands to enable the configuration, such as ifreload -a.

If two different users try to commit a change at the same time, NCLU displays a warning but implements the change according to the first commit received. The second user needs to abort the commit.

When you have a running configuration, you can review and update the configuration with the following commands:

Tab Completion, Verification, and Inline Help

In addition to tab completion and partial keyword command identification, NCLU includes verification checks to ensure you use the correct syntax. The examples below show the output for incorrect commands:

cumulus@switch:~$ net add bgp router-id 1.1.1.1/32
ERROR: Command not found

Did you mean one of the following?
    net add bgp router-id <ipv4>
        This command is looking for an IP address, not an IP/prefixlen

cumulus@switch:~$ net add bgp router-id 1.1.1.1
cumulus@switch:~$ net add int swp10 mtu <TAB>
    <552-9216> :
cumulus@switch:~$ net add int swp10 mtu 9300
ERROR: Command not found

Did you mean one of the following?
    net add interface <interface> mtu <552-9216>

NCLU has a comprehensive built in help system. In addition to the net man page, you can use ?and help to display available commands:

cumulus@switch:~$ net help

Usage:
    # net <COMMAND> [<ARGS>] [help]
    #
    # net is a command line utility for networking on Cumulus Linux switches.
    #
    # COMMANDS are listed below and have context specific arguments which can
    # be explored by typing "<TAB>" or "help" anytime while using net.
    #
    # Use 'man net' for a more comprehensive overview.

    net abort
    net commit [verbose] [confirm [<number-seconds>]] [description <wildcard>]
    net commit permanent <wildcard>
    net del all
    net help [verbose]
    net pending [json]
    net rollback (<number>|last)
    net rollback description <wildcard-snapshot>
    net show commit (history|<number>|last)
    net show rollback (<number>|last)
    net show rollback description <wildcard-snapshot>
    net show configuration [commands|files|acl|bgp|multicast|ospf|ospf6]
    net show configuration interface [<interface>] [json]

Options:

    # Help commands
    help     : context sensitive information; see section below
    example  : detailed examples of common workflows

    # Configuration commands
    add      : add/modify configuration
    del      : remove configuration


    # Commit buffer commands
    abort    : abandon changes in the commit buffer
    commit   : apply the commit buffer to the system
    pending  : show changes staged in the commit buffer
    rollback : revert to a previous configuration state

    # Status commands
    show     : show command output
    clear    : clear counters, BGP neighbors, etc

cumulus@switch:~$ net help bestpath
The following commands contain keyword(s) 'bestpath'

    net (add|del) bgp bestpath as-path multipath-relax [as-set|no-as-set]
    net (add|del) bgp bestpath compare-routerid
    net (add|del) bgp bestpath med missing-as-worst
    net (add|del) bgp ipv4 labeled-unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp ipv4 unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp ipv6 labeled-unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp ipv6 unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp vrf <text> bestpath as-path multipath-relax [as-set|no-as-set]
    net (add|del) bgp vrf <text> bestpath compare-routerid
    net (add|del) bgp vrf <text> bestpath med missing-as-worst
    net (add|del) bgp vrf <text> ipv4 labeled-unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp vrf <text> ipv4 unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp vrf <text> ipv6 labeled-unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp vrf <text> ipv6 unicast neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net (add|del) bgp vrf <text> neighbor <bgppeer> addpath-tx-bestpath-per-AS
    net add bgp debug bestpath <ip/prefixlen>
    net del bgp debug bestpath [<ip/prefixlen>]
    net show bgp (<ipv4>|<ipv4/prefixlen>|<ipv6>|<ipv6/prefixlen>) [bestpath|multipath] [json]
    net show bgp vrf <text> (<ipv4>|<ipv4/prefixlen>|<ipv6>|<ipv6/prefixlen>) [bestpath|multipath] [json]

You can configure multiple interfaces at the same time:

cumulus@switch:~$ net add int swp7-9,12,15-17,22 mtu 9216

Search for Specific Commands

To search for specific NCLU commands so that you can identify the correct syntax to use, run the net help verbose | <term> command. For example, to show only commands that include clag (for MLAG):

cumulus@leaf01:mgmt:~$ net help verbose | grep clag
    net example clag basic-clag
    net example clag l2-with-server-vlan-trunks
    net example clag l3-uplinks-virtual-address
    net add clag peer sys-mac <mac-clag> interface <interface> (primary|secondary) [backup-ip <ipv4>]
    net add clag peer sys-mac <mac-clag> interface <interface> (primary|secondary) [backup-ip <ipv4> vrf <text>]
    net del clag peer
    net add clag port bond <interface> interface <interface> clag-id <0-65535>
    net del clag port bond <interface>
    net show clag [our-macs|our-multicast-entries|our-multicast-route|our-multicast-router-ports|peer-macs|peer-multicast-entries|peer-multicast-route|peer-multicast-router-ports|params|backup-ip|id] [verbose] [json]
    net show clag macs [<mac>] [json]
    net show clag neighbors [verbose]
    net show clag peer-lacp-rate
    net show clag verify-vlans [verbose]
    net show clag status [verbose] [json]
    net add bond <interface> clag id <0-65535>
    net add interface <interface> clag args <wildcard>
    net add interface <interface> clag backup-ip (<ipv4>|<ipv4> vrf <text>)
    net add interface <interface> clag enable (yes|no)
    net add interface <interface> clag peer-ip (<ipv4>|<ipv6>|linklocal)
    net add interface <interface> clag priority <0-65535>
    net add interface <interface> clag sys-mac <mac>
    net add loopback lo clag vxlan-anycast-ip <ipv4>
    net del bond <interface> clag id [<0-65535>]
    net del interface <interface> clag args [<wildcard>]
    ...

Add ? (Question Mark) Ability to NCLU

Cumulus Linux enables tab completion by default. You can also configure NCLU to use the ? (question mark character) to look at available commands. To enable this feature for the cumulus user, open the following file:

cumulus@switch:~$ sudo nano ~/.inputrc

Uncomment the very last line in the .inputrc file so that the file changes from this:

# Uncomment to use ? as an alternative to
# ?: complete

to this:

# Uncomment to use ? as an alternative to
?: complete

Save the file and reconnect to the switch. The ? (question mark) ability does not work on all subsequent sessions on the switch.

cumulus@switch:~$ net
    abort     :  abandon changes in the commit buffer
    add       :  add/modify configuration
    clear     :  clear counters, BGP neighbors, etc
    commit    :  apply the commit buffer to the system
    del       :  remove configuration
    example   :  detailed examples of common workflows
    help      :  Show this screen and exit
    pending   :  show changes staged in the commit buffer
    rollback  :  revert to a previous configuration state
    show      :  show command output

When you type the question mark, NCLU shows all available options, but the question mark does not actually appear on the terminal.

Built-in Examples

NCLU has built in examples to guide you through basic configuration setup:

cumulus@switch:~$ net example
    acl              :  access-list
    bgp              :  Border Gateway Protocol
    bond             :  bond, port-channel, etc
    bridge           :  a layer2 bridge
    clag             :  Multi-Chassis Link Aggregation
    dhcp             :  Dynamic Host Configuration Protocol
    dot1x            :  Configure, Enable, Delete or Show IEEE 802.1X EAPOL
    evpn             :  Ethernet VPN
    link-settings    :  Physical link parameters
    management-vrf   :  Management VRF
    mlag             :  Multi-Chassis Link Aggregation
    ospf             :  Open Shortest Path First (OSPFv2)
    snmp-server      :  Configure the SNMP server
    syslog           :  Set syslog logging
    vlan-interfaces  :  IP interfaces for VLANs
    voice-vlan       :  VLAN used for IP Phones
    vrr              :  add help text
cumulus@switch:~$ net example bridge

Scenario
========
We are configuring switch1 and would like to configure the following
- configure switch1 as an L2 switch for host-11 and host-12
- enable vlans 10-20
- place host-11 in vlan 10
- place host-12 in vlan 20
- create an SVI interface for vlan 10
- create an SVI interface for vlan 20
- assign IP 10.0.0.1/24 to the SVI for vlan 10
- assign IP 20.0.0.1/24 to the SVI for vlan 20
- configure swp3 as a trunk for vlans 10, 11, 12 and 20
                  swp3
         *switch1 --------- switch2
            /\
      swp1 /  \ swp2
          /    \
         /      \
     host-11   host-12

switch1 net commands
====================
- enable vlans 10-20
switch1# net add vlan 10-20
- place host-11 in vlan 10
- place host-12 in vlan 20
switch1# net add int swp1 bridge access 10
switch1# net add int swp2 bridge access 20
- create an SVI interface for vlan 10
- create an SVI interface for vlan 20
- assign IP 10.0.0.1/24 to the SVI for vlan 10
- assign IP 20.0.0.1/24 to the SVI for vlan 20
switch1# net add vlan 10 ip address 10.0.0.1/24
switch1# net add vlan 20 ip address 20.0.0.1/24
- configure swp3 as a trunk for vlans 10, 11, 12 and 20
switch1# net add int swp3 bridge trunk vlans 10-12,20
switch1# net pending
switch1# net commit

Verification
============
switch1# net show interface
switch1# net show bridge macs

Configure User Accounts

You can configure user accounts in Cumulus Linux with read-only or edit permissions for NCLU:

The examples below add a new user account and modify an existing user account called myuser.

To add a new user account with NCLU show permissions:

cumulus@switch:~$ sudo adduser --ingroup netshow myuser
Adding user `myuser' ...
Adding new user `myuser' (1001) with group `netshow'...
...

To add NCLU show permissions to a user account that already exists:

cumulus@switch:~$ sudo addgroup myuser netshow
Adding user `myuser' to group `netshow' ...
Adding user myuser to group netshow
Done

To add a new user account with NCLU edit permissions:

cumulus@switch:~$ sudo adduser --ingroup netedit myuser
Adding user `myuser' ...
Adding new user `myuser' (1001) with group `netedit'
...

To add NCLU edit permissions to a user account that already exists:

cumulus@switch:~$ sudo addgroup myuser netedit
Adding user `myuser' to group `netedit' ...
Adding user myuser to group netedit
Done

You can use the adduser command for local user accounts only. You can use the addgroup command for both local and remote user accounts. For a remote user account, you must use the mapping username, such as tacacs3 or radius_user, not the TACACS or RADIUS account name.

If the user tries to run commands that are not allowed, the following error displays:

myuser@switch:~$ net add hostname host01
ERROR: User username does not have permission to make networking changes.

Edit the netd.conf File

Instead of using the NCLU commands described above, you can manually configure users and groups to be able to run NCLU commands.

Edit the /etc/netd.conf file to add users to the users_with_edit and users_with_show lines in the file, then save the file.

For example, if you want the user netoperator to be able to run both edit and show commands, add the user to the users_with_edit and users_with_show lines in the /etc/netd.conf file:

cumulus@switch:~$ sudo nano /etc/netd.conf

# Control which users/groups are allowed to run 'add', 'del',
# 'clear', 'net abort', 'net commit' and restart services
# to apply those changes
users_with_edit = root, cumulus, netoperator
groups_with_edit = netedit

# Control which users/groups are allowed to run 'show' commands
users_with_show = root, cumulus, netoperator
groups_with_show = netshow, netedit

To configure a new user group to use NCLU, add that group to the groups_with_edit and groups_with_show lines in the file.

Use caution giving edit permissions to groups. For example, do not give edit permissions to the tacacs group.

Restart the netd Service

Whenever you modify netd.conf or when NSS services change, you must restart the netd service for the changes to take effect:

cumulus@switch:~$ sudo systemctl restart netd.service

Back Up the Configuration to a Single File

You can back up your NCLU configuration to a file by outputting the results of net show configuration commands to a file, then retrieving the contents of the file using the source command. You can then view the configuration at any time or copy it to other switches and use the source command to apply that configuration to those switches.

For example, to copy the configuration of a leaf switch called leaf01, run the following command:

cumulus@leaf01:~$ net show configuration commands >> leaf01.txt

With the commands all stored in a single file, you can now copy this file to another ToR switch in your network called leaf01 and apply the configuration by running:

cumulus@leaf01:~$ source leaf01.txt

Advanced Configuration

NCLU needs no initial configuration; however, if you need to modify certain configuration, you must manually update the /etc/netd.conf file. You can configure this file to allow different permission levels for users to edit configurations and run show commands. The file also contains a blacklist that hides less frequently used terms from the tabbed autocomplete.

After you edit the netd.conf file, restart the netd service for the changes to take effect.

cumulus@switch:~$ sudo nano /etc/netd.conf
cumulus@switch:~$ sudo systemctl restart netd.service
Configuration Variable Default Setting Description
show_linux_commandFalseWhen true, displays the Linux command running in the background.
color_diffsTrueWhen true, the diffs shown in net pending and net commit use colors.
enable_<component>TrueWhen true, enables you to configure the component with NCLU. For example, when enable_frr is true, you can use NCLU to configure FRR.
users_with_editroot, cumulusSets the Linux users with root edit privileges.
groups_with_editroot, cumulusSets the Linux groups with root edit privileges.
users_with_showroot, cumulusControls which users are allowed to run show commands.
groups_with_showroot, cumulusControls which groups are allowed to run show commands.
ifupdown_blacklistaddress-purge, bond-ad-actor-sys-prio, bond-ad-actor-system, bond-num-grat-arp,bond-num-unsol-na, bond-use-carrier, bond-xmit-hash-policy, bridge-bridgeprio, bridge-fd, bridge-hashel, bridge-hashmax, bridge-hello, bridge-igmp-querier-src, bridge-maxage, bridge-maxwait, bridge-mclmc, bridge-mclmi bridge-mcmi, bridge-mcqi, bridge-mcqpi, bridge-mcqri, bridge-mcrouter, bridge-mcsqc, bridge-mcsqi, bridge-pathcosts, bridge-port-pvids, bridge-port-vids, bridge-portprios, bridge-waitport, broadcast, link-type, mstpctl-ageing, mstpctl-fdelay, mstpctl-forcevers, mstpctl-hello, mstpctl-maxage, mstpctl-maxhops, mstpctl-portp2p, mstpctl-portpathcost, mstpctl-portrestrtcn, mstpctl-treeportcost, mstpctl-treeportprio, mstpctl-txholdcount, netmask, preferred-lifetime, scope, vxlan-ageing, vxlan-learning, vxlan-port, up, down, bridge-gcint, bridge-mcqifaddr, bridge-mcqv4srcHides corner case command options from tab complete, to simplify and streamline output.

net provides an environment variable to set the location of the net output. To only use stdout, set the NCLU_TAB_STDOUT environment variable to true. The value is not case sensitive.

Considerations

Unsupported Interface Names

NCLU does not support interfaces named dev.

Bonds With No Configured Members

If you configure a bond interface that contains no members, NCLU reports that the interface does not exist.

Large NCLU Inputs

The system must parse each NCLU command. Large inputs, such as a large paste of NCLU commands can take some time, sometimes minutes, to process.

NVIDIA User Experience - NVUE

NVUE is an object-oriented, schema driven model of a complete Cumulus Linux system (hardware and software) providing a robust API that allows for multiple interfaces to both view (show) and configure (set and unset) any element within a system running the NVUE software. The NVUE command line interface (CLI) and the REST API leverage the same API to interface with Cumulus Linux.

NVUE follows a declarative model, removing context-specific commands and settings. It is structured as a big tree that represents the entire state of a Cumulus Linux instance. At the base of the tree are high level branches representing objects, such as router and interface. Under each of these branches are further branches. As you navigate through the tree, you gain a more specific context. At the leaves of the tree are actual attributes, represented as key-value pairs. The path through the tree is similar to a filesystem path.

Start the NVUE Service

Cumulus Linux installs NVUE by default but disables the NVUE service. To run NVUE commands, you must enable and start the NVUE service (nvued):

cumulus@switch:~$ sudo systemctl enable nvued
cumulus@switch:~$ sudo systemctl start nvued

Do not mix NVUE and NCLU commands to configure the switch; use either the NCLU CLI or the NVUE CLI.

NVUE REST API

To access the NVUE API, run these commands on the switch:

cumulus@switch:~$ sudo ln -s /etc/nginx/sites-{available,enabled}/nvue.conf
cumulus@switch:~$ sudo sed -i 's/listen localhost:8765 ssl;/listen \[::\]:8765 ipv6only=off ssl;/g' /etc/nginx/sites-available/nvue.conf
cumulus@switch:~$ sudo systemctl restart nginx

You can run the cURL commands from the command line. For example:

cumulus@switch:~$ curl  -u 'cumulus:CumulusLinux!' --insecure https://127.0.0.1:8765/cue_v1/interface
{
  "eth0": {
    "ip": {
      "address": {
        "192.168.200.12/24": {}
      }
    },
    "link": {
      "mtu": 1500,
      "state": {
        "up": {}
      },
      "stats": {
        "carrier-transitions": 2,
        "in-bytes": 184151,
        "in-drops": 0,
        "in-errors": 0,
        "in-pkts": 2371,
        "out-bytes": 117506,
        "out-drops": 0,
        "out-errors": 0,
        "out-pkts": 762
      }
...

For information about using the NVUE API, refer to the NVUE API documentation.

NVUE CLI

The NVUE CLI has a flat structure as opposed to a modal structure. This means that you can run all commands from the primary prompt instead of only in a specific mode.

The NVUE commands and outputs in this documentation are subject to change.

Command Syntax

NVUE commands all begin with nv and fall into one of three syntax categories:

Command Completion

As you enter commands, you can get help with the valid keywords or options using the Tab key. For example, using Tab completion with nv set displays the possible options for the command, and returns you to the command prompt to complete the command.

cumulus@switch:~$ nv set <<press Tab>>
acl        evpn       mlag       platform   router     system     
bridge     interface  nve        qos        service    vrf 

cumulus@switch:~$ nv set

Command Help

As you enter commands, you can get help with command syntax by entering -h or --help at various points within a command entry. For example, to find out what options are available for nv set interface, enter nv set interface -h or nv set interface --help.

cumulus@switch:~$ nv set interface -h
Usage:
  nv set interface [options] <interface-id> ...

Description:
  Interfaces

Identifiers:
  <interface-id>    Interface

General Options:
  -h, --help        Show help.

Command List

You can list all the NVUE commands by running nv list-commands. See List All NVUE Commands below.

Command History

At the command prompt, press the Up Arrow and Down Arrow keys to move back and forth through the list of commands you entered. When you find a given command, you can run the command by pressing Enter. Optionally, you can modify the command before you run it.

Command Categories

The NVUE CLI has a flat structure; however, the commands are in three functional categories:

Configuration Commands

The NVUE configuration commands modify switch configuration. You can set and unset configuration options.

The nv set and nv unset commands are in the following categories. Each command group includes subcommands. Use command completion (Tab key) to list the subcommands.

Command Group
Description
nv set acl
nv unset acl
Configures ACLs in Cumulus Linux.
nv set bridge
nv unset bridge
Configures a bridge domain. This is where you configure the bridge type (such as VLAN-aware), 802.1Q encapsulation, the STP state and priority, and the VLANs in the bridge domain.
nv set evpn
nv unset evpn
Configures EVPN. This is where you enable and disable the EVPN control plane, and set EVPN route advertise options, default gateway configuration for centralized routing, multihoming, and duplicate address detection options.
nv set interface <interface-id>
nv unset interface <interface-id>
Configures the switch interfaces. Use this command to configure bond interfaces, bridge interfaces, interface IP addresses, VLAN IDs, and links (MTU, FEC, speed, duplex, and so on).
nv set mlag
nv unset mlag
Configures MLAG. This is where you configure the backup IP address or interface, MLAG system MAC address, peer IP address, MLAG priority, and the delay before bonds come up.
nv set nve
nv unset nve
Configures network virtualization (VXLAN) settings. This is where you configure the UDP port for VXLAN frames, control dynamic MAC learning over VXLAN tunnels, enable and disable ARP/ND suppression, and configure how Cumulus Linux handles BUM traffic in the overlay.
nv set platform
nv unset platform
Configures the hostname of the switch and sets how configuration apply operations work (which files to ignore and which files to overwrite).
nv set qos
nv unset qos
Configures QoS RoCE.
nv set router
nv unset router
Configures router policies (prefix list rules and route maps), global BGP options (enable and disable BGP, set the ASN and the router ID, and configure BGP graceful restart and shutdown), global OSPF options (enable and disable OSPF, set the router ID, and configure OSPF timers), and PBR.
nv set service
nv unset service
Configures DHCP relays and servers, NTP, PTP, LLDP, and syslog.
nv set system
nv unset system
Configures global system settings, such as the anycast ID, the system MAC address, and the anycast MAC address. This is also where you configure SPAN and ERSPAN sessions.
nv set vrf <vrf-id>
nv unset vrf <vrf-id>
Configures VRFs. This is where you configure VRF-level router configuration including PTP, BGP, OSPF, and EVPN.

Monitoring Commands

The NVUE monitoring commands show various parts of the network configuration. For example, you can show the complete network configuration or only interface configuration. The monitoring commands are in the following categories. Each command group includes subcommands. Use command completion (Tab key) to list the subcommands.

Command Group
Description
nv show aclShows ACL configuration.
nv show bridgeShows bridge domain configuration.
nv show evpnShows EVPN configuration.
nv show interfaceShows interface configuration.
nv show mlagShows MLAG configuration.
nv show nveShows network virtualization configuration, such as VXLAN-specfic MLAG configuration and VXLAN flooding.
nv show platformShows platform configuration, such as hardware and software components, and the hostname of the switch.
nv show qosShows QoS RoCE configuration.
nv show routerShows router configuration, such as router policies, and global BGP and OSPF configuration.
nv show serviceShows DHCP relays and server, NTP, PTP, LLDP, and syslog configuration.
nv show systemShows global system settings, such as the reserved routing table range for PBR and the reserved VLAN range for layer 3 VNIs. You can also see switch reboot history with the reason why the switch rebooted.
nv show vrfShows VRF configuration.

The following example shows the nv show router commands after pressing the TAB key, then shows the output of the nv show router bgp command.

cumulus@leaf01:mgmt:~$ nv show router <<TAB>>
bgp     ospf    pbr     policy
cumulus@leaf01:mgmt:~$ nv show router bgp
                                operational  applied  pending      description
------------------------------  -----------  -------  -----------  ----------------------------------------------------------------------
enable                                       off      on           Turn the feature 'on' or 'off'.  The default is 'off'.
autonomous-system                                     none         ASN for all VRFs, if a single AS is in use.  If "none", then ASN mu...
graceful-shutdown                                     off          Graceful shutdown enable will initiate the GSHUT community to be an...
policy-update-timer                                   5            Wait time in seconds before processing updates to policies to ensur...
router-id                                             none         BGP router-id for all VRFs, if a common one is used.  If "none", th...
wait-for-install                                      off          bgp waits for routes to be installed into kernel/asic before advert...
convergence-wait
  establish-wait-time                                 0            Maximum time to wait to establish BGP sessions. Any peers which do...
  time                                                0            Time to wait for peers to send end-of-RIB before router performs pa...
graceful-restart
  mode                                                helper-only  Role of router during graceful restart. helper-only, router is in h...
  path-selection-deferral-time                        360          Used by the restarter as an upper-bounds for waiting for peering es...
  restart-time                                        120          Amount of time taken to restart by router. It is advertised to the...
  stale-routes-time                                   360          Specifies an upper-bounds on how long we retain routes from a resta...
cumulus@leaf01:mgmt:~$ 

If there are no pending or applied configuration changes, the nv show command only shows the running configuration.

Additional options are available for the nv show commands. For example, you can choose the configuration you want to show (pending, applied, startup, or operational). You can also turn on colored output, and paginate specific output.

Option
Description
--appliedShows configuration applied with the nv config apply command. For example, nv show --applied interface bond1.
--colorTurns colored output on or off. For example, nv show --color on interface bond1
--helpShows help for the NVUE commands.
--operationalShows the running configuration (the actual system state). For example, nv show --operational interface bond1 shows the running configuration for bond1. The running and applied configuration should be the same. If different, inspect the logs.
--outputShows command output in table format (auto), json format or yaml format. For example:
nv show --ouptut auto interface bond1
nv show --ouptut json interface bond1
nv show --ouptut yaml interface bond1
--paginatePaginates the output. For example, nv show --paginate on interface bond1.
--pendingShows configuration that is set and unset but not yet applied or saved. For example, nv show --pending interface bond1.
--rev <revision>Shows a detached pending configuration. See the nv config detach configuration management command below. For example, nv show --rev changeset/cumulus/2021-06-11_16.16.41_FPKK interface bond1.
--startupShows configuration saved with the nv config save command. This is the configuration after the switch boots.

The following example shows pending BGP graceful restart configuration:

cumulus@switch:~$ nv show router bgp graceful-restart --pending
                             pending_20210128_212626_4WSY  description
----------------------------  ----------------------------  ----------------------------------------------------------------------
mode                          helper-only                   Role of router during graceful restart. helper-only, router is in h...
path-selection-deferral-time  360                           Used by the restarter as an upper-bounds for waiting for peeringes...
restart-time                  120                           Amount of time taken to restart by router. It is advertised to the...
stale-routes-time             360                           Specifies an upper-bounds on how long we retain routes from a resta...

Show Legacy Commands

Cumulus Linux includes show legacy commands that provide the same output as the NCLU show commands. To use the show legacy commands, the NCLU service (netd) must be running.

When the NVUE service is running, Cumulus Linux disables NCLU commands (net add, net del, net show) to prevent you from configuring the switch with both NVUE and NVUE commands. For example, if you try to run the net show interface command, you see the error Using NCLU with 'nvued' running is not supported.

To run the show legacy commands, replace net show with nv show --legacy.

For example, to show the link and administrative state of an interface (such as swp1), run the nv show --legacy interface swp1 command:

cumulus@switch:~$ nv show --legacy interface swp1
    Name  MAC                Speed  MTU   Mode
--  ----  -----------------  -----  ----  ------------
UP  swp1  44:38:39:00:00:31  1G     9216  Interface/L3

IP Details
-------------------------  -----------
IP:                        10.0.0.9/32
IP Neighbor(ARP) Entries:  0

cl-netstat counters
-------------------
RX_OK  RX_ERR  RX_DRP  RX_OVR  TX_OK  TX_ERR  TX_DRP  TX_OVR
-----  ------  ------  ------  -----  ------  ------  ------
    4       0       4       0  55425       0       0       0

Routing
-------
  Interface swp1 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: default
  index 3 metric 0 mtu 9216 speed 1000
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 44:38:39:00:00:31
  inet 10.0.0.9/32 unnumbered
  inet6 fe80::4638:39ff:fe00:31/64
  Interface Type Other
  protodown: off

To see a summary of layer 3 fabric connectivity, run the nv show --legacy bgp command:

cumulus@switch:~$ nv show --legacy bgp
show bgp ipv4 unicast
=====================
BGP table version is 0, local router ID is 10.10.10.2, vrf id 0
Default local pref 100, local AS 65102
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
   10.10.10.2/32    0.0.0.0                  0         32768 i

Displayed  1 routes and 1 total paths


show bgp ipv6 unicast
=====================
No BGP prefixes displayed, 0 exist
...

Not all net show commands are available as nv show --legacy commands.

Configuration Management Commands

The NVUE configuration management commands manage and apply configurations.

Command
Description
nv config applyApplies the pending configuration to become the applied configuration.
You can also use these prompt options:
  • --y or --assume-yes to automatically reply yes to all prompts.
  • --assume-no to automatically reply no to all prompts.

Cumulus Linux applies but does not save the configuration; the configuration does not persist after a reboot.

You can also use these apply options:
--confirm applies the configuration change but you must confirm the applied configuration. If you do not confirm within ten minutes, the configuration rolls back automatically. You can change the default time with the apply --confirm <time> command. For example, apply --confirm 60 requires you to confirm within one hour.
--confirm-status shows the amount of time left before the automatic rollback.
nv config detachDetaches the configuration from the current pending configuration. Cumulus Linux names the detached configuration pending and includes a timestamp with extra characters. For example: pending_20210128_212626_4WSY
nv config diff <revision> <revision>Shows differences between configurations, such as the pending configuration and the applied configuration or the detached configuration and the pending configuration.
nv config history <nvue-file>Shows the apply history for the revision.
nv config patch <nvue-file>Updates the pending configuration with the specified YAML configuration file.
nv config replace <nvue-file>Replaces the pending configuration with the specified YAML configuration file.
nv config saveOverwrites the startup configuration with the applied configuration by writing to the /etc/nvue.d/startup.yaml file. The configuration persists after a reboot.

List All NVUE Commands

To show the full list of NVUE commands, run nv list-commands. For example:

cumulus@switch:~$ nv list-commands
nv show router
nv show router pbr
nv show router pbr nexthop-group
nv show router pbr nexthop-group <nexthop-group-id>
nv show router pbr nexthop-group <nexthop-group-id> via
nv show router pbr nexthop-group <nexthop-group-id> via <nhg-via-id>
nv show router pbr map
nv show router pbr map <pbr-map-id>
nv show router pbr map <pbr-map-id> rule
nv show router pbr map <pbr-map-id> rule <rule-id>
nv show router pbr map <pbr-map-id> rule <rule-id> match
nv show router pbr map <pbr-map-id> rule <rule-id> action
nv show router policy
nv show router policy community-list
...

You can show the list of commands for a command grouping. For example, to show the list of interface commands:

cumulus@switch:~$ nv list-commands interface
nv show interface
nv show interface <interface-id>
nv show interface <interface-id> router
nv show interface <interface-id> router pbr
nv show interface <interface-id> router ospf
nv show interface <interface-id> router ospf timers
nv show interface <interface-id> router ospf authentication
nv show interface <interface-id> router ospf bfd
nv show interface <interface-id> bond
nv show interface <interface-id> bond member
nv show interface <interface-id> bond member <member-id>
nv show interface <interface-id> bond mlag
nv show interface <interface-id> bridge
...

Use the Tab key to get help for the command lists you want to see. For example, to show the list of command options available for the interface swp1, run:

cumulus@switch:~$ nv list-commands interface swp1 <<press Tab>>
acl     bond    bridge  evpn    ip      link    ptp     qos     router 

NVUE Configuration File

When you save network configuration using NVUE, Cumulus Linux writes the configuration to the /etc/nvue.d/startup.yaml file.

NVUE also writes to underlying Linux files, such as /etc/network/interfaces and /etc/frr/frr.conf, when you apply a configuration. You can view these configuration files; however NVIDIA recommends that you do not manually edit them while using NVUE.

You can edit or replace the contents of the /etc/nvue.d/startup.yaml file. NVUE applies the configuration in the /etc/nvue.d/startup.yaml file during system boot only if the nvue-startup.service is running. If this service is not running, the running configuration that you apply last loads when the switch boots.

To start nvue-startup.service:

cumulus@switch:~$ sudo systemctl enable nvue-startup.service
cumulus@switch:~$ sudo systemctl start nvue-startup.service

Example Configuration Commands

This section provides examples of how to configure a Cumulus Linux switch using NVUE commands.

Configure the System Hostname

The example below shows the NVUE commands required to change the hostname for the switch to leaf01:

cumulus@switch:~$ nv set platform hostname value leaf01
cumulus@switch:~$ nv config apply

Configure the System DNS Server

The example below shows the NVUE commands required to define the DNS server for the switch:

cumulus@switch:~$ nv set service dns mgmt server 192.168.200.1
cumulus@switch:~$ nv config apply

Configure an Interface

The example below shows the NVUE commands required to bring up swp1.

cumulus@switch:~$ nv set interface swp1 link state up
cumulus@switch:~$ nv config apply 

Configure a Bond

The example below shows the NVUE commands required to configure the front panel port interfaces swp1 thru swp4 to be slaves in bond0.

cumulus@switch:~$ nv set interface bond0 bond member swp1-4
cumulus@switch:~$ nv config apply

Configure a Bridge

The example below shows the NVUE commands required to create a VLAN-aware bridge that contains two switch ports (swp1 and swp2) and includes 3 VLANs; tagged VLANs 10 and 20 and an untagged (native) VLAN of 1.

With NVUE, there is a default bridge called br_default, which has no ports assigned to it. The example below configures this default bridge.

cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv config apply

Configure MLAG

The example below shows the NVUE commands required to configure MLAG on leaf01. The commands:

cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@switch:~$ nv set interface bond1-2 bridge domain br_default 
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv config apply

Configure BGP Unnumbered

The example below shows the NVUE commands required to configure BGP unnumbered on leaf01. The commands:

cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast static-network 10.10.10.1/32
cumulus@leaf01:~$ nv config apply

Example Monitoring Commands

This section provides monitoring command examples.

Show Installed Software

The following example command lists the software installed on the switch:

cumulus@switch:~$ nv show platform software
Installed Software
=====================
                      description                                                     package                version
--------------------- ----------------------------                                    --------------------   ------------
acpi                  displays information on ACPI devices                            acpi                   1.7-1.1                   
acpi-support-base     scripts for handling base ACPI events such as the power button  acpi-support-base      0.142-8
acpid                 Advanced Configuration and Power Interface event daemon         acpid                  1:2.0.31-1
adduser               add and remove users and groups                                 adduser                3.118
apt                   commandline package manager                                     apt                    1.8.2.3
arping                sends IP and/or ARP pings (to the MAC address)                  arping                 2.19-6
arptables             ARP table administration                                        arptables              0.0.4+snapshot20181021-4
atftp                 advanced TFTP client                                            atftp                  0.7.git20120829-3.2~deb10u1                 
atftpd                advanced TFTP server                                            atftpd                 0.7.git20120829-3.2~deb10u1 
auditd                User space tools for security auditing                          auditd                 1:2.8.4-3              
base-files            Debian base system miscellaneous files                          base-files             10.3+deb10u9                 
base-passwd           Debian base system master password and group files              base-passwd            3.5.46 
bash                  GNU Bourne Again SHell                                          bash                   5.0-4
...

Show Interface Configuration

The following example command shows the running, applied, and pending swp1 interface configuration.

cumulus@leaf01:~$ nv show interface swp1
                         operational  applied  description
-----------------------  -----------  -------  ----------------------------------------------------------------------
type                     swp                   The type of interface
ip
  [address]                                    ipv4 and ipv6 address
link
  mtu                    9216                  interface mtu
  state                  down                  The state of the interface
  stats
    carrier-transitions  3                     Number of times the interface state has transitioned between up and...
    in-bytes             300 Bytes             total number of bytes received on the interface
    in-drops             5                     number of received packets dropped
    in-errors            0                     number of received packets with errors
    in-pkts              5                     total number of packets received on the interface
    out-bytes            0 Bytes               total number of bytes transmitted out of the interface
    out-drops            0                     The number of outbound packets that were chosen to be discarded eve...
    out-errors           0                     The number of outbound packets that could not be transmitted becaus...
    out-pkts             0                     total number of packets transmitted out of the interface
...

Example Configuration Management Commands

This section provides examples of how to use the configuration management commands to apply, save, and detach configurations.

Apply and Save a Configuration

The following example command configures the front panel port interfaces swp1 thru swp4 to be slaves in bond0. The configuration is only in a pending configuration state. The configuration is not applied. NVUE has not yet made any changes to the running configuration.

cumulus@switch:~$ nv set interface bond0 bond member swp1-4

To apply the pending configuration to the running configuration, run the nv config apply command. The configuration does not persist after a reboot.

cumulus@switch:~$ nv config apply

To save the applied configuration to the startup configuration, run the nv config save command. This command overwrites the startup configuration with the applied configuration by writing to the /etc/nvue.d/startup.yaml file. The configuration persists after a reboot.

cumulus@switch:~$ nv config save

Detach a Pending Configuration

The following example configures the IP address of the loopback interface, then detaches the configuration from the current pending configuration. Cumulus Linux saves the detached configuration to a file changeset/cumulus/<date>_<time>_xxxx that includes a timestamp with extra characters to distinguish it from other pending configurations; for example, changeset/cumulus/2021-06-11_18.35.06_FPKP.

cumulus@switch:~$ nv set interface lo ip address 10.10.10.1
cumulus@switch:~$ nv config detach

View Differences Between Configurations

To view differences between configurations, run the nv config diff command.

To view differences between two detached pending configurations, run the nv config diff «TAB» command to list all the current detached pending configurations, then run the nv config diff command with the pending configurations you want to diff:

cumulus@switch:~$ nv config diff <<press Tab>>
applied                                     changeset/cumulus/2021-06-11_18.35.06_FPKP
changeset/cumulus/2021-06-11_16.16.41_FPKK  empty
changeset/cumulus/2021-06-11_17.05.12_FPKN  startup
cumulus@switch:~$ nv config diff changeset/cumulus/2021-06-11_18.35.06_FPKP changeset/cumulus/2021-06-11_17.05.12_FPKN

To view differences between a detached pending configuration and the applied configuration:

cumulus@switch:~$ nv config diff changeset/cumulus/2021-06-11_18.35.06_FPKP applied

Replace and Patch a Pending Configuration

The following example replaces the pending configuration with the contents of the YAML configuration file called nv-02/13/2021.yaml located in the /deps directory:

cumulus@switch:~$ nv config replace /deps/nv-02/13/2021.yaml

The following example patches the pending configuration (runs the set or unset commands from the configuration in the nv-02/13/2021.yaml file located in the /deps directory):

cumulus@switch:~$ nv config patch /deps/nv-02/13/2021.yaml

How Is NVUE Different from NCLU?

This section lists some of the differences between NVUE CLI and the NCLU CLI.

Configuration File

When you save network configuration using NVUE, Cumulus Linux saves the configuration in the /etc/nvue.d/startup.yaml file.

NVUE also writes to underlying Linux files when you apply a configuration, such as the /etc/network/interfaces and /etc/frr/frr.conf files. You can view these configuration files; however NVIDIA recommends that you do not manually edit them while using NVUE.

Bridge Configuration

You set global bridge configuration on the bridge domain. For example:

cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20

However, you set specific bridge interface options with interface commands. For example:

cumulus@leaf01:~$ nv set interface swp1 bridge domain br_default learning on

The default VLAN-aware bridge in NVUE is br_default. The default VLAN-aware bridge in NCLU is bridge.

BGP Configuration

You can set global BGP configuration, such as the ASN, router ID, graceful shutdown and restart with the nv set router bgp command. For example:

cumulus@leaf01:~$ nv set router bgp autonomous-system 65101

However, BGP peer and peer group, route information, timer, and address family configuration requires a VRF. For example:

cumulus@leaf01:~$ nv set vrf default router bgp peer swp51 remote-as external

Date and Time

This section discusses how to:

Setting the Date and Time

This section discusses how to set the time zone, and how to set the date and time on the software clock on the switch. To configure NTP, see Network Time Protocol - NTP. To configure PTP, see Precision Time Protocol - PTP.

Setting the time zone, and the date and time on the software clock requires root privileges; use sudo.

Set the Time Zone

You can use one of two methods to set the time zone on the switch:

Edit the /etc/timezone File

To see the current time zone, list the contents of /etc/timezone:

cumulus@switch:~$ cat /etc/timezone
US/Eastern

Edit the file to add your desired time zone. You can see a list of valid time zones here.

Use the following command to apply the new time zone:

cumulus@switch:~$ sudo dpkg-reconfigure --frontend noninteractive tzdata

Use the following command to change /etc/localtime to reflect your current time zone. Use the same value as the previous step.

sudo ln -sf /usr/share/zoneinfo/US/Eastern /etc/localtime

Follow the Guided Wizard

To set the time zone using the guided wizard, run the following command:

cumulus@switch:~$ sudo dpkg-reconfigure tzdata

For more information, see the Debian System Administrator's Manual - Time.

Set the Date and Time

The switch contains a battery backed hardware clock that maintains the time while the switch powers off and between reboots. When the switch is running, the Cumulus Linux operating system maintains its own software clock.

During boot up, the switch copies the time from the hardware clock to the operating system software clock. The software clock takes care of all the timekeeping. During system shutdown, the switch copies the software clock back to the battery backed hardware clock.

You can set the date and time on the software clock with the date command. First, determine your current time zone:

cumulus@switch:~$ date +%Z

If you need to reconfigure the current time zone, refer to the instructions above.

To set the software clock according to the configured time zone:

cumulus@switch:~$ sudo date -s "Tue Jan 26 00:37:13 2021"

You can write the current value of the software clock to the hardware clock using the hwclock command:

cumulus@switch:~$ sudo hwclock -w

See man hwclock(8) for more information.

Network Time Protocol - NTP

The ntpd daemon running on the switch implements the NTP protocol. It synchronizes the system time with time servers in the /etc/ntp.conf file. The ntpd daemon starts at boot by default.

If you intend to run this service within a VRF, including the management VRF, follow these steps to configure the service.

Configure NTP Servers

The default NTP configuration includes the following servers, which are in the /etc/ntp.conf file:

To add the NTP servers you want to use, run the following commands. Include the iburst option to increase the sync speed.

cumulus@switch:~$ net add time ntp server 4.cumulusnetworks.pool.ntp.org iburst
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands add the NTP server to the list of servers in the /etc/ntp.conf file:

# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
server 4.cumulusnetworks.pool.ntp.org iburst

The NVUE command requires a VRF. The following command adds the NTP servers in the default VRF.

cumulus@switch:~$ nv set service ntp default server 4.cumulusnetworks.pool.ntp.org iburst on
cumulus@switch:~$ nv config apply

Edit the /etc/ntp.conf file to add or update NTP server information:

cumulus@switch:~$ sudo nano /etc/ntp.conf
# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
server 4.cumulusnetworks.pool.ntp.org iburst

To set the initial date and time with NTP before starting the ntpd daemon, run the ntpd -q command. Be aware that ntpd -q can hang if the time servers are not reachable.

To verify that ntpd is running on the system:

cumulus@switch:~$ ps -ef | grep ntp
ntp       4074     1  0 Jun20 ?        00:00:33 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 101:102

To check the NTP peer status:

cumulus@switch:~$ net show time ntp servers
      remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+minime.fdf.net  58.180.158.150   3 u  140 1024  377   55.659    0.339   1.464
+69.195.159.158  128.138.140.44   2 u  259 1024  377   41.587    1.011   1.677
*chl.la          216.218.192.202  2 u  210 1024  377    4.008    1.277   1.628
+vps3.drown.org  17.253.2.125     2 u  743 1024  377   39.319   -0.316   1.384
cumulus@switch:~$ nv show service ntp default server
cumulus@switch:~$ ntpq -p
      remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+ec2-34-225-6-20 129.6.15.30      2 u   73 1024  377   70.414   -2.414   4.110
+lax1.m-d.net    132.163.96.1     2 u   69 1024  377   11.676    0.155   2.736
*69.195.159.158  199.102.46.72    2 u  133 1024  377   48.047   -0.457   1.856
-2.time.dbsinet. 198.60.22.240    2 u 1057 1024  377   63.973    2.182   2.692

The following example commands remove some of the default NTP servers:

cumulus@switch:~$ net del time ntp server 0.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 1.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 2.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net del time ntp server 3.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv unset service ntp default server 0.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 1.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 2.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 3.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv config apply

Edit the /etc/ntp.conf file to delete NTP servers.

cumulus@switch:~$ sudo nano /etc/ntp.conf
...
# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 4.cumulusnetworks.pool.ntp.org iburst
...

Specify the NTP Source Interface

By default, the source interface that NTP uses is eth0. The following example command configures the NTP source interface to be swp10.

cumulus@switch:~$ net add time ntp source swp10
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands create the following configuration snippet in the ntp.conf file:

...
# Specify interfaces
interface listen swp10
...
cumulus@switch:~$ nv set service ntp default listen swp10
cumulus@switch:~$ nv config apply

Edit the /etc/ntp.conf file and modify the entry under the Specify interfaces comment.

cumulus@switch:~$ sudo nano /etc/ntp.conf
...
# Specify interfaces
interface listen swp10
...

Use NTP in a DHCP Environment

You can use DHCP to specify your NTP servers. Ensure that the DHCP-generated configuration file /run/ntp.conf.dhcp exists. This file is generated by the /etc/dhcp/dhclient-exit-hooks.d/ntp script and is a copy of the default /etc/ntp.conf file with a modified server list from the DHCP server. If this file does not exist and you plan on using DHCP in the future, you can copy your current /etc/ntp.conf file to the location of the DHCP file.

To use DHCP to specify your NTP servers, run the sudo -E systemctl edit ntp.service command and add the ExecStart= line:

cumulus@switch:~$ sudo -E systemctl edit ntp.service
[Service]
ExecStart=
ExecStart=/usr/sbin/ntpd -n -u ntp:ntp -g -c /run/ntp.conf.dhcp

The sudo -E systemctl edit ntp.service command always updates the base ntp.service even if ntp@mgmt.service is used. The ntp@mgmt.service is re-generated automatically.

To validate that your configuration, run these commands:

cumulus@switch:~$ sudo systemctl restart ntp
cumulus@switch:~$ sudo systemctl status -n0 ntp.service

If the state is not Active, or the alternate configuration file does not appear in the ntp command line, it is likely that you made a configuration mistake. Correct the mistake and rerun the commands above to verify.

Configure NTP with Authorization Keys

For added security, you can configure NTP to use authorization keys.

Configure the NTP Server

  1. Create a .keys file, such as /etc/ntp.keys. Specify a key identifier (a number from 1-65535), an encryption method (M for MD5), and the password. The following provides an example:

    #
    # PLEASE DO NOT USE THE DEFAULT VALUES HERE.
    #
    #65535  M  akey
    #1      M  pass
    
    1  M  CumulusLinux!
    
  2. In the /etc/ntp/ntp.conf file, add a pointer to the /etc/ntp.keys file you created above and specify the key identifier. For example:

    keys /etc/ntp/ntp.keys
    trustedkey 1
    controlkey 1
    requestkey 1
    
  3. Restart NTP with the sudo systemctl restart ntp command.

Configure the NTP Client

The NTP client is the Cumulus Linux switch.

  1. Create the same .keys file you created on the NTP server (/etc/ntp.keys). For example:

    cumulus@switch:~$  sudo nano /etc/ntp.keys
    #
    # DO NOT USE THE DEFAULT VALUES HERE.
    #
    #65535  M  akey
    #1      M  pass
    
    1  M  CumulusLinux!
    
  2. Edit the /etc/ntp.conf file to specify the server you want to use, the key identifier, and a pointer to the /etc/ntp.keys file you created in step 1. For example:

    cumulus@switch:~$ sudo nano /etc/ntp.conf
    ...
    # You do need to talk to an NTP server or two (or three).
    #pool ntp.your-provider.example
    # OR
    #server ntp.your-provider.example
    
    # pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
    # pick a different set every time it starts up.  Please consider joining the
    # pool: <http://www.pool.ntp.org/join.html>
    #server 0.cumulusnetworks.pool.ntp.org iburst
    #server 1.cumulusnetworks.pool.ntp.org iburst
    #server 2.cumulusnetworks.pool.ntp.org iburst
    #server 3.cumulusnetworks.pool.ntp.org iburst
    server 10.50.23.121 key 1
    
    #keys
    keys /etc/ntp.keys
    trustedkey 1
    controlkey 1
    requestkey 1
    ...
    
  3. Restart NTP in the active VRF (default or management). For example:

    cumulus@switch:~$ systemctl restart ntp@mgmt.service
    
  4. Wait a few minutes, then run the ntpq -c as command to verify the configuration:

    cumulus@switch:~$ ntpq -c as
    
    ind assid status  conf reach auth condition  last_event cnt
    ===========================================================
      1 40828  f014   yes   yes   ok     reject   reachable  1
    

    After authorization is accepted, you see the following command output:

    cumulus@switch:~$ ntpq -c as
    
    ind assid status  conf reach auth condition  last_event cnt
    ===========================================================
      1 40828  f61a   yes   yes   ok   sys.peer    sys_peer  1
    

Considerations

NTP in Cumulus Linux uses the /usr/share/zoneinfo/leap-seconds.list file, which expires periodically and results in generated log messages about the expiration. When the file expires, update it from https://www.ietf.org/timezones/data/leap-seconds.list or upgrade the tzdata package to the newest version.

Precision Time Protocol - PTP

Cumulus Linux supports IEEE 1588-2008 Precision Timing Protocol (PTPv2), which defines the algorithm and method for synchronizing clocks of various devices across packet-based networks, including Ethernet switches and IP routers.

PTP is capable of sub-microsecond accuracy. The clocks are in a master-slave hierarchy, where the slaves synchronize to their masters, which can be slaves to their own masters. The best master clock (BMC) algorithm, which runs on every clock, creates and updates the hierarchy automatically. The grandmaster clock is the top-level master. To provide a high-degree of accuracy, a Global Positioning System (GPS) time source typically synchronizes the grandmaster clock.

In the following example:

Cumulus Linux and PTP

PTP in Cumulus Linux uses the linuxptp package that includes the following programs:

Cumulus Linux supports:

  • PTP boundary clock mode only (the switch provides timing to downstream servers; it is a slave to a higher-level clock and a master to downstream clocks).
  • Both IPv4 and IPv6 UDP PTP encapsulation. Cumulus Linux does not support 802.3 encapsulation.
  • Only a single PTP domain per network.
  • PTP on layer 3 interfaces, trunk ports, and switch ports belonging to a VLAN. Cumulus Linux does not support PTP on bonds.
  • Multicast and mixed message mode but not unicast only message mode.
  • End-to-End delay mechanism (not Peer-to-Peer).
  • Two-step clock correction mode, where PTP notes time when the packet goes out of the port and sends the time in a separate (follow-up) message. Cumulus Linux does not support one-step mode.
  • Hardware time stamping for PTP packets. This allows PTP to avoid inaccuracies caused by message transfer delays and improves the accuracy of time synchronization.

You cannot run both PTP and NTP on the switch. By default, Cumulus Linux enables PTP in the default VRF and in any new VRFs you create. You can disable PTP on a VRF to isolate PTP traffic.

Basic Configuration

Basic PTP configuration requires you:

The basic configuration shown below uses the default PTP settings:

To configure optional settings, such as the PTP domain, priority, transport mode, DSCP, and timers, see Optional Configuration below.

You can configure PTP with NVUE or by manually editing /etc/cumulus/switchd.conf file. You cannot configure PTP with NCLU.

The NVUE nv set service PTP commands require an instance number (1 in the example command below) for management purposes.

cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.9/32
cumulus@switch:~$ nv set interface swp2 ip address 10.0.0.10/32
cumulus@switch:~$ nv set interface swp1 ptp enable on
cumulus@switch:~$ nv set interface swp2 ptp enable on
cumulus@switch:~$ nv config apply

The configuration writes to the /etc/ptp4l.conf file.

  1. Enable and start the ptp4l and phc2sys services:

    cumulus@switch:~$ sudo systemctl enable ptp4l.service phc2sys.service
    cumulus@switch:~$ sudo systemctl start ptp4l.service phc2sys.service
    
  2. Edit the Default interface options section of the /etc/ptp4l.conf file to configure the interfaces on the switch that you want to use for PTP.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
[global]
#
# Default Data Set
#
slaveOnly               0
priority1               128
priority2               128
domainNumber            0

twoStepFlag             1
dscp_event              43
dscp_general            43

offset_from_master_min_threshold   -50
offset_from_master_max_threshold   50
mean_path_delay_threshold          200

#
# Run time options
#
logging_level           6
path_trace_enabled      0
use_syslog              1
verbose                 0
summary_interval        0

#
# servo parameters
#
pi_proportional_const   0.000000
pi_integral_const       0.000000
pi_proportional_scale   0.700000
pi_proportional_exponent -0.300000
pi_proportional_norm_max 0.700000
pi_integral_scale       0.300000
pi_integral_exponent    0.400000
pi_integral_norm_max    0.300000
step_threshold          0.000002
first_step_threshold    0.000020
max_frequency           900000000
sanity_freq_limit       0

#
# Default interface options
#
time_stamping           hardware


# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4

[swp2]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4
  1. Restart the ptp4l service:

    cumulus@switch:~$ sudo systemctl restart ptp4l.service
    

Optional Configuration

Clock Domains

PTP domains allow different independent timing systems to be present in the same network without confusing each other. A PTP domain is a network or a portion of a network within which all the clocks synchronize. Every PTP message contains a domain number. A PTP instance works in only one domain and ignores messages that contain a different domain number.

You can specify multiple PTP clock domains. PTP isolates each domain from other domains so that each domain is a different PTP network. You can specify a number between 0 and 127.

The following example commands configure domain 3:

cumulus@switch:~$ nv set service ptp 1 domain 3
cumulus@switch:~$ nv config apply

Edit the Default Data Set section of the /etc/ptp4l.conf file to change the domainNumber setting, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly               0
priority1               254
priority2               254
domainNumber            3
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

PTP Priority

Use the PTP priority to select the best master clock. You can set priority 1 and 2:

The range for both priority1 and priority2 is between 0 and 255. The default priority is 128. For the boundary clock, use a number above 128. The lower priority applies first.

The following example commands set priority 1 and priority 2 to 200:

cumulus@switch:~$ nv set service ptp 1 priority1 200
cumulus@switch:~$ nv set service ptp 1 priority2 200
cumulus@switch:~$ nv config apply

Edit the Default Data Set section of the /etc/ptp4l.conf file to change the priority1 and, or priority2 setting, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly               0
priority1               254
priority2               254
domainNumber            3
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

DSCP

You can configure the DiffServ code point (DSCP) value for all PTP IPv4 packets originated locally. You can set a value between 0 and 63.

cumulus@switch:~$ nv set service ptp 1 ip-dscp 22
cumulus@switch:~$ nv config apply

Edit the Default Data Set section of the /etc/ptp4l.conf file to change the dscp_event setting for PTP messages that trigger a timestamp read from the clock and the dscp_general setting for PTP messages that carry commands, responses, information, or timestamps.

After you save the /etc/ptp4l.conf file, restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly               0
priority1               254
priority2               254
domainNumber            3

twoStepFlag             1
dscp_event              22
dscp_general            22
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

Transport Mode

By default, Cumulus Linux encapsulates PTP messages in UDP/IPV4 frames. To encapsulate PTP messages on an interface in UDP/IPV6 frames:

cumulus@switch:~$ nv set interface swp1 ptp transport ipv6
cumulus@switch:~$ nv config apply

Edit the Default interface options section of the /etc/ptp4l.conf file to change the network_transport setting for the interface, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv6

[swp2]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv6
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

Forced Master

By default, PTP ports are in auto mode, where the BMC algorithm determines the state of the port.

You can configure Forced Master mode on a PTP port so that it is always in a master state and the BMC algorithm does not run for this port. This port ignores any Announce messages it receives.

cumulus@switch:~$ nv set interface swp1 ptp forced-master on
cumulus@switch:~$ nv config apply

Edit the Default interface options section of the /etc/ptp4l.conf file to change the masterOnly setting for the interface, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              1
delay_mechanism         E2E
network_transport       UDPv4
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

Mixed Mode

Cumulus Linux supports the following PTP message modes:

Mixed mode is an early access feature in Cumulus Linux.

Multicast mode is the default setting. To set the message mode to mixed on an interface:

NVUE commands are not supported.

Edit the Default interface options section of the /etc/ptp4l.conf file to change the Hybrid_e2e setting to 1 for the interface, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 20
masterOnly              1
delay_mechanism         E2E
network_transport       UDPv4
Hybrid_e2e              1
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

TTL for a PTP Message

TTL for a PTP message is an early access feature in Cumulus Linux.

To restrict the number of hops a PTP message can travel, set the TTL on the PTP interface. You can set a value between 1 and 255.

cumulus@switch:~$ nv set interface swp1 ptp ttl 20
cumulus@switch:~$ nv config apply

Edit the Default interface options section of the /etc/ptp4l.conf file to change the udp_ttl setting for the interface, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 20
masterOnly              1
delay_mechanism         E2E
network_transport       UDPv4
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

Acceptable Master Table

The acceptable master table option is a security feature that prevents a rogue player from pretending to be the Grandmaster to take over the PTP network. To use this feature, you configure the clock IDs of known Grandmasters in the acceptable master table and set the acceptable master table option on a PTP port. The BMC algorithm checks if the Grandmaster received on the Announce message is in this table before proceeding with the master selection. Cumulus Linux disables this option by default on PTP ports.

The following example command adds the Grandmaster clock ID 24:8a:07:ff:fe:f4:16:06 to the acceptable master table and enable the PTP acceptable master table option for swp1:

cumulus@switch:~$ nv set service ptp 1 acceptable-master 24:8a:07:ff:fe:f4:16:06
cumulus@switch:~$ nv config apply

You can also configure an alternate priority 1 value for the Grandmaster:

cumulus@switch:~$ nv set service ptp 1 acceptable-master 24:8a:07:ff:fe:f4:16:06 alt-priority 2

To enable the PTP acceptable master table option for swp1:

cumulus@switch:~$ nv set interface swp1 ptp acceptable-master on
cumulus@switch:~$ nv config apply

Edit the Default interface options section of the /etc/ptp4l.conf file to add acceptable_master_clockIdentity 248a07.fffe.f41606.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
#
# Default interface options
#
time_stamping           hardware


[acceptable_master_table]
maxTableSize 16
acceptable_master_clockIdentity 248a07.fffe.f41606
...

You can also configure an alternate priority 1 value for the Grandmaster.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
#
# Default interface options
#
time_stamping           hardware


[acceptable_master_table]
maxTableSize 16
acceptable_master_clockIdentity 248a07.fffe.f41606 2

To enable the PTP acceptable master table option for swp1, add acceptable_master on under [swp1].

...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 20
masterOnly              1
delay_mechanism         E2E
network_transport       UDPv4
acceptable_master       on
...

Restart the ptp4l service:

cumulus@switch:~$ sudo systemctl restart ptp4l.service

PTP Timers

You can set the following timers for PTP messages.

TimerDescription
announce-intervalThe average interval between successive Announce messages. Specify the value as a power of two in seconds.
announce-timeoutThe number of announce intervals that have to occur without receiving an Announce message before a timeout occurs.
Make sure that this value is longer than the announce-interval in your network.
delay-req-intervalThe minimum average time interval allowed between successive Delay Required messages.
sync-intervalThe interval between PTP synchronization messages on an interface. Specify the value as a power of two in seconds.

The following example sets the announce interval between successive Announce messages on swp1 to -1.

cumulus@switch:~$ nv set interface swp1 ptp timers announce-interval -1
cumulus@switch:~$ nv config apply

The following example sets the mean sync-interval for multicast messages on swp1 to -5.

cumulus@switch:~$ nv set interface swp1 ptp timers sync-interval -5
cumulus@switch:~$ nv config apply

Edit the Default interface options section of the /etc/ptp4l.conf file:

  • To set the announce interval between successive Announce messages on swp1 to -1, change the logAnnounceInterval setting for the interface to -1.
  • To set the mean sync-interval for multicast messages on swp1 to -5, change the logSyncInterval setting for the interface to -5.

After you edit the /etc/ptp4l.conf file, restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     -1
logSyncInterval         -5
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 20
masterOnly              1
delay_mechanism         E2E
network_transport       UDPv4
...
cumulus@switch:~$ sudo systemctl restart ptp4l.service

PTP on a VRF

By default, Cumulus Linux enables PTP in the default VRF and in any VRFs you create. To isolate traffic to a specific VRF, disable PTP on any other VRFs.

PTP in a VRF other than the default is an early access feature in Cumulus Linux.

cumulus@switch:~$ nv set vrf RED ptp enable off
cumulus@switch:~$ nv config apply
Linux commands are not supported.

Delete PTP Configuration

To delete PTP configuration, delete the PTP master and slave interfaces. The following example commands delete the PTP interfaces swp1, swp2, and swp3.

cumulus@switch:~$ nv unset interface swp1 ptp
cumulus@switch:~$ nv unset interface swp2 ptp
cumulus@switch:~$ nv unset interface swp3 ptp
cumulus@switch:~$ nv config apply

Edit the /etc/ptp4l.conf file to remove the interfaces from the Default interface options section, then restart the ptp4l service.

cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
cumulus@switch:~$ sudo systemctl restart ptp4l.service

To disable PTP on the switch and stop the ptp4l and phc2sys processes:

cumulus@switch:~$ nv set service ptp 1 enable off
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo systemctl stop ptp4l.service phc2sys.service
cumulus@switch:~$ sudo systemctl disable ptp4l.service phc2sys.service

Troubleshooting

NVUE provides several show commands for PTP. You can view the current PTP configuration, monitor violations, and see time attributes of the clock. For example, to show a summary of the PTP configuration on the switch, run the nv show service ptp <instance> command:

cumulus@switch:~$ nv show service ptp 1

                       operational  applied   description
----------------------  -----------  --------  ----------------------------------------------------------------------
enable                  off          on        Turn the feature 'on' or 'off'.  The default is 'off'.
clock-mode                           boundary  Clock mode
domain                               3         Domain number of the current syntonization
ipv4-dscp                            43        Sets the Diffserv code point for all PTP packets originated locally.
message-mode                         mixed     Mode in which PTP delay message is transmitted.
priority1                            254       Priority1 attribute of the local clock
priority2                            254       Priority2 attribute of the local clock
two-step                             off       Determines if the Clock is a 2 step clock
monitor
  max-offset-threshold               200       Maximum offset threshold in nano seconds
  min-offset-threshold               -200      Minimum offset threshold in nano seconds
  path-delay-threshold               1         Path delay threshold in nano seconds
...

To see the list of NVUE show commands for PTP, run nv list-commands service ptp:

cumulus@switch:~$ nv list-commands service ptp
nv show service ptp
nv show service ptp <instance-id>
nv show service ptp <instance-id> acceptable-master
nv show service ptp <instance-id> acceptable-master <clock-id>
nv show service ptp <instance-id> monitor
nv show service ptp <instance-id> monitor violations
nv show service ptp <instance-id> monitor violations forced-master
nv show service ptp <instance-id> monitor violations forced-master <clock-id>
nv show service ptp <instance-id> monitor violations acceptable-master
nv show service ptp <instance-id> monitor violations acceptable-master <clock-id>
nv show service ptp <instance-id> current
nv show service ptp <instance-id> clock-quality
nv show service ptp <instance-id> parent
nv show service ptp <instance-id> parent grandmaster-clock-quality
...

To view PTP status information, including the delta in nanoseconds from the master clock:

cumulus@switch:~$ sudo pmc -u -b 0 'GET TIME_STATUS_NP'
sending: GET TIME_STATUS_NP
    7cfe90.fffe.f56dfc-0 seq 0 RESPONSE MANAGEMENT TIME_STATUS_NP
        master_offset              12610
        ingress_time               1525717806521177336
        cumulativeScaledRateOffset +0.000000000
        scaledLastGmPhaseChange    0
        gmTimeBaseIndicator        0
        lastGmPhaseChange          0x0000'0000000000000000.0000
        gmPresent                  true
        gmIdentity                 000200.fffe.000005
    000200.fffe.000005-1 seq 0 RESPONSE MANAGEMENT TIME_STATUS_NP
        master_offset              0
        ingress_time               0
        cumulativeScaledRateOffset +0.000000000
        scaledLastGmPhaseChange    0
        gmTimeBaseIndicator        0
        lastGmPhaseChange          0x0000'0000000000000000.0000
        gmPresent                  false
        gmIdentity                 000200.fffe.000005
    000200.fffe.000006-1 seq 0 RESPONSE MANAGEMENT TIME_STATUS_NP
        master_offset              5544033534
        ingress_time               1525717812106811842
        cumulativeScaledRateOffset +0.000000000
        scaledLastGmPhaseChange    0
        gmTimeBaseIndicator        0
        lastGmPhaseChange          0x0000'0000000000000000.0000
        gmPresent                  true
        gmIdentity                 000200.fffe.000005

Example Configuration

In the following example, the boundary clock on the switch receives time from Master 1 (the grandmaster) on PTP slave port swp1, sets its clock and passes the time down through PTP master ports swp2, swp3, and swp4 to the hosts that receive the time.

The following example configuration assumes that you have already configured the layer 3 routed interfaces (swp1, swp2, swp3, and swp4) you want to use for PTP.

cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set service ptp 1 priority2 254
cumulus@switch:~$ nv set service ptp 1 priority1 254
cumulus@switch:~$ nv set service ptp 1 domain 3
cumulus@switch:~$ nv set interface swp1 ptp enable on
cumulus@switch:~$ nv set interface swp2 ptp enable on
cumulus@switch:~$ nv set interface swp3 ptp enable on
cumulus@switch:~$ nv set interface swp4 ptp enable on
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
    interface:
      lo:
        ip:
          address:
            10.10.10.1/32: {}
        type: loopback
      swp1:
        type: swp
        service:
          ptp:
            enable: on
      swp2:
        type: swp
        service:
          ptp:
            enable: on
      swp3:
        type: swp
        service:
          ptp:
            enable: on
      swp4:
        type: swp
        service:
          ptp:
            enable: on
    service:
      ptp:
        '1':
          enable: on
          priority1: 254
          priority2: 254
          domain: 3
cumulus@switch:~$ sudo cat /etc/ptp4l.conf
...
[global]
#
# Default Data Set
#
slaveOnly               0
priority1               254
priority2               254
domainNumber            3

clock_type              BC

twoStepFlag             1
dscp_event              43
dscp_general            43

offset_from_master_min_threshold   -200
offset_from_master_max_threshold   200
mean_path_delay_threshold          1

#
# Run time options
#
logging_level           6
path_trace_enabled      0
use_syslog              1
verbose                 0
summary_interval        0

#
# Default interface options
#
time_stamping           hardware

# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.

[swp1]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4

[swp2]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4

[swp3]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4

[swp4]
logAnnounceInterval     0
logSyncInterval         -3
logMinDelayReqInterval  -3
announceReceiptTimeout  3
udp_ttl                 1
masterOnly              0
delay_mechanism         E2E
network_transport       UDPv4

Authentication Authorization and Accounting

This section describes how to set up user accounts and ssh for remote access, and configure LDAP authentication, TACACS+, and RADIUS AAA.

SSH for Remote Access

You can generate authentication keys to access a Cumulus Linux switch securely with the ssh-keygen component of the Secure Shell (SSH) protocol. Cumulus Linux uses the OpenSSH package to provide this functionality. This section describes how to generate an SSH key pair.

Generate an SSH Key Pair

  1. To generate the SSH key pair, run the ssh-keygen command and follow the prompts:

    To configure the system without a password, do not enter a passphrase when prompted in the following step.

    cumulus@leaf01:~$ ssh-keygen
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/cumulus/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/cumulus/.ssh/id_rsa.
    Your public key has been saved in /home/cumulus/.ssh/id_rsa.pub.
    The key fingerprint is:
    5a:b4:16:a0:f9:14:6b:51:f6:f6:c0:76:1a:35:2b:bb cumulus@leaf04
    The key's randomart image is:
    +---[RSA 2048]----+
    |      +.o   o    |
    |     o * o . o   |
    |    o + o O o    |
    |     + . = O     |
    |      . S o .    |
    |       +   .     |
    |      .   E      |
    |                 |
    |                 |
    +-----------------+
    
  2. To copy the generated public key to the desired location, run the ssh-copy-id command and follow the prompts:

    cumulus@leaf01:~$ ssh-copy-id -i /home/cumulus/.ssh/id_rsa.pub cumulus@leaf02
    The authenticity of host 'leaf02 (192.168.0.11)' can't be established.
    ECDSA key fingerprint is b1:ce:b7:6a:20:f4:06:3a:09:3c:d9:42:de:99:66:6e.
    Are you sure you want to continue connecting (yes/no)? yes
    /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
    /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
    cumulus@leaf01's password:
    
    Number of key(s) added: 1
    

    ssh-copy-id does not work if the username on the remote switch is different from the username on the local switch. To work around this issue, use the scp command instead:

    cumulus@leaf01:~$ scp .ssh/id_rsa.pub cumulus@leaf02:.ssh/authorized_keys
    Enter passphrase for key '/home/cumulus/.ssh/id_rsa':
    id_rsa.pub
    
  3. Connect to the remote switch to confirm that the authentication keys are in place:

    cumulus@leaf01:~$ ssh cumulus@leaf02
    
    Welcome to Cumulus VX (TM)
    
    Cumulus VX (TM) is a community supported virtual appliance designed for
    experiencing, testing and prototyping the latest technology.
    For any questions or technical support, visit our community site at:
    http://community.cumulusnetworks.com
    
    The registered trademark Linux (R) is used pursuant to a sublicense from LMI,
    the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.
    Last login: Thu Sep 29 16:56:54 2016
    
Debian Documentation - Password-less logins with OpenSSH

User Accounts

By default, Cumulus Linux has two user accounts: cumulus and root.

The cumulus account:

The root account:

You can add additional user accounts as needed. Like the cumulus account, these accounts must use sudo to execute privileged commands; be sure to include them in the sudo group. For example:

cumulus@switch:~$ sudo adduser NEWUSERNAME sudo

To access the switch without a password, you need to boot into a single shell/user mode.

Enable Remote Access for the root User

The root user does not have a password and cannot log into a switch using SSH. This default account behavior is consistent with Debian. To connect to a switch using the root account, you can do one of the following:

Generate an SSH Key for the root Account

  1. In a terminal on your host system (not the switch), check to see if a key already exists:

    root@host:~# ls -al ~/.ssh/
    

    The name of the key is similar to id_dsa.pub, id_rsa.pub, or id_ecdsa.pub.

  2. If a key does not exist, generate a new one by first creating the RSA key pair:

    root@host:~# ssh-keygen -t rsa
    
  1. At the prompt, enter a file in which to save the key (/root/.ssh/id_rsa). Press Enter to use the home directory of the root user or provide a different destination.
  1. At the prompt, enter a passphrase (empty for no passphrase). This is optional but it does provide an extra layer of security.

  2. The public key is now located in /root/.ssh/id_rsa.pub. The private key (identification) is now located in /root/.ssh/id_rsa.

  3. Copy the public key to the switch. SSH to the switch as the cumulus user, then run:

    cumulus@switch:~$ sudo mkdir -p /root/.ssh
    cumulus@switch:~$ echo <SSH public key string> | sudo tee -a /root/.ssh/authorized_keys
    

Set the root User Password

  1. Run the following command:

    cumulus@switch:~$ sudo passwd root
    
  2. Change the PermitRootLogin setting in the /etc/ssh/sshd_config file from without-password to yes.

    cumulus@switch:~$ sudo nano /etc/ssh/sshd_config
    ...
    # Authentication:
    LoginGraceTime 120
    PermitRootLogin yes
    StrictModes yes
    ...  
    
  3. Restart the ssh service:

    cumulus@switch:~$ sudo systemctl reload ssh.service
    

Using sudo to Delegate Privileges

By default, Cumulus Linux has two user accounts: root and cumulus. The cumulus account is a normal user and is in the group sudo.

You can add more user accounts as needed. Like the cumulus account, these accounts must use sudo to execute privileged commands.

sudo Basics

sudo allows you to execute a command as superuser or another user as specified by the security policy.

The default security policy is sudoers, which you configure in the /etc/sudoers file. Use /etc/sudoers.d/ to add to the default sudoers policy.

Use visudo only to edit the sudoers file; do not use another editor like vi or emacs.

When creating a new file in /etc/sudoers.d, use visudo -f. This option performs sanity checks before writing the file to avoid errors that prevent sudo from working.

Errors in the sudoers file can result in losing the ability to elevate privileges to root. You can fix this issue only by power cycling the switch and booting into single user mode. Before modifying sudoers, enable the root user by setting a password for the root user.

By default, users in the sudo group can use sudo to execute privileged commands. To add users to the sudo group, use the useradd(8) or usermod(8) command. To see which users belong to the sudo group, see /etc/group (man group(5)).

You can run any command as sudo, including su. You must enter a password.

The example below shows how to use sudo as a non-privileged user cumulus to bring up an interface:

cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br0 state DOWN mode DEFAULT qlen 500
link/ether 44:38:39:00:27:9f brd ff:ff:ff:ff:ff:ff

cumulus@switch:~$ ip link set dev swp1 up
RTNETLINK answers: Operation not permitted

cumulus@switch:~$ sudo ip link set dev swp1 up
Password:

umulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:27:9f brd ff:ff:ff:ff:ff:ff

sudoers Examples

The following examples show how you grant as few privileges as necessary to a user or group of users to allow them to perform the required task. Each example uses the system group noc; groups include the prefix %.

When an unprivileged user runs a command, the command must include the sudo prefix.

CategoryPrivilegeExample Commandsudoers Entry
MonitoringSwitch port informationethtool -m swp1%noc ALL=(ALL) NOPASSWD:/sbin/ethtool
MonitoringSystem diagnosticscl-support%noc ALL=(ALL) NOPASSWD:/usr/cumulus/bin/cl-support
MonitoringRouting diagnosticscl-resource-query%noc ALL=(ALL) NOPASSWD:/usr/cumulus/bin/cl-resource-query
Image managementInstall imagesonie-select http://lab/install.bin%noc ALL=(ALL) NOPASSWD:/usr/cumulus/bin/onie-select
Package managementAny apt-get commandapt-get update or apt-get install%noc ALL=(ALL) NOPASSWD:/usr/bin/apt-get
Package managementJust apt-get updateapt-get update%noc ALL=(ALL) NOPASSWD:/usr/bin/apt-get update
Package managementInstall packagesapt-get install vim%noc ALL=(ALL) NOPASSWD:/usr/bin/apt-get install *
Package managementUpgradingapt-get upgrade%noc ALL=(ALL) NOPASSWD:/usr/bin/apt-get upgrade
NetfilterInstall ACL policiescl-acltool -i%noc ALL=(ALL) NOPASSWD:/usr/cumulus/bin/cl-acltool
NetfilterList iptables rulesiptables -L%noc ALL=(ALL) NOPASSWD:/sbin/iptables
Layer 1 and 2Any LLDP commandlldpcli show neighbors / configure%noc ALL=(ALL) NOPASSWD:/usr/sbin/lldpcli
Layer 1 and 2Just show neighborslldpcli show neighbors%noc ALL=(ALL) NOPASSWD:/usr/sbin/lldpcli show neighbors*
InterfacesModify any interfaceip link set dev swp1 {updown}
InterfacesUp any interfaceifup swp1%noc ALL=(ALL) NOPASSWD:/sbin/ifup
InterfacesDown any interfaceifdown swp1%noc ALL=(ALL) NOPASSWD:/sbin/ifdown
InterfacesUp/down only swp2ifup swp2 / ifdown swp2%noc ALL=(ALL) NOPASSWD:/sbin/ifup swp2,/sbin/ifdown swp2
InterfacesAny IP address changeip addr {adddel} 192.0.2.1/30 dev swp1
InterfacesOnly set IP addressip addr add 192.0.2.1/30 dev swp1%noc ALL=(ALL) NOPASSWD:/sbin/ip addr add *
Ethernet bridgingAny bridge commandbrctl addbr br0 / brctl delif br0 swp1%noc ALL=(ALL) NOPASSWD:/sbin/brctl
Ethernet bridgingAdd bridges and interfacesbrctl addbr br0 / brctl addif br0 swp1%noc ALL=(ALL) NOPASSWD:/sbin/brctl addbr *,/sbin/brctl addif *
Spanning treeSet STP propertiesmstpctl setmaxage br2 20%noc ALL=(ALL) NOPASSWD:/sbin/mstpctl
TroubleshootingRestart switchdsystemctl restart switchd.service%noc ALL=(ALL) NOPASSWD:/usr/sbin/service switchd *
TroubleshootingRestart any servicesystemctl cron switchd.service%noc ALL=(ALL) NOPASSWD:/usr/sbin/service
TroubleshootingPacket capturetcpdump%noc ALL=(ALL) NOPASSWD:/usr/sbin/tcpdump
Layer 3Add static routesip route add 10.2.0.0/16 via 10.0.0.1%noc ALL=(ALL) NOPASSWD:/bin/ip route add *
Layer 3Delete static routesip route del 10.2.0.0/16 via 10.0.0.1%noc ALL=(ALL) NOPASSWD:/bin/ip route del *
Layer 3Any static route changeip route *%noc ALL=(ALL) NOPASSWD:/bin/ip route *
Layer 3Any iproute commandip *%noc ALL=(ALL) NOPASSWD:/bin/ip
Layer 3Non-modal OSPFcl-ospf area 0.0.0.1 range 10.0.0.0/24%noc ALL=(ALL) NOPASSWD:/usr/bin/cl-ospf

LDAP Authentication and Authorization

Cumulus Linux uses Pluggable Authentication Modules (PAM) and Name Service Switch (NSS) for user authentication. NSS enables PAM to use LDAP to provide user authentication, group mapping, and information for other services on the system.

To configure LDAP authentication on Linux, you can use libnss-ldap, libnss-ldapd, or libnss-sss. This chapter describes libnss-ldapd only. From internal testing, this library worked best with Cumulus Linux and is the easiest to configure, automate, and troubleshoot.

Install libnss-ldapd

The libldap-2.4-2 and libldap-common LDAP packages are already installed on the Cumulus Linux image; however you need to install these additional packages to use LDAP authentication:

To install the additional packages, run the following command:

cumulus@switch:~$ sudo apt-get install libnss-ldapd libpam-ldapd ldap-utils nslcd

You can also install these packages even if the switch does not connect to the internet, as they are in the cumulus-local-apt-archive repository that is embedded in the Cumulus Linux image.

Follow the interactive prompts to specify the LDAP URI, search base distinguished name (DN), and services that must have LDAP lookups enabled. You need to select at least the passwd, group, and shadow services (press space to select a service). When done, select OK. This creates a basic LDAP configuration using anonymous bind and initiates user search under the base DN specified.

After the dialog closes, the install process prints information similar to the following:

/etc/nsswitch.conf: enable LDAP lookups for group
/etc/nsswitch.conf: enable LDAP lookups for passwd
/etc/nsswitch.conf: enable LDAP lookups for shadow

After the installation is complete, the name service caching daemon (nslcd) runs. This service handles all the LDAP protocol interactions and caches information that returns from the LDAP server. nslcd appends ldap to the /etc/nsswitch.conf file, as well as the secondary information source for passwd, group, and shadow. nslcd references the local files (/etc/passwd, /etc/groups and /etc/shadow) first, as specified by the compat source.

passwd: compat ldap
group: compat ldap
shadow: compat ldap

Keep compat as the first source in NSS for passwd, group, and shadow. This prevents you from getting locked out of the system.

Entering incorrect information during the installation process produces configuration errors. You can correct the information after installation by editing certain configuration files.

Be sure to restart netd after editing the files.

cumulus@switch:~$ sudo systemctl restart netd.service
Alternative Installation Method Using debconf-utils

Instead of running the installer and following the interactive prompts, as described above, you can pre-seed the installer parameters using debconf-utils.

  1. Run apt-get install debconf-utils and create the pre-seeded parameters using debconf-set-selections. Provide the appropriate answers.

  2. Run debconf-show <pkg> to check the settings. Here is an example of how to pre-seed answers to the installer questions using debconf-set-selections:

    root# debconf-set-selections <<'zzzEndOfFilezzz'
    
    # LDAP database user. Leave blank will be populated later!
    
    nslcd nslcd/ldap-binddn  string
    
    # LDAP user password. Leave blank!
    nslcd nslcd/ldap-bindpw   password
    
    # LDAP server search base:
    nslcd nslcd/ldap-base string  ou=support,dc=rtp,dc=example,dc=test
    
    # LDAP server URI. Using ldap over ssl.
    nslcd nslcd/ldap-uris string  ldaps://myadserver.rtp.example.test
    
    # New to 0.9. restart cron, exim and others libraries without asking
    nslcd libraries/restart-without-asking: boolean true
    
    # LDAP authentication to use:
    # Choices: none, simple, SASL
    # Using simple because its easy to configure. Security comes by using LDAP over SSL
    # keep /etc/nslcd.conf 'rw' to root for basic security of bindDN password
    nslcd nslcd/ldap-auth-type    select  simple
    
    # Don't set starttls to true
    nslcd nslcd/ldap-starttls     boolean false
    
    # Check server's SSL certificate:
    # Choices: never, allow, try, demand
    nslcd nslcd/ldap-reqcert      select  never
    
    # Choices: Ccreds credential caching - password saving, Unix authentication, LDAP Authentication , Create home directory on first time login, Ccreds credential caching - password checking
    # This is where "mkhomedir" pam config is activated that allows automatic creation of home directory
    libpam-runtime        libpam-runtime/profiles multiselect     ccreds-save, unix, ldap, mkhomedir , ccreds-check
    
    # for internal use; can be preseeded
    man-db        man-db/auto-update      boolean true
    
    # Name services to configure:
    # Choices: aliases, ethers, group, hosts, netgroup, networks, passwd, protocols, rpc, services,  shadow
    libnss-ldapd  libnss-ldapd/nsswitch   multiselect     group, passwd, shadow
    libnss-ldapd  libnss-ldapd/clean_nsswitch     boolean false
    
    ## define platform specific libnss-ldapd debconf questions/answers. 
    ## For demo used amd64.
    libnss-ldapd:amd64    libnss-ldapd/nsswitch   multiselect     group, passwd, shadow
    libnss-ldapd:amd64    libnss-ldapd/clean_nsswitch     boolean false
    # libnss-ldapd:powerpc   libnss-ldapd/nsswitch   multiselect     group, passwd, shadow
    # libnss-ldapd:powerpc    libnss-ldapd/clean_nsswitch     boolean false
    

Update the nslcd.conf File

After installation, update the main configuration file (/etc/nslcd.conf) to accommodate the expected LDAP server settings.

This section documents some of the more important options that relate to security and queries. For details on all the available configuration options, read the nslcd.conf man page.

After first editing the /etc/nslcd.conf file and/or enabling LDAP in the /etc/nsswitch.conf file, you must restart netd with the sudo systemctl restart netd command. If you disable LDAP, you need to restart the netd service.

Connection

The LDAP client starts a session by connecting to the LDAP server on TCP and UDP port 389 or on port 636 for LDAPS. Depending on the configuration, this connection establishes without authentication (anonymous bind); otherwise, the client must provide a bind user and password. The variables you use to define the connection to the LDAP server are the URI and bind credentials.

The URI is mandatory and specifies the LDAP server location using the FQDN or IP address. The URI also designates whether to use ldap:// for clear text transport, or ldaps:// for SSL/TLS encrypted transport. You can also specify an alternate port in the URI. In production environments, use the LDAPS protocol so that all communications are secure.

After the connection to the server is complete, the BIND operation authenticates the session. The BIND credentials are optional; if you do not specify the credentials, the switch assumes an anonymous bind. Configure authenticated (Simple) BIND by specifying the user (binddn) and password (bindpw) in the configuration. Another option is to use SASL (Simple Authentication and Security Layer) BIND, which provides authentication services using other mechanisms, like Kerberos. Contact your LDAP server administrator for this information as it depends on the configuration of the LDAP server and the credentials for the client device.

# The location at which the LDAP server(s) should be reachable.
uri ldaps://ldap.example.com
# The DN to bind with for normal lookups.
binddn cn=CLswitch,ou=infra,dc=example,dc=com
bindpw CuMuLuS

Search Function

When an LDAP client requests information about a resource, it must connect and bind to the server. Then, it performs one or more resource queries depending on the lookup. All search queries to the LDAP server use the configured search base, filter, and the desired entry (uid=myuser). If the LDAP directory is large, this search takes a long time. Define a more specific search base for the common maps (passwd and group).

# The search base that will be used for all queries.
base dc=example,dc=com
# Mapped search bases to speed up common queries.
base passwd ou=people,dc=example,dc=com
base group ou=groups,dc=example,dc=com

Search Filters

To limit the search scope when authenticating users, use search filters to specify criteria when searching for objects within the directory. The default filters applied are:

filter passwd (objectClass=posixAccount)
filter group (objectClass=posixGroup)

Attribute Mapping

The map configuration allows you to override the attributes pushed from LDAP. To override an attribute for a given map, specify the attribute name and the new value. This is useful to ensure that the shell is bash and the home directory is /home/cumulus:

map    passwd homeDirectory "/home/cumulus"
map    passwd shell "/bin/bash"

In LDAP, the map refers to one of the supported maps specified in the manpage for nslcd.conf (such as passwd or group).

Create Home Directory on Login

If you want to use unique home directories, run the sudo pam-auth-update command and select Create home directory on login in the PAM configuration dialog (press the space bar to select the option). Select OK, then press Enter to save the update and close the dialog.

cumulus@switch:~$ sudo pam-auth-update

The home directory for any user that logs in (using LDAP or not) populates with the standard dotfiles from /etc/skel.

When nslcd starts, an error message similar to the following (where 5816 is the nslcd PID) sometimes appears:

nslcd[5816]: unable to dlopen /usr/lib/x86_64-linux-gnu/sasl2/libsasldb.so: libdb-5.3.so: cannot open
shared object file: No such file or directory

You can ignore this message. The libdb package and resulting log messages from nslcd do not cause any issues when you use LDAP as a client for login and authentication.

Example Configuration

Here is an example configuration using Cumulus Linux.

# /etc/nslcd.conf
# nslcd configuration file. See nslcd.conf(5)
# for details.

# The user and group nslcd should run as.
uid nslcd
gid nslcd

# The location at which the LDAP server(s) should be reachable.
uri ldaps://myadserver.rtp.example.test

# The search base that will be used for all queries.
base ou=support,dc=rtp,dc=example,dc=test

# The LDAP protocol version to use.
#ldap_version 3

# The DN to bind with for normal lookups.
# defconf-set-selections doesn't seem to set this. so have to manually set this.
binddn CN=cumulus admin,CN=Users,DC=rtp,DC=example,DC=test
bindpw 1Q2w3e4r!

# The DN used for password modifications by root.
#rootpwmoddn cn=admin,dc=example,dc=com

# SSL options
#ssl off (default)
# Not good does not prevent man in the middle attacks
#tls_reqcert demand(default)
tls_cacertfile /etc/ssl/certs/rtp-example-ca.crt

# The search scope.
#scope sub

# Add nested group support
# Supported in nslcd 0.9 and higher.
# default wheezy install of nslcd supports on 0.8. wheezy-backports has 0.9
nss_nested_groups yes

# Mappings for Active Directory
# (replace the SIDs in the objectSid mappings with the value for your domain)
# "dsquery * -filter (samaccountname=testuser1) -attr ObjectSID" where cn == 'testuser1'
pagesize 1000
referrals off
idle_timelimit 1000

# Do not allow uids lower than 100 to login (aka Administrator)
# not needed as pam already has this support
# nss_min_uid 1000

# This filter says to get all users who are part of the cumuluslnxadm group. Supports nested groups.
# Example, mary is part of the snrnetworkadm group which is part of cumuluslnxadm group
# Ref: http://msdn.microsoft.com/en-us/library/aa746475%28VS.85%29.aspx (LDAP_MATCHING_RULE_IN_CHAIN)
filter passwd (&(Objectclass=user)(!(objectClass=computer))(memberOf:1.2.840.113556.1.4.1941:=cn=cumuluslnxadm,ou=groups,ou=support,dc=rtp,dc=example,dc=test))
map    passwd uid           sAMAccountName
map    passwd uidNumber     objectSid:S-1-5-21-1391733952-3059161487-1245441232
map    passwd gidNumber     objectSid:S-1-5-21-1391733952-3059161487-1245441232
map    passwd homeDirectory "/home/$sAMAccountName"
map    passwd gecos         displayName
map    passwd loginShell    "/bin/bash"

# Filter for any AD group or user in the baseDN. the reason for filtering for the
# user to make sure group listing for user files don't say '<user> <gid>'. instead will say '<user> <user>'
# So for cosmetic reasons..nothing more.
filter group (&(|(objectClass=group)(Objectclass=user))(!(objectClass=computer)))
map    group gidNumber     objectSid:S-1-5-21-1391733952-3059161487-1245441232
map    group cn            sAMAccountName

Configure LDAP Authorization

Linux uses the sudo command to allow non-administrator users (such as the default cumulus user account) to perform privileged operations. To control the users that can use sudo, define a series of rules in the /etc/sudoers file and files in the /etc/sudoers.d/ directory. The rules apply to groups but you can also define specific users. You can add sudo rules using the group names from LDAP. For example, if a group of users are in the group netadmin, you can add a rule to give those users sudo privileges. Refer to the sudoers manual (man sudoers) for a complete usage description. The following shows an example in the /etc/sudoers file:

# The basic structure of a user specification is "who where = (as_whom) what ".
%sudo ALL=(ALL:ALL) ALL
%netadmin ALL=(ALL:ALL) ALL

Active Directory Configuration

Active Directory (AD) is a fully featured LDAP-based NIS server create by Microsoft. It offers unique features that classic OpenLDAP servers do not have. AD can be more complicated to configure on the client and each version works a little differently with Linux-based LDAP clients. Some more advanced configuration examples, from testing LDAP clients on Cumulus Linux with Active Directory (AD/LDAP), are available in the knowledge base.

LDAP Verification Tools

The LDAP client daemon retrieves and caches password and group information from LDAP. To verify the LDAP interaction, use these command-line tools to trigger an LDAP query from the device.

Identify a User with the id Command

The id command performs a username lookup by following the lookup information sources in NSS for the passwd service. This returns the user ID, group ID and the group list retrieved from the information source. In the following example, the user cumulus is locally defined in /etc/passwd, and myuser is on LDAP. The NSS configuration has the passwd map configured with the sources compat ldap:

cumulus@switch:~$ id cumulus
uid=1000(cumulus) gid=1000(cumulus) groups=1000(cumulus),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev)
cumulus@switch:~$ id myuser
uid=1230(myuser) gid=3000(Development) groups=3000(Development),500(Employees),27(sudo)

getent

The getent command retrieves all records found with NSS for a given map. It can also retrieve a specific entry under that map. You can perform tests with the passwd, group, shadow, or any other map in the /etc/nsswitch.conf file. The output from this command formats according to the map requested. For the passwd service, the structure of the output is the same as the entries in /etc/passwd. The group map outputs the same structure as /etc/group.

In this example, looking up a specific user in the passwd map, the user cumulus is locally defined in /etc/passwd, and myuser is only in LDAP.

cumulus@switch:~$ getent passwd cumulus
cumulus:x:1000:1000::/home/cumulus:/bin/bash
cumulus@switch:~$ getent passwd myuser
myuser:x:1230:3000:My Test User:/home/myuser:/bin/bash

In the next example, looking up a specific group in the group service, the group cumulus is locally defined in /etc/groups, and netadmin is on LDAP.

cumulus@switch:~$ getent group cumulus
cumulus:x:1000:
cumulus@switch:~$ getent group netadmin
netadmin:*:502:larry,moe,curly,shemp

Running the command getent passwd or getent group without a specific request returns all local and LDAP entries for the passwd and group maps.

The ldapsearch command performs LDAP operations directly on the LDAP server. This does not interact with NSS. This command displays the information that the LDAP daemon process receives back from the server. The command has several options. The simplest option uses anonymous bind to the host and specifies the search DN and the attribute to look up.

cumulus@switch:~$ ldapsearch -H ldap://ldap.example.com -b dc=example,dc=com -x uid=myuser
Click to expand the command output
# extended LDIF
#
# LDAPv3
# base <dc=example,dc=com> with scope subtree
# filter: uid=myuser
# requesting: ALL
#
# myuser, people, example.com
dn: uid=myuser,ou=people,dc=example,dc=com
cn: My User
displayName: My User
gecos: myuser
gidNumber: 3000
givenName: My
homeDirectory: /home/myuser
initials: MU
loginShell: /bin/bash
mail: myuser@example.com
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: shadowAccount
objectClass: top
shadowExpire: -1
shadowFlag: 0
shadowMax: 999999
shadowMin: 8
shadowWarning: 7
sn: User
uid: myuser
uidNumber: 1234

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

NCLU

To use NCLU, a user must be in either the netshow or netedit NCLU group in the LDAP database. You can either:

In the following example, a user that is not in the netshow or netedit NCLU group in the LDAP database runs the NCLU net show version command, which produces an error:

hsolo@switch:~$ net show version
ERROR: 'getpwuid(): uid not found: 0922'
See /var/log/netd.log for more details

To add user to the netshow or netedit NCLU group in the LDAP database, either edit the /etc/group file manually or use the sudo adduser USERNAME netshow command, then restart netd. For example, to add the user bill to the netshow group:

cumulus@switch:~$ sudo adduser hsolo netshow
Adding user `hsolo' to group `netshow' ...
Adding user hsolo to group netshow
Done.

cumulus@switch:~$ sudo systemctl restart netd

Now, the user can run the NCLU net show commands:

hsolo@switch:~$ net show version
NCLU_VERSION=1.0-cl4u5
DISTRIB_ID="Cumulus Linux"
DISTRIB_RELEASE=4.1.0
DISTRIB_DESCRIPTION="Cumulus Linux 4.1.0"

LDAP Browsers

The GUI LDAP clients are free tools that show the structure of the LDAP database graphically.

Troubleshooting

nslcd Debug Mode

When setting up LDAP authentication for the first time, turn off the nslcd service using the systemctl stop nslcd.service command (or the systemctl stop nslcd@mgmt.service if you are running the service in a management VRF) and run it in debug mode. Debug mode works whether you are using LDAP over SSL (port 636) or an unencrypted LDAP connection (port 389).

cumulus@switch:~$ sudo systemctl stop nslcd.service
cumulus@switch:~$ sudo nslcd -d

After you enable debug mode, run the following command to test LDAP queries:

cumulus@switch:~$ getent passwd

If you configure LDAP correctly, the following messages appear after you run the getent command:

nslcd: DEBUG: accept() failed (ignored): Resource temporarily unavailable
nslcd: [8e1f29] DEBUG: connection from pid=11766 uid=0 gid=0
nslcd: [8e1f29] <passwd(all)> DEBUG: myldap_search(base="dc=example,dc=com", filter="(objectClass=posixAccount)")
nslcd: [8e1f29] <passwd(all)> DEBUG: ldap_result(): uid=myuser,ou=people,dc=example,dc=com
nslcd: [8e1f29] <passwd(all)> DEBUG: ldap_result(): ... 152 more results
nslcd: [8e1f29] <passwd(all)> DEBUG: ldap_result(): end of results (162 total)

In the example output above, <passwd(all)> shows a query of the entire directory structure.

You can query a specific user with the following command:

cumulus@switch:~$ getent passwd myuser

You can replace myuser with any username on the switch. The following debug output indicates that user myuser exists:

nslcd: DEBUG: add_uri(ldap://10.50.21.101)
nslcd: version 0.8.10 starting
nslcd: DEBUG: unlink() of /var/run/nslcd/socket failed (ignored): No such file or directory
nslcd: DEBUG: setgroups(0,NULL) done
nslcd: DEBUG: setgid(110) done
nslcd: DEBUG: setuid(107) done
nslcd: accepting connections
nslcd: DEBUG: accept() failed (ignored): Resource temporarily unavailable
nslcd: [8b4567] DEBUG: connection from pid=11369 uid=0 gid=0
nslcd: [8b4567] <passwd="myuser"> DEBUG: myldap_search(base="dc=cumulusnetworks,dc=com", filter="(&(objectClass=posixAccount)(uid=myuser))")
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_initialize(ldap://<ip_address>)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_rebind_proc()
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_PROTOCOL_VERSION,3)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_DEREF,0)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_TIMELIMIT,0)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_TIMEOUT,0)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_NETWORK_TIMEOUT,0)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_REFERRALS,LDAP_OPT_ON)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_set_option(LDAP_OPT_RESTART,LDAP_OPT_ON)
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_simple_bind_s(NULL,NULL) (uri="ldap://<ip_address>")
nslcd: [8b4567] <passwd="myuser"> DEBUG: ldap_result(): end of results (0 total)

Common Problems

SSL/TLS

NSCD

If you are running the nslcd service in a management VRF, you need to run the systemctl restart nslcd@mgmt.service command instead of the systemctl restart nslcd.service command. For example:

cumulus@switch:~$ sudo nscd -K
cumulus@switch:~$ sudo systemctl restart nslcd@mgmt.service

LDAP

TACACS

Cumulus Linux implements TACACS+ client AAA (Accounting, Authentication, and Authorization) in a transparent way with minimal configuration. The client implements the TACACS+ protocol as described in this IETF document. There is no need to create accounts or directories on the switch. Accounting records go to all configured TACACS+ servers by default. Using per-command authorization requires additional setup on the switch.

Supported Features

Install the TACACS+ Client Packages

You can install the TACACS+ packages even if the switch is not connected to the internet; the packages are in the cumulus-local-apt-archive repository in the Cumulus Linux image.

To install all required packages, run these commands:

cumulus@switch:~$ sudo -E apt-get update
cumulus@switch:~$ sudo -E apt-get install tacplus-client

Configure the TACACS+ Client

After installing TACACS+, edit the /etc/tacplus_servers file to add at least one server and one shared secret (key). You can specify the server and secret parameters in any order anywhere in the file. Whitespace (spaces or tabs) are not allowed. For example, if your TACACS+ server IP address is 192.168.0.30 and your shared secret is tacacskey, add these parameters to the /etc/tacplus_servers file:

secret=tacacskey
server=192.168.0.30

Cumulus Linux supports a maximum of seven TACACS+ servers. To specify multiple servers, add one per line to the /etc/tacplus_servers file.

Connections establish in the order that this file lists. In most cases, you do not need to change any other parameters. You can add parameters the packages use to this file, which affects all the TACACS+ client software. For example, the timeout value for NSS lookups is 5 seconds by default in the /etc/tacplus_nss.conf file, whereas the timeout value for other packages is 10 seconds in the /etc/tacplus_servers file. The timeout value applies to each connection to the TACACS+ servers. (If you configure authorization per command, the timeout occurs for each command.) There are several connections to the server per login attempt from PAM, as well as two or more through NSS. Therefore, with the default timeout values, a TACACS+ server that is not reachable can delay logins by a minute or more for each unreachable server. If you must list unreachable TACACS+ servers, place them at the end of the server list and consider reducing the timeout values.

When you add or remove TACACS+ servers, you must restart auditd (with the systemctl restart auditd command) or you must send a signal (with killall -HUP audisp-tacplus) before audisp-tacplus rereads the configuration to see the changed server list.

You can also configure the IP address used as the source IP address when communicating with the TACACS+ server. See TACACS Configuration Parameters below for the full list of TACACS+ parameters.

Following is the complete list of the TACACS+ client configuration files, and their use.

Filename
Description
/etc/tacplus_serversThe primary file that requires configuration after installation. All packages with include=/etc/tacplus_servers parameters use this file. Typically, this file contains the shared secrets; make sure that the Linux file mode is 600.
/etc/nsswitch.confWhen the libnss_tacplus package installs, this file configures tacplus lookups through libnss_tacplus. If you replace this file by automation, you need to add tacplus as the first lookup method for the passwd database line.
/etc/tacplus_nss.confSets the basic parameters for libnss_tacplus. The file includes a debug variable for debugging NSS lookups separately from other client packages.
/usr/share/pam-configs/tacplusThe configuration file for pam-auth-update to generate the files in the next row. The file uses these configurations at login, by su, and by ssh.
/etc/pam.d/common-*The /etc/pam.d/common-* files update for tacplus authentication. The files update with pam-auth-update, when you install or remove libpam-tacplus.
/etc/sudoers.d/tacplusAllows TACACS+ privilege level 15 users to run commands with sudo. The file includes an example (commented out) of how to enable privilege level 15 TACACS users to use sudo without a password and provides an example of how to enable all TACACS users to run specific commands with sudo. Only edit this file with the visudo -f /etc/sudoers.d/tacplus command.
/etc/audisp/plugins.d/audisp-tacplus.confThe audisp plugin configuration file. You do not need to modify this file.
/etc/audisp/audisp-tac_plus.confThe TACACS+ server configuration file for accounting. You do not need to modify this file. You can use this configuration file when you only want to debug TACACS+ accounting issues, not all TACACS+ users.
/etc/audit/rules.d/audisp-tacplus.rulesThe auditd rules for TACACS+ accounting. The augenrules command uses all rule files to generate the rules file (described below).
/etc/audit/audit.rulesThe audit rules file that generate when you install auditd.

You can edit the /etc/pam.d/common-* files manually. However, if you run pam-auth-update again after making the changes, the update fails. Only configure /usr/share/pam-configs/tacplus, then run pam-auth-update.

TACACS+ Authentication (login)

PAM modules and an updated version of the libpam-tacplus package configure authentication initially. When you install the package, the pam-auth-update command updates the PAM configuration in /etc/pam.d. If you make changes to your PAM configuration, you need to integrate these changes. If you also use LDAP with the libpam-ldap package, you need to edit the PAM configuration with the LDAP and TACACS ordering you prefer. The libpam-tacplus package ignore rules and the values in success=2 require adjustments to ignore LDAP rules.

The TACACS+ privilege attribute priv_lvl determines the privilege level for the user that the TACACS+ server returns during the user authorization exchange. The client accepts the attribute in either the mandatory or optional forms and also accepts priv-lvl as the attribute name. The attribute value must be a numeric string in the range 0 to 15, with 15 the most privileged level.

By default, TACACS+ users at privilege levels other than 15 cannot run sudo commands and can only run commands with standard Linux user permissions.

TACACS+ Client Sequencing

Due to SSH and login processing mechanisms, Cumulus Linux needs to know the following at the beginning of the AAA sequence:

For non-local users (users not in the local password file) you need to send a TACACS+ authorization request as the first communication with the TACACS+ server, before authentication and before the user logging in requests a password.

You need to configure certain TACACS+ servers to allow authorization requests before authentication. Contact your TACACS+ server vendor for information.

Local Fallback Authentication

If a site wants to allow local fallback authentication for a user when none of the TACACS servers are reachable, you can add a privileged user account as a local account on the switch.

To configure local fallback authentication:

  1. Edit the /etc/nsswitch.conf file to remove the keyword tacplus from the line starting with passwd. (You need to add the keyword back in step 3.)

    The following example shows the /etc/nsswitch.conf file with no tacplus keyword in the line starting with passwd.

    cumulus@switch:~$ sudo vi /etc/nsswitch.conf
    #
    # Example configuration of GNU Name Service Switch functionality.
    # If you have the `glibc-doc-reference' and `info' packages installed, try:
    # `info libc "Name Service Switch"' for information about this file.
    
    passwd:         files
    group:          tacplus files
    shadow:         files
    gshadow:        files
    ...
    
  2. To enable the local privileged user to run sudo and NCLU commands, run the adduser commands shown below. In the example commands, the TACACS account name is tacadmin.

    The first adduser command prompts for information and a password. You can skip most of the requested information by pressing ENTER.

    cumulus@switch:~$ sudo adduser --ingroup tacacs tacadmin
    cumulus@switch:~$ sudo adduser tacadmin netedit
    cumulus@switch:~$ sudo adduser tacadmin sudo
    
  3. Edit the /etc/nsswitch.conf file to add the keyword tacplus back to the line starting with passwd (the keyword you removed in the first step).

  4. Restart the netd service with the following command:

    cumulus@switch:~$ sudo systemctl restart netd
    

TACACS+ Accounting

TACACS+ accounting uses the audisp module, with an additional plugin for auditd/audisp. The plugin maps the auid in the accounting record to a TACACS login, which it bases on the auid and sessionid. The audisp module requires libnss_tacplus and uses the libtacplus_map.so library interfaces as part of the modified l1ipam_tacplus` package.

Communication with the TACACS+ servers occurs with the libsimple-tacact1 library, through dlopen(). A maximum of 240 bytes of command name and arguments send in the accounting record, due to the TACACS+ field length limitation of 255 bytes.

All Linux commands result in an accounting record, including login commands and sub-processes of other commands. This can generate a lot of accounting records.

Configure the IP address and encryption key of the server in the /etc/tacplus_servers file. Minimal configuration to auditd and audisp is necessary to enable the audit records needed for accounting. These records install as part of the package.

audisp-tacplus installs the audit rules for command accounting. When you configure a management VRF, you must add the vrf parameter and signal the audisp-tacplus process to reread the configuration. The example below shows that the management VRF name is mgmt. You can add the vrf parameter to either the /etc/tacplus_servers file or the /etc/audisp/audisp-tac_plus.conf file.

vrf=mgmt

After editing the configuration file, send the HUP signal killall -HUP audisp-tacplus to notify the accounting process to reread the file.

All sudo commands run by TACACS+ users generate accounting records against the original TACACS+ login name.

For more information, refer to the audisp.8 and auditd.8 man pages.

Configure NCLU for TACACS+ Users

When you install or upgrade TACACS+ packages, the installation and update process maps user accounts automatically and adds all tacacs0 through tacacs15 users to the netshow group.

If you want any TACACS+ users to execute net add, net del, and net commit commands and restart services with NCLU, you need to add those users to the users_with_edit variable in the /etc/netd.conf file. Add the tacacs15 user and, depending upon your policies, other users (tacacs1 through tacacs14) to this variable.

To allow a TACACS+ user access to the show commands, add the tacacs group to the groups_with_show variable.

Do not add the tacacs group to the groups_with_edit variable; any user can log into the switch as the root user.

To add the users, edit the /etc/netd.conf file:

cumulus@switch:~$ sudo nano /etc/netd.conf
...
# Control which users/groups are allowed to run "add", "del",
# "clear", "abort", and "commit" commands.
users_with_edit = root, cumulus, tacacs15
groups_with_edit = netedit

# Control which users/groups are allowed to run "show" commands
users_with_show = root, cumulus
groups_with_show = netshow, netedit, tacacs
...

After you save and exit the netd.conf file, restart the netd service. Run:

cumulus@switch:~$ sudo systemctl restart netd

TACACS+ Per-command Authorization

The tacplus-auth command handles authorization for each command. To make this an enforced authorization, change the TACACS+ login to use a restricted shell, with a very limited executable search path. Otherwise, the user can bypass the authorization. The tacplus-restrict utility simplifies setting up the restricted environment. The example below initializes the environment for the tacacs0 user account. This is the account for TACACS+ users at privilege level 0.

tacuser0@switch:~$ sudo tacplus-restrict -i -u tacacs0 -a command1 command2 ... commandN

If the TACACS+ server does not allow the user or command combination, you see a message similar to the following:

tacuser0@switch:~$ net show version
net not authorized by TACACS+ with given arguments, not executing

The following table provides the command options:

OptionDescription
-iInitializes the environment. You only need to issue this option one time per username.
-aYou can invoke the utility with the -a option as often as you like. For each command in the -a list, the utility creates a symbolic link from tacplus-auth to the relative portion of the command name in the local bin subdirectory. You also need to enable these commands on the TACACS+ server (refer to the TACACS+ server documentation). It is common for the server to allow some options to a command, but not others.
-fRe-initializes the environment. If you need to restart, issue the -f option with -i to force re-initialization; otherwise, the utility ignores repeated use of -i.
As part of the initialization:
- The user shell changes to /bin/rbash.
- The utility saves any existing dot files.

For example, if you want to allow the user to be able to run the net and ip commands (if authorized by the TACACS+ server):

cumulus@switch:~$ sudo tacplus-restrict -i -u tacacs0 -a ip net

After running this command, examine the tacacs0 directory::

cumulus@switch:~$ sudo ls -lR ~tacacs0
total 12
lrwxrwxrwx 1 root root 22 Nov 21 22:07 ip -> /usr/sbin/tacplus-auth
lrwxrwxrwx 1 root root 22 Nov 21 22:07 net -> /usr/sbin/tacplus-auth

Other than shell built-ins, privilege level 0 TACACS users can only run the ip and net commands.

If add commands with the -a option by mistake, you can remove them. The example below removes the net command:

cumulus@switch:~$ sudo rm ~tacacs0/bin/net

You can remove all commands:

cumulus@switch:~$ sudo rm ~tacacs0/bin/*

For more information on tacplus-auth and tacplus-restrict, run the man command.

cumulus@switch:~$ man tacplus-auth tacplus-restrict

NSS Plugin

With pam_tacplus, TACACS+ authenticated users can log in without a local account on the system using the NSS plugin that comes with the tacplus_nss package. The plugin uses the mapped tacplus information if the user is not in the local password file, provides the getpwnam() and getpwuid()entry points, and uses the TACACS+ authentication functions.

The plugin asks the TACACS+ server if it knows the user, and then for relevant attributes to determine the privilege level of the user. When you install the libnss_tacplus package, nsswitch.conf changes to set tacplus as the first lookup method for passwd. If you change the order, lookups return the local accounts, such as tacacs0

If TACACS+ server does not find the user, it uses the libtacplus.so exported functions to do a mapped lookup. The privilege level appends to tacacs and the lookup searches for the name in the local password file. For example, privilege level 15 searches for the tacacs15 user. If the TACACS+ server finds the user, it adds information for the user in the password structure.

If the TACACS+ server does not find the user, it decrements the privilege level and checks again until it reaches privilege level 0 (user tacacs0). This allows you to use only the two local users tacacs0 and tacacs15, for minimal configuration.

TACACS Configuration Parameters

The recognized configuration options are the same as the libpam_tacplus command line arguments; however, Cumulus Linux does not support all pam_tacplus options. For a description of the configuration parameters, refer to the tacplus_servers.5 man page, which is part of the libpam-tacplus package.

The table below describes the configuration options available:

Configuration OptionDescription
debugThe output debugging information through syslog(3).
Note: Debugging is heavy, including passwords. Do not leave debugging enabled on a production switch after you have completed troubleshooting.
secret=STRINGThe secret key to encrypt and decrypt packets sent to and received from the server.
You can specify the secret key more than one time in any order.
Only use this parameter in files such as /etc/tacplus_servers that are not world readable.
server=hostname
server=ip-address
Adds a TACACS+ server to the servers list. Cumulus Linux queries servers in turn until it finds a match or no servers remain in the list. You can provide an IP address with a port number, preceded by a colon (:). The default port is 49.
Note: When sending accounting records, the record sends to all servers in the list if acct_all=1, which is the default.
source_ip=ipv4-addressSets the IP address as the source IP address when communicating with the TACACS+ server. You must specify an IPv4 address. You cannot use IPv6 addresses and hostnames. The address must be valid for the interface you use.
timeout=secondsTACACS+ servers communication timeout.
This parameter defaults to 10 seconds in the /etc/tacplus_servers file, but defaults to 5 seconds in the /etc/tacplus_nss.conf file.
include=/file/nameA supplemental configuration file to avoid duplicating configuration information. You can include up to 8 more configuration files.
min_uid=valueThe minimum user ID that the NSS plugin looks up. 0 specifies that the plugin never looks up uid 0 (root). Do not specify a value greater than the local TACACS+ user IDs (0 through 15).
exclude_users=user1,user2,A comma-separated list of usernames in the tacplus_nss.conf file that the NSS plugin never looks up. You cannot use * (asterisk) as a wild card in the list. While it is not a legal username, bash can look up this asterisk as a username during pathname completion, so it is in this list as a username string.
Note: Do not remove the cumulus user from the exclude_users list; doing so can make it impossible to log in as the cumulus user, which is the primary administrative account in Cumulus Linux. If you do remove the cumulus user, add some other local fallback user that does not rely on TACACS but is a member of sudo and netedit groups, so that these accounts can run sudo and NCLU commands.
login=stringTACACS+ authentication service (pap, chap, or login).
The default value is pap.
user_homedir=1This option is off by default. When you enable this option, Cumulus Linux creates a separate home directory for each TACACS+ user when the TACACS+ user first logs in. By default, the switch uses the home directory in the mapping accounts in /etc/passwd (/home/tacacs0 through /home/tacacs15). If the home directory does not exist, the mkhomedir_helper program creates it, in the same way as pam_mkhomedir.
This option does not apply for accounts with restricted shells when you enable per-command authorization.
acct_all=1Configuration option for audisp_tacplus and pam_tacplus sending accounting records to all supplied servers (1), or the first server to respond (0).
The default value is 1.
timeout=secondsSets the timeout in seconds for connections to each TACACS+ server.
The default is 10 seconds for all lookups. NSS lookups use a 5 second timeout.
vrf=vrf-nameIf the management network is in a VRF, set this variable to the VRF name. This is typically mgmt. When you set this variable, the connection to the TACACS+ accounting servers establishes through the named VRF.
serviceTACACS+ accounting and authorization service. Examples include shell, pap, raccess, ppp, and slip.
The default value is shell.
protocolTACACS+ protocol field. This option is user dependent. PAM uses the SSH protocol.

Remove the TACACS+ Client Packages

To remove all the TACACS+ client packages, use the following commands:

cumulus@switch:~$ sudo -E apt-get remove tacplus-client
cumulus@switch:~$ sudo -E apt-get autoremove

To remove the TACACS+ client configuration files as well as the packages (recommended), use this command:

cumulus@switch:~$ sudo -E apt-get autoremove --purge

Troubleshooting

Basic Server Connectivity or NSS Issues

You can use the getent command to determine if you configured TACACS+ correctly and if the local password is in the configuration files. In the example commands below, the cumulus user represents the local user, while cumulusTAC represents the TACACS user.

To look up the username within all NSS methods:

cumulus@switch:~$ sudo getent passwd cumulusTAC
cumulusTAC:x:1016:1001:TACACS+ mapped user at privilege level 15,,,:/home/tacacs15:/bin/bash

To look up the user within the local database only:

cumulus@switch:~$ sudo getent -s compat passwd cumulus
cumulus:x:1000:1000:cumulus,,,:/home/cumulus:/bin/bash

To look up the user within the TACACS+ database only:

cumulus@switch:~$ sudo getent -s tacplus passwd cumulusTAC
cumulusTAC:x:1016:1001:TACACS+ mapped user at privilege level 15,,,:/home/tacacs15:/bin/bash

If TACACS is not working correctly, debug the following configuration files by adding the debug=1 parameter to one or more of these files:

You can also add debug=1 to individual pam_tacplus lines in /etc/pam.d/common*.

All log messages are in /var/log/syslog.

Incorrect Shared Key

The TACACS client on the switch and the TACACS server must have the same shared secret key. If this key is incorrect, the following message prints to syslog:

2017-09-05T19:57:00.356520+00:00 leaf01 sshd[3176]: nss_tacplus: TACACS+ server 192.168.0.254:49 read failed with protocol error (incorrect shared secret?) user cumulus

Issues with Per-command Authorization

To debug TACACS user command authorization, have the TACACS+ user enter the following command at a shell prompt, then try the command again:

tacuser0@switch:~$ export TACACSAUTHDEBUG=1

When you enable debugging, the command authorization conversation with the TACACS+ server shows additional information:

tacuser0@switch:~$ net pending
tacplus-auth: found matching command (/usr/bin/net) request authorization
tacplus-auth: error connecting to 10.0.3.195:49 to request authorization for net: Transport endpoint is not connected
tacplus-auth: cmd not authorized (16)
tacplus-auth: net not authorized from 192.168.3.189:49
net not authorized by TACACS+ with given arguments, not executing
tacuser0@switch:~$ net show version
tacplus-auth: found matching command (/usr/bin/net) request authorization
tacplus-auth: error connecting to 10.0.3.195:49 to request authorization for net: Transport endpoint is not connected
tacplus-auth: 192.168.3.189:49 authorized command net
tacplus-auth: net authorized, executing
DISTRIB_ID="Cumulus Linux"
DISTRIB_RELEASE=4.1.0
DISTRIB_DESCRIPTION="Cumulus Linux 4.1.0"

To disable debugging:

tacuser0@switch:~$ export -n TACACSAUTHDEBUG

Debug Issues with Accounting Records

If you add or delete TACACS+ servers from the configuration files, make sure you notify the audisp plugin with this command:

cumulus@switch:~$ sudo killall -HUP audisp-tacplus

If accounting records do not send, add debug=1 to the /etc/audisp/audisp-tac_plus.conf file, then run the command above to notify the plugin. Ask the TACACS+ user to run a command and examine the end of /var/log/syslog for messages from the plugin. You can also check the auditing log file /var/log/audit audit.log to be sure the auditing records exist. If the auditing records do not exist, restart the audit daemon with:

cumulus@switch:~$ sudo systemctl restart auditd.service

TACACS Component Software Descriptions

Cumulus Linux uses the following packages for TACACS.

Package NameDescription
audisp-tacplus_1.0.0-1-cl3u3This package uses auditing data from auditd to send accounting records to the TACACS+ server and starts as part of auditd.
libtac2_1.4.0-cl3u2Basic TACACS+ server utility and communications routines.
libnss-tacplus_1.0.1-cl3u3Provides an interface between libc username lookups, the mapping functions, and the TACACS+ server.
tacplus-auth-1.0.0-cl3u1Includes the tacplus-restrict setup utility, which enables you to perform per-command TACACS+ authorization. Per-command authorization is not the default.
libpam-tacplus_1.4.0-1-cl3u2A modified version of the standard Debian package.
libtacplus-map1_1.0.0-cl3u2The mapping functionality between local and TACACS+ users on the server. Sets the immutable sessionid and auditing UID to ensure that you can track the original user through multiple processes and privilege changes. Sets the auditing loginuid as immutable. Creates and maintains a status database in /run/tacacs_client_map to manage and lookup mappings.
libsimple-tacacct1_1.0.0-cl3u2Provides an interface for programs to send accounting records to the TACACS+ server. audisp-tacplus uses this package.
libtac2-bin_1.4.0-cl3u2Provides the tacc testing program and TACACS+ man page.

Considerations

TACACS+ Client Is only Supported through the Management Interface

The TACACS+ client is only supported through the management interface on the switch: eth0, eth1, or the VRF management interface. The TACACS+ client is not supported through bonds, switch virtual interfaces (SVIs), or switch port interfaces (swp).

Multiple TACACS+ Users

If two or more TACACS+ users log in simultaneously with the same privilege level, while the accounting records are correct, a lookup on either name matches both users, while a UID lookup only returns the user that logs in first.

This means that any processes that either user runs apply to both, and all files either user creates apply to the first name matched. This is similar to adding two local users to the password file with the same UID and GID, and is an inherent limitation of using the UID for the base user from the password file.

The current algorithm returns the first name matching the UID from the mapping file; either the first or the second user that logs in.

To work around this issue, you can use the switch audit log or the TACACS server accounting logs to determine which processes and files each user creates.

The Linux auditd system does not always generate audit events for processes when terminated with a signal (with the kill system call or internal errors such as SIGSEGV). As a result, processes that exit on a signal that you do not handle, generate a STOP accounting record.

Issues with deluser Command

TACACS+ and other non-local users that run the deluser command with the --remove-home option see the following error:

tacuser0@switch: deluser --remove-home USERNAME
userdel: cannot remove entry 'USERNAME' from /etc/passwd
/usr/sbin/deluser: `/usr/sbin/userdel USERNAME' returned error code 1. Exiting

The command does remove the home directory. The user can still log in on that account but does not have a valid home directory. This is a known upstream issue with the deluser command for all non-local users.

Only use the --remove-home option with the user_homedir=1 configuration command.

Both TACACS+ and RADIUS AAA Clients

When you install both the TACACS+ and the RADIUS AAA client, Cumulus Linux does not attempt RADIUS login. As a workaround, do not install both the TACACS+ and the RADIUS AAA client on the same switch.

RADIUS AAA

Various add-on packages enable RADIUS users to log in to Cumulus Linux switches in a transparent way with minimal configuration. There is no need to create accounts or directories on the switch. Authentication uses PAM and includes login, ssh, sudo and su.

Install the RADIUS Packages

You can install the RADIUS packages even if the switch is not connected to the internet, as they are in the cumulus-local-apt-archive repository, which is embedded in the Cumulus Linux image.

To install the RADIUS packages:

cumulus@switch:~$ sudo apt-get update
cumulus@switch:~$ sudo apt-get install libnss-mapuser libpam-radius-auth

After installation is complete, either reboot the switch or run the sudo systemctl restart netd command.

The libpam-radius-auth package supplied with the Cumulus Linux RADIUS client is a newer version than the one in Debian Buster. This package contains support for IPv6, the src_ip option described below, as well as bug fixes and minor features. The package also includes VRF support, provides man pages describing the PAM and RADIUS configuration, and sets the SUDO_PROMPT environment variable to the login name for RADIUS mapping support.

The libnss-mapuser package is specific to Cumulus Linux and supports the getgrent, getgrnam and getgrgid library interfaces. These interfaces add logged in RADIUS users to the group member list for groups that contain the mapped_user (radius_user) if the RADIUS account does not have privileges, and add privileged RADIUS users to the group member list for groups that contain the mapped_priv_user (radius_priv_user) during the group lookups.

During package installation:

Configure the RADIUS Client

To configure the RADIUS client, edit the /etc/pam_radius_auth.conf file:

  1. Add the hostname or IP address of at least one RADIUS server (such as a freeradius server on Linux), and the shared secret used to authenticate and encrypt communication with each server.

    You must be able to resolve the hostname of the switch to an IP address. If for some reason you cannot find the hostname in DNS, you can add the hostname to the /etc/hosts file manually. However, this can cause problems because DHCP assigns the IP address, which can change at any time.

    Multiple server configuration lines are verified in the order listed. Other than memory, there is no limit to the number of RADIUS servers you can use.

    The server port number or name is optional. The system looks up the port in the /etc/services file. However, you can override the ports in the /etc/pam_radius_auth.conf file.

  2. If the server is slow or latencies are high, change the timeout setting. The setting defaults to 3 seconds.

  3. If you want to use a specific interface to reach the RADIUS server, specify the src_ip option. You can specify the hostname of the interface, an IPv4, or an IPv6 address. If you specify the src_ip option, you must also specify the timeout option.

  4. Set the vrf-name field. This is typically set to mgmt if you are using a management VRF. You cannot specify more than one VRF.

The configuration file includes the mapped_priv_user field that sets the account used for privileged RADIUS users and the priv-lvl field that sets the minimum value for the privilege level to be a privileged login (the default value is 15). If you edit these fields, make sure the values match those set in the /etc/nss_mapuser.conf file.

The following example provides a sample /etc/pam_radius_auth.conf file configuration:

mapped_priv_user   radius_priv_user
# server[:port]    shared_secret   timeout (secs)  src_ip
192.168.0.254      secretkey
other-server       othersecret     3               192.168.1.10
# when mgmt vrf is in use
vrf-name mgmt

If this is the first time you are configuring the RADIUS client, uncomment the debug line for troubleshooting. The debugging messages write to /var/log/syslog. When the RADIUS client is working correctly, comment out the debug line.

As an optional step, you can set PAM configuration keywords by editing the /usr/share/pam-configs/radius file. After you edit the file, you must run the pam-auth-update --package command. The pam_radius_auth (8) man page describes the PAM configuration keywords.

The value of the VSA (Vendor Specific Attribute) shell:priv-lvl determines the privilege level for the user on the switch. If the attribute does not return, the user does not have privileges. The following shows an example using the freeradius server for a fully privileged user.

Service-Type = Administrative-User,
Cisco-AVPair = "shell:roles=network-administrator",
Cisco-AVPair += "shell:priv-lvl=15"

The VSA vendor name (Cisco-AVPair in the example above) can have any content. The RADIUS client only checks for the string shell:priv-lvl.

Enable Login without Local Accounts

LDAP is not commonly used with switches and adding accounts locally is cumbersome, Cumulus Linux includes a mapping capability with the libnss-mapuser package.

Mapping uses two NSS (Name Service Switch) plugins, one for account name, and one for UID lookup. The installation process configures these accounts automatically in the /etc/nsswitch.conf file and removes them when you delete the package. See the nss_mapuser (8) man page for the full description of this plugin.

A username is mapped at login to a fixed account specified in the configuration file, with the fields of the fixed account used as a template for the user that is logging in.

For example, if you look up the name dave and the fixed account in the configuration file is radius\_user, and that entry in /etc/passwd is:

radius_user:x:1017:1002:radius user:/home/radius_user:/bin/bash

then the matching line that returns when you run getent passwd dave is:

cumulus@switch:~$ getent passwd dave
dave:x:1017:1002:dave mapped user:/home/dave:/bin/bash

The login process creates the home directory /home/dave if it does not already exist and populates it with the standard skeleton files by the mkhomedir_helper command.

The configuration file /etc/nss_mapuser.conf configures the plugins. The file includes the mapped account name, which is radius_user by default. You can change the mapped account name by editing the file. The nss_mapuser (5) man page describes the configuration file.

A flat file mapping derives from the session number assigned during login, which persists across su and sudo. Cumulus Linux removes the mapping at logout.

Local Fallback Authentication

If a site wants to allow local fallback authentication for a user when none of the RADIUS servers are reachable, you can add a privileged user account as a local account on the switch. The local account must have the same unique identifier as the privileged user and the shell must be the same.

To configure local fallback authentication:

  1. Add a local privileged user account. For example, if the radius_priv_user account in the /etc/passwd file is radius_priv_user:x:1002:1001::/home/radius_priv_user:/sbin/radius_shell, run the following command to add a local privileged user account named johnadmin:

    cumulus@switch:~$ sudo useradd -u 1002 -g 1001 -o -s /sbin/radius_shell johnadmin
    
  2. To enable the local privileged user to run sudo and NCLU commands, run the following commands:

    cumulus@switch:~$ sudo adduser johnadmin netedit
    cumulus@switch:~$ sudo adduser johnadmin sudo
    cumulus@switch:~$ sudo systemctl restart netd
    
  3. Edit the /etc/passwd file to move the local user line before to the radius_priv_user line:

    cumulus@switch:~$ sudo vi /etc/passwd
    ...
    johnadmin:x:1002:1001::/home/johnadmin:/sbin/radius_shell
    radius_priv_user:x:1002:1001::/home/radius_priv_user:/sbin/radius_shell
    
  4. To set the local password for the local user, run the following command:

    cumulus@switch:~$ sudo passwd johnadmin
    

Verify RADIUS Client Configuration

To verify that you configured the RADIUS client correctly, log in as a non-privileged user and run a net add interface command.

In this example, the ops user is not a privileged RADIUS user so the ops user cannot add an interface.

ops@leaf01:~$ net add interface swp1
ERROR: User ops does not have permission to make networking changes.

In this example, the admin user is a privileged RADIUS user (with privilege level 15) so is able to add interface swp1.

admin@leaf01:~$ net add interface swp1
admin@leaf01:~$ net pending
--- /etc/network/interfaces    2018-04-06 14:49:33.099331830 +0000
+++ /var/run/nclu/iface/interfaces.tmp    2018-04-06 16:01:16.057639999 +0000
@@ -3,10 +3,13 @@

  source /etc/network/interfaces.d/*.intf

  # The loopback network interface
  auto lo
  iface lo inet loopback

  # The primary network interface
  auto eth0
  iface eth0 inet dhcp
+
+auto swp1
iface swp1
...

Remove RADIUS Client Packages

Remove the RADIUS packages with the following command:

cumulus@switch:~$ sudo apt-get remove libnss-mapuser libpam-radius-auth

When you remove the packages, Cumulus Linux deletes the plugins from the /etc/nsswitch.conf file and from the PAM files.

To remove all configuration files for these packages, run:

cumulus@switch:~$ sudo apt-get purge libnss-mapuser libpam-radius-auth

The RADIUS fixed account is not removed from the /etc/passwd or /etc/group file and the home directories are not removed. They remain in case there are modifications to the account or files in the home directories.

To remove the home directories of the RADIUS users, first get the list by running:

cumulus@switch:~$ sudo ls -l /home | grep radius

For all users listed, except the radius_user, run this command to remove the home directories:

cumulus@switch:~$ sudo deluser --remove-home USERNAME

where USERNAME is the account name (the home directory relative portion). This command gives the following warning because the user is not listed in the /etc/passwd file.

userdel: cannot remove entry 'USERNAME' from /etc/passwd
/usr/sbin/deluser: `/usr/sbin/userdel USERNAME' returned error code 1. Exiting.

After you remove all the RADIUS users, run the command to remove the fixed account. If there are changes to the account in the /etc/nss_mapuser.conf file, use that account name instead of radius_user.

cumulus@switch:~$ sudo deluser --remove-home radius_user
cumulus@switch:~$ sudo deluser --remove-home radius_priv_user
cumulus@switch:~$ sudo delgroup radius_users

Considerations

Netfilter - ACLs

Netfilter is the packet filtering framework in Cumulus Linux and most other Linux distributions. There several tools available for configuring ACLs in Cumulus Linux:

Traffic Rules

Chains

Netfilter describes the way that the Linux kernel classifies and controls packets to, from, and across the switch. Netfilter does not require a separate software daemon to run; it is part of the Linux kernel itself. Netfilter asserts policies at layer 2, 3 and 4 of the OSI model by inspecting packet and frame headers according to a list of rules. The iptables, ip6tables, and ebtables userspace applications provide syntax you use to define rules.

The rules inspect or operate on packets at several points (chains) in the life of the packet through the system:

Tables

When you build rules to affect the flow of traffic, tables can access the individual chains. Linux provides three tables by default:

Each table has a set of default chains that modify or inspect packets at different points of the path through the switch. Chains contain the individual rules to influence traffic. The figure below shows each table and the default chains they support. Cumulus Linux supports the tables and chains in green. Cumulus Linux does not support the tables and chains in red.

Rules

Rules classify the traffic you want to control. You apply rules to chains, which attach to tables.

Rules have several different components:

How Rules Parse and Apply

The switch reads all the rules from each chain from iptables, ip6tables, and ebtables and enters them in order into either the filter table or the mangle table. The switch reads the rules from the kernel in the following order:

When you combine and put rules into one table, the order determines the relative priority of the rules; iptables and ip6tables have the highest precedence and ebtables has the lowest.

The Linux packet forwarding construct is an overlay for how the silicon underneath processes packets. Be aware of the following:

Rule Placement in Memory

INPUT and ingress (FORWARD -i) rules occupy the same memory space. A rule counts as ingress if you set the -i option. If you set both input and output options (-i and -o), the switch considers the rule as ingress and occupies that memory space. For example:

-A FORWARD -i swp1 -o swp2 -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT

If you set an output flag with the INPUT chain, you see an error. For example:

-A FORWARD,INPUT -i swp1 -o swp2 -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT
error: line 2 : output interface specified with INPUT chain error processing rule '-A FORWARD,INPUT -i swp1 -o swp2 -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT'

If you remove the -o option and the interface, it is a valid rule.

Nonatomic Update Mode and Atomic Update Mode

Cumulus Linux enables atomic update mode by default. However, this mode limits the number of ACL rules that you can configure.

To increase the number of configurable ACL rules, configure the switch to operate in nonatomic mode.

Instead of reserving 50% of your TCAM space for atomic updates, incremental update uses the available free space to write the new TCAM rules and swap over to the new rules after this is complete. Cumulus Linux then deletes the old rules and frees up the original TCAM space. If there is insufficient free space to complete this task, the original nonatomic update runs, which interrupts traffic.

You can enable nonatomic updates for switchd, which offer better scaling because all TCAM resources actively impact traffic. With atomic updates, half of the hardware resources are on standby and do not actively impact traffic.

Incremental nonatomic updates are table based, so they do not interrupt network traffic when you install new rules. The rules map to the following tables and update in this order:

The incremental nonatomic update operation follows this order:

  1. Updates are incremental, one table at a time without stopping traffic.
  2. Cumulus Linux checks if the rules in a table are different from installation time; if a table does not have any changes, it does not reinstall the rules.
  3. If there are changes in a table, the new rules populate in new groups or slices in hardware, then that table switches over to the new groups or slices.
  4. Finally, old resources for that table free up. This process repeats for each of the tables listed above.
  5. If there are isufficient resources to hold both the new rule set and old rule set, Cumulus Linux tries the regular nonatomic mode, which interrupts network traffic.
  6. If the regular nonatomic update fails, Cumulus Linux reverts back to the previous rules.

To always start switchd with nonatomic updates:

  1. Edit /etc/cumulus/switchd.conf.

  2. Add the following line to the file:

    acl.non_atomic_update_mode = TRUE
    
  3. Restart switchd:

    cumulus@switch:~$ sudo systemctl restart switchd.service

    Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.

During regular non-incremental nonatomic updates, traffic stops, then continues after all the new configuration is in the hardware.

iptables, ip6tables, and ebtables

Do not use iptables, ip6tables, ebtables directly; installed rules only apply to the Linux kernel and Cumulus Linux does not hardware accelerate. When you run cl-acltool -i, Cumulus Linux resets all rules and deletes anything that is not in /etc/cumulus/acl/policy.conf.

For example, the following rule appears to work:

cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP

The cl-acltool -L command shows the rule:

cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------

TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP icmp -- any any anywhere anywhere icmp echo-request

However, Cumulus Linux does not synchronize the rule to hardware. Running cl-acltool -i or reboot removes the rule without replacing it. To ensure that Cumulus Linux hardware accelerates all rules that can be in hardware, add them to /etc/cumulus/acl/policy.conf and install them with the cl-acltool -i command.

Estimate the Number of Rules

To estimate the number of rules you can create from an ACL entry, first determine if that entry is an ingress or an egress. Then, determine if it is an IPv4-mac or IPv6 type rule. This determines the slice to which the rule belongs. Use the following to determine how many entries the switch uses for each type.

By default, each entry occupies one double wide entry, except if the entry is one of the following:

Install and Manage ACL Rules with NCLU

NCLU provides an easy way to create custom ACLs. The rules you create live in the /var/lib/cumulus/nclu/nclu_acl.conf file, which Cumulus Linux converts to a rules file, /etc/cumulus/acl/policy.d/50_nclu_acl.rules. The rules you create with NCLU are independent of the default files in /etc/cumulus/acl/policy.d/ 00control_plane.rules and 99control_plane_catch_all.rules. If you update the content in these files after a Cumulus Linux upgrade, you do not lose the rules.

Instead of crafting a rule by hand then installing it with cl-acltool, NCLU handles most options automatically. For example, consider the following iptables rule:

-A FORWARD -i swp1 -o swp2 -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT

To create this rule with NCLU and call it EXAMPLE1:

cumulus@switch:~$ net add acl ipv4 EXAMPLE1 accept tcp source-ip 10.0.14.2/32 source-port any dest-ip 10.0.15.8/32 dest-port any
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

NCLU adds alll options, such as the -j and -p, even FORWARD in the above rule automatically when you apply the rule to the control plane; NCLU figures it all out for you.

You can also set a priority value, which specifies the order in which the rules execute and the order in which they appear in the rules file. Lower numbers execute first. To add a new rule in the middle, first run net show config acl, which displays the priority numbers. Otherwise, new rules append to the end of the list of rules in the nclu_acl.conf and 50_nclu_acl.rules files.

If you need to edit a rule manually, do not edit the 50_nclu_acl.rules file. Instead, edit the nclu_acl.conf file.

After you add the rule, you need to apply it to an inbound or outbound interface with the net add int acl command. The inbound interface in the following example is swp1:

cumulus@switch:~$ net add int swp1 acl ipv4 EXAMPLE1 inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

After you commit your changes, verify the rule; run net show configuration acl:

cumulus@switch:~$ net show configuration acl
acl ipv4 EXAMPLEv4 priority 10 accept tcp source-ip 10.0.14.2/32 source-port any dest-ip 10.0.15.8/32 dest-port any

interface swp1
acl ipv4 EXAMPLE1 inbound

To see all installed rules, run cat on the 50_nclu_acl.rules file:

cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/50_nclu_acl.rules
[iptables]
# swp1: acl ipv4 EXAMPLE1 inbound
-A FORWARD --in-interface swp1 --out-interface swp2 -j ACCEPT -p tcp -s 10.0.14.2/32 -d 10.0.15.8/32 --dport 110

For INPUT and FORWARD rules, apply the rule to a control plane interface with net add control-plane:

cumulus@switch:~$ net add control-plane acl ipv4 EXAMPLE1 inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To remove a rule, run net del acl ipv4|ipv6|mac RULENAME. This command deletes all rules from the 50_nclu_acl.rules file with the name you specify. The command also deletes the interfaces referenced in the nclu_acl.conf file.

cumulus@switch:~$ net del acl ipv4 EXAMPLE1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

Install and Manage ACL Rules with cl-acltool

You can manage Cumulus Linux ACLs with cl-acltool. Rules write first to the iptables chains, as described above, and then synchronize to hardware through switchd.

To examine the current state of chains and list all installed rules, run:

cumulus@switch:~$ sudo cl-acltool -L all
 -------------------------------
Listing rules of type iptables:
-------------------------------

TABLE filter :
Chain INPUT (policy ACCEPT 90 packets, 14456 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP all -- swp+ any 240.0.0.0/5 anywhere
0 0 DROP all -- swp+ any loopback/8 anywhere
0 0 DROP all -- swp+ any base-address.mcast.net/8 anywhere
0 0 DROP all -- swp+ any 255.255.255.255 anywhere ...

To list installed rules using native iptables, ip6tables and ebtables, use the -L option with the respective commands:

cumulus@switch:~$ sudo iptables -L
cumulus@switch:~$ sudo ip6tables -L
cumulus@switch:~$ sudo ebtables -L

To flush all installed rules, run:

cumulus@switch:~$ sudo cl-acltool -F all

To flush only the IPv4 iptables rules, run:

cumulus@switch:~$ sudo cl-acltool -F ip

If the install fails, ACL rules in the kernel and hardware roll back to the previous state. You also see errors from programming rules in the kernel or ASIC.

Install Packet Filtering (ACL) Rules

cl-acltool takes access control list (ACL) rule input in files. Each ACL policy file includes iptables, ip6tables and ebtables categories under the tags [iptables], [ip6tables] and [ebtables]. You must assign each rule in an ACL policy to one of the rule categories.

See man cl-acltool(5) for ACL rule details. For iptables rule syntax, see man iptables(8). For ip6tables rule syntax, see man ip6tables(8). For ebtables rule syntax, see man ebtables(8).

See man cl-acltool(5) and man cl-acltool(8) for more details on using cl-acltool.

By default:

  • ACL policy files are in /etc/cumulus/acl/policy.d/.
  • All *.rules files in /etc/cumulus/acl/policy.d/ directory are also in /etc/cumulus/acl/policy.conf.
  • All files in the policy.conf file install when the switch boots up.
  • The policy.conf file expects rule files to have a .rules suffix as part of the file name.

Here is an example ACL policy file:

[iptables]
-A INPUT --in-interface swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD --in-interface swp1 -p tcp --dport 80 -j ACCEPT

[ip6tables]
-A INPUT --in-interface swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD --in-interface swp1 -p tcp --dport 80 -j ACCEPT

[ebtables]
-A INPUT -p IPv4 -j ACCEPT
-A FORWARD -p IPv4 -j ACCEPT

You can use wildcards or variables to specify chain and interface lists.

You can only use swp+ and bond+ as wildcard names.

swp+ rules apply as an aggregate, not per port. If you want to apply per port policing, specify a specific port instead of the wildcard.

INGRESS = swp+
INPUT_PORT_CHAIN = INPUT,FORWARD

[iptables]
-A $INPUT_PORT_CHAIN --in-interface $INGRESS -p tcp --dport 80 -j ACCEPT

[ip6tables]
-A $INPUT_PORT_CHAIN --in-interface $INGRESS -p tcp --dport 80 -j ACCEPT

[ebtables]
-A INPUT -p IPv4 -j ACCEPT

You can write ACL rules for the system into multiple files under the default /etc/cumulus/acl/policy.d/ directory. The ordering of rules during installation follows the sort order of the files according to their file names.

Use multiple files to stack rules. The example below shows two rule files that separate rules for management and datapath traffic:

cumulus@switch:~$ ls /etc/cumulus/acl/policy.d/
00sample_mgmt.rules 01sample_datapath.rules
cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/00sample_mgmt.rules

INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT

[iptables]
# protect the switch management
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 10.0.11.2 -d 10.0.12.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -d 10.0.16.8 -p udp -j DROP

cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/01sample_datapath.rules
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT, FORWARD

[iptables]
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.5 -p icmp -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.6 -d 192.0.2.4 -j DROP
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -s 192.0.2.2 -d 192.0.2.8 -j DROP

Install all ACL policies under a directory:

cumulus@switch:~$ sudo cl-acltool -i -P ./rules
Reading files under rules
Reading rule file ./rules/01_http_rules.txt ...
Processing rules in file ./rules/01_http_rules.txt ...
Installing acl policy ...
Done.

Apply all rules and policies included in /etc/cumulus/acl/policy.conf:

cumulus@switch:~$ sudo cl-acltool -i

Specify the Policy Files to Install

By default, Cumulus Linux installs any .rules file you configure in /etc/cumulus/acl/policy.d/. To add other policy files to an ACL, you need to include them in /etc/cumulus/acl/policy.conf. For example, for Cumulus Linux to install a rule in a policy file called 01_new.datapathacl, add include /etc/cumulus/acl/policy.d/01_new.rules to policy.conf:

cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.conf

#
# This file is a master file for acl policy file inclusion
#
# Note: This is not a file where you list acl rules.
#
# This file can contain:
# - include lines with acl policy files
#   example:
#     include <filepath>
#
# see manpage cl-acltool(5) and cl-acltool(8) for how to write policy files 
#

include /etc/cumulus/acl/policy.d/01_new.datapathacl

Hardware Limitations on Number of Rules

The maximum number of rules that the hardware process depends on:

If you exceed the maximum number of rules for a particular table, cl-acltool -i generates the following error:

error: hw sync failed (sync_acl hardware installation failed) Rolling back .. failed.

In the table below, the default rules count toward the limits listed. The raw limits below assume only one ingress and one egress table are present.

The NVIDIA Spectrum ASIC has one common TCAM for both ingress and egress, which you can use for other non-ACL-related resources. However, the number of supported rules varies with the TCAM profile for the switch.

ProfileAtomic Mode IPv4 RulesAtomic Mode IPv6 RulesNonatomic Mode IPv4 RulesNonatomic Mode IPv6 Rules
default5002501000500
ipmc-heavy75050015001000
acl-heavy1750100035002000
ipmc-max100050020001000
ip-acl-heavy75000150000

Even though the table above specifies the ip-acl-heavy profile supports no IPv6 rules, Cumulus Linux does not prevent you from configuring IPv6 rules. However, there is no guarantee that IPv6 rules work under the ip-acl-heavy profile.

Supported Rule Types

The iptables/ip6tables/ebtables construct tries to layer the Linux implementation on top of the underlying hardware but they are not always directly compatible. This section provides the supported rules for chains in iptables, ip6tables and ebtables.

To learn more about any of the options shown in the tables below, run iptables -h [name of option]. The same help syntax works for options for ip6tables and ebtables.

root@leaf1# ebtables -h tricolorpolice
...
tricolorpolice option:
--set-color-mode STRING setting the mode in blind or aware
--set-cir INT setting committed information rate in kbits per second
--set-cbs INT setting committed burst size in kbyte
--set-pir INT setting peak information rate in kbits per second
--set-ebs INT setting excess burst size in kbyte
--set-conform-action-dscp INT setting dscp value if the action is accept for conforming packets
--set-exceed-action-dscp INT setting dscp value if the action is accept for exceeding packets
--set-violate-action STRING setting the action (accept/drop) for violating packets
--set-violate-action-dscp INT setting dscp value if the action is accept for violating packets
Supported chains for the filter table:
INPUT FORWARD OUTPUT

iptables and ip6tables Rule Support

Rule ElementSupportedUnsupported
MatchesSrc/Dst, IP protocol
In/out interface
IPv4: icmp, ttl,
IPv6: icmp6, frag, hl,
IP common: tcp (with flags), udp, multiport, DSCP, addrtype
Rules with input/output Ethernet interfaces do not apply
Inverse matches
Standard TargetsACCEPT, DROPRETURN, QUEUE, STOP, Fall Thru, Jump
Extended TargetsLOG (IPv4/IPv6); UID is not supported for LOG
TCP SEQ, TCP options or IP options
ULOG
SETQOS
DSCP
Unique to Cumulus Linux:
SPAN
ERSPAN (IPv4/IPv6)
POLICE
TRICOLORPOLICE
SETCLASS

ebtables Rule Support

Rule ElementSupportedUnsupported
Matchesether type
input interface/wildcard
output interface/wildcard
Src/Dst MAC
IP: src, dest, tos, proto, sport, dport
IPv6: tclass, icmp6: type, icmp6: code range, src/dst addr, sport, dport
802.1p (CoS)
VLAN
Inverse matches
Proto length
Standard TargetsACCEPT, DROPRETURN, CONTINUE, Jump, Fall Thru
Extended TargetsULOG
LOG
Unique to Cumulus Linux:
SPAN
ERSPAN
POLICE
TRICOLORPOLICE
SETCLASS

Unsupported Rules

Splitting Rules Across the Ingress TCAM and the Egress TCAM

Splitting rules across the ingress TCAM and the egress TCAM causes the ingress IPv6 part of the rule to match packets going to all destinations, which can interfere with the regular expected linear rule match in a sequence. For example:

A higher rule can prevent a lower rule from matching:

Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT

Rule 2: -A FORWARD --out-interface vlan101 -p icmp6 -s 01::02 -j ACCEPT

Rule 1 matches all icmp6 packets from to all out interfaces in the ingress TCAM.`

This prevents rule 2 from matching, which is more specific but with a different out interface. Make sure to put more specific matches above more general matches even if the output interfaces are different.

When you have two rules with the same output interface, the lower rule might match depending on the presence of the previous rules.

Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT

Rule 2: -A FORWARD --out-interface vlan101 -s 00::01 -j DROP

Rule 3: -A FORWARD --out -interface vlan101 -p icmp6 -j ACCEPT

Rule 3 still matches for an icmp6 packet with sip 00:01 going out of vlan101. Rule 1 interferes with the normal function of rule 2 and/or rule 3.

When you have two adjacent rules with the same match and different output interfaces, such as:

Rule 1: -A FORWARD --out-interface vlan100 -p icmp6 -j ACCEPT

Rule 2: -A FORWARD --out-interface vlan101 -p icmp6 -j DROP

Rule 2 never matches on ingress. Both rules share the same mark.

Control Plane and Data Plane Traffic

You can configure quality of service for traffic on both the control plane and the data plane. By using QoS policers, you can rate limit traffic so incoming packets get dropped if they exceed specified thresholds.

Counters on POLICE ACL rules in iptables do not show dropped packets due to those rules.

Use the POLICE target with iptables. POLICE takes these arguments:

For example, to rate limit the incoming traffic on swp1 to 400 packets per second with a burst of 100 packets per second and set the class of the queue for the policed traffic as 0, set this rule in your appropriate .rules file:

-A INPUT --in-interface swp1 -j POLICE --set-mode pkt --set-rate 400 --set-burst 100 --set-class 0

Here is another example of control plane ACL rules to lock down the switch. You specify the rules in /etc/cumulus/acl/policy.d/00control_plane.rules:

View the contents of the file
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT
INNFWD_CHAIN = INPUT,FORWARD
MARTIAN_SOURCES_4 = "240.0.0.0/5,127.0.0.0/8,224.0.0.0/8,255.255.255.255/32"
MARTIAN_SOURCES_6 = "ff00::/8,::/128,::ffff:0.0.0.0/96,::1/128"

# Custom Policy Section
SSH_SOURCES_4 = "192.168.0.0/24"
NTP_SERVERS_4 = "192.168.0.1/32,192.168.0.4/32"
DNS_SERVERS_4 = "192.168.0.1/32,192.168.0.4/32"
SNMP_SERVERS_4 = "192.168.0.1/32"

[iptables]
-A $INNFWD_CHAIN --in-interface $INGRESS_INTF -s $MARTIAN_SOURCES_4 -j DROP
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p ospf -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p tcp --dport bgp -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p tcp --sport bgp -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p icmp -j POLICE --set-mode pkt --set-rate 100 --set-burst 40 --set-class 2
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport bootps:bootpc -j POLICE --set-mode pkt --set-rate 100 --set-burst 100 --set-class 2
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p tcp --dport bootps:bootpc -j POLICE --set-mode pkt --set-rate 100 --set-burst 100 --set-class 2
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p igmp -j POLICE --set-mode pkt --set-rate 300 --set-burst 100 --set-class 6

# Custom policy
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p tcp --dport 22 -s $SSH_SOURCES_4 -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --sport 123 -s $NTP_SERVERS_4 -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --sport 53 -s $DNS_SERVERS_4 -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport 161 -s $SNMP_SERVERS_4 -j ACCEPT


# Allow UDP traceroute when we are the current TTL expired hop 
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport 1024:65535 -m ttl --ttl-eq 1 -j ACCEPT
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -j DROP

Set DSCP on Transit Traffic

The examples here use the mangle table to modify the packet as it transits the switch. DSCP is in decimal notation in the examples below.

[iptables]

#Set SSH as high priority traffic.
-t mangle -A FORWARD -p tcp --dport 22  -j DSCP --set-dscp 46

#Set everything coming in SWP1 as AF13
-t mangle -A FORWARD --in-interface swp1 -j DSCP --set-dscp 14

#Set Packets destined for 10.0.100.27 as best effort
-t mangle -A FORWARD -d 10.0.100.27/32 -j DSCP --set-dscp 0

#Example using a range of ports for TCP traffic
-t mangle -A FORWARD -p tcp -s 10.0.0.17/32 --sport 10000:20000 -d 10.0.100.27/32 --dport 10000:20000 -j DSCP --set-dscp 34

Verify DSCP Values on Transit Traffic

The examples here use the DSCP match criteria in combination with other IP, TCP, and interface matches to identify traffic and count the number of packets.

[iptables]

#Match and count the packets that match SSH traffic with DSCP EF
-A FORWARD -p tcp --dport 22 -m dscp --dscp 46 -j ACCEPT

#Match and count the packets coming in SWP1 as AF13
-A FORWARD --in-interface swp1 -m dscp --dscp 14 -j ACCEPT
#Match and count the packets with a destination 10.0.0.17 marked best effort
-A FORWARD -d 10.0.100.27/32 -m dscp --dscp 0 -j ACCEPT

#Match and count the packets in a port range with DSCP AF41
-A FORWARD -p tcp -s 10.0.0.17/32 --sport 10000:20000 -d 10.0.100.27/32 --dport 10000:20000 -m dscp --dscp 34 -j ACCEPT

Check the Packet and Byte Counters for ACL Rules

To verify the counters using the above example rules, first send test traffic matching the patterns through the network. The following example generates traffic with mz (or mausezahn), which you can install on host servers or on Cumulus Linux switches. After you send traffic to validate the counters, they match on switch1 using cl-acltool.

Policing counters do not increment on switches with the Spectrum ASIC.

# Send 100 TCP packets on host1 with a DSCP value of EF with a destination of host2 TCP port 22:

cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t tcp "dp=22,dscp=46"
  IP:  ver=4, len=40, tos=184, id=0, frag=0, ttl=255, proto=6, sum=0, SA=10.0.0.17, DA=10.0.100.27,
      payload=[see next layer]
  TCP: sp=0, dp=22, S=42, A=42, flags=0, win=10000, len=20, sum=0,
      payload=

# Verify the 100 packets are matched on switch1

cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
  pkts bytes target     prot opt in     out     source               destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
  pkts bytes target     prot opt in     out     source               destination
  100  6400 ACCEPT     tcp  --  any    any     anywhere             anywhere             tcp dpt:ssh DSCP match 0x2e
    0     0 ACCEPT     all  --  swp1   any     anywhere             anywhere             DSCP match 0x0e
    0     0 ACCEPT     all  --  any    any     10.0.0.17            anywhere             DSCP match 0x00
    0     0 ACCEPT     tcp  --  any    any     10.0.0.17            10.0.100.27          tcp spts:webmin:20000
    dpts:webmin:2002

# Send 100 packets with a small payload on host1 with a DSCP value of AF13 with a destination of host2:

cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t ip
  IP:  ver=4, len=20, tos=0, id=0, frag=0, ttl=255, proto=0, sum=0, SA=10.0.0.17, DA=10.0.100.27,
      payload=

# Verify the 100 packets are matched on switch1

cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
  pkts bytes target     prot opt in     out     source               destination
  chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
  pkts bytes target     prot opt in     out     source               destination
  100  6400 ACCEPT     tcp  --  any    any     anywhere             anywhere             tcp dpt:ssh DSCP match 0x2e
  100  7000 ACCEPT     all  --  swp3   any     anywhere             anywhere             DSCP match 0x0e
  100  6400 ACCEPT     all  --  any    any     10.0.0.17            anywhere             DSCP match 0x00
    0     0 ACCEPT     tcp  --  any    any     10.0.0.17            10.0.100.27          tcp spts:webmin:20000 dpts:webmin:2002

# Send 100 packets on host1 with a destination of host2:

cumulus@host1$ mz eth1 -A 10.0.0.17 -B 10.0.100.27 -c 100 -v -t ip
 IP:  ver=4, len=20, tos=56, id=0, frag=0, ttl=255, proto=0, sum=0, SA=10.0.0.17, DA=10.0.100.27,
     payload=

# Verify the 100 packets are matched on switch1

cumulus@switch1$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 9314 packets, 753K bytes)
  pkts bytes target     prot opt in     out     source               destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
  pkts bytes target     prot opt in     out     source               destination
  100  6400 ACCEPT     tcp  --  any    any     anywhere             anywhere             tcp dpt:ssh DSCP match 0x2e
  100  7000 ACCEPT     all  --  swp3   any     anywhere             anywhere             DSCP match 0x0e
    0     0 ACCEPT     all  --  any    any     10.0.0.17            anywhere             DSCP match 0x00
    0     0 ACCEPT     tcp  --  any    any     10.0.0.17            10.0.100.27          tcp spts:webmin:20000 dpts:webmin:2002Still working

Filter Specific TCP Flags

The example solution below creates rules on the INPUT and FORWARD chains to drop ingress IPv4 and IPv6 TCP packets when you set the SYN bit and reset the RST, ACK, and FIN bits. The default for the INPUT and FORWARD chains allows all other packets. The ACL apply to ports swp20 and swp21. After configuring this ACL, you cannot establish new TCP sessions that originate from ingress ports swp20 and swp21. You can establish TCP sessions that originate from any other port.

INGRESS_INTF = swp20,swp21

[iptables]
-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --syn -j DROP
[ip6tables]
-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --syn -j DROP

The --syn flag in the above rule matches packets with the SYN bit set and the ACK, RST, and FIN bits cleared. It is equivalent to using -tcp-flags SYN,RST,ACK,FIN SYN. For example, you can write the above rule as:

-A INPUT,FORWARD --in-interface $INGRESS_INTF -p tcp --tcp-flags SYN,RST,ACK,FIN SYN -j DROP

Control Who Can SSH into the Switch

Run the following NCLU commands to control who can SSH into the switch. In the following example, 10.0.0.11/32 is the interface IP address (or loopback IP address) of the switch and 10.255.4.0/24 can SSH into the switch.

cumulus@switch:~$ net add acl ipv4 test priority 10 accept source-ip 10.255.4.0/24 dest-ip 10.0.0.11/32
cumulus@switch:~$ net add acl ipv4 test priority 20 drop source-ip any dest-ip 10.0.0.11/32
cumulus@switch:~$ net add control-plane acl ipv4 test inbound
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

Cumulus Linux does not support the keyword iprouter (typically used for traffic that goes to the CPU, where the destination MAC address is that of the router but the destination IP address is not the router).

Match on VLAN IDs on Layer 2 Interfaces

You can match on VLAN IDs on layer 2 interfaces for ingress rules. The following example matches on a VLAN and DSCP class, and sets the internal class of the packet. For extended matching on IP fields, combine this rule with ingress iptable rules.

[ebtables]
-A FORWARD -p 802_1Q --vlan-id 100 -j mark --mark-set 0x66

[iptables]
-A FORWARD -i swp31 -m mark --mark 0x66 -m dscp --dscp-class CS1 -j SETCLASS --class 2

Match on ECN Bits in the TCP IP Header

ECN allows end-to-end notification of network congestion without dropping packets. In Cumulus Linux 4.4.1 and later, you can add ECN rules to match on the ECE, CWR, and ECT flags in the TCP IPv4 header.

By default, ECN rules match a packet with the bit set. You can reverse the match by using an explanation point (!).

Match on the ECE Bit

After an endpoint receives a packet with the CE bit set by a router, it sets the ECE bit in the returning ACK packet to notify the other endpoint that it needs to slow down.

To match on the ECE bit:

Run the net add acl ipv4 <rule-name> <action> tcp ece command:

cumulus@switch:~$ net add acl ipv4 ece-rule accept tcp ece
cumulus@switch:~$ net commit

Create a rules file in the /etc/cumulus/acl/policy.d directory and add the following rule under [iptables]:

cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.d/30-tcp-flags.rules
[iptables]
-A FORWARD -i swp1 -p tcp -m ecn --ecn-tcp-ece -j ACCEPT 

Apply the rule:

cumulus@switch:~$ sudo cl-acltool -i

Match on the CWR Bit

The CWR bit notifies the other endpoint of the connection that it received and reacted to an ECE.

To match on the CWR bit:

Run the net add acl ipv4 <rule-name> <action> tcp cwr command:

cumulus@switch:~$ net add acl ipv4 cwr-rule accept tcp cwr
cumulus@switch:~$ net commit

Create a rules file in the /etc/cumulus/acl/policy.d directory and add the following rule under [iptables]:

cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.d/30-tcp-flags.rules
[iptables]
-A FORWARD -i swp1 -p tcp -m ecn --ecn-tcp-cwr -j ACCEPT 

Apply the rule:

cumulus@switch:~$ sudo cl-acltool -i

Match on the ECT Bit

The ECT codepoints negotiate if the connection is ECN capable by setting one of the two bits to 1. Routers also use the ECT bit to indicate that they are experiencing congestion by setting both the ECT codepoints to 1.

To match on the ECT bit:

Run the net add acl ipv4 <acl-name> <action> tcp ecn <value> command. You can specify a value between 0 and 3.

cumulus@switch:~$ net add acl ipv4 ect-rule accept tcp ecn 1
cumulus@switch:~$ net commit

Create a rules file in the /etc/cumulus/acl/policy.d directory and add the following rule under [iptables]:

cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.d/30-tcp-flags.rules
[iptables]
-A FORWARD -i swp1 -p tcp -m ecn --ecn-ip-ect 1 -j ACCEPT

Apply the rule:

cumulus@switch:~$ sudo cl-acltool -i

Example Configuration

The following example demonstrates how Cumulus Linux applies several different rules.

These are the configurations for the two switches in these examples. The configuration for each switch is in the /etc/network/interfaces file on that switch.

Switch 1 Configuration

cumulus@switch1:~$ sudo cat /etc/network/interfaces
...
/etc/network/interfaces
=======================

auto swp1
iface swp1

auto swp2
iface swp2

auto swp3
iface swp3

auto swp4
iface swp4

auto bond2
iface bond2
  bond-slaves swp3 swp4

auto br-untagged
iface br-untagged
  address 10.0.0.1/24
  bridge_ports swp1 bond2
  bridge_stp on

auto br-tag100
iface br-tag100
  address 10.0.100.1/24
  bridge_ports swp2.100 bond2.100
  bridge_stp on
...

Switch 2 Configuration

cumulus@switch2:~$ sudo cat /etc/network/interfaces
...
/etc/network/interfaces
=======================

auto swp3
iface swp3

auto swp4
iface swp4

auto br-untagged
iface br-untagged
  address 10.0.0.2/24
  bridge_ports bond2
  bridge_stp on

auto br-tag100
iface br-tag100
  address 10.0.100.2/24
  bridge_ports bond2.100
  bridge_stp on

auto bond2
iface bond2
  bond-slaves swp3 swp4
...

Egress Rule

The following rule blocks any TCP traffic with destination port 200 going from host1 or host2 through the switch (corresponding to rule 1 in the diagram above).

[iptables] -A FORWARD -o bond2 -p tcp --dport 200 -j DROP

Ingress Rule

The following rule blocks any UDP traffic with source port 200 going from host1 through the switch (corresponding to rule 2 in the diagram above).

[iptables] -A FORWARD -i swp2 -p udp --sport 200 -j DROP

Input Rule

The following rule blocks any UDP traffic with source port 200 and destination port 50 going from host1 to the switch (corresponding to rule 3 in the diagram above).

[iptables] -A INPUT -i swp1 -p udp --sport 200 --dport 50 -j DROP

Output Rule

The following rule blocks any TCP traffic with source port 123 and destination port 123 going from Switch 1 to host2 (corresponding to rule 4 in the diagram above).

[iptables] -A OUTPUT -o br-tag100 -p tcp --sport 123 --dport 123 -j DROP

Combined Rules

The following rule blocks any TCP traffic with source port 123 and destination port 123 going from any switch port egress or generated from Switch 1 to host1 or host2 (corresponding to rules 1 and 4 in the diagram above).

[iptables] -A OUTPUT,FORWARD -o swp+ -p tcp --sport 123 --dport 123 -j DROP

This also becomes two ACLs and is the same as:

[iptables]
-A FORWARD -o swp+ -p tcp --sport 123 --dport 123 -j DROP 
-A OUTPUT -o swp+ -p tcp --sport 123 --dport 123 -j DROP

Layer 2 Rules/ebtables

The following rule blocks any traffic with source MAC address 00:00:00:00:00:12 and destination MAC address 08:9e:01:ce:e2:04 going from any switch port egress/ingress.

[ebtables] -A FORWARD -s 00:00:00:00:00:12 -d 08:9e:01:ce:e2:04 -j DROP

Considerations

Not All Rules Supported

Cumulus Linux does not support all iptables, ip6tables, or ebtables rules. Refer to Supported Rules for specific rule support.

ACL Log Policer Limits Traffic

To protect the CPU from overloading, Cumulus Linux limits traffic copied to the CPU to 1 packet per second by an ACL Log Policer.

Bridge Traffic Limitations

Bridge traffic that matches LOG ACTION rules do not log to syslog; the kernel and hardware identify packets using different information.

You Cannot Forward Log Actions

You cannot forward logged packets. The hardware cannot both forward a packet and send the packet to the control plane (or kernel) for logging. A log action must also have a drop action.

SPAN Sessions that Reference an Outgoing Interface

SPAN sessions that reference an outgoing interface create mirrored packets based on the ingress interface before the routing decision. See SPAN Sessions that Reference an Outgoing Interface and Use the CPU Port as the SPAN Destination in the Network Troubleshooting section.

iptables Interactions with cl-acltool

Because Cumulus Linux is a Linux operating system, you can use the iptables commands. However, consider using cl-acltool instead because:

For example, running the following command works:

cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP

And the rules appear when you run cl-acltool -L:

cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target  prot opt in   out   source    destination

0     0 DROP    icmp --  any  any   anywhere  anywhere      icmp echo-request

However, running cl-acltool -i or reboot removes them. To ensure that Cumulus Linux can hardware accelerate all rules that can be in hardware, place them in the /etc/cumulus/acl/policy.conf file, then run cl-acltool -i.

Hardware Limitations

Due to hardware limitations in the Spectrum ASIC, BFD policers and the BFD-related control plane share rules. The following default rules share the same policer in the 00control_plan.rules file:

[iptables]
-A $INGRESS_CHAIN -p udp --dport $BFD_ECHO_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000
-A $INGRESS_CHAIN -p udp --dport $BFD_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000
-A $INGRESS_CHAIN -p udp --dport $BFD_MH_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000

[ip6tables]
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport $BFD_ECHO_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport $BFD_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p udp --dport $BFD_MH_PORT -j POLICE --set-mode pkt --set-rate 2000 --set-burst 2000 --set-class 7

To work around this limitation, set the rate and burst for all these rules to the same values with the --set-rate and --set-burst options.

Where to Assign Rules

ACL Rule Installation Failure

After an ACL rule installation failure, you see a generic error message like the following:

cumulus@switch:$ sudo cl-acltool -i -p 00control_plane.rules
Using user provided rule file 00control_plane.rules
Reading rule file 00control_plane.rules ...
Processing rules in file 00control_plane.rules ...
error: hw sync failed (sync_acl hardware installation failed)
Installing acl policy... Rolling back ..
failed.

INPUT Chain Rules

Cumulus Linux implements INPUT chain rules using a trap mechanism and assigns trap IDs to packets that go to the CPU. The default INPUT chain rules map to these trap IDs. However, if a packet matches multiple traps, an internal priority mechanism resolves them. which can be different from the rule priorities. The default expected rule does not police the packet but another rule polices it instead. For example, the LOCAL rule polices ICMP packets that go to the CPU instead of the ICMP rule. Also, multiple rules can share the same trap, where the largest of the policer values applies.

To work around this issue, create rules on the INPUT and FORWARD chains (INPUT,FORWARD).

Hardware Policing of Packets in the Input Chain

Certain platforms have limitations on hardware policing packets in the INPUT chain. To work around these limitations, Cumulus Linux supports kernel based policing of these packets in software using limit or hashlimit matches. Cumulus Linux does not hardware offload rules with these matches, but ignores them during hardware install.

ACLs Do not Match when the Output Port on the ACL is a Subinterface

The ACL does not match on packets when you configure a subinterface as the output port. The ACL matches on packets only if the primary port is as an output port. If a subinterface is an output or egress port, the packets match correctly.

For example:

-A FORWARD --out-interface swp49s1.100 -j ACCEPT

Egress ACL Matching on Bonds

Cumulus Linux does not support ACL rules that match on an outbound bond interface. For example, you cannot create the following rule:

[iptables]
-A FORWARD --out-interface <bond_intf> -j DROP

To work around this issue, duplicate the ACL rule on each physical port of the bond. For example:

[iptables]
-A FORWARD --out-interface <bond-member-port-1> -j DROP
-A FORWARD --out-interface <bond-member-port-2> -j DROP

Default Cumulus Linux ACL Configuration

The Cumulus Linux default ACL configuration includes iptables, ip6tables, and ebtables. The sections below describe the default configurations for each one. You can see the default file by clicking the Default ACL Configuration link:

Default ACL Configuration
cumulus@switch:~$ sudo cl-acltool -L all
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 167 packets, 16481 bytes)
    pkts bytes target     prot opt in     out     source               destination
      0     0 DROP       all  --  swp+   any     240.0.0.0/5          anywhere
      0     0 DROP       all  --  swp+   any     loopback/8           anywhere
      0     0 DROP       all  --  swp+   any     base-address.mcast.net/8  anywhere
      0     0 DROP       all  --  swp+   any     255.255.255.255      anywhere
      0     0 SETCLASS   udp  --  swp+   any     anywhere             anywhere             udp dpt:3785 SETCLASS  class:7
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:3785 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   udp  --  swp+   any     anywhere             anywhere             udp dpt:3784 SETCLASS  class:7
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:3784 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   udp  --  swp+   any     anywhere             anywhere             udp dpt:4784 SETCLASS  class:7
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:4784 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   ospf --  swp+   any     anywhere             anywhere             SETCLASS  class:7
      0     0 POLICE     ospf --  any    any     anywhere             anywhere             POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   tcp  --  swp+   any     anywhere             anywhere             tcp dpt:bgp SETCLASS  class:7
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp dpt:bgp POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   tcp  --  swp+   any     anywhere             anywhere             tcp spt:bgp SETCLASS  class:7
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp spt:bgp POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   tcp  --  swp+   any     anywhere             anywhere             tcp dpt:5342 SETCLASS  class:7
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp dpt:5342 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   tcp  --  swp+   any     anywhere             anywhere             tcp spt:5342 SETCLASS  class:7
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp spt:5342 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   icmp --  swp+   any     anywhere             anywhere             SETCLASS  class:2
      1    84 POLICE     icmp --  any    any     anywhere             anywhere             POLICE  mode:pkt rate:100 burst:40
      0     0 SETCLASS   udp  --  swp+   any     anywhere             anywhere             udp dpts:bootps:bootpc SETCLASS  class:2
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:bootps POLICE  mode:pkt rate:100 burst:100
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:bootpc POLICE  mode:pkt rate:100 burst:100
      0     0 SETCLASS   tcp  --  swp+   any     anywhere             anywhere             tcp dpts:bootps:bootpc SETCLASS  class:2
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp dpt:bootps POLICE  mode:pkt rate:100 burst:100
      0     0 POLICE     tcp  --  any    any     anywhere             anywhere             tcp dpt:bootpc POLICE  mode:pkt rate:100 burst:100
      0     0 SETCLASS   udp  --  swp+   any     anywhere             anywhere             udp dpt:10001 SETCLASS  class:3
      0     0 POLICE     udp  --  any    any     anywhere             anywhere             udp dpt:10001 POLICE  mode:pkt rate:2000 burst:2000
      0     0 SETCLASS   igmp --  swp+   any     anywhere             anywhere             SETCLASS  class:6
      1    32 POLICE     igmp --  any    any     anywhere             anywhere             POLICE  mode:pkt rate:300 burst:100
      0     0 POLICE     all  --  swp+   any     anywhere             anywhere             addrtype match dst-type LOCAL POLICE  mode:pkt rate:1000 burst:1000 class:0
      0     0 POLICE     all  --  swp+   any     anywhere             anywhere             addrtype match dst-type IPROUTER POLICE  mode:pkt rate:400 burst:100 class:0
      0     0 SETCLASS   all  --  swp+   any     anywhere             anywhere             SETCLASS  class:0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination
      0     0 DROP       all  --  swp+   any     240.0.0.0/5          anywhere
      0     0 DROP       all  --  swp+   any     loopback/8           anywhere
      0     0 DROP       all  --  swp+   any     base-address.mcast.net/8  anywhere
      0     0 DROP       all  --  swp+   any     255.255.255.255      anywhere

Chain OUTPUT (policy ACCEPT 107 packets, 12590 bytes)
    pkts bytes target     prot opt in     out     source               destination

TABLE mangle :
Chain PREROUTING (policy ACCEPT 172 packets, 17871 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 172 packets, 17871 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 111 packets, 18134 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 111 packets, 18134 bytes)
    pkts bytes target     prot opt in     out     source               destination


TABLE raw :
Chain PREROUTING (policy ACCEPT 173 packets, 17923 bytes)
    pkts bytes target     prot opt in     out     source               destination

    Chain OUTPUT (policy ACCEPT 112 packets, 18978 bytes)
     pkts bytes target     prot opt in     out     source               destination


--------------------------------
Listing rules of type ip6tables:
--------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target    prot opt in     out     source               destination
      0     0 DROP       all      swp+   any     ip6-mcastprefix/8    anywhere
      0     0 DROP       all      swp+   any     ::/128               anywhere
      0     0 DROP       all      swp+   any     ::ffff:0.0.0.0/96    anywhere
      0     0 DROP       all      swp+   any     localhost/128        anywhere
      0     0 POLICE     udp      swp+   any     anywhere             anywhere             udp dpt:3785 POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     udp      swp+   any     anywhere             anywhere             udp dpt:3784 POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     udp      swp+   any     anywhere             anywhere             udp dpt:4784 POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     ospf     swp+   any     anywhere             anywhere             POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     tcp      swp+   any     anywhere             anywhere             tcp dpt:bgp POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     tcp      swp+   any     anywhere             anywhere             tcp spt:bgp POLICE  mode:pkt rate:2000 burst:2000 class:7
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmp router-solicitation POLICE  mode:pkt rate:100 burst:100 class:2
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmp router-advertisement POLICE  mode:pkt rate:500 burst:500 class:2
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmp neighbour-solicitation POLICE  mode:pkt rate:400 burst:400 class:2
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmp neighbour-advertisement POLICE  mode:pkt rate:400 burst:400 class:2
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmptype 130 POLICE  mode:pkt rate:200 burst:100 class:6
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmptype 131 POLICE  mode:pkt rate:200 burst:100 class:6
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmptype 132 POLICE  mode:pkt rate:200 burst:100 class:6
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             ipv6-icmptype 143 POLICE  mode:pkt rate:200 burst:100 class:6
      0     0 POLICE     ipv6-icmp    swp+   any     anywhere             anywhere             POLICE  mode:pkt rate:64 burst:40 class:2
      0     0 POLICE     udp      swp+   any     anywhere             anywhere             udp dpts:dhcpv6-client:dhcpv6-server POLICE  mode:pkt rate:100 burst:100 class:2
      0     0 POLICE     tcp      swp+   any     anywhere             anywhere             tcp dpts:dhcpv6-client:dhcpv6-server POLICE  mode:pkt rate:100 burst:100 class:2
      0     0 POLICE     all      swp+   any     anywhere             anywhere             addrtype match dst-type LOCAL POLICE  mode:pkt rate:1000 burst:1000 class:0
      0     0 POLICE     all      swp+   any     anywhere             anywhere             addrtype match dst-type IPROUTER POLICE  mode:pkt rate:400 burst:100 class:0
      0     0 SETCLASS   all      swp+   any     anywhere             anywhere             SETCLASS  class:0

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target    prot opt in     out     source               destination
      0     0 DROP       all      swp+   any     ip6-mcastprefix/8    anywhere
      0     0 DROP       all      swp+   any     ::/128               anywhere
      0     0 DROP       all      swp+   any     ::ffff:0.0.0.0/96    anywhere
      0     0 DROP       all      swp+   any     localhost/128        anywhere

Chain OUTPUT (policy ACCEPT 5 packets, 408 bytes)
    pkts bytes target     prot opt in     out     source               destination


TABLE mangle :
Chain PREROUTING (policy ACCEPT 7 packets, 718 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination


TABLE raw :
Chain PREROUTING (policy ACCEPT 7 packets, 718 bytes)
    pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
    pkts bytes target     prot opt in     out     source               destination

-------------------------------
Listing rules of type ebtables:
-------------------------------
TABLE filter :
Bridge table: filter

Bridge chain: INPUT, entries: 16, policy: ACCEPT
-d BGA -i swp+ -j setclass --class 7 , pcnt = 0 -- bcnt = 0
-d BGA -j police --set-mode pkt --set-rate 2000 --set-burst 2000 , pcnt = 0 -- bcnt = 0
-d 1:80:c2:0:0:2 -i swp+ -j setclass --class 7 , pcnt = 0 -- bcnt = 0
-d 1:80:c2:0:0:2 -j police --set-mode pkt --set-rate 2000 --set-burst 2000 , pcnt = 0 -- bcnt = 0
-d 1:80:c2:0:0:e -i swp+ -j setclass --class 6 , pcnt = 0 -- bcnt = 0
-d 1:80:c2:0:0:e -j police --set-mode pkt --set-rate 200 --set-burst 200 , pcnt = 0 -- bcnt = 0
-d 1:0:c:cc:cc:cc -i swp+ -j setclass --class 6 , pcnt = 0 -- bcnt = 0
-d 1:0:c:cc:cc:cc -j police --set-mode pkt --set-rate 200 --set-burst 200 , pcnt = 0 -- bcnt = 0
-p ARP -i swp+ -j setclass --class 2 , pcnt = 0 -- bcnt = 0
-p ARP -j police --set-mode pkt --set-rate 400 --set-burst 100 , pcnt = 0 -- bcnt = 0
-d 1:0:c:cc:cc:cd -i swp+ -j setclass --class 7 , pcnt = 0 -- bcnt = 0
-d 1:0:c:cc:cc:cd -j police --set-mode pkt --set-rate 2000 --set-burst 2000 , pcnt = 0 -- bcnt = 0
-p IPv4 -i swp+ -j ACCEPT , pcnt = 0 -- bcnt = 0
-p IPv6 -i swp+ -j ACCEPT , pcnt = 0 -- bcnt = 0
-i swp+ -j setclass --class 0 , pcnt = 0 -- bcnt = 0
-j police --set-mode pkt --set-rate 100 --set-burst 100 , pcnt = 0 -- bcnt = 0

Bridge chain: FORWARD, entries: 0, policy: ACCEPT

Bridge chain: OUTPUT, entries: 0, policy: ACCEPT

iptables

Action/Value Protocol/IP Address
Drop
Destination IP: Any
Source IPv4:
240.0.0.0/5
loopback/8
224.0.0.0/4
255.255.255.255
Set class: 7
Police: Packet rate 2000 burst 2000
Source IP: Any
Destination IP: Any
Protocol:
UDP/BFD Echo
UDP/BFD Control
UDP BFD Multihop Control
OSPF
TCP/BGP (spt dpt 179)
TCP/MLAG (spt dpt 5342)
Set Class: 6
Police: Rate 300 burst 100
Source IP: Any
Destination IP: Any
Protocol:
IGMP
Set class: 2
Police: Rate 100 burst 40
Source IP : Any
Destination IP: Any
Protocol:
ICMP
Set class: 2
Police: Rate 100 burst 100
Source IP: Any
Destination IP: Any
Protocol:
UDP/bootpc, bootps
Set class: 0
Police: Rate 1000 burst 1000
Source IP: Any
Destination IP: Any
addrtype match dst-type LOCAL
Note: LOCAL is any local address -> Receiving a packet with a destination matching a local IP address on the switch goes to the CPU.
Set class: 0
Police: Rate 400 burst 100
Source IP: Any
Destination IP: Any
addrtype match dst-type IPROUTER
Note: IPROUTER is any unresolved address -> On a layer 2 or layer3 boundary receiving a packet from layer 3 and needs to go to CPU to ARP for the destination.
Set class 0All

Set class is internal to the switch - it does not set any precedence bits.

ip6tables

Action/Value Protocol/IP Address
DropSource IPv6:
ff00::/8
::
::ffff:0.0.0.0/96
localhost
Set class: 7
Police: Packet rate 2000 burst 2000
Source IPv6: Any
Destination IPv6: Any
Protocol:
UDP/BFD Echo
UDP/BFD Control
UDP BFD Multihop Control
OSPF
TCP/BGP (spt dpt 179)
Set class: 6
Police: Packet Rte: 200 burst 100
Source IPv6: Any
Destination IPv6: Any
Protocol:
Multicast Listener Query (MLD)
Multicast
Listener Report (MLD)
Multicast Listener Done (MLD
Multicast Listener Report V2
Set class: 2
Police: Packet rate: 100 burst 100
Source IPv6: Any
Destination IPv6: Any
Protocol:
ipv6-icmp router-solicitation
Set class: 2
Police: Packet rate: 500 burst 500
Source IPv6: Any
Destination IPv6: Any
Protocol:
ipv6-icmp router-advertisement POLICE
Set class: 2
Police: Packet rate: 400 burst 400
Source IPv6: Any
Destination IPv6: Any
Protocol:
ipv6-icmp neighbour-solicitation
ipv6-icmp neighbour-advertisement
Set class: 2
Police: Packet rate: 64 burst: 40
Source IPv6: Any
Destination IPv6: Any
Protocol:
Ipv6 icmp
Set class: 2
Police: Packet rate: 100 burst: 100
Source IPv6: Any
Destination IPv6: Any
Protocol:
UDP/dhcpv6-client:dhcpv6-server (Spts & dpts)
Police: Packet rate: 1000 burst 1000
Source IPv6: Any
Destination IPv6: Any
addrtype match dst-type LOCAL
Note: LOCAL is any local address -> Receiving a packet with a destination matching a local IPv6 address on the switch goes to the CPU.
Set class: 0
Police: Packet rate: 400 burst 100
addrtype match dst-type IPROUTER
Note: IPROUTER is an unresolved address -> On a layer 2 or layer 3 boundary receiving a packet from layer 3 and needs to go to CPU to ARP for the destination.
Set class 0All

Set class is internal to the switch - it does not set any precedence bits.

ebtables

Action/ValueProtocol/MAC Address
Set Class: 7
Police: packet rate: 2000 burst rate:2000
Any switchport input interface
BDPU
LACP=
Cisco PVST
Set Class: 6
Police: packet rate: 200 burst rate: 200
Any switchport input inteface
LLDP
CDP
Set Class: 2
Police: packet rate: 400 burst rate: 100
Any switchport input interface
ARP
Catch All:
Allow all traffic
Any switchport input interface
IPv4
IPv6
Catch All (applied at end):
Set class: 0
Police: packet rate 100 burst rate 100
Any switchport
ALL OTHER

Set class is internal to the switch. It does not set any precedence bits.

Services and Daemons in Cumulus Linux

Services (also known as daemons) and processes are at the heart of how a Linux system functions. Most of the time, a service takes care of itself; you just enable and start it, then let it run. However, because a Cumulus Linux switch is a Linux system, you can dig deeper if you like. Services can start multiple processes as they run. Services are important to monitor on a Cumulus Linux switch.

You manage services in Cumulus Linux in the following ways:

systemd and the systemctl Command

You manage services using systemd with the systemctl command. You run the systemctl command with any service on the switch to start, stop, restart, reload, enable, disable, reenable, or get the status of the service.

cumulus@switch:~$ sudo systemctl start | stop | restart | status | reload | enable | disable | reenable SERVICENAME.service

For example to restart networking, run the command:

cumulus@switch:~$ sudo systemctl restart networking.service

Add the service name after the systemctl argument.

To show all running services, use the systemctl status command. For example:

cumulus@switch:~$ sudo systemctl status
● switch
    State: running
      Jobs: 0 queued
    Failed: 0 units
    Since: Thu 2019-01-10 00:19:34 UTC; 23h ago
    CGroup: /
            ├─init.scope
            │ └─1 /sbin/init
            └─system.slice
              ├─haveged.service
              │ └─234 /usr/sbin/haveged --Foreground --verbose=1 -w 1024
              ├─sysmonitor.service
              │ ├─  658 /bin/bash /usr/lib/cumulus/sysmonitor
              │ └─26543 sleep 60
              ├─systemd-udevd.service
              │ └─218 /lib/systemd/systemd-udevd
              ├─system-ntp.slice
              │ └─ntp@mgmt.service
              │   └─vrf
              │     └─mgmt
              │       └─12108 /usr/sbin/ntpd -n -u ntp:ntp -g
              ├─cron.service
              │ └─274 /usr/sbin/cron -f -L 38
              ├─system-serial\x2dgetty.slice
              │ └─serial-getty@ttyS0.service
              │   └─745 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220
              ├─nginx.service
              │ ├─332 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
              │ └─333 nginx: worker process
              ├─auditd.service
              │ └─235 /sbin/auditd
              ├─rasdaemon.service
              │ └─275 /usr/sbin/rasdaemon -f -r
              ├─clagd.service
              │ └─11443 /usr/bin/python /usr/sbin/clagd --daemon 169.254.1.2 peerlink.4094 44:39:39:ff:40:9
              --priority 100 --vxlanAnycas
              ├─switchd.service
              │ └─430 /usr/sbin/switchd -vx
              ...

systemctl Commands

systemctl has commands that perform a specific operation on a given service:

You do not need to interact with the services directly using these commands. If a critical service crashes or encounters an error, systemd restarts it automatically. systemd is the caretaker of services in modern Linux systems and responsible for starting all the necessary services at boot time.

Ensure a Service Starts after Multiple Restarts

By default, systemd tries to restart a particular service only a certain number of times within a given interval before the service fails to start. The settings StartLimitInterval (which defaults to 10 seconds) and StartBurstLimit (which defaults to 5 attempts) are in the service script; however, certain services override these defaults, sometimes with much longer times. For example, switchd.service sets StartLimitInterval=10m and StartBurstLimit=3; therefore, if you restart switchd more than three times in ten minutes, it does not start.

When the restart fails for this reason, you see a message similar to the following:

Job for switchd.service failed. See 'systemctl status switchd.service' and 'journalctl -xn' for details.

systemctl status switchd.service shows output similar to:

Active: failed (Result: start-limit) since Thu 2016-04-07 21:55:14 UTC; 15s ago

To clear this error, run systemctl reset-failed switchd.service. If you know you are going to restart frequently (multiple times within the StartLimitInterval), you can run the same command before you issue the restart request. This also applies to stop followed by start.

Keep systemd Services from Hanging after Starting

If you start, restart, or reload a systemd service that you can start from another systemd service, you must use the --no-block option with systemctl.

Identify Active Listener Ports for IPv4 and IPv6

You can identify the active listener ports under both IPv4 and IPv6 using the netstat command:

cumulus@switch:~$ netstat -nlp --inet --inet6
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      444/dnsmasq
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      874/sshd
tcp6       0      0 :::53                   :::*                    LISTEN      444/dnsmasq
tcp6       0      0 :::22                   :::*                    LISTEN      874/sshd
udp        0      0 0.0.0.0:28450           0.0.0.0:*                           839/dhclient
udp        0      0 0.0.0.0:53              0.0.0.0:*                           444/dnsmasq
udp        0      0 0.0.0.0:68              0.0.0.0:*                           839/dhclient
udp        0      0 192.168.0.42:123        0.0.0.0:*                           907/ntpd
udp        0      0 127.0.0.1:123           0.0.0.0:*                           907/ntpd
udp        0      0 0.0.0.0:123             0.0.0.0:*                           907/ntpd
udp        0      0 0.0.0.0:4784            0.0.0.0:*                           909/ptmd
udp        0      0 0.0.0.0:3784            0.0.0.0:*                           909/ptmd
udp        0      0 0.0.0.0:3785            0.0.0.0:*                           909/ptmd
udp6       0      0 :::58352                :::*                                839/dhclient
udp6       0      0 :::53                   :::*                                444/dnsmasq
udp6       0      0 fe80::a200:ff:fe00::123 :::*                                907/ntpd
udp6       0      0 ::1:123                 :::*                                907/ntpd
udp6       0      0 :::123                  :::*                                907/ntpd
udp6       0      0 :::4784                 :::*                                909/ptmd
udp6       0      0 :::3784                 :::*                                909/ptmd

Identify Active or Stopped Services

To see active or stopped services, run the cl-service-summary command:

cumulus@switch:~$ cl-service-summary
Service cron               enabled    active
Service ssh                enabled    active
Service syslog             enabled    active
Service asic-monitor       enabled    inactive
Service clagd              enabled    inactive
Service cumulus-poe                   inactive
Service lldpd              enabled    active
Service mstpd              enabled    active
Service neighmgrd          enabled    active
Service nvued               enabled    active
Service netq-agent         enabled    active
Service ntp                enabled    active
Service ptmd               enabled    active
Service pwmd               enabled    active
Service smond              enabled    active
Service switchd            enabled    active
Service sysmonitor         enabled    active
Service rdnbrd             disabled   inactive
Service frr                enabled    inactive
...

You can also run the systemctl list-unit-files --type service command to list all services on the switch and to see their status:

cumulus@switch:~$ systemctl list-unit-files --type service
UNIT FILE                              STATE
aclinit.service                        enabled
acltool.service                        enabled
acpid.service                          disabled
asic-monitor.service                   enabled
auditd.service                         enabled
autovt@.service                        disabled
bmcd.service                           disabled
bootlog.service                        enabled
bootlogd.service                       masked  
bootlogs.service                       masked  
bootmisc.service                       masked  
checkfs.service                        masked  
checkroot-bootclean.service            masked  
checkroot.service                      masked
clagd.service                          enabled
console-getty.service                  disabled
console-shell.service                  disabled
container-getty@.service               static  
cron.service                           enabled
cryptdisks-early.service               masked  
cryptdisks.service                     masked  
cumulus-aclcheck.service               static  
cumulus-core.service                   static  
cumulus-fastfailover.service           enabled
cumulus-firstboot.service              disabled
cumulus-platform.service               enabled  
...

Identify Essential Services

To identify which services must run when the switch boots:

cumulus@switch:~$ systemctl list-dependencies --before basic.target

To identify which services you need for networking:

cumulus@switch:~$ systemctl list-dependencies --after network.target
   ├─switchd.service
   ├─wd_keepalive.service
   └─network-pre.target

To identify the services needed for a multi-user environment, run:

cumulus@switch:~$ systemctl list-dependencies --before multi-user.target

 ●  ├─bootlog.service
   ├─systemd-readahead-done.service
   ├─systemd-readahead-done.timer
   ├─systemd-update-utmp-runlevel.service
   └─graphical.target
   └─systemd-update-utmp-runlevel.service

Important Services

The following table lists the most important services in Cumulus Linux.

Service NameDescriptionAffects Forwarding?
switchdHardware abstraction daemon. Synchronizes the kernel with the ASIC.YES
sx_sdkInterfaces with the Spectrum ASIC. Only on Spectrum switches.YES
frrFRRouting. Handles routing protocols. There are separate processes for each routing protocol, such as bgpd and ospfd.YES if routing
clagdCumulus link aggregation daemon. Handles MLAG.YES if using MLAG
neighmgrdSynchronizes MAC address information if using MLAG.YES if using MLAG
mstpdSpanning tree protocol daemon.YES if using layer 2
ptmdPrescriptive Topology Manager. Verifies cabling based on LLDP output. Also sets up BFD sessions.YES if using BFD
netdNCLU back end.
nvuedHandles the NVUE object model.NO
rsyslogHandles logging of syslog messages.NO
ntpNetwork time protocol.NO
ledmgrdLED manager. Reads the state of system LEDs.NO
sysmonitorWatches and logs critical system load (free memory, disk, CPU).NO
lldpdHandles Tx/Rx of LLDP information.NO
smondReads platform sensors and fan information from pwmd.NO
pwmdReads and sets fan speeds.NO

Configuring switchd

switchd is the daemon at the heart of Cumulus Linux. It communicates between the switch and Cumulus Linux, and all the applications running on Cumulus Linux.

Cumulus Linux stores the switchd configuration in the /etc/cumulus/switchd.conf file.

The switchd File System

switchd also exports a file system, mounted on /cumulus/switchd, that presents all the switchd configuration options as a series of files arranged in a tree structure. To show the contents, run the tree /cumulus/switchd command. The following example shows output for a switch with one switch port configured:

cumulus@switch:~$ sudo tree /cumulus/switchd/
/cumulus/switchd/
├── clear
│   └── stats
│       ├── vlan
│       └── vxlan
├── config
│   ├── acl
│   │   ├── flow_based_mirroring
│   │   ├── non_atomic_update_mode
│   │   ├── optimize_hw
│   │   └── vxlan_tnl_arp_punt_disable
│   ├── arp
│   │   ├── drop_during_failed_state
│   │   └── next_hops
│   ├── bridge
│   │   ├── broadcast_frame_to_cpu
│   │   └── optimized_mcast_flood
│   ├── buf_util
│   │   ├── measure_interval
│   │   └── poll_interval
│   ├── coalesce
│   │   ├── offset
│   │   ├── reducer
│   │   └── timeout
│   ├── disable_internal_hw_err_restart
│   ├── disable_internal_parity_restart
│   ├── hal
│   │   └── bcm
│   │       ├── l3
│   │       │   └── per_vlan_router_mac_lookup_for_vrrp
│   │       ├── linkscan_interval
│   │       ├── logging
│   │       │   └── l3mc
│   │       ├── per_vlan_router_mac_lookup
│   │       └── vxlan_support
│   ├── ignore_non_swps
│   ├── interface
│   │   ├── swp1
│   │   │   ├── ethtool_mode
│   │   │   ├── interface_mode
│   │   │   ├── port_security
│   │   │   │   ├── enable
│   │   │   │   ├── mac_limit
│   │   │   │   ├── static_mac
│   │   │   │   ├── sticky_aging
│   │   │   │   ├── sticky_mac
│   │   │   │   ├── sticky_timeout
│   │   │   │   ├── violation_mode
│   │   │   │   └── violation_timeout
│   │   │   └── storm_control
│   │   │       ├── broadcast
│   │   │       ├── multicast
│   │   │       └── unknown_unicast
...

Configure switchd Parameters

To configure the switchd parameters, edit the /etc/cumulus/switchd.conf file.

cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
#
# /etc/cumulus/switchd.conf - switchd configuration file
#

# Statistic poll interval (in msec)
#stats.poll_interval = 2000

# Buffer utilization poll interval (in msec), 0 means disable
#buf_util.poll_interval = 0

# Buffer utilization measurement interval (in mins)
#buf_util.measure_interval = 0

# Optimize ACL HW resources for better utilization
#acl.optimize_hw = FALSE

# Enable Flow based mirroring.
#acl.flow_based_mirroring = TRUE

# Enable non atomic acl update
acl.non_atomic_update_mode = FALSE

# Send ARPs for next hops
#arp.next_hops = TRUE

# Kernel routing table ID, range 1 - 2^31, default 254
#route.table = 254
...

When you update the /etc/cumulus/switchd.conf file, you must restart switchd for the changes to take effect. See Restart switchd, below.

Restart switchd

Whenever you modify a switchd hardware configuration file (for example, you update any *.conf file that requires making a change to the switching hardware, like /etc/cumulus/datapath/traffic.conf), you must restart the switchd service for the change to take effect:

cumulus@switch:~$ sudo systemctl restart switchd.service

You do not have to restart the switchd service when you update a network interface configuration (for example, when you edit the /etc/network/interfaces file).

Restarting the switchd service causes all network ports to reset in addition to resetting the switch hardware configuration.

Configuring a Global Proxy

You configure global HTTP and HTTPS proxies in the /etc/profile.d/ directory of Cumulus Linux. Set the http_proxy and https_proxy variables to configure the switch with the address of the proxy server you want to use to get URLs on the command line. This is useful for programs such as apt, apt-get, curl and wget, which can all use this proxy.

  1. In a terminal, create a new file in the /etc/profile.d/ directory.

    cumulus@switch:~$ sudo nano /etc/profile.d/proxy.sh
    
  2. Add a line to the file to configure either an HTTP or an HTTPS proxy, or both:

    • HTTP proxy:

      http_proxy=http://myproxy.domain.com:8080
      export http_proxy
      
    • HTTPS proxy:

      https_proxy=https://myproxy.domain.com:8080
      export https_proxy
      
  3. Create a file in the /etc/apt/apt.conf.d directory and add the following lines to the file to get the HTTP and HTTPS proxies. The example below uses http_proxy as the file name:

    cumulus@switch:~$ sudo nano /etc/apt/apt.conf.d/http_proxy
    Acquire::http::Proxy "http://myproxy.domain.com:8080";
    Acquire::https::Proxy "https://myproxy.domain.com:8080";
    
  4. Add the proxy addresses to the /etc/wgetrc file, then uncomment the http_proxy and https_proxy lines, if necessary:

    cumulus@switch:~$ sudo nano /etc/wgetrc
    ...
    https_proxy = https://myproxy.domain.com:8080
    http_proxy = http://myproxy.domain.com:8080
    ...
    
  5. To execute the /etc/profile.d/proxy.sh file in the current environment, run the source command:

    cumulus@switch:~$ source /etc/profile.d/proxy.sh
    

Use the echo command to confirm the configuration:

Set up an apt package cache

HTTP API

Cumulus Linux implements an HTTP application programming interface to NCLU. Instead of accessing Cumulus Linux using SSH, you can interact with the switch using an HTTP client, such as cURL, HTTPie or web browser.

HTTP API Basics

Cumulus Linux includes the supporting software for the API. To use the REST API, you must enable nginx on the switch:

cumulus@switch:~$ sudo systemctl enable nginx; systemctl restart nginx

To enable the HTTP API service, run the following command:

cumulus@switch:~$ sudo systemctl enable restserver

To start or stop the service, run the following commands:

cumulus@switch:~$ sudo systemctl start restserver
cumulus@switch:~$ sudo systemctl stop restserver

To disable the service from running at startup, run the following command:

cumulus@switch:~$ sudo systemctl disable restserver

Each service runs as a background daemon.

Configure API Services

To configure the HTTP API services, edit the /etc/nginx/sites-available/nginx-restapi.conf configuration file, enter the IP address on which the REST API listens, then run the sudo systemctl restart nginx command.

IP and Port Settings

You can modify the IP and port combinations on which services listen by changing the parameters of the listen directives. By default, nginx-restapi.conf has only one listen parameter.

All URLs must use HTTPS instead of HTTP.

For more information on the listen directive, refer to the NGINX documentation.

Configure Security

Authentication

The default configuration requires all HTTP requests from external sources (not internal switch traffic) to set the HTTP Basic Authentication header.

The user and password must correspond to a user on the host switch.

Transport Layer Security

You secure all traffic in transport using TLSv1.2 by default. Cumulus Linux contains a self-signed certificate and private key used server-side in this application so that it works out of the box; however, NVIDIA recommends you use your own certificates and keys. Certificates must be in PEM format.

For step by step documentation on generating self-signed certificates and keys, and installing them to the switch, refer to the Ubuntu Certificates and Security documentation.

Do not copy the cumulus.pem or cumulus.key files. After installation, edit the ssl_certificate and ssl_certificate_key values in the configuration file for your hardware.

cURL Examples

This section includes several example cURL commands you can use to send HTTP requests to a host. The examples use the following settings:

For NCLU requests, set the Content-Type request header to application/json.

The cURL -k flag is necessary when the server uses a self-signed certificate. This is the default configuration (see the Security section). To display the response headers, include the -D flag in the command.

To retrieve a list of all available HTTP endpoints:

cumulus@switch:~$ curl -X GET -k -u user:pw https://192.168.0.32:8080

To run net show counters on the host as a remote procedure call:

cumulus@switch:~$ curl -X POST -k -u user:pw -H "Content-Type: application/json" -d '{"cmd": "show counters"}' https://192.168.0.32:8080/nclu/v1/rpc

To add a bridge using ML2:

cumulus@switch:~$ curl -X PUT -k -u user:pw https://192.168.0.32:8080/ml2/v1/bridge/"br1"/200

Considerations

The net show configuration files command output does not list the /etc/restapi.conf file.

Smart System Manager

Use Smart System Manager to upgrade and troubleshoot an active switch with minimal disruption to the network.

Smart System Manager includes the following modes:

  • The Smart System Manager NCLU commands do not require a net commit.

Restart Mode

You can restart the switch in one of the following modes.

The following command restarts the system in cold mode:

cumulus@switch:~$ net system maintenance restart cold
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -c

The following command restarts the system in fast mode:

cumulus@switch:~$ net system maintenance restart fast
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -f

The following command restarts the system in warm mode:

Warm boot resets any manually configured FEC settings.

cumulus@switch:~$ net system maintenance restart warm
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -w

Upgrade Mode

Upgrade mode updates all the components and services on the switch to the latest Cumulus Linux release without traffic loss. After upgrade is complete, you must restart the switch with either a cold or fast restart.

Upgrade mode includes the following options:

The following command upgrades all the system components:

cumulus@switch:~$ net system maintenance upgrade all
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -u

The following command provides information on the components you want to upgrade:

cumulus@switch:~$ net system maintenance upgrade dry-run
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -d

Maintenance Mode

Maintenance mode isolates the system from the rest of the network so that you can perform intrusive troubleshooting tasks and data collection or perform system changes, such as break out ports and replace optics or cables with minimal disruption.

Depending on your configuration and network topology, complete isolation is not possible.

Enable Maintenance Mode

Run the following command to enable maintenance mode. When maintenance mode is on, Smart System Manager performs a graceful BGP shutdown, redirects traffic over the peerlink and brings down the MLAG port link. switchd maintains full capability.

cumulus@switch:~$ net system maintenance mode enable
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -m1

You can run additional commands to bring all the ports down, then up to restore the port admin state.

cumulus@switch:~$ net system maintenance ports down
cumulus@switch:~$ net system maintenance ports up
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -p0
cumulus@switch:~$ sudo csmgrctl -p1

Before you disable maintenance mode, be sure to bring the ports back up.

Disable Maintenance Mode

Run the following command to disable maintenance mode and restore normal operation. When maintenance mode is off, Smart System Manager performs a soft restart, runs a BGP graceful restart, and brings the MLAG port link back up. switchd maintains full capability.

cumulus@switch:~$ net system maintenance mode disable
NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -m0

Show Maintenance Mode Status

To see the status of maintenance mode, run the NCLU net system maintenance show status command or the Linux sudo csmgrctl -s command. For example:

cumulus@switch:~$ net system maintenance show status
Current System Mode: Maintenance since Tue Jan  5 00:13:37 2021 (Duration: 00:00:31)
 Boot Mode: reboot_cold  
 2 registered modules
 frr     : Maintenance, down
 switchd : Maintenance, down 

Layer 1 and Switch Ports

This section discusses layer 1 and switch port configuration.

Interface Configuration and Management

This section discusses how to configure and manage network interfaces.

Cumulus Linux uses ifupdown2 to manage network interfaces, which is a new implementation of the Debian network interface manager ifupdown.

Basic Commands

Bring Up the Physical Connection to an Interface

To bring up the physical connection to an interface or apply changes to an existing interface, run the sudo ifup <interface> command. The following example command brings up the physical connection to swp1:

cumulus@switch:~$ sudo ifup swp1

To bring down the physical connection to a single interface, run the sudo ifdown <interface> command. The following example command brings down the physical connection to swp1:

cumulus@switch:~$ sudo ifdown swp1

The ifdown command always deletes logical interfaces after bringing them down. When you bring down the physical connection to an interface, it comes back up automatically after you reboot the switch or apply configuration changes with ifreload -a.

By default, ifupdown is quiet. Use the verbose option (-v) to show commands as they execute when you bring an interface down or up.

Bring Up an Interface Administratively

When you bring an interface up or down administratively (admin up or admin down), you bring down a port, bridge, or bond but not the physical connection for the port, bridge, or bond.

When you put an interface into an admin down state, the interface remains down after any future reboots or configuration changes with ifreload -a.

To put an interface into an admin down state:

cumulus@switch:~$ net add interface swp1 link down
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands create the following configuration in the /etc/network/interfaces file:

auto swp1
iface swp1
    link-down yes

To bring the interface back up:

cumulus@switch:~$ net del interface swp1 link down
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To put an interface into an admin down state:

cumulus@switch:~$ nv set interface swp1 link state down
cumulus@switch:~$ nv config apply

To bring the interface back up:

cumulus@switch:~$ nv set interface swp1 link state up
cumulus@switch:~$ nv config apply

To put an interface into an admin down state:

cumulus@switch:~$ sudo ifdown swp1 --admin-state

To bring the interface back up:

cumulus@switch:~$ sudo ifup swp1 --admin-state

For additional information on interface administrative state and physical state, refer to this knowledge base article.

Interface Classes

ifupdown2 enables you to group interfaces into separate classes. A class is a user-defined label that groups interfaces that share a common function (such as uplink, downlink or compute). You specify classes in the /etc/network/interfaces file.

The most common class is auto, which you configure like this:

auto swp1
iface swp1

You can add other classes using the allow prefix. For example, if you have multiple interfaces used for uplinks, you can define a class called uplinks:

auto swp1
allow-uplink swp1
iface swp1 inet static
    address 10.1.1.1/31

auto swp2
allow-uplink swp2
iface swp2 inet static
    address 10.1.1.3/31

This allows you to perform operations on only these interfaces using the --allow=uplinks option. You can still use the -a options because these interfaces are also in the auto class:

cumulus@switch:~$ sudo ifup --allow=uplinks
cumulus@switch:~$ sudo ifreload -a

If you are using Management VRF, you can use the special interface class called mgmt and put the management interface into that class. The management VRF must have an IPv6 address in addition to an IPv4 address to work correctly.

allow-mgmt eth0
iface eth0 inet dhcp
    vrf mgmt

allow-mgmt mgmt
iface mgmt
    address 127.0.0.1/8
    address ::1/128
    vrf-table auto

All ifupdown2 commands (ifup, ifdown, ifquery, ifreload) can take a class. Include the --allow=<class> option when you run the command. For example, to reload the configuration for the management interface described above, run:

cumulus@switch:~$ sudo ifreload --allow=mgmt

Use the -a option to bring up or down all interfaces with the common auto class in the /etc/network/interfaces file.

To administratively bring up all interfaces marked auto, run:

cumulus@switch:~$ sudo ifup -a

To administratively bring down all interfaces marked auto, run:

cumulus@switch:~$ sudo ifdown -a

To reload all network interfaces marked auto, use the ifreload command. This command is equivalent to running ifdown then ifup; however, ifreload skips unchanged configurations:

cumulus@switch:~$ sudo ifreload -a

Cumulus Linux checks syntax by default. As a precaution, apply configurations only if the syntax check passes. Use the following compound command:

cumulus@switch:~$ sudo bash -c "ifreload -s -a && ifreload -a"

For more information, see the individual man pages for ifup(8), ifdown(8), ifreload(8).

Loopback Interface

Cumulus Linux has a preconfigured loopback interface. When the switch boots up, the loopback interface called lo is up and assigned an IP address of 127.0.0.1.

The loopback interface lo must always exist on the switch and must always be up.

To configure an IP address for the loopback interface:

cumulus@switch:~$ net add loopback lo ip address 10.10.10.1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface lo ip address 10.10.10.1
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file to add an address line:

auto lo
iface lo inet loopback
    address 10.10.10.1

  • If the IP address has no subnet mask, it automatically becomes a /32 IP address. For example, 10.10.10.1 is 10.10.10.1/32.
  • You can configure multiple IP addresses for the loopback interface.

Child Interfaces

By default, ifupdown2 recognizes and uses any interface present on the system that is a dependent (child) of an interface (for example, a VLAN, bond, or physical interface). You do not need to list interfaces in the /etc/network/interfaces file unless the interfaces need specific configuration for MTU, link speed, and so on. If you need to delete a child interface, delete all references to that interface from the /etc/network/interfaces file.

In the following example, swp1 and swp2 do not need an entry in the interfaces file. The following stanzas in /etc/network/interfaces provide the exact same configuration:

With Child Interfaces Defined

auto swp1
iface swp1

auto swp2
iface swp2

auto bridge
iface bridge
    bridge-vlan-aware yes
    bridge-ports swp1 swp2
    bridge-vids 1-100
    bridge-pvid 1
    bridge-stp on

Without Child Interfaces Defined

auto bridge
iface bridge
    bridge-vlan-aware yes
    bridge-ports swp1 swp2
    bridge-vids 1-100
    bridge-pvid 1
    bridge-stp on

In the following example, swp1.100 and swp2.100 do not need an entry in the interfaces file. The following stanzas defined in /etc/network/interfaces provide the exact same configuration:

With Child Interfaces Defined

auto swp1.100
iface swp1.100

auto swp2.100
iface swp2.100

auto br-100
iface br-100
    address 10.0.12.2/24
    address 2001:dad:beef::3/64
    bridge-ports swp1.100 swp2.100
    bridge-stp on

Without Child Interfaces Defined

auto br-100
iface br-100
    address 10.0.12.2/24
    address 2001:dad:beef::3/64
    bridge-ports swp1.100 swp2.100
    bridge-stp on

Interface Dependencies

ifupdown2 understands interface dependency relationships. When you run ifup and ifdown with all interfaces, the commands always run with all interfaces in dependency order. When you run ifup and ifdown with the interface list on the command line, the default behavior is to not run with dependents; however, if there are any built-in dependents, they do come up or go down.

To run with dependents when you specify the interface list, use the --with-depends option. The --with-depends option walks through all dependents in the dependency tree rooted at the interface you specify. Consider the following example configuration:

auto bond1
iface bond1
    address 100.0.0.2/16
    bond-slaves swp29 swp30

auto bond2
iface bond2
    address 100.0.0.5/16
    bond-slaves swp31 swp32

auto br2001
iface br2001
    address 12.0.1.3/24
    bridge-ports bond1.2001 bond2.2001
    bridge-stp on

The ifup --with-depends br2001 command brings up all dependents of br2001: bond1.2001, bond2.2001, bond1, bond2, bond1.2001, bond2.2001, swp29, swp30, swp31, swp32.

cumulus@switch:~$ sudo ifup --with-depends br2001

The ifdown --with-depends br2001 command brings down all dependents of br2001: bond1.2001, bond2.2001, bond1, bond2, bond1.2001, bond2.2001, swp29, swp30, swp31, swp32.

cumulus@switch:~$ sudo ifdown --with-depends br2001

ifdown2 always deletes logical interfaces after bringing them down. Use the --admin-state option if you only want to administratively bring the interface up or down. In the above example, ifdown br2001 deletes br2001.

To guide you through which interfaces go down and come up, use the --print-dependency option.

For example, run ifquery --print-dependency=list -a to show the dependency list for all interfaces:

cumulus@switch:~$ sudo ifquery --print-dependency=list -a
lo : None
eth0 : None
bond0 : ['swp25', 'swp26']
bond1 : ['swp29', 'swp30']
bond2 : ['swp31', 'swp32']
br0 : ['bond1', 'bond2']
bond1.2000 : ['bond1']
bond2.2000 : ['bond2']
br2000 : ['bond1.2000', 'bond2.2000']
bond1.2001 : ['bond1']
bond2.2001 : ['bond2']
br2001 : ['bond1.2001', 'bond2.2001']
swp40 : None
swp25 : None
swp26 : None
swp29 : None
swp30 : None
swp31 : None
swp32 : None

To print the dependency list of a single interface, run the ifquery --print-dependency=list <interface> command.

To show the dependency information for an interface in dot format, run the ifquery --print-dependency=dot <interface> command. The following example command shows the dependency information for interface br2001 in dot format:

cumulus@switch:~$ sudo ifquery --print-dependency=dot br2001
/* Generated by GvGen v.0.9 (http://software.inl.fr/trac/wiki/GvGen) */
digraph G {
    compound=true;
    node1 [label="br2001"];
    node2 [label="bond1.2001"];
    node3 [label="bond2.2001"];
    node4 [label="bond1"];
    node5 [label="bond2"];
    node6 [label="swp29"];
    node7 [label="swp30"];
    node8 [label="swp31"];
    node9 [label="swp32"];
    node1->node2;
    node1->node3;
    node2->node4;
    node3->node5;
    node4->node6;
    node4->node7;
    node5->node8;
    node5->node9;
}

You can use dot to render the graph on an external system.

To print the dependency information of the entire interfaces file, run the following command:

cumulus@switch:~$ sudo ifquery --print-dependency=dot -a >interfaces_all.dot

Subinterfaces

On Linux, an interface is a network device that can be either physical, (for example, swp1) or virtual (for example, vlan100). A VLAN subinterface is a VLAN device on an interface, and the VLAN ID appends to the parent interface using dot (.) VLAN notation. For example, a VLAN with ID 100 that is a subinterface of swp1 is swp1.100. The dot VLAN notation for a VLAN device name is a standard way to specify a VLAN device on Linux.

A VLAN subinterface only receives traffic tagged for that VLAN; therefore, swp1.100 only receives packets that have a VLAN 100 tag on switch port swp1. Any packets that transmit from swp1.100 have a VLAN 100 tag.

In an MLAG configuration, the peer link interface that connects the two switches in the MLAG pair has a VLAN subinterface named 4094. The peerlink.4094 subinterface only receives traffic tagged for VLAN 4094.

If you are using a VLAN subinterface, do not add that VLAN under the bridge stanza.

Parent Interfaces

When you run ifup on a logical interface (like a bridge, bond, or VLAN interface), if the ifup creates the logical interface, it also tries to execute on the upper (or parent) interfaces of the interface.

Consider this example configuration:

auto br100
iface br100
    bridge-ports bond1.100 bond2.100

auto bond1
iface bond1
    bond-slaves swp1 swp2

If you run ifdown bond1, ifdown deletes bond1 and the VLAN interface on bond1 (bond1.100); it also removes bond1 from the bridge br100. Next, when you run ifup bond1, it creates bond1 and the VLAN interface on bond1 (bond1.100); it also executes ifup br100 to add the bond VLAN interface (bond1.100) to the bridge br100.

There can be cases where an upper interface (like br100) is not in the right state, which can result in warnings. The warnings are harmless.

If you want to disable these warnings, set skip_upperifaces=1 in the /etc/network/ifupdown2/ifupdown2.conf file.

With skip_upperifaces=1, you have to execute ifup on the upper interfaces. In this case, you must run ifup br100 after an ifup bond1 to add bond1 back to bridge br100.

If you specify a subinterface, such as swp1.100, then run ifup swp1.100, Cumulus Linux creates the swp1 interface automatically in the kernel. Consider also specifying the parent interface swp1. A parent interface is one where any physical layer configuration can reside, such as link-speed 1000 or link-duplex full. If you create only swp1.100 and not swp1, you cannot run ifup swp1.

Interface IP Addresses

You can specify both IPv4 and IPv6 addresses for the same interface.

For IPv6 addresses, you can create or modify the IP address for an interface using either :: or 0:0:0 notation. For example,both 2620:149:43:c109:0:0:0:5 and 2001:DB8::1/126 are valid.

The following example commands configure three IP addresses for swp1; two IPv4 addresses and one IPv6 address.

cumulus@switch:~$ net add interface swp1 ip address 12.0.0.1/30
cumulus@switch:~$ net add interface swp1 ip address 12.0.0.2/30
cumulus@switch:~$ net add interface swp1 ipv6 address 2001:DB8::1/126
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands create the following code snippet in the /etc/network/interfaces file:

auto swp1
iface swp1
    address 12.0.0.1/30
    address 12.0.0.2/30
    address 2001:DB8::1/126
  • NCLU adds the address method and address family when needed:

    auto lo
    iface lo inet loopback
    
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.1/30
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.2/30
cumulus@switch:~$ nv set interface swp1 ip address 2001:DB8::1/126
cumulus@switch:~$ nv config apply

In the /etc/network/interfaces file, list all IP addresses under the iface section.

auto swp1
iface swp1
    address 10.0.0.1/30
    address 10.0.0.2/30
    address 2001:DB8::1/126

The address method and address family are not mandatory; they default to inet/inet6 and static. However, you must specify inet/inet6 when you are creating DHCP or loopback interfaces.

auto lo
iface lo inet loopback

To make non-persistent changes to interfaces at runtime, use ip addr add:

cumulus@switch:~$ sudo ip addr add 10.0.0.1/30 dev swp1
cumulus@switch:~$ sudo ip addr add 2001:DB8::1/126 dev swp1

To remove an addresses from an interface, use ip addr del:

cumulus@switch:~$ sudo ip addr del 10.0.0.1/30 dev swp1
cumulus@switch:~$ sudo ip addr del 2001:DB8::1/126 dev swp1

Interface Descriptions

You can add a description (alias) to an interface.

Interface descriptions also appear in the SNMP OID IF-MIB::ifAlias

  • Interface descriptions can have a maximum of 256 characters.
  • Avoid using apostrophes or non-ASCII characters. Cumulus Linux does not parse these characters.

The following example commands create a description for swp1:

cumulus@switch:~$ net add interface swp1 alias hypervisor_port_1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE command is not supported.

In the /etc/network/interfaces file, add a description using the alias keyword:

cumulus@switch:~# sudo nano /etc/network/interfaces

auto swp1
iface swp1
    alias swp1 hypervisor_port_1

Interface Commands

You can specify user commands for an interface that run at pre-up, up, post-up, pre-down, down, and post-down.

You can add any valid command in the sequence to bring an interface up or down; however, limit the scope to network-related commands associated with the particular interface. For example, it does not make sense to install a Debian package on ifup of swp1, even though it is technically possible. See man interfaces for more details.

The following examples adds a command to an interface to enable proxy ARP:

cumulus@switch:~$ net add interface swp1 post-up echo 1 > /proc/sys/net/ipv4/conf/swp1/proxy_arp
cumulus@switch:~$ net add interface ip address 10.0.0.1/30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

If your post-up command also starts, restarts, or reloads any systemd service, you must use the --no-block option with systemctl. Otherwise, that service or even the switch itself might hang after starting or restarting. For example, to restart the dhcrelay service after bringing up VLAN 100, first run:

cumulus@switch:~$ net add vlan 100 post-up systemctl --no-block restart dhcrelay.service

NVUE command is not supported.
cumulus@switch:~# sudo nano /etc/network/interfaces
auto swp1
iface swp1
    address 12.0.0.1/30
    post-up echo 1 > /proc/sys/net/ipv4/conf/swp1/proxy_arp

If your post-up command also starts, restarts, or reloads any systemd service, you must use the --no-block option with systemctl. Otherwise, that service or even the switch itself might hang after starting or restarting. For example, to restart the dhcrelay service after bringing up a VLAN, the /etc network/interfaces configuration looks like this:

auto bridge.100
iface bridge.100
    post-up systemctl --no-block restart dhcrelay.service

Source Interface File Snippets

Sourcing interface files helps organize and manage the /etc/network/interfaces file. For example:

cumulus@switch:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet dhcp

source /etc/network/interfaces.d/bond0

The contents of the sourced file used above are:

cumulus@switch:~$ sudo cat /etc/network/interfaces.d/bond0
auto bond0
iface bond0
    address 14.0.0.9/30
    address 2001:ded:beef:2::1/64
    bond-slaves swp25 swp26

Port Ranges

To specify port ranges in commands:

Use commas to separate different port ranges (for example, swp1-46,10-12):

cumulus@switch:~$ net add bridge bridge ports swp1-4,6,10-12
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands produce the following snippet in the /etc/network/interfaces file. The file renders the list of ports individually.

...
auto bridge
iface bridge
    bridge-ports swp1 swp2 swp3 swp4 swp6 swp10 swp11 swp12
    bridge-vlan-aware yes
auto swp1
iface swp1
...

Use commas to separate different port ranges (for example, swp1-46,10-12):

cumulus@switch:~$ nv set interface swp1-4,6,10-12 bridge domain br_default
cumulus@switch:~$ nv config apply

Use the glob keyword to specify bridge ports and bond slaves:

auto br0
iface br0
    bridge-ports glob swp1-6.100

auto br1
iface br1
    bridge-ports glob swp7-9.100  swp11.100 glob swp15-18.100

Mako Templates

ifupdown2 supports Mako-style templates. The Mako template engine processes the interfaces file before parsing.

Use the template to declare cookie-cutter bridges and to declare addresses in the interfaces file:

%for i in [1,12]:
auto swp${i}
iface swp${i}
    address 10.20.${i}.3/24

  • In Mako syntax, use square brackets ([1,12]) to specify a list of individual numbers. Use range(1,12) to specify a range of interfaces.
  • To test your template and confirm it evaluates correctly, run mako-render /etc/network/interfaces.

To comment out content in Mako templates, use double hash marks (##). For example:

## % for i in range(1, 4):
## auto swp${i}
## iface swp${i}
## % endfor
##

For more Mako template examples, refer to this knowledge base article.

ifupdown Scripts

Unlike the traditional ifupdown system, ifupdown2 does not run scripts installed in /etc/network/*/ automatically to configure network interfaces.

To enable or disable ifupdown2 scripting, edit the addon_scripts_support line in the /etc/network/ifupdown2/ifupdown2.conf file. 1 enables scripting and 2 disables scripting. For example:

cumulus@switch:~$ sudo nano /etc/network/ifupdown2/ifupdown2.conf
# Support executing of ifupdown style scripts.
# Note that by default python addon modules override scripts with the same name
addon_scripts_support=1

ifupdown2 sets the following environment variables when executing commands:

Troubleshooting

To see the link and administrative state of an interface:

cumulus@switch:~$ net show interface swp1
cumulus@switch:~$ nv show interface swp1 link state

In the following example, swp1 is administratively UP and the physical link is UP (LOWER_UP).

cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff

To show the assigned IP address on an interface:

cumulus@switch:~$ net show interface swp1
cumulus@switch:~$ nv show interface swp1 ip address
cumulus@switch:~$ ip addr show swp1
3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
    inet 192.0.2.1/30 scope global swp1
    inet 192.0.2.2/30 scope global swp1
    inet6 2001:DB8::1/126 scope global tentative
        valid_lft forever preferred_lft forever

To show the description (alias) for an interface:

cumulus@switch$ net show interface swp1
    Name   MAC                Speed     MTU   Mode
--  ----   -----------------  -------   -----  ---------
UP  swp1   44:38:39:00:00:04  1G        1500   Access/L2
Alias
-----
hypervisor_port_1

To show the interface description (alias) for all interfaces on the switch:

cumulus@switch:~$ net show interface alias
State    Name            Mode              Alias
-----    -------------   -------------     ------------------
UP       bond01          LACP
UP       bond02          LACP
UP       bridge          Bridge/L2
UP       eth0            Mgmt
UP       lo              Loopback          loopback interface
UP       mgmt            Interface/L3
UP       peerlink        LACP
UP       peerlink.4094   SubInt/L3
UP       swp1            BondMember        hypervisor_port_1
UP       swp2            BondMember        to Server02
...

To show the interface description for all interfaces on the switch in JSON format, run the net show interface alias json command.

cumulus@switch$ nv show interface swp1
cumulus@switch$ ip link show swp1
3: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT qlen 500
    link/ether aa:aa:aa:aa:aa:bc brd ff:ff:ff:ff:ff:ff
    alias hypervisor_port_1

Considerations

Even though ifupdown2 supports the inclusion of multiple iface stanzas for the same interface, use a single iface stanza for each interface. If you must specify more than one iface stanza; for example, if the configuration for a single interface comes from many places, like a template or a sourced file, make sure the stanzas do not specify the same interface attributes. Otherwise, you see unexpected behavior.

In the following example, swp1 is in two file: /etc/network/interfaces and /etc/network/interfaces.d/speed_settings. ifupdown2 parses this configuration because the same attributes are not in multiple iface stanzas.

cumulus@switch:~$ sudo cat /etc/network/interfaces

source /etc/network/interfaces.d/speed_settings

auto swp1
iface swp1
  address 10.0.14.2/24

cumulus@switch:~$ cat /etc/network/interfaces.d/speed_settings

auto swp1
iface swp1
  link-speed 1000
  link-duplex full

ifupdown2 and sysctl

For sysctl commands in the pre-up, up, post-up, pre-down, down, and post-down lines that use the $IFACE variable, if the interface name contains a dot (.), ifupdown2 does not change the name to work with sysctl. For example, the interface name bridge.1 does not convert to bridge/1.

ifupdown2 and the gateway Parameter

The default route that the gateway parameter creates in ifupdown2 does not install in FRRouting, therefore does not redistribute into other routing protocols. Define a static default route instead, which installs in FRR and redistributes, if needed.

The following shows an example of the /etc/network/interfaces file when you use a static route instead of a gateway parameter:

auto swp2
iface swp2
address 172.16.3.3/24
up ip route add default via 172.16.3.2

Interface Name Limitations

Interface names can be a maximum of 15 characters. You cannot use a number for the first character and you cannot include a dash (-) in the name. In addition, you cannot use any name that matches with the regular expression .{0,13}\-v.*.

If you encounter issues, remove the interface name from the /etc/network/interfaces file, then restart the networking.service.

cumulus@switch:~$ sudo nano /etc/network/interfaces
cumulus@switch:~$ sudo systemctl restart networking.service

IP Address Scope

ifupdown2 does not honor the configured IP address scope setting in the /etc/network/interfaces file and treats all addresses as global. It does not report an error. Consider this example configuration:

auto swp2
iface swp2
    address 35.21.30.5/30
    address 3101:21:20::31/80
    scope link

When you run ifreload -a on this configuration, ifupdown2 considers all IP addresses as global.

cumulus@switch:~$ ip addr show swp2
5: swp2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:82 brd ff:ff:ff:ff:ff:ff
inet 35.21.30.5/30 scope global swp2
valid_lft forever preferred_lft forever
inet6 3101:21:20::31/80 scope global
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6282/64 scope link
valid_lft forever preferred_lft forever

To work around this issue, configure the IP address scope:

cumulus@switch:~$ net add interface swp6 post-up ip address add 71.21.21.20/32 dev swp6 scope site
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE command is not supported.

In the /etc/network/interfaces file, configure the IP address scope using post-up ip address add <address> dev <interface> scope <scope>. For example:

auto swp6
iface swp6
    post-up ip address add 71.21.21.20/32 dev swp6 scope site

Then run the ifreload -a command on this configuration.

The following configuration shows the correct scope:

cumulus@switch:~$ ip addr show swp6
9: swp6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:86 brd ff:ff:ff:ff:ff:ff
inet 71.21.21.20/32 scope site swp6
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6286/64 scope link
valid_lft forever preferred_lft forever

Switch Port Attributes

Cumulus Linux exposes network interfaces for several types of physical and logical devices:

Each physical network interface (port) has several settings:

For Spectrum ASICs, MTU is the only port attribute you can directly configure. The Spectrum firmware configures FEC, link speed, duplex mode and auto-negotiation automatically, following a predefined list of parameter settings until the link comes up. However, you can disable FEC if necessary, which forces the firmware to not try any FEC options.

MTU

Interface MTU applies to traffic traversing the management port, front panel or switch ports, bridge, VLAN subinterfaces, and bonds (both physical and logical interfaces). MTU is the only interface setting that you must set manually.

In Cumulus Linux, ifupdown2 assigns 9216 as the default MTU setting. The initial MTU value set by the driver is 9238. After you configure the interface, the default MTU setting is 9216.

To change the MTU setting, run the following commands. The example command sets the MTU to 1500 for the swp1 interface.

cumulus@switch:~$ net add interface swp1 mtu 1500
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands create the following code snippet in the /etc/network/interfaces file:

auto swp1
iface swp1
    mtu 1500
cumulus@switch:~$ nv set interface swp1 link mtu 1500
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces

auto swp1
iface swp1
    mtu 1500
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

Run the ip link set command. The following example command sets the swp1 interface MTU to 1500.

cumulus@switch:~$ sudo ip link set dev swp1 mtu 1500

A runtime configuration is non-persistent; the configuration you create does not persist after you reboot the switch.

Set a Global Policy

To set a global MTU policy, create a policy document (called mtu.json). For example:

cumulus@switch:~$ sudo cat /etc/network/ifupdown2/policy.d/mtu.json
{
  "address": {"defaults": { "mtu": "9216" }
            }
}

The policies and attributes in any file in /etc/network/ifupdown2/policy.d/ override the default policies and attributes in /var/lib/ifupdown2/policy.d/.

Bridge MTU

The MTU setting is the lowest MTU of any interface that is a member of the bridge (every interface specified in bridge-ports in the bridge configuration of the /etc/network/interfaces file). You are not required to specify an MTU on the bridge. Consider this bridge configuration:

auto bridge
iface bridge
    bridge-ports bond1 bond2 bond3 bond4 peer5
    bridge-vids 100-110
    bridge-vlan-aware yes

For a bridge to have an MTU of 9000, set the MTU for each of the member interfaces (bond1 to bond 4, and peer5) to 9000 at minimum.

When configuring MTU for a bond, configure the MTU value directly under the bond interface; the member links or slave interfaces inherit the configured value. If you need a different MTU on the bond, set it on the bond interface, as this ensures the slave interfaces pick it up. You do not have to specify an MTU on the slave interfaces.

VLAN interfaces inherit their MTU settings from their physical devices or their lower interface; for example, swp1.100 inherits its MTU setting from swp1. Therefore, specifying an MTU on swp1 ensures that swp1.100 inherits the MTU setting for swp1.

If you are working with VXLANs, the MTU for a virtual network interface (VNI must be 50 bytes smaller than the MTU of the physical interfaces on the switch, as various headers and other data require those 50 bytes. Also, consider setting the MTU much higher than 1500.

The MTU for an SVI interface, such as vlan10, comes from the bridge. When you use NCLU to change the MTU for an SVI and the MTU setting is higher than it is for the other bridge member interfaces, the MTU for all bridge member interfaces changes to the new setting. If you need to use a mixed MTU configuration for SVIs, (if some SVIs have a higher MTU and some lower), set the MTU for all member interfaces to the maximum value, then set the MTU on the specific SVIs that need to run at a lower MTU.

To show the MTU setting for an interface:

cumulus@switch:~$ net show interface swp1
    Name    MAC                Speed      MTU  Mode
--  ------  -----------------  -------  -----  ---------
UP  swp1    44:38:39:00:00:04  1G        9216  Access/L2
cumulus@switch:~$ nv show interface swp1
cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc pfifo_fast state UP mode DEFAULT qlen 500
   link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff

Drop Packets that Exceed the Egress Layer 3 MTU

The switch forwards all packets that are within the MTU value set for the egress layer 3 interface. However, when packets are larger in size than the MTU value, the switch fragments the packets that do not have the DF bit set and drops the packets that do have the DF bit set.

In Cumulus Linux 4.4.1 and later, run the following command to drop all IP packets that are larger in size than the MTU value for the egress layer 3 interface instead of fragmenting packets:

cumulus@switch:~$ net add trap l3-mtu-err action off
cumulus@switch:~$ net commit
cumulus@switch:~$ echo "0 >" /cumulus/switchd/config/trap/l3-mtu-err/enable

FEC

Forward Error Correction (FEC) is an encoding and decoding layer that enables the switch to detect and correct bit errors introduced over the cable between two interfaces. The target IEEE bit error rate (BER) on high speed Ethernet links is 10-12. Because 25G transmission speeds can introduce a higher than acceptable BER on a link, FEC is often required to correct errors to achieve the target BER at 25G, 4x25G, 100G, and higher link speeds. The type and grade of a cable or module and the medium of transmission determine which FEC setting is necessary.

For the link to come up, the two interfaces on each end must use the same FEC setting.

FEC requires small latency overhead. For most applications, this small amount of latency is preferable to error packet retransmission latency.

The two FEC types are:

Cumulus Linux includes additional FEC options:

While Auto FEC is the default setting on the NVIDIA Spectrum switch, do not explicitly configure the fec auto option on the switch as this leads to a link flap whenever you run net commit or ifreload -a.

For 25G DAC, 4x25G Breakouts DAC and 100G DAC cables, the IEEE 802.3by specification creates 3 classes:

The IEEE classification specifies various dB loss measurements and minimum achievable cable length. You can build longer and shorter cables if they comply to the dB loss and BER requirements.

If a cable has a CA-25G-S classification and FEC is not on, the BER might be unacceptable in a production network. It is important to set the FEC according to the cable class (or better) to have acceptable bit error rates. See Determining Cable Class below.

You can check bit errors using cl-netstat (RX_ERR column) or ethtool -S (HwIfInErrors counter) after a large amount of traffic passes through the link. A non-zero value indicates bit errors. Expect error packets to be zero or extremely low compared to good packets. If a cable has an unacceptable rate of errors with FEC enabled, replace the cable.

For 25G, 4x25G Breakout, and 100G Fiber modules and AOCs, there is no classification of 25G cable types for dB loss, BER or length. Use FEC if the BER is low enough.

Cable Class of 100G and 25G DACs

You can determine the cable class for 100G and 25G DACs from the Extended Specification Compliance Code field (SFP28: 0Ah, byte 35, QSFP28: Page 0, byte 192) in the cable EEPROM programming.

For 100G DACs, most manufacturers use the 0x0Bh 100GBASE-CR4 or 25GBASE-CR CA-L value (the 100G DAC specification predates the IEEE 802.3by 25G DAC specification). Use RS FEC for 100G DAC; shorter or better cables might not need this setting.

A manufacturer’s EEPROM setting might not match the dB loss on a cable or the actual bit error rates that a particular cable introduces. Use the designation as a guide, but set FEC according to the bit error rate tolerance in the design criteria for the network. For most applications, the highest mutual FEC ability of both end devices is the best choice.

You can determine for which grade the manufacturer has designated the cable as follows.

For the SFP28 DAC, run the following command:

cumulus@switch:~$ sudo ethtool -m swp1 hex on | grep 0020 | awk '{ print $6}'
0c

The values at location 0x0024 are:

For the QSFP28 DAC, run the following command:

cumulus@switch:~$ sudo ethtool -m swp1s0 hex on | grep 00c0 | awk '{print $2}'
0b

The values at 0x00c0 are:

In each example below, the Compliance field comes from the method described above; the ethool -m output does not show it.

3meter cable that does not require FEC
(CA-N)
Cost: More expensive
Cable size: 26AWG (Note that AWG does not necessarily correspond to overall dB loss or BER performance)
Compliance Code: 25GBASE-CR CA-N

3meter cable that requires Base-R FEC
(CA-S)
Cost: Less expensive
Cable size: 26AWG
Compliance Code: 25GBASE-CR CA-S

When in doubt, consult the manufacturer directly to determine the cable classification.

Spectrum ASIC FEC Behavior

The firmware in a Spectrum ASIC applies an FEC configuration to 25G and 100G cables based on the cable type and whether the peer switch also has a Spectrum ASIC.

When the link is between two switches with Spectrum ASICs:

Cable Type
FEC Mode
25G optical cablesBase-R/FC-FEC
25G 1,2 meters: CA-N, loss <13dbBase-R/FC-FEC
25G 2.5,3 meters: CA-S, loss <16dbBase-R/FC-FEC
25G 2.5,3,4,5 meters: CA-L, loss > 16dbRS-FEC
100G DAC or opticalRS-FEC

When linking to a non-Spectrum peer, the firmware lets the peer decide. The Spectrum ASIC supports RS-FEC (for both 100G and 25G), Base-R/FC-FEC (25G only), or no-FEC (for both 100G and 25G).

Cable Type
FEC Mode
25G optical cablesLet peer decide
25G 1,2 meters: CA-N, loss <13dbLet peer decide
25G 2.5,3 meters: CA-S, loss <16dbLet peer decide
25G 2.5,3,4,5 meters: CA-L, loss > 16dbLet peer decide
100GLet peer decide: RS-FEC or No FEC

How Does Cumulus Linux use FEC?

A Spectrum switch enables FEC automatically when it powers up. The port firmware tests and determines the correct FEC mode to bring the link up with the neighbor. It is possible to get a link up to a switch without enabling FEC on the remote device as the switch eventually finds a working combination to the neighbor without FEC.

The following sections describe how to show the current FEC mode, and how to enable and disable FEC.

Show the Current FEC Mode

On a Spectrum switch, the --show-fec output shows you the current active state of FEC only if the link is up; if the FEC modes matches that of the neighbor. If the link is not up, the value displays None, which is not valid.

To show the FEC mode on a switch port, run the ethtool --show-fec <interface> command.

cumulus@switch:~$ sudo ethtool --show-fec swp1
FEC parameters for swp1:
Configured FEC encodings: Auto
Active FEC encoding: Off

Enable or Disable FEC

To enable Reed Solomon (RS) FEC on a link:

cumulus@switch:~$ sudo net add interface swp1 link fec rs
cumulus@switch:~$ sudo net pending
cumulus@switch:~$ sudo net commit
cumulus@switch:~$ nv set interface swp1 link fec rs
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example enables RS FEC for the swp1 interface (link-fec rs):

cumulus@switch:~$ sudo nano /etc/network/interfaces

auto swp1
iface swp1
    link-autoneg off
    link-speed 100000
    link-fec rs
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

Run the ethtool --set-fec <interface> encoding RS command. For example:

cumulus@switch:~$ sudo ethtool --set-fec swp1 encoding RS

A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.

To enable Base-R/FireCode FEC on a link:

cumulus@switch:~$ sudo net add interface swp1 link fec baser
cumulus@switch:~$ sudo net pending
cumulus@switch:~$ sudo net commit
cumulus@switch:~$ nv set interface swp1 link fec baser
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example enables Base-R FEC for the swp1 interface (link-fec baser):

cumulus@switch:~$ sudo nano /etc/network/interfaces

auto swp1
iface swp1
    link-autoneg off
    link-speed 100000
    link-fec baser
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

Run the ethtool --set-fec <interface> encoding baser command. For example:

cumulus@switch:~$ sudo ethtool --set-fec swp1 encoding BaseR

A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.

To enable FEC with Auto-negotiation:

You can use FEC with auto-negotiation on DACs only.

cumulus@switch:~$ sudo net add interface swp1 link autoneg on
cumulus@switch:~$ sudo net pending
cumulus@switch:~$ sudo net commit
cumulus@switch:~$ nv set interface swp1 link auto-negotiate on
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file to set auto-negotiation to on, then run the ifreload -a command:

cumulus@switch:~$ sudo nano /etc/network/interfaces

auto swp1
iface swp1
link-autoneg on
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

You can use ethtool to enable FEC with auto-negotiation. For example:

ethtool -s swp1 speed 10000 duplex full autoneg on

A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.

To show the FEC and auto-negotiation settings for an interface:

cumulus@switch:~$ sudo ethtool swp1 | egrep 'FEC|auto'
Supports auto-negotiation: Yes
Supported FEC modes: RS
Advertised auto-negotiation: Yes
Advertised FEC modes: RS
Link partner advertised auto-negotiation: Yes
Link partner advertised FEC modes: Not reported

To disable FEC on a link:

cumulus@switch:~$ sudo net add interface swp1 link fec off
cumulus@switch:~$ sudo net pending
cumulus@switch:~$ sudo net commit
cumulus@switch:~$ nv set interface swp1 link fec off
cumulus@switch:~$ nv config apply

To configure FEC to the default value, run the nv unset interface swp1 link fec command.

Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example disables Base-R FEC for the swp1 interface (link-fec baser):

cumulus@switch:~$ sudo nano /etc/network/interfaces

auto swp1
iface swp1
link-fec off
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

Run the ethtool --set-fec <interface> encoding off command. For example:

cumulus@switch:~$ sudo ethtool --set-fec swp1 encoding off

A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.

Default Policies for Interface Settings

Instead of configuring settings for each individual interface, you can specify a policy for all interfaces on a switch or tailor custom settings for each interface. Create a file in /etc/network/ifupdown2/policy.d/ and populate the settings accordingly. The following example shows a file called address.json.

cumulus@switch:~$ cat /etc/network/ifupdown2/policy.d/address.json
{
    "ethtool": {
        "defaults": {
            "link-duplex": "full"
        },
        "iface_defaults": {
            "swp1": {
                "link-autoneg": "on",
                "link-speed": "1000"
          },
            "swp16": {
                "link-autoneg": "off",
                "link-speed": "10000"
            },
            "swp50": {
                "link-autoneg": "off",
                "link-speed": "100000",
                "link-fec": "rs"
            }
        }
    },
    "address": {
        "defaults": { "mtu": "9000" },
        "iface_defaults": {
            "eth0": {"mtu": "1500"}
        }
    }
}

Setting the default MTU also applies to the management interface. Be sure to add the iface_defaults to override the MTU for eth0, to remain at 9216.

Breakout Ports

Cumulus Linux supports the following ports breakout options:

18x SFP+ 25G and 4x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.

All 4x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.

  • 18x 10G - 18x SFP28 set to 10G
  • 16x 10G - 4x QSFP28 configured as 4x25G breakouts and set to 10G

Maximum 10G ports: 34

  • 18x 25G - 18x SFP28 (native speed)
  • 16x 25G - 4x QSFP28 breakouts to 4x25G

Maximum 25G ports: 34

4x 40G - 4x QSFP28 set to 40G

Maximum 40G ports: 4

8x 50G - 4x QSFP28 break out into 2x 50G

Maximum 50G ports: 8

4x 100G - 4x QSFP28 (native speed)

Maximum 100G ports: 4

16x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.

All QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.

64x 10G - 16x QSFP28 break out into 4x 25G and set to 10G

Maximum 10G ports: 64

64x 25G - 16x QSFP28 break out into 4x 25G

Maximum 25G ports: 64

16x 40G - 4x QSFP28 set to 40G

Maximum 40G ports: 16

32x 50G - 16x QSFP28 break out into 2x 50G

Maximum 50G ports: 32

16x 100G - 16x QSFP28 (native speed)

Maximum 100G ports: 16

48x SFP28 25G and 8x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.

The top 4x QSFP28 ports can break out into 4x SFP28. You cannot use the lower 4x QSFP28 disabled ports.

All 8x QSFP28 ports can break out into 2x QSFP28 without disabling ports.

  • 48x 10G - 48x SFP28 set to 10G
  • 16x 10G - 4x QSPF28 break out into 4x25G and set to 10G

Maximum 10G ports: 64

  • 48x 25G - 48x SFP28 (native speed)
  • 16x 25G - Top 4x QSFP28 break out into 4x25G (bottom 4x QSFP28 disabled)

Maximum 25G ports: 64

8x 40G - 8x QSFP28 set to 40G

Maximum 40G ports: 8

16x 50G - 8x QSFP28 break out into 2x 50G

Maximum 50G ports: 16

8x 100G - 16x QSFP28 (native speed)

Maximum 100G ports: 8

32x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.

The top 16x QSFP28 ports can break out into 4x SFP28. You cannot use the lower 4x QSFP28 disabled ports.

All 32x QSFP28 ports can break out into 2x QSFP28 without disabling ports.

64x 10G - Top 16x QSFP28 break out into 4x 25G and set to 10G (bottom 16x QSFP28 disabled)

Maximum 10G ports: 64

64x 25G - Top 16x QSFP28 break out into 4x25G (bottom 16x QSFP28 disabled)

Maximum 25G ports: 64

32x 40G - 32x QSFP28 set to 40G

Maximum 40G ports: 32

64x 50G - 64x QSFP28 break out into 2x 50G

Maximum 50G ports: 64

32x 100G - 32x QSFP28 (native speed)

Maximum 100G ports: 32

48x SFP28 25G and 12x QSFP28 100G interfaces only support NRZ encoding.

All 12x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.

  • 48x 10G - 48x SFP28 set to 10G
  • 48x 10G - 12x QSPF28 break out into 4x 25G and set to 10G

Maximum 10G ports: 96

  • 48x 25G - 48x SFP28 (native speed)
  • 48x 25G - 12x QSPF28 break out into 4x 25G

Maximum 25G ports: 96

12x 40G - 12x QSFP28 set to 40G

Maximum 40G ports: 12

24x 50G - 12x QSFP28 break out into 2x 50G

Maximum 50G ports: 24

12x 100G - 12x QSFP28 (native speed)

Maximum 100G ports: 12

32x QSFP28 100G interfaces only support NRZ encoding.

All 32x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.

128x 10G - 32x QSFP28 break out into 4x 25G and set to 10G

Maximum 10G ports: 128

128x 25G - 32x QSFP28 break out into 4x 25G

Maximum 25G ports: 128

32x 40G - 32x QSFP28 set to 40G

Maximum 40G ports: 32

64x 50G - 32x QSFP28 break out into 2x 50G

Maximum 50G ports: 64

32x 100G - 32x QSFP28 (native speed)

Maximum 100G ports: 32

32x QSFP56 200G interfaces support both PAM4 and NRZ encodings.

For lower speed interface configurations, PAM4 is automatically converted to NRZ encoding.

All 32x QSFP56 ports can break out into 4xSFP56 or 2x QSFP56.

128x 10G - 32x QSFP56 break out into 4x 50G and set to 10G

Maximum 10G ports: 128

128x 25G - 32x QSFP56 break out into 4x 50G and set to 25G

Maximum 25G ports: 128

32x 40G - 32x QSFP56 set to 40G

Maximum 40G ports: 32

128x 50G - 32x QSFP56 break out into 4x 50G

Maximum 50G ports: 128

64x 100G - 32x QSFP56 break out into 2x 100G

Maximum 100G ports: 64

32x 200G - 32x QSFP56 (native speed)

Maximum 200G ports: 32

64x QSFP28 100G interfaces only support NRZ encoding.

Only 32x QSFP28 ports can break out into 4x SFP28. You must disable the adjacent QSFP28 port. Only the first and third or second and forth rows can break out into 4xSFP28.

All 64x QSFP28 ports can break out into 2x QSFP28 without disabling ports.

128x 10G - 32x QSFP28 break out into 4x 25G and set to 10G

Maximum 10G ports: 128

128x 25G - 32x QSFP28 break out into 4x 25G

Maximum 25G ports: 128

64x 40G - 64x QSFP28 set to 40G

Maximum 40G ports: 64

128x 50G - 64x QSFP28 break out into 2x 50G

Maximum 50G ports: 128

64x 100G - 64x QSFP28 (native speed)

Maximum 100G ports: 80

SN4700 32x QSFP-DD 400GbE interfaces support both PAM4 and NRZ encodings.

For lower speed interface configurations, PAM4 is automatically converted to NRZ encoding.

Only the top or the bottom 16x QSFP-DD ports can break out into 8x SFP56. You must disable the adjacent QSFP-DD port.

All 32x QSFP-DD ports can break out into 2x QSFP56 at 2x200G or 4x QSFP56 at 4x 100G without disabling ports.

128x 10G - 16x QSFP-DD break out into 8x 50G and set to 10G

Maximum 10G ports: 128

*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x 10G ports.

128x 25G - 16x QSFP-DD break out into 8x 50G and set to 25G

Maximum 25G ports: 128

*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x 25G ports.

32x 40G - 32x QSFP-DD set to 40G

Maximum 40G ports: 32

128x 50G - 16x QSFP-DD break out into 8x 50G

Maximum 50G ports: 128

*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x 50G ports.

128x 100G - 32x QSFP-DD break out into 4x 100G

Maximum 100G ports: 128

64x 200G - 64x QSFP-DD break out into 2x 200G

Maximum 200G ports: 64

32x 400G - 32x QSFP-DD (native speed)

Maximum 400G ports: 32

  • You can use a single SFP (10/25/50G) transceiver in a QSFP (100/200/400G) port with QSFP-to-SFP Adapter (QSA). Set the port speed to the SFP speed with the nv set interface <interface> link speed <speed> command. Do not configure this port as a breakout port.

  • If you break out a port, then reload the switchd service on a switch running in nonatomic ACL mode, temporary disruption to traffic occurs while the ACLs reinstall.

  • Cumulus Linux does not support port ganging.

  • Switches with the Spectrum 1 ASIC have a limit of 64 logical ports. If you want to break ports out to 4x25G or 4x10G:

    • You can only break out odd-numbered ports into four logical ports.
    • You must disable the next even numbered port. For example, if you break out port 11 into four logical ports, you must disable port 12. These restrictions do not apply to a 2x50G breakout configuration or to the NVIDIA SN2100 and SN2010 switch.
  • Switches with the Spectrum 2 and Spectrum 3 ASIC have a limit of 128 logical ports. To ensure that the number of total logical interfaces does not exceed the limit, if you split ports into four interfaces on Spectrum 2 and Spectrum 3 switches with 64 interfaces, you must disable the adjacent port. For example, when splitting port 1 into four 25G interfaces, you must disable port 2 in the /etc/cumulus/ports.conf file:

    1=4x25G
    2=disabled
    

    When you split a port into two interfaces, such as 2x50G, you do not have to disable the adjacent port.

For valid port configuration and breakout guidance, see the /etc/cumulus/ports.conf file.

Configure a Breakout Port

To configure a breakout port:

This example command breaks out the 100G port on swp1 into four 25G ports:

cumulus@switch:~$ net add interface swp1 breakout 4x25G
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

To break out a port into four 10G ports, you must also disable the next port.

cumulus@switch:~$ net add interface swp2 breakout disabled
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

These commands break out swp1 into four 25G interfaces in the /etc/cumulus/ports.conf file and create four interfaces in the /etc/network/interfaces file:

cumulus@switch:~$ cat /etc/network/interfaces
...
auto swp1s0
iface swp1s0

auto swp1s1
iface swp1s1

auto swp1s2
iface swp1s2

auto swp1s3
iface swp1s3
...

This example command breaks out the 100G port on swp1 into four 25G ports:

cumulus@switch:~$ nv set interface swp1 link breakout 4x25G 
cumulus@switch:~$ nv config apply

To break out a port into four 10G ports, you must also disable the next port.

cumulus@switch:~$ nv set interface swp1 link breakout 4x10G
cumulus@switch:~$ nv set interface swp2 link breakout disabled
cumulus@switch:~$ nv config apply
  1. Edit the /etc/cumulus/ports.conf file to configure the port breakout. The following example breaks out the 100G port on swp1 into four 25G ports. (To break out swp1 into four 10G ports, use 1=4x10G.) You also need to disable the next port. The example also disables swp2.

    cumulus@switch:~$ sudo cat /etc/cumulus/ports.conf
    ...
    1=4x25G
    2=disabled
    3=100G
    4=100G
    ...
    
  2. Configure the breakout ports in the /etc/network/interfaces file. The following example shows the swp1 breakout ports (swp1s0, swp1s1, swp1s2, and swp1s3).

cumulus@switch:~$ sudo cat /etc/network/interfaces
...
auto swp1s0
iface swp1s0

auto swp1s1
iface swp1s1

auto swp1s2
iface swp1s2

auto swp1s3
iface swp1s3
...

Reload switchd with the sudo systemctl reload switchd.service command. The reload does not interrupt network services.

cumulus@switch:~$ sudo systemctl reload switchd.service

Remove a Breakout Port

To remove a breakout port:

  1. Run the net del interface <interface> command. For example:

    cumulus@switch:~$ net del interface swp1s0
    cumulus@switch:~$ net del interface swp1s1
    cumulus@switch:~$ net del interface swp1s2
    cumulus@switch:~$ net del interface swp1s3
    cumulus@switch:~$ net pending
    cumulus@switch:~$ net commit
    
  2. Manually edit the /etc/cumulus/ports.conf file to configure the interface for the original speed. For example:

    cumulus@switch:~$ sudo nano /etc/cumulus/ports.conf
    ...
    
    1=100G
    2=100G
    3=100G
    4=100G
    ...
    
  3. Reload switchd. The reload does not interrupt network services.

    cumulus@switch:~$ sudo systemctl reload switchd.service
    
  1. Run the nv unset interface <interface> command. For example:

    cumulus@switch:~$ nv unset interface swp1s0
    cumulus@switch:~$ nv unset interface swp1s1
    cumulus@switch:~$ nv unset interface swp1s2
    cumulus@switch:~$ nv unset interface swp1s3
    cumulus@switch:~$ nv config apply
    
  2. Run the nv unset interface <interface> link breakout command to configure the interface for the original speed. For example:

    cumulus@switch:~$ nv unset interface swp1 link breakout
    cumulus@switch:~$ nv config apply
    
  1. Edit the /etc/cumulus/ports.conf file to configure the interface for the original speed.

    cumulus@switch:~$ sudo nano /etc/cumulus/ports.conf
    ...
    
    1=100G
    2=100G
    3=100G
    4=100G
    ...
    
  2. Reload switchd. The reload does not interrupt network services.

    cumulus@switch:~$ sudo systemctl reload switchd.service
    

Logical Switch Port Limitations

100G and 40G switches can support a certain number of logical ports depending on the switch. Before you configure any logical ports on a switch, check the limitations listed in the /etc/cumulus/ports.conffile.

Troubleshooting

This section shows basic commands for troubleshooting switch ports. For a more comprehensive troubleshooting guide, see Troubleshoot Layer 1.

Statistics

To show high-level interface statistics, run the NCLU net show interface command. The NVUE Command is nv show interface.

cumulus@switch:~$ net show interface swp1

    Name    MAC                Speed      MTU  Mode
--  ------  -----------------  -------  -----  ---------
UP  swp1    44:38:39:00:00:04  1G        1500  Access/L2

Vlans in disabled State
-------------------------
br0

Counters      TX    RX
----------  ----  ----
errors         0     0
unicast        0     0
broadcast      0     0
multicast      0     0

LLDP
------  ----  ---------------------------
swp1    ====  44:38:39:00:00:03(server01)

To show low-level interface statistics, run the following ethtool command:

cumulus@switch:~$ sudo ethtool -S swp1
NIC statistics:
      HwIfInOctets: 21870
      HwIfInUcastPkts: 0
      HwIfInBcastPkts: 0
      HwIfInMcastPkts: 243
      HwIfOutOctets: 1148217
      HwIfOutUcastPkts: 0
      HwIfOutMcastPkts: 11353
      HwIfOutBcastPkts: 0
      HwIfInDiscards: 0
      HwIfInL3Drops: 0
      HwIfInBufferDrops: 0
      HwIfInAclDrops: 0
      HwIfInBlackholeDrops: 0
      HwIfInDot3LengthErrors: 0
      HwIfInErrors: 0
      SoftInErrors: 0
      HwIfOutErrors: 0
      HwIfOutQDrops: 0
      HwIfOutNonQDrops: 0
      SoftOutErrors: 0
      SoftOutDrops: 0
      SoftOutTxFifoFull: 0
      HwIfOutQLen: 0

Query SFP Port Information

To verify SFP settings, run the ethtool -m command. The following example shows the vendor, type and power output for the swp1 interface.

cumulus@switch:~$ sudo ethtool -m swp1 | egrep 'Vendor|type|power\s+:'
        Transceiver type                          : 10G Ethernet: 10G Base-LR
        Vendor name                               : FINISAR CORP.
        Vendor OUI                                : 00:90:65
        Vendor PN                                 : FTLX2071D327
        Vendor rev                                : A
        Vendor SN                                 : UY30DTX
        Laser output power                        : 0.5230 mW / -2.81 dBm
        Receiver signal average optical power     : 0.7285 mW / -1.38 dBm

Considerations

Auto-negotiation and FEC

If auto-negotiation is off on 100G and 25G interfaces, you must set FEC to OFF, RS, or BaseR to match the neighbor. The FEC default setting of auto does not link up when auto-negotiation is off.

Port Speed and the ifreload -a Command

When configuring port speed or break outs in the /etc/cumulus/ports.conf file, you need to run the ifreload -a command to reload the configuration after restarting switchd in the following cases:

Port Speed Configuration

If you change the port speed in the /etc/cumulus/ports.conf file but the speed for that port is also in the /etc/network/interfaces file, after you edit the /etc/cumulus/ports.conf file and restart switchd, you must also run the ifreload -a command.

1000BASE-T SFP Modules Supported Only on Certain 25G Platforms

The following 25G switches support 1000BASE-T SFP modules:

100G or faster switches do not support 1000BASE-T SFP modules.

After rebooting the NVIDIA SN2100 switch, eth0 always has a speed of 100Mb/s. If you bring the interface down and then back up again, the interface negotiates 1000Mb. This only occurs the first time the interface comes up.

To work around this issue, add the following commands to the /etc/rc.local file to flap the interface automatically when the switch boots:

modprobe -r igb
sleep 20
modprobe igb

Delay in Reporting Interface as Operational Down

When you remove two transceivers simultaneously from a switch, both interfaces show the carrier down status immediately. However, it takes one second for the second interface to show the operational down status. In addition, the services on this interface also take an extra second to come down.

NVIDIA Spectrum-2 Switches and FEC Mode

The NVIDIA Spectrum-2 (25G) switch only supports RS FEC.

ifplugd

ifplugd is an Ethernet link-state monitoring daemon that executes scripts to configure an Ethernet device when you plug in or remove a cable. Follow the steps below to install and configure the ifplugd daemon.

Install ifplugd

You can install this package even if the switch does not connect to the internet. The package is in the cumulus-local-apt-archive repository on the Cumulus Linux image.

To install ifplugd:

  1. Update the switch before installing the daemon:

    cumulus@switch:~$ sudo -E apt-get update
    
  2. Install the ifplugd package:

    cumulus@switch:~$ sudo -E apt-get install ifplugd
    

Configure ifplugd

After you install ifplugd, you must edit two configuration files:

The example configuration below configures ifplugd to bring down all uplinks when the peer bond goes down in an MLAG environment.

  1. Open /etc/default/ifplugd in a text editor and configure the file as appropriate. Add the peerbond name before you save the file.

    INTERFACES="peerbond"
    HOTPLUG_INTERFACES=""
    ARGS="-q -f -u0 -d1 -w -I"
    SUSPEND_ACTION="stop"
    
  2. Open the /etc/ifplugd/action.d/ifupdown file in a text editor. Configure the script, then save the file.

    #!/bin/sh
    set -e
    case "$2" in
    up)
            clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
            if [ "$clagrole" = "secondary" ]
            then
                #List all the interfaces below to bring up when clag peerbond comes up.
                for interface in swp1 bond1 bond3 bond4
                do
                    echo "bringing up : $interface"  
                    ip link set $interface up
                done
            fi
        ;;
    down)
            clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
            if [ "$clagrole" = "secondary" ]
            then
                #List all the interfaces below to bring down when clag peerbond goes down.
                for interface in swp1 bond1 bond3 bond4
                do
                    echo "bringing down : $interface"
                    ip link set $interface down
                done
            fi
        ;;
    esac
    
  3. Restart the ifplugd daemon to implement the changes:

    cumulus@switch:~$ sudo systemctl restart ifplugd.service
    

Considerations

The default shell for ifplugd is dash (/bin/sh) instead of bash, as it provides a faster and more nimble shell. However, dash contains fewer features than bash (for example, dash is unable to handle multiple uplinks).

Quality of Service

This section refers to frames for all internal QoS functionality. Unless explicitly stated, the actions are independent of layer 2 frames or layer 3 packets.

Cumulus Linux supports several different QoS features and standards including:

Cumulus Linux uses two configuration files for QoS:

Cumulus Linux 4.4 does not use the traffic.conf and datapath.conf files but uses the qos_features.conf and qos_infra.conf files instead. Review your existing QoS configuration to determine the changes you need to make.

switchd and QoS

You apply QoS changes to the ASIC with the following command:

cumulus@switch:~$ sudo systemctl reload switchd.service

Unlike the restart command, the reload switchd.service command does not impact traffic forwarding except when the following conditions occur:

These conditions require modifications to the ASIC buffer which might result in momentary packet loss.

When you run the reload switchd.service command, Cumulus Linux always runs the Syntax Checker before applying changes.

Classification

When a frame or packet arrives on the switch, Cumulus Linux maps it to an internal COS value. This value never writes to the frame or packet but classifies and schedules traffic internally through the switch.

You can define which values are trusted in the qos_features.conf file by configuring the traffic.packet_priority_source_set setting.

The traffic.port_default_priority setting accepts a value between 0 and 7 and defines the internal COS marking to use with the port value.

If traffic.packet_priority_source_set is cos or dscp, you can map the ingress values to an internal COS value.

To apply the settings, reload the switchd service. See the switchd section for more information.

The following table describes the default classifications for various frame and packet_priority_source_set configurations:

packet_priority_source_set settingVLAN Tagged?IP or Non-IPResult
802.1pYesIPAccept incoming 802.1p COS marking.
802.1pYesNon-IPAccept incoming 802.1p COS marking.
802.1pNoIPUse the port_default_priority setting.
802.1pNoNon-IPUse the port_default_priority setting.
dscpYesIPAccept incoming DSCP IP header marking.
dscpYesNon-IPUse the port_default_priority setting.
dscpNoIPAccept incoming DSCP IP header marking.
dscpNoNon-IPUse the port_default_priority setting.
802.1p, dscpYesIPAccept incoming DSCP IP header marking.
802.1p, dscpYesNon-IPAccept incoming 802.1p COS marking.
802.1p, dscpNoIPAccept incoming DSCP IP header marking.
802.1p, dscpNoNon-IPUse the port_default_priority setting.
portEitherEitherIgnore any existing markings and use port_default_priority setting.

Trust COS

To trust ingress COS markings, set traffic.packet_priority_source_set = [802.1p].

When COS is trusted, the following lines classify ingress COS values to internal COS values:

traffic.cos_0.priority_source.8021p = [0]
traffic.cos_1.priority_source.8021p = [1]
traffic.cos_2.priority_source.8021p = [2]
traffic.cos_3.priority_source.8021p = [3]
traffic.cos_4.priority_source.8021p = [4]
traffic.cos_5.priority_source.8021p = [5]
traffic.cos_6.priority_source.8021p = [6]
traffic.cos_7.priority_source.8021p = [7]

The traffic.cos_ number is the internal COS value; for example traffic.cos_0 defines the mapping for internal COS 0.
To map ingress COS 0 to internal COS 4, configure the traffic.cos_4.priority_source.8021p setting.

You can map multiple ingress COS values to the same internal value. For example, to map ingress COS values 0, 1, and 2 to internal COS 0:

traffic.cos_0.priority_source.8021p = [0, 1, 2]

You can also choose not to use an internal COS value. This example does not use internal COS values 3 and 4.

traffic.cos_0.priority_source.8021p = [0]
traffic.cos_1.priority_source.8021p = [1]
traffic.cos_2.priority_source.8021p = [2,3,4]
traffic.cos_3.priority_source.8021p = []
traffic.cos_4.priority_source.8021p = []
traffic.cos_5.priority_source.8021p = [5]
traffic.cos_6.priority_source.8021p = [6]
traffic.cos_7.priority_source.8021p = [7]

You can configure additional settings using Port Groups.

To apply the settings, reload the switchd service. See the switchd section for more information.

Trust DSCP

To trust ingress DSCP markings, configure traffic.packet_priority_source_set = [dscp].

If DSCP is trusted, the following lines classify ingress DSCP values to internal COS values:

traffic.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
traffic.cos_1.priority_source.dscp = [8,9,10,11,12,13,14,15]
traffic.cos_2.priority_source.dscp = [16,17,18,19,20,21,22,23]
traffic.cos_3.priority_source.dscp = [24,25,26,27,28,29,30,31]
traffic.cos_4.priority_source.dscp = [32,33,34,35,36,37,38,39]
traffic.cos_5.priority_source.dscp = [40,41,42,43,44,45,46,47]
traffic.cos_6.priority_source.dscp = [48,49,50,51,52,53,54,55]
traffic.cos_7.priority_source.dscp = [56,57,58,59,60,61,62,63]

The # in the configuration file is a comment. By default, the file comments out the traffic.cos_*.priority_source.dscp lines.
You must uncomment them for them to take effect.

The traffic.cos_ number is the internal COS value; for example traffic.cos_0 defines the mapping for internal COS 0. To map ingress DSCP 22 to internal COS 4, configure the traffic.cos_4.priority_source.dscp setting.

You can map multiple ingress DSCP values to the same internal COS value. For example, to map ingress DSCP values 10, 21, and 36 to internal COS 0:

traffic.cos_0.priority_source.dscp = [10,21,36]

You can also choose not to use an internal COS value. This example does not use internal COS values 3 and 4:

traffic.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
traffic.cos_1.priority_source.dscp = [8,9,10,11,12,13,14,15]
traffic.cos_2.priority_source.dscp = [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
traffic.cos_3.priority_source.dscp = []
traffic.cos_4.priority_source.dscp = []
traffic.cos_5.priority_source.dscp = [40,41,42,43,44,45,46,47,32,33,34,35,36,37,38,39]
traffic.cos_6.priority_source.dscp = [48,49,50,51,52,53,54,55]
traffic.cos_7.priority_source.dscp = [56,57,58,59,60,61,62,63]

You can configure additional settings using Port Groups.

To apply the settings, reload the switchd service. See the switchd section for more information.

Trust Port

To assign all traffic to an internal COS queue regardless of the ingress marking, configure traffic.packet_priority_source_set = [port].

The traffic.port_default_priority setting defines the COS value that all traffic uses. You can configure additional settings using Port Groups.

To apply the settings, reload the switchd service. See the switchd section for more information.

Mark and Remark Traffic

You can mark or remark traffic in two ways:

iptables

Cumulus Linux supports ACLs through ebtables, iptables or ip6tables for egress packet marking and remarking.

Cumulus Linux uses ebtables to mark layer 2, 802.1p COS values. Cumulus Linux uses iptables to match IPv4 traffic and ip6tables to match IPv6 traffic for DSCP marking.

For more information on configuring and applying ACLs, refer to Netfilter - ACLs.

Mark Layer 2 COS

You must use ebtables to match and mark layer 2 bridged traffic. You can match traffic with any supported ebtables rule.

To set the new COS value when traffic matches, use -A FORWARD -o <interface> -j setqos --set-cos <value>.

You can only set COS on a per-egress interface basis. Cumulus Linux does not support ebtables based matching on ingress.

The configured action always has the following conditions:

For example, to set traffic leaving interface swp5 to COS value 4:

-A FORWARD -o swp5 -j setqos --set-cos 4

Mark Layer 3 DSCP

You must use iptables (for IPv4 traffic) or ip6tables (for IPv6 traffic) to match and mark layer 3 traffic.

You can match traffic with any supported iptable or ip6tables rule. To set the new COS or DSCP value when traffic is matches, use -A FORWARD -o <interface> -j setqos [--set-dscp <value> | --set-cos <value> | --set-dscp-class <name>].

The configured action always has the following conditions:

You can configure COS markings with --set-cos and a value between 0 and 7 (inclusive).

You can use only one of --set-dscp or --set-dscp-class.
--set-dscp supports decimal or hex DSCP values between 0 and 77. --set-dscp-class supports standard DSCP naming, described in RFC3260, including ef, be, CS and AF classes.

You can specify either --set-dscp or --set-dscp-class, but not both.

For example, to set traffic leaving interface swp5 to DSCP value 32:

-A FORWARD -o swp5 -j setqos --set-dscp 32

To set traffic leaving interface swp11 to DSCP class value CS6:

-A FORWARD -o swp11 -j setqos --set-dscp-class cs6

Ingress COS or DSCP for Marking

To remark COS or DSCP values, modify the traffic.packet_priority_remark_set value in the qos_features.conf file.

This configuration allows an internal COS value to determine the egress COS or DSCP value. For example, to enable the remarking of only DSCP values:

traffic.packet_priority_remark_set = [dscp]

You can remark both COS and DSCP with traffic.packet_priority_remark_set = [cos,dscp].

Remark COS

You remark COS with the priority_remark.8021p setting in the qos_features.conf file. The internal cos_ value determines the egress 802.1p COS remarking. For example, to remark internal COS 0 to egress COS 4:

traffic.cos_0.priority_remark.8021p = [4]

The # in the configuration file is a comment. The file comments out the traffic.cos_*.priority_remark.8021p lines by default.
You must uncomment them to set the configuration.

You can remap multiple internal COS values to the same external COS value. For example, to map internal COS 1 and internal COS 2 to external COS 3:

traffic.cos_1.priority_remark.8021p = [3]
traffic.cos_2.priority_remark.8021p = [3]

To apply the settings, reload the switchd service. See the switchd section for more information.

Remark DSCP

You remark DSCP with the priority_remark.dscp component of the qos_features.conf file. The internal cos_ value determines the egress DSCP remark. For example, to remark internal COS 0 to egress DSCP 22:

traffic.cos_0.priority_remark.dscp = [22]

The # in the configuration file is a comment. The file comments out the traffic.cos_*.priority_remark.dscp lines by default.
You must uncomment them to set the configuration.

You can remap multiple internal COS values to the same external DSCP value. For example, to map internal COS 1 and internal COS 2 to external DSCP 40:

traffic.cos_1.priority_remark.dscp = [40]
traffic.cos_2.priority_remark.dscp = [40]

You can configure additional settings using Port Groups.

To apply the settings, reload the switchd service. See the switchd section for more information.

Flow Control

Congestion control prevents traffic loss during times of congestion and helps identify which traffic to keep if you need to drop packets.

Cumulus Linux supports the following congestion control mechanisms:

Flow Control Buffers

Before configuring pause frames or PFC, set buffer pools and limits for lossless flows.

Edit the following lines in the /etc/mlx/datapath/qos/qos_infra.conf file:

  1. Modify the existing ingress_service_pool.0.percent and egress_service_pool.0.percent buffer allocation. Change the existing ingress setting to ingress_service_pool.0.percent = 50. Change the existing egress setting to egress_service_pool.0.percent = 50.

  2. Add the following lines to create a new service_pool, set flow_control to the service pool, and define buffer reservations:

ingress_service_pool.1.percent = 50.0
ingress_service_pool.1.mode = 1
egress_service_pool.1.percent = 50.0
egress_service_pool.1.mode = 1
egress_service_pool.1.infinite_flag = TRUE
#
flow_control.ingress_service_pool = 1
flow_control.egress_service_pool = 1
#
port.service_pool.1.ingress_buffer.reserved = 0
port.service_pool.1.ingress_buffer.dynamic_quota = ALPHA_1
port.service_pool.1.egress_buffer.reserved = 0
port.service_pool.1.egress_buffer.dynamic_quota = ALPHA_INFINITY
#
flow_control.ingress_buffer.dynamic_quota = ALPHA_1
flow_control.egress_buffer.reserved = 0
flow_control.egress_buffer.dynamic_quota = ALPHA_INFINITY

To apply the settings, reload the switchd service. See the switchd section for more information.

Pause Frames

Pause frames are an older congestion control mechanism that causes all traffic on a link between two devices (two switches or a host and switch) to stop transmitting during times of congestion. Pause frames start and stop depending on how congested the buffer is. The value that determines when pause frames start is the xoff value (transmit off). When the buffer congestion reaches the xoff point, the switch sends a pause frame to one or more neighbors. When congestion drops below the xon point (transmit on), the switch sends an updated pause frame so that the neighbor resumes sending traffic.

Use Priority Flow Control (PFC) instead of pause frames.

Before configuring pause frames, you must first modify the switch buffer allocation. Refer to Flow Control Buffers.

You configure pause frames on a per-direction, per-interface basis under the link_pause section of the qos_features.conf file.
Setting link_pause.pause_port_group.rx_enable = true receives pause frames to stop the switch from transmitting when requested. Setting link_pause.pause_port_group.tx_enable = true sends pause frames to request neighboring devices to stop transmitting. You can use pause frames for either receive (rx), transmit (tx), or both.

Cumulus Linux automatically enables or derives the following settings when link pause is on an interface with link_pause.port_group_list:

  • link_pause.pause_port_group.rx_enable
  • link_pause.pause_port_group.tx_enable
  • link_pause.pause_port_group.port_buffer_bytes
  • link_pause.pause_port_group.xoff_size
  • link_pause.pause_port_group.xon_delta

To process pause frames, you must enable link pause on the specific interfaces.

The following is an example link_pause configuration.

link_pause.port_group_list = [my_pause_ports]
link_pause.my_pause_ports.port_set = swp1-swp4,swp6

Pause frame buffer calculation is a complex topic that IEEE 802.1Q-2012 defines. This attempts to incorporate the delay between signaling congestion and the reception of the signal by the neighboring device. This calculation includes the delay that the PHY and MAC layers (interface delay) introduce as well as the distance between end points (cable length).

Incorrect cable length settings can cause wasted buffer space (triggering congestion too early) or packet drops (congestion occurs before flow control activates).

Unless NVIDIA support or engineering asks you to, do not change these values.

To apply the settings, reload the switchd service. See the switchd section for more information.

All Link Pause configuration options
ConfigurationExampleDescription
link_pause.port_group_listlink_pause.port_group_list = [my_pause_ports]Creates a port group to use with pause frame settings. In this example, the group is my_pause_ports.
link_pause.my_pause_ports.port_setlink_pause.my_pause_ports.port_set = swp1-swp4,swp6Define the set of interfaces to apply pause frame configuration. In this example, ports swp1, swp2, swp3, swp4 and swp6 have pause frame on.
link_pause.my_pause_ports.port_buffer_byteslink_pause.my_pause_ports.port_buffer_bytes = 25000The amount of reserved buffer space for the set of ports in the port group list (reserved from the global shared buffer).
link_pause.my_pause_ports.xoff_sizelink_pause.my_pause_ports.xoff_size = 10000Set the amount of reserved buffer to consume before thew switch sends a pause frame out of the set of interfaces in the port group list when transmitting pause frames is on. In this example, after you consume 10000 bytes of reserved buffer, the switch sends pause frames.
link_pause.my_pause_ports.xon_deltalink_pause.my_pause_ports.xon_delta = 2000The number of bytes below the xoff threshold that the buffer consumption must drop below before sending pause frame stops, if transmitting pause frames is on. In this example, the buffer congestion must reduce by 2000 bytes (to 8000 bytes) before pause frame stops.
link_pause.my_pause_ports.rx_enablelink_pause.my_pause_ports.tx_enable = trueEnable (true) or disable (false) sending pause frames. The default value is true. In this example, sending pause frames is on.
link_pause.my_pause_ports.tx_enablelink_pause.my_pause_ports.rx_enable = trueEnable (true) or disable (false) the switch to receive pause frames. The default value is true. In this example, the receiving pause frames is on.
link_pause.my_pause_ports.cable_lengthlink_pause.pause_port_group.cable_length = 5The length, in meters, of the cable that attaches to the port in the port group list. Cumulus Linux uses this value internally to determine the latency between generating a pause frame and receiving the pause frame. The default is 100 meters. In this example, the attached cable is 5 meters.

Priority Flow Control (PFC)

Priority flow control extends the capabilities of pause frames by the frames for a specific COS value instead of stopping all traffic on a link. If a switch supports PFC and receives a PFC pause frame for a given COS value, the switch stops transmitting frames from that queue, but continues transmitting frames for other queues.

PFC is typically used with RDMA over Converged Ethernet - RoCE. The RoCE section provides information to specifically deploy PFC and ECN for RoCE environments.

Before configuring PFC, first modify the switch buffer allocation according to Flow Control Buffers.

You configure PFC pause frames on a per-direction, per-interface basis under the pfc section of the qos_features.conf file.
Setting pfc.pfc_port_group.rx_enable = true supports the reception of PFC pause frames causing the switch to stop transmitting when requested.

Setting pfc.pfc_port_group.tx_enable = true supports the sending of PFC pause frames for the defined COS values, causing the switch to request neighboring devices to stop transmitting.

Cumulus Linux supports PFC pause frames for either receive (rx), transmit (tx), or both.

Cumulus Linux automatically enables or derives the following settings when PFC is on an interface with pfc.port_group_list:

  • pfc.pause_port_group.rx_enable
  • pfc.pause_port_group.tx_enable
  • pfc.pause_port_group.port_buffer_bytes
  • pfc.pause_port_group.xoff_size
  • pfc.pause_port_group.xon_delta

The following is an example pfc configuration.

pfc.port_group_list = [my_pfc_ports]
pfc.my_pfc_ports.cos_list = [3,5]
pfc.my_pfc_ports.port_set = swp1-swp4,swp6

PFC buffer calculation is a complex topic defined in IEEE 802.1Q-2012. This attempts to incorporate the delay between signaling congestion and receiving the signal by the neighboring device. This calculation includes the delay that the PHY and MAC layers (called the interface delay) introduce as well as the distance between end points (cable length).
Incorrect cable length settings cause wasted buffer space (triggering congestion too early) or packet drops (congestion occurs before flow control activates).

Unless directed by NVIDIA support or engineering, do not change these values.

To apply the settings, reload the switchd service. See the switchd section for more information.

All PFC configuration options
ConfigurationExampleExplanation
pfc.port_group_listpfc.port_group_list = [my_pfc_ports]Creates a port group to use with PFC pause frame settings. In this example, the group is my_pfc_ports.
pfc.my_pfc_ports.cos_listpfc.my_pfc_ports.cos_list = [3,5]Define the COS values that support sending PFC pause frames, if sending PFC pause frames is on. This example enables COS values 3 and 5 to send PFC pause frames.
pfc.my_pfc_ports.port_setpfc.my_pfc_ports.port_set = swp1-swp4,swp6Define the set of interfaces to which you want to apply PFC pause frame configuration. In this example, ports swp1, swp2, swp3, swp4 and swp6 have pause frame configurations on.
pfc.my_pfc_ports.port_buffer_bytespfc.my_pfc_ports.port_buffer_bytes = 25000The amount of reserved buffer space for the set of ports defined in the port group list (reserved from the global shared buffer).
pfc.my_pfc_ports.xoff_sizepfc.my_pfc_ports.xoff_size = 10000Set the amount of reserved buffer that the switch must consume before sending a PFC pause frame out the set of interfaces in the port group list, if sending pause frames is on. This example sends PFC pause frames after consuming 10000 bytes of reserved buffer.
pfc.my_pfc_ports.xon_deltapfc.my_pfc_ports.xon_delta = 2000The number of bytes below the xoff threshold that the buffer consumption must drop below before sending PFC pause frames stops, if sending pause frames is on. This example the buffer congestion must reduce by 2000 bytes (to 8000 bytes) before PFC pause frames stop.
pfc.my_pfc_ports.rx_enablepfc.my_pfc_ports.tx_enable = trueEnable (true) or disable (false) sending PFC pause frames. The default value is true. This example enables sending PFC pause frames.
pfc.my_pfc_ports.tx_enablepfc.my_pfc_ports.rx_enable = trueEnable (true) or disable (false) receiving PFC pause frames. You do not need to define the COS values for rx_enable. The switch receives any COS value. The default value is true. This example enables receiving PFC pause frames.
pfc.my_pfc_ports.cable_lengthpfc.my_pfc_ports.cable_length = 5The length, in meters, of the cable that attaches to the port in the port group list. Cumulus Linux uses this value internally to determine the latency between generating a PFC pause frame and receiving the PFC pause frame. The default is 10 meters. In this example, the cable is 5 meters.

Explicit Congestion Notification (ECN)

Unlike pause frames or PFC, ECN is an end-to-end flow control technology. Instead of telling adjacent devices to stop transmitting during times of buffer congestion, ECN sets the ECN bits of the transit IPv4 or IPv6 header to indicate to end-hosts that congestion might occur. As a result, the sending hosts reduce their sending rate until the transit switch no longer setss ECN bits.

You use ECN with RDMA over Converged Ethernet - RoCE. The RoCE section describes how to deploy PFC and ECN for RoCE environments.

ECN operates by having a transit switch mark packets between two end-hosts.

  1. Transmitting host indicates it is ECN-capable by setting the ECN bits in the outgoing IP header to 01 or 10
  2. If the buffer of a transit switch is greater than the configured min_threshold_bytes, the switch remarks the ECN bits to 11 indicating Congestion Encountered or CE.
  3. The receiving host marks any reply packets, like a TCP-ACK, as CE (11).
  4. The original transmitting host reduces its transmission rate.
  5. When the switch buffer congestion falls below the configured min_threshold_bytes, the switch stops remarking ECN bits, setting them back to 01 or 10.
  6. A receiving host reflects this new ECN marking in the next reply so that the transmitting host resumes sending at normal speeds.

This is the default ECN configuration:

default_ecn_conf.egress_queue_list = [0]
default_ecn_conf.ecn_enable = true
default_ecn_conf.min_threshold_bytes = 150000
default_ecn_conf.max_threshold_bytes = 1500000
default_ecn_conf.probability = 100

To apply the settings, reload the switchd service. See the switchd section for more information.

All ECN configuration options
ConfigurationExampleExplanation
default_ecn_conf.egress_queue_listdefault_ecn_conf.egress_queue_list = [0]The list of ECN enabled queues. By default a single queue exists.
default_ecn_conf.ecn_enabledefault_ecn_conf.ecn_enable = trueEnable (true) or disable (false) ECN bit mmarking.
default_ecn_conf.min_threshold_bytesdefault_ecn_conf.min_threshold_bytes = 150000The minimum threshold of the buffer in bytes. Random ECN marking starts when buffer congestion crosses this threshold. The default_ecn_conf.probability value determines if ECN marking occurs.
default_ecn_conf.max_threshold_bytesdefault_ecn_conf.max_threshold_bytes = 1500000The maximum threshold of the buffer in bytes. Cumulus Linux marks all ECN-capable packets when buffer congestion crosses this threshold.
default_ecn_conf.probabilitydefault_ecn_conf.probability = 100The probability, in percent, that Cumulus Linux marks an ECN-capable packet when buffer congestion is between the default_ecn_conf.min_threshold_bytes and default_ecn_conf.max_threshold_bytes. The default is 100 (marks all ECN-capable packets).
default_ecn_conf.red_enabledefault_ecn_conf.red_enable = falseEnable or disable Random Early Detection. The default value is false.

Random Early Detection (RED)

ECN prevents packet drops in the network due to congestion by signaling hosts to transmit less. However, if congestion continues after ECN marking, packets drop after the switch buffer is full. By default, Cumulus Linux tail-drops packets when the buffer is full.

You can configure Random Early Detection (RED) to drop packets that are in the queue randomly instead of always dropping the last arriving packet. This might improve overall performance of TCP based flows.

To configure RED, change the value of default_ecn_conf.red_enable to true.

default_ecn_conf.red_enable = true

To apply the settings, reload the switchd service. See the switchd section for more information.

Egress Queues

Cumulus Linux supports eight egress queues to provide different classes of service.

You configure egress queues in the following section of the qos_infra.conf file.

cos_egr_queue.cos_0.uc  = 0
cos_egr_queue.cos_1.uc  = 1
cos_egr_queue.cos_2.uc  = 2
cos_egr_queue.cos_3.uc  = 3
cos_egr_queue.cos_4.uc  = 4
cos_egr_queue.cos_5.uc  = 5
cos_egr_queue.cos_6.uc  = 6
cos_egr_queue.cos_7.uc  = 7

By default internal COS values map directly to the matching egress queue. For example, cos_egr_queue.cos_0.uc = 0 maps internal COS value 0 to egress queue 0.

You can remap queues by changing the .cos_ to the corresponding queue value. For example, to assign internal COS 2 to queue 7 cos_egr_queue.cos_2.uc = 7, map multiple internal COS values to a single egress queue. You do not have to assign all egress queues.

Egress Schedules

Cumulus Linux supports 802.1Qaz, Enhanced Transmission Selection, which allows the switch to assign bandwidth to egress queues and then schedule the transmission of traffic from each queue. 802.1Qaz supports Priority Queuing.

You configure the egress scheduling policy in the following section of the qos_features.conf file:

default_egress_sched.egr_queue_0.bw_percent = 12
default_egress_sched.egr_queue_1.bw_percent = 13
default_egress_sched.egr_queue_2.bw_percent = 12
default_egress_sched.egr_queue_3.bw_percent = 13
default_egress_sched.egr_queue_4.bw_percent = 12
default_egress_sched.egr_queue_5.bw_percent = 13
default_egress_sched.egr_queue_6.bw_percent = 12
default_egress_sched.egr_queue_7.bw_percent = 13

The egr_queue_ value defines the egress queue where you want to assign bandwidth. For example, default_egress_sched.egr_queue_0 defines the bandwidth allocation for egress queue 0.

The combined total of values you assign to bw_percent must be less than or equal to 100.

If you do not define a queue, there is no bandwidth reservation.

A value of 0 uses strict priority scheduling. This queue always processes ahead of other queues.

The use of strict priority does not define a maximum bandwidth allocation. This can lead to starvation of other queues.

Configured schedules apply on a per-interface basis. Using the default_egress_sched applies the settings to all ports. To customize the scheduler for other interfaces, configure a port_group.

All egress scheduling options
ConfigurationExampleExplanation
default_egress_sched.egr_queue_0.bw_percentdefault_egress_sched.egr_queue_0.bw_percent = 12Define the bandwidth percentage for queue 0.
default_egress_sched.egr_queue_1.bw_percentdefault_egress_sched.egr_queue_1.bw_percent = 13Define the bandwidth percentage for queue 1.
default_egress_sched.egr_queue_2.bw_percentdefault_egress_sched.egr_queue_2.bw_percent = 0Define the bandwidth percentage for queue 2. In this example, a value of 0 means strict priority scheduling.
default_egress_sched.egr_queue_3.bw_percentdefault_egress_sched.egr_queue_3.bw_percent = 13Define the bandwidth percentage for queue 3.
default_egress_sched.egr_queue_4.bw_percentdefault_egress_sched.egr_queue_4.bw_percent = 12Define the bandwidth percentage for queue 4.
default_egress_sched.egr_queue_5.bw_percentdefault_egress_sched.egr_queue_5.bw_percent = 13Define the bandwidth percentage for queue 5.
default_egress_sched.egr_queue_6.bw_percentdefault_egress_sched.egr_queue_6.bw_percent = 12Define the bandwidth percentage for queue 6.
default_egress_sched.egr_queue_7.bw_percentdefault_egress_sched.egr_queue_7.bw_percent = 13Define the bandwidth percentage for queue 7.

Policing and Shaping

Traffic shaping and policing control the rate at which the switch sends or receives traffic on a network to prevent congestion.

Traffic shaping typically occurs at egress and traffic policing at ingress.

Shaping

Traffic shaping allows a switch to send traffic at an average bitrate lower than the physical interface. Traffic shaping prevents a receiving device from dropping bursty traffic if the device is either not capable of that rate of traffic or has a policer that limits what it accepts; for example, an ISP.

Traffic shaping works by holding packets in the buffer and releasing them at time intervals called the tc.

Cumulus Linux supports two levels of hierarchical traffic shaping: one at the egress queue level and one at the port level. This allows for minimum and maximum bandwidth guarantees for each egress-queue and a defined interface traffic shaping rate.

You configure traffic shaping in the shaping section of the qos_features.conf file. Traffic shaping configuration supports Port Groups so that you can apply different shaping profiles to different ports.

Cumulus Linux base the egr_queue value on the configured egress queue.

This is an example traffic shaping configuration:

shaping.port_group_list = [shaper_port_group]
shaping.shaper_port_group.port_set = swp1-swp3,swp5
shaping.shaper_port_group.egr_queue_0.shaper = [50000, 100000]
shaping.shaper_port_group.port.shaper = 900000
All Shaping configuration options
ConfigurationExampleExplanation
shaping.port_group_listshaping.port_group_list = [shaper_port_group]Creates a port group to use with traffic shaping settings. In this example, the group is shaper_port_group
shaping.shaper_port_group.port_setshaping.shaper_port_group.port_set = swp1-swp3,swp5Define the set of interfaces to which you want apply traffic shaping. This example enables traffic shaping on swp1, swp2, swp3 and swp5.
shaping.shaper_port_group.egr_queue_0.shapershaping.shaper_port_group.egr_queue_0.shaper = [50000, 100000]Applies a minimum and maximum bandwidth value in kbps for internal COS group 0. In this example, internal COS 0 always has at least 50000 kbps of bandwidth with a maximum of 100000 kbps.
shaping.shaper_port_group.port.shapershaping.shaper_port_group.port.shaper = 900000Applies the maximum packet shaper rate at the interface level. In this example, swp1, swp2, swp3, and swp5 do not transmit greater than 900000 kbps.

If you define a queue minimum shaping value of 0, there is no bandwidth guarantee for this queue. The maximum queue shaping value must not exceed the interface shaping value defined by port.shaper. The port.shaper value must not exceed the physical interface speed.

Policing

Traffic policing prevents an interface from receiving more traffic than intended. You use policing to enforce a maximum transmission rate on an interface. The switch drops any traffic you send above the policing level.

Cumulus Linux supports both a single-rate policer and a dual-rate policer (tricolor policer).

You configure traffic policing using ebtables, iptables, or ip6table rules.

For more information on configuring and applying ACLs, refer to Netfilter - ACLs.

Single-rate Policer

To configure a single-rate policer, use iptables JUMP action -j POLICE.

Cumulus Linux supports the following iptables flags with a single-rate policer.

iptables FlagDescription
--set-mode [pkt | KB]Define the policer to count packets or kilobytes.
--set-rate [<kbytes> | <packets>]The maximum rate of traffic in kilobytes or packets per second.
--set-burst <kilobytes>The allowed burst size in kilobytes.

For example, to create a policer to allow 400 packets per second with 100 packet burst:
-j POLICE --set-mode pkt --set-rate 400 --set-burst 100

Dual-rate Policer

To configure a policer, use the iptables JUMP action -j TRICOLORPOLICE.

Cumulus Linux supports the following iptables flags with a dual-rate policer.

iptables FlagDescription
--set-color-mode [blind | aware]Define the policing mode as single-rate (blind) or dual-rate (aware). The default is aware.
--set-cir <kbps>Committed information rate (CIR) in kilobits per second.
--set-cbs <kbytes>Committed burst size (CBS) in kilobytes.
--set-pir <kbps>Peak information rate (PIR) in kilobits per second.
--set-ebs <kbytes>Excess burst size (EBS) in kilobytes.
--set-conform-action-dscp <dscp value>The numerical DSCP value to mark for traffic that conforms to the policer rate.
--set-exceed-action-dscp <dscp value>The numerical DSCP value to mark for traffic that exceeds to the policer rate.
--set-violate-action-dscp <dscp value>The numerical DSCP value to mark for traffic that violates the policer rate.
--set-violate-action [accept | drop]Cumulus Linux either accepts and remarks, or drops packets that violate the policer rate.

For example, to configure a dual-rate, three-color policer, with a 3 Mbps CIR, 500 KB CBS, 10 Mbps PIR, and 1 MB EBS and drops packets that violate the policer:

-j TRICOLORPOLICE --set-color-mode blind --set-cir 3000 --set-cbs 500 --set-pir 10000 --set-ebs 1000 --set-violate-action drop

Port Groups

The qos_features.conf file supports port groups to apply similar QoS configurations to a set of ports. Cumulus Linux supports port groups for all features including ECN and RED .

  • Configurations with port groups override the global settings for the ingress ports in the port group.
  • Ports not in a port group use the global settings.
  • You can add all ports to a port_set with the allports value.

Trust and Marking

You define port groups with the source.port_group_list configuration in the qos_features.conf file.

A source.port_group_list is one or more names used for group settings. The name is a label for configuration settings. For example, if a source.port_group_list includes test, Cumulus Linux configures the following port_default_priority with source.test.port_default_priority.

The following is an example source.port_group_list configuration.

source.port_group_list = [customer1,customer2]
source.customer1.packet_priority_source_set = [dscp]
source.customer1.port_set = swp1-swp4,swp6
source.customer1.port_default_priority = 0
source.customer1.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
source.customer2.packet_priority_source_set = [cos]
source.customer2.port_set = swp5,swp7
source.customer2.port_default_priority = 0
source.customer2.cos_1.priority_source.8021p = [4]
ConfigurationExampleDescription
source.port_group_listsource.port_group_list = [customer1,customer2]Defines the names of the port groups you want to use (customer1 and customer2).
source.customer1.packet_priority_source_setsource.customer1.packet_priority_source_set = [dscp]Defines the ingress marking trust. In this example, ingress DSCP values are for group customer1.
source.customer1.port_setsource.customer1.port_set = swp1-swp4,swp6The set of ports to which to apply the ingress marking trust policy. In this example, ports swp1, swp2, swp3, swp4, and swp6 are for customer1.
source.customer1.port_default_prioritysource.customer1.port_default_priority = 0Define the default internal COS marking for unmarked or untrusted traffic. In this example, Cumulus Linux marks unmarked traffic or layer 2 traffic for customer1 ports with internal COS 0.
source.customer1.cos_0.priority_sourcesource.customer1.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]Map the ingress DSCP values to an internal COS value for customer1. In this example, the set of DSCP values from 0 through 7 map to internal COS 0.
source.customer2.packet_priority_source_setsource.packet_priority_source_set = [cos]Defines the ingress marking trust for customer2. In this example, COS is trusted.
source.customer2.port_setsource.customer2.port_set = swp5,swp7The set of ports to which to apply the ingress marking trust policy. In this example, ports swp5 and swp7 apply for customer2.
source.customer2.port_default_prioritysource.customer2.port_default_priority = 0Define the default internal COS marking for unmarked or untrusted traffic. In this example, Cumulus Linux marks unmarked tagged layer 2 traffic or unmarked VLAN tagged traffic for customer1 ports with internal COS 0.
source.customer2.cos_0.priority_sourcesource.customer2.cos_1.priority_source.8021p = [4]Map the ingress COS values to an internal COS value for customer2. This example maps ingress COS value 4 to internal COS 1 .

To apply the settings, reload the switchd service. See the switchd section for more information.

Remarking

You can also use port groups to remark COS or DSCP on egress according to the internal COS value. You define these port groups with remark.port_group_list in the qos_features.conf file.

A remark.port_group_list includes the names for the group settings. The name is a label for configuration settings. For example, if a remark.port_group_list includes test, Cumulus Linux configures the following remark.port_set with remark.test.port_set.

The following is an example remark.group_list configuration.

remark.port_group_list = [list1,list2]
remark.list1.port_set = swp1-swp3,swp6
remark.list1.cos_3.priority_remark.dscp = [24]
remark.list2.packet_priority_remark_set = [802.1p]
remark.list2.port_set = swp9,swp10
remark.list2.cos_3.priority_remark.8021p = [2]
ConfigurationExampleExplanation
remark.port_group_listremark.port_group_list = [list1,list2]Defines the names of the port groups to use (list1 and list2).
remark.list1.packet_priority_remark_setremark.list1.packet_priority_remark_set = [dscp]Defines the egress marking to apply, 802.1p or dscp. This example rewrites the egress DSCP marking.
remark.list1.port_setremark.list1.port_set = swp1-swp3,swp6The set of ingress ports that receives frames or packets that has remarking, regardless of egress interface. This example remarks the egress DSCP values of traffic arriving on ports swp1, swp2, swp3 and swp6.
remark.list1.cos_3.priority_remark.dscpremark.list1.cos_3.priority_remark.dscp = [24]The egress DSCP value to write to the packet according to the internal COS value. In this example, traffic in internal COS 3 sets the egress DSCP to 24.
remark.list2.packet_priority_remark_setremark.list2.packet_priority_remark_set = [802.1p]Defines the egress marking to apply, cos or dscp. This example rewrites the egress COS marking.
remark.list2.port_setremark.list2.port_set = swp9,swp10The set of ingress ports that receives frames or packets with remarking, regardless of egress interface. This example remarks the egress COS values for traffic arriving on ports swp9 and swp10.
remark.list2.cos_4.priority_remark.8021premark.list1.cos_3.priority_remark.8021p = [2]The egress COS value to write to the frame according to the internal COS value. In this example, traffic in internal COS 4 sets the egress COS 2.

Egress Scheduling

You can use port groups with egress scheduling weights to assign different profiles to different egress ports. You define these port groups with egress_sched.port_group_list in the qos_features.conf file.

An egress_sched.port_group_list includes the names for the group settings. The name is a label for the configuration settings. For example, if an egress_sched.port_group_list includes test, Cumulus Linux configures the following egress_sched.port_set with egress_sched.test.port_set.

The following is an example egress_sched.group_list configuration:

egress_sched.port_group_list = [list1,list2]
egress_sched.list1.port_set = swp2
egress_sched.list1.egr_queue_0.bw_percent = 10
egress_sched.list1.egr_queue_1.bw_percent = 20
egress_sched.list1.egr_queue_2.bw_percent = 30
egress_sched.list1.egr_queue_3.bw_percent = 10
egress_sched.list1.egr_queue_4.bw_percent = 10
egress_sched.list1.egr_queue_5.bw_percent = 10
egress_sched.list1.egr_queue_6.bw_percent = 10
egress_sched.list1.egr_queue_7.bw_percent = 0
#
egress_sched.list2.port_set = [swp1,swp3,swp18]
egress_sched.list2.egr_queue_2.bw_percent = 50
egress_sched.list2.egr_queue_5.bw_percent = 50
egress_sched.list2.egr_queue_6.bw_percent = 0
ConfigurationExampleExplanation
egress_sched.port_group_listegress_sched.port_group_list = [list1,list2]Defines the names of the port groups to use (list1 and list2).
egress_sched.list1.port_setegress_sched.list1.port_set = swp2Assigns a port to a port group. In this example, swp2 is now part of port group list1.
egress_sched.list1.egr_queue_0.bw_percentegress_sched.list1.egr_queue_0.bw_percent = 10Assigns the percentage of bandwidth to egress queue 0. In this example, 10% of egress bandwidth.
egress_sched.list1.egr_queue_1.bw_percentegress_sched.list1.egr_queue_1.bw_percent = 20Assigns the percentage of bandwidth to egress queue 1. In this example, 20% of egress bandwidth.
egress_sched.list1.egr_queue_2.bw_percentegress_sched.list1.egr_queue_2.bw_percent = 30Assigns the percentage of bandwidth to egress queue 2. In this example, 13% of egress bandwidth.
egress_sched.list1.egr_queue_3.bw_percentegress_sched.list1.egr_queue_3.bw_percent = 10Assigns the percentage of bandwidth to egress queue 3. In this example, 10% of egress bandwidth.
egress_sched.list1.egr_queue_4.bw_percentegress_sched.list1.egr_queue_4.bw_percent = 10Assigns the percentage of bandwidth to egress queue 4. In this example, 10% of egress bandwidth.
egress_sched.list1.egr_queue_5.bw_percentegress_sched.list1.egr_queue_5.bw_percent = 10Assigns the percentage of bandwidth to egress queue 5. In this example, 10% of egress bandwidth.
egress_sched.list1.egr_queue_6.bw_percentegress_sched.list1.egr_queue_6.bw_percent = 10Assigns the percentage of bandwidth to egress queue 6. In this example, 10% of egress bandwidth.
egress_sched.list1.egr_queue_7.bw_percentegress_sched.list1.egr_queue_7.bw_percent = 0Assigns the percentage of bandwidth to egress queue 7. In this example, 0 indicates a strict priority queue.
egress_sched.list2.port_setegress_sched.list2.port_set = [swp1,swp3,swp18]Assigns ports swp1, swp3 and swp18 to port group list2.
egress_sched.list2.egr_queue_2.bw_percentegress_sched.list2.egr_queue_2.bw_percent = 50Assigns the percentage of bandwidth to egress queue 2. In this example, 50% of egress bandwidth.
egress_sched.list2.egr_queue_5.bw_percentegress_sched.list2.egr_queue_5.bw_percent = 50Assigns the percentage of bandwidth to egress queue 5. In this example, 50% of egress bandwidth.
egress_sched.list2.egr_queue_6.bw_percentegress_sched.list2.egr_queue_6.bw_percent = 0Assigns the percentage of bandwidth to egress queue 6. In this example, 0 indicates a strict priority queue.

The above example only assigns weights to queues 2, 5, and 6 to the port group list2 and schedules the other queues on a best-effort basis when there is no congestion in queues 2, 5, or 6.

Syntax Checker

Cumulus Linux provides a syntax checker for the qos_features.conf and qos_infra.conf files to check for errors, such missing parameters or invalid parameter labels and values.

The syntax checker runs automatically with every switchd reload.

You can run the syntax checker manually from the command line with the cl-consistency-check --datapath-syntax-check command. If errors exist, they write to stderr by default. If you run the command with -q, errors write to the /var/log/switchd.log file.

The cl-consistency-check --datapath-syntax-check command takes the following options:

Option
Description
-hDisplays this list of command options.
-qRuns the command in quiet mode. Errors write to the /var/log/switchd.log file instead of stderr.
-qiRuns the syntax checker against a specified qos_infra.conf file.
-qfRuns the syntax checker against a specified qos_features.conf file.

By default the syntax checker assumes:

You can run the syntax checker when switchd is either running or stopped.

Default Configuration Files

qos_features.conf
#
# /etc/cumulus/datapath/qos/qos_features.conf
# Copyright (C) 2021 NVIDIA Corporation. ALL RIGHTS RESERVED.
#

# packet header field used to determine the packet priority level
# fields include {802.1p, dscp, port}
traffic.packet_priority_source_set = [802.1p]
traffic.port_default_priority      = 0

# packet priority source values assigned to each internal cos value
# internal cos values {cos_0..cos_7}
# (internal cos 3 has been reserved for CPU-generated traffic)
#
# 802.1p values = {0..7}
traffic.cos_0.priority_source.8021p = [0]
traffic.cos_1.priority_source.8021p = [1]
traffic.cos_2.priority_source.8021p = [2]
traffic.cos_3.priority_source.8021p = [3]
traffic.cos_4.priority_source.8021p = [4]
traffic.cos_5.priority_source.8021p = [5]
traffic.cos_6.priority_source.8021p = [6]
traffic.cos_7.priority_source.8021p = [7]

# dscp values = {0..63}
#traffic.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
#traffic.cos_1.priority_source.dscp = [8,9,10,11,12,13,14,15]
#traffic.cos_2.priority_source.dscp = [16,17,18,19,20,21,22,23]
#traffic.cos_3.priority_source.dscp = [24,25,26,27,28,29,30,31]
#traffic.cos_4.priority_source.dscp = [32,33,34,35,36,37,38,39]
#traffic.cos_5.priority_source.dscp = [40,41,42,43,44,45,46,47]
#traffic.cos_6.priority_source.dscp = [48,49,50,51,52,53,54,55]
#traffic.cos_7.priority_source.dscp = [56,57,58,59,60,61,62,63]

# remark packet priority value
# fields include {802.1p, dscp}
traffic.packet_priority_remark_set = []

# packet priority remark values assigned from each internal cos value
# internal cos values {cos_0..cos_7}
# (internal cos 3 has been reserved for CPU-generated traffic)
#
# 802.1p values = {0..7}
#traffic.cos_0.priority_remark.8021p = [0]
#traffic.cos_1.priority_remark.8021p = [1]
#traffic.cos_2.priority_remark.8021p = [2]
#traffic.cos_3.priority_remark.8021p = [3]
#traffic.cos_4.priority_remark.8021p = [4]
#traffic.cos_5.priority_remark.8021p = [5]
#traffic.cos_6.priority_remark.8021p = [6]
#traffic.cos_7.priority_remark.8021p = [7]

# dscp values = {0..63}
#traffic.cos_0.priority_remark.dscp = [0]
#traffic.cos_1.priority_remark.dscp = [8]
#traffic.cos_2.priority_remark.dscp = [16]
#traffic.cos_3.priority_remark.dscp = [24]
#traffic.cos_4.priority_remark.dscp = [32]
#traffic.cos_5.priority_remark.dscp = [40]
#traffic.cos_6.priority_remark.dscp = [48]
#traffic.cos_7.priority_remark.dscp = [56]

# source.port_group_list = [source_port_group]
# source.source_port_group.packet_priority_source_set = [dscp]
# source.source_port_group.port_set = swp1-swp4,swp6
# source.source_port_group.port_default_priority = 0
# source.source_port_group.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
# source.source_port_group.cos_1.priority_source.dscp = [8,9,10,11,12,13,14,15]
# source.source_port_group.cos_2.priority_source.dscp = [16,17,18,19,20,21,22,23]
# source.source_port_group.cos_3.priority_source.dscp = [24,25,26,27,28,29,30,31]
# source.source_port_group.cos_4.priority_source.dscp = [32,33,34,35,36,37,38,39]
# source.source_port_group.cos_5.priority_source.dscp = [40,41,42,43,44,45,46,47]
# source.source_port_group.cos_6.priority_source.dscp = [48,49,50,51,52,53,54,55]
# source.source_port_group.cos_7.priority_source.dscp = [56,57,58,59,60,61,62,63]

# remark.port_group_list = [remark_port_group]
# remark.remark_port_group.packet_priority_remark_set = [dscp]
# remark.remark_port_group.port_set = swp1-swp4,swp6
# remark.remark_port_group.cos_0.priority_remark.dscp = [0]
# remark.remark_port_group.cos_1.priority_remark.dscp = [8]
# remark.remark_port_group.cos_2.priority_remark.dscp = [16]
# remark.remark_port_group.cos_3.priority_remark.dscp = [24]
# remark.remark_port_group.cos_4.priority_remark.dscp = [32]
# remark.remark_port_group.cos_5.priority_remark.dscp = [40]
# remark.remark_port_group.cos_6.priority_remark.dscp = [48]
# remark.remark_port_group.cos_7.priority_remark.dscp = [56]

# to configure priority flow control on a group of ports:
# -- assign cos value(s) to the cos list
# -- add or replace a port group names in the port group list
# -- for each port group in the list
#    -- populate the port set, e.g.
#       swp1-swp4,swp8,swp50s0-swp50s3
#    -- set a PFC buffer size in bytes for each port in the group
#    -- set the xoff byte limit (buffer limit that triggers PFC frames transmit to start)
#    -- set the xon byte delta (buffer limit that triggers PFC frames transmit to stop)
#    -- enable PFC frame transmit and/or PFC frame receive

# priority flow control
#pfc.port_group_list = [pfc_port_group]
#pfc.pfc_port_group.cos_list = []
#pfc.pfc_port_group.port_set = swp1-swp4,swp6
#pfc.pfc_port_group.port_buffer_bytes = 25000
#pfc.pfc_port_group.xoff_size = 10000
#pfc.pfc_port_group.xon_delta = 2000
#pfc.pfc_port_group.tx_enable = true
#pfc.pfc_port_group.rx_enable = true
#
#Specify cable length in mts
#pfc.pfc_port_group.cable_length = 10

# to configure pause on a group of ports:
# -- add or replace port group names in the port group list
# -- for each port group in the list
#    -- populate the port set, e.g.
#       swp1-swp4,swp8,swp50s0-swp50s3
#    -- set a pause buffer size in bytes for each port
#    -- set the xoff byte limit (buffer limit that triggers pause frames transmit to start)
#    -- set the xon byte delta (buffer limit that triggers pause frames transmit to stop)
#    -- enable pause frame transmit and/or pause frame receive

# link pause
# link_pause.port_group_list = [pause_port_group]
# link_pause.pause_port_group.port_set = swp1-swp4,swp6
# link_pause.pause_port_group.port_buffer_bytes = 25000
# link_pause.pause_port_group.xoff_size = 10000
# link_pause.pause_port_group.xon_delta = 2000
# link_pause.pause_port_group.rx_enable = true
# link_pause.pause_port_group.tx_enable = true
#
# Specify cable length in mts
# link_pause.pause_port_group.cable_length = 10

# Explicit Congestion Notification
# to configure ECN and RED on a group of ports:
# -- add or replace port group names in the port group list
# -- assign cos value(s) to the cos list
# -- for each port group in the list
#    -- populate the port set, e.g.
#       swp1-swp4,swp8,swp50s0-swp50s3
# -- to enable RED requires the latest qos_features.conf
#ecn_red.port_group_list = [ecn_red_port_group]
#ecn_red.ecn_red_port_group.egress_queue_list = []
#ecn_red.ecn_red_port_group.port_set = swp1-swp4,swp6
#ecn_red.ecn_red_port_group.ecn_enable = true
#ecn_red.ecn_red_port_group.red_enable = false
#ecn_red.ecn_red_port_group.min_threshold_bytes = 40000
#ecn_red.ecn_red_port_group.max_threshold_bytes = 200000
#ecn_red.ecn_red_port_group.probability = 100

#Default ECN configuration on TC0
default_ecn_conf.egress_queue_list = [0]
default_ecn_conf.ecn_enable = true
default_ecn_conf.red_enable = false
default_ecn_conf.min_threshold_bytes = 150000
default_ecn_conf.max_threshold_bytes = 1500000
default_ecn_conf.probability = 100

# Hierarchical traffic shaping
# to configure shaping at 2 levels:
#     - per egress queue egr_queue_0 - egr_queue_7
#     - port level aggregate
# -- add or replace a port group names in the port group list
# -- for each port group in the list
#    -- populate the port set, e.g.
#       swp1-swp4,swp8,swp50s0-swp50s3
#    -- set min and max rates in kbps for each egr_queue [min, max]
#    -- set max rate in kbps at port level
# shaping.port_group_list = [shaper_port_group]
# shaping.shaper_port_group.port_set = swp1-swp3,swp5,swp7s0-swp7s3
# shaping.shaper_port_group.egr_queue_0.shaper = [50000, 100000]
# shaping.shaper_port_group.egr_queue_1.shaper = [51000, 150000]
# shaping.shaper_port_group.egr_queue_2.shaper = [52000, 200000]
# shaping.shaper_port_group.egr_queue_3.shaper = [53000, 250000]
# shaping.shaper_port_group.egr_queue_4.shaper = [54000, 300000]
# shaping.shaper_port_group.egr_queue_5.shaper = [55000, 350000]
# shaping.shaper_port_group.egr_queue_6.shaper = [56000, 400000]
# shaping.shaper_port_group.egr_queue_7.shaper = [57000, 450000]
# shaping.shaper_port_group.port.shaper = 900000

# default egress scheduling weight per egress queue 
# To be applied to all the ports if port_group profile not configured
# If you do not specify any bw_percent of egress_queues, those egress queues 
# will assume DWRR weight 0 - no egress scheduling for those queues
# '0' indicates strict priority
default_egress_sched.egr_queue_0.bw_percent = 12
default_egress_sched.egr_queue_1.bw_percent = 13
default_egress_sched.egr_queue_2.bw_percent = 12
default_egress_sched.egr_queue_3.bw_percent = 13
default_egress_sched.egr_queue_4.bw_percent = 12
default_egress_sched.egr_queue_5.bw_percent = 13
default_egress_sched.egr_queue_6.bw_percent = 12
default_egress_sched.egr_queue_7.bw_percent = 13

# port_group profile for egress scheduling weight per egress queue 
# If you do not specify any bw_percent of egress_queues, those egress queues 
# will assume DWRR weight 0 - no egress scheduling for those queues
# '0' indicates strict priority
#egress_sched.port_group_list = [sched_port_group1]
#egress_sched.sched_port_group1.port_set = swp2
#egress_sched.sched_port_group1.egr_queue_0.bw_percent = 10
#egress_sched.sched_port_group1.egr_queue_1.bw_percent = 20
#egress_sched.sched_port_group1.egr_queue_2.bw_percent = 30
#egress_sched.sched_port_group1.egr_queue_3.bw_percent = 10
#egress_sched.sched_port_group1.egr_queue_4.bw_percent = 10
#egress_sched.sched_port_group1.egr_queue_5.bw_percent = 10
#egress_sched.sched_port_group1.egr_queue_6.bw_percent = 10
#egress_sched.sched_port_group1.egr_queue_7.bw_percent = 0

# Cut-through is disabled by default on all chips with the exception of
# Spectrum.  On Spectrum cut-through cannot be disabled.
#cut_through_enable = false
qos_infra.conf
### qos_infra.conf
#
# Default qos_infra configuration for Mellanox Spectrum chip
# Copyright (C) 2021 NVIDIA Corporation. ALL RIGHTS RESERVED.
#
# scheduling algorithm: algorithm values = {dwrr}
scheduling.algorithm = dwrr

# priority groups
# supported group names are control, bulk, service1-6
traffic.priority_group_list = [bulk]

# internal cos values assigned to each priority group
# each cos value should be assigned exactly once
# internal cos values {0..7}
priority_group.bulk.cos_list = [0,1,2,3,4,5,6,7]

# Alias Name defined for each priority group
# Valid string between 0-255 chars
# Sample alias support for naming priority groups
#priority_group.bulk.alias = "Bulk"

# priority group ID assigned to each priority group
#priority_group.control.id = 7
#priority_group.service2.id = 2
priority_group.bulk.id = 0

# all priority groups share a service pool on Spectrum
# service pools assigned to each priority group
priority_group.bulk.service_pool = 0

# service pool assigned for lossless PGs
#flow_control.ingress_service_pool = 0

# --- ingress buffer space allocations ---
#
# total buffer
#  - ingress minimum buffer allocations
#  - ingress service pool buffer allocations
#  - priority group ingress headroom allocations
#  - ingress global headroom allocations
#  = total ingress shared buffer size

# ingress service pool buffer allocation: percent of total buffer
# If a service pool has no priority groups, the buffer is added
# to the shared buffer space.
ingress_service_pool.0.percent = 100.0
# all priority groups

# Ingress buffer port.pool buffer : size in bytes
#port.service_pool.0.ingress_buffer.reserved = 10240
#port.service_pool.0.ingress_buffer.shared_size = 9000
#port.management.ingress_buffer.reserved = 0

# priority group minimum buffer allocation: size in bytes
# priority group shared buffer allocation: shared buffer size in bytes
# if a priority group has no packet priority values assigned to it, the buffers will not be allocated

#priority_group.bulk.ingress_buffer.reserved           = 0
#priority_group.bulk.ingress_buffer.shared_size        = 15

# ---- ingress dynamic buffering settings
# To enable ingress static pool, set the mode to 0
#ingress_service_pool.0.mode = 0

# The ALPHA defines the max% of buffers (quota) available on a
# per ingress port OR ipool, Ingress PG, Egress TC, Egress port OR epool.
# ALPHA value equates to the following buffer limit calculated as:
# alpha%(alpha+1) = Max Buffer percentage

# https://community.mellanox.com/s/article/understanding-the-alpha-parameter-in-the-buffer-configuration-of-mellanox-spectrum-switches
# Each shared buffer pool can use a maximum of [total_buffer * (alpha / (alpha+1))]
# Configure quota values mapped to the following alpha values:
# Configuration value = alpha level:
# Both ALPHA_*(string representation) as well as integer values (old representation) will be supported for alpha
# 0/ALPHA_0  = alpha 0
# 1/ALPHA_1_128  = alpha 1/128
# 2/ALPHA_1_64  = alpha 1/64
# 3/ALPHA_1_32  = alpha 1/32
# 4/ALPHA_1_16  = alpha 1/16
# 5/ALPHA_1_8  = alpha 1/8
# 6/ALPHA_1_4  = alpha 1/4
# 7/ALPHA_1_2  = alpha 1/2
# 8/ALPHA_1  = alpha  1
# 9/ALPHA_2  = alpha  2
# 10/ALPHA_4 = alpha  4
# 11/ALPHA_8 = alpha  8
# 12/ALPHA_16 = alpha 16
# 13/ALPHA_32 = alpha 32
# 14/ALPHA_64 = alpha 64
# 15/ALPHA_INFINITY = alpha Infinity

# Ingress buffer per-port dynamic buffering alpha (Default: ALPHA_8)
#port.service_pool.0.ingress_buffer.dynamic_quota = ALPHA_8
#port.management.ingress_buffer.dynamic_quota = ALPHA_8

# Ingress buffer dynamic buffering alpha for lossless PGs (if any; Default: ALPHA_1)
#flow_control.ingress_buffer.dynamic_quota = ALPHA_1

# Ingress buffer per-PG dynamic buffering alpha (Default: ALPHA_8)
#priority_group.bulk.ingress_buffer.dynamic_quota      = ALPHA_8

# --- egress buffer space allocations ---
#
# total egress buffer
#  - minimum buffer allocations
#  = total service pool buffer size
#
# service pool assigned for lossless PGs
#flow_control.egress_service_pool = 0
#
# service pool assigned for egress queues
egress_buffer.egr_queue_0.uc.service_pool = 0
egress_buffer.egr_queue_1.uc.service_pool = 0
egress_buffer.egr_queue_2.uc.service_pool = 0
egress_buffer.egr_queue_3.uc.service_pool = 0
egress_buffer.egr_queue_4.uc.service_pool = 0
egress_buffer.egr_queue_5.uc.service_pool = 0
egress_buffer.egr_queue_6.uc.service_pool = 0
egress_buffer.egr_queue_7.uc.service_pool = 0
#
# Service pool buffer allocation: percent of total
# buffer size.
egress_service_pool.0.percent = 100.0
# all priority groups, UC and MC

# Egress buffer port.pool buffer : size in bytes
#port.service_pool.0.egress_buffer.uc.reserved = 10240
#port.service_pool.0.egress_buffer.uc.shared_size = 9000
#port.management.egress_buffer.reserved = 0

# Front panel port egress buffer limits enforced for each
# priority group.
# Unlimited egress buffers not supported on Spectrum.
#priority_group.bulk.unlimited_egress_buffer     = false

#
# if a priority group has no cos values assigned to it, the buffers will not be allocated
#

# Service pool mapping for MC.SP region
egress_buffer.cos_0.mc.service_pool = 0
egress_buffer.cos_1.mc.service_pool = 0
egress_buffer.cos_2.mc.service_pool = 0
egress_buffer.cos_3.mc.service_pool = 0
egress_buffer.cos_4.mc.service_pool = 0
egress_buffer.cos_5.mc.service_pool = 0
egress_buffer.cos_6.mc.service_pool = 0
egress_buffer.cos_7.mc.service_pool = 0
#
# Reserved and static shared buffer allocation for MC.SP region: size in bytes
#egress_buffer.cos_0.mc.reserved = 10240
#egress_buffer.cos_1.mc.reserved = 10240
#egress_buffer.cos_2.mc.reserved = 10240
#egress_buffer.cos_3.mc.reserved = 10240
#egress_buffer.cos_4.mc.reserved = 10240
#egress_buffer.cos_5.mc.reserved = 10240
#egress_buffer.cos_6.mc.reserved = 10240
#egress_buffer.cos_7.mc.reserved = 10240
#
#egress_buffer.cos_0.mc.shared_size = 40
#egress_buffer.cos_1.mc.shared_size = 40
#egress_buffer.cos_2.mc.shared_size =  5
#egress_buffer.cos_3.mc.shared_size = 40
#egress_buffer.cos_4.mc.shared_size = 40
#egress_buffer.cos_5.mc.shared_size = 40
#egress_buffer.cos_6.mc.shared_size = 40
#egress_buffer.cos_7.mc.shared_size = 30

# Shared buffer allocation for ePort.TC region : size in bytes.
#egress_buffer.egr_queue_0.uc.shared_size   = 40
#egress_buffer.egr_queue_1.uc.shared_size   = 40
#egress_buffer.egr_queue_2.uc.shared_size   =  5
#egress_buffer.egr_queue_3.uc.shared_size   = 40
#egress_buffer.egr_queue_4.uc.shared_size   = 40
#egress_buffer.egr_queue_5.uc.shared_size   = 40
#egress_buffer.egr_queue_6.uc.shared_size   = 40
#egress_buffer.egr_queue_7.uc.shared_size   = 30

# Minimum buffer allocation for ePort.TC region: size in bytes
#egress_buffer.egr_queue_0.uc.reserved  = 1024
#egress_buffer.egr_queue_1.uc.reserved  = 1024
#egress_buffer.egr_queue_2.uc.reserved  = 1024
#egress_buffer.egr_queue_3.uc.reserved  = 1024
#egress_buffer.egr_queue_4.uc.reserved  = 1024
#egress_buffer.egr_queue_5.uc.reserved  = 1024
#egress_buffer.egr_queue_6.uc.reserved  = 1024
#egress_buffer.egr_queue_7.uc.reserved  = 1024

# Reserved Egress buffer for TCs mapped to lossless SPs
#flow_control.egress_buffer.reserved = 0

# Egress buffer ePort.MC buffer : size in bytes
# the per-port limit on multicast packets (applies to all switch priorities)
#port.egress_buffer.mc.reserved = 10240
#port.egress_buffer.mc.shared_size = 92160

# To enable egress static pool, set the mode to 0
#egress_service_pool.0.mode = 0
 
# Egress dynamic buffer pool configuration
# Replace the shared_size parameter with the dynamic_quota=n/ALPHA_x,
# where ‘n’ should be the configuration value for alpha.
# 		‘ALPHA_x’ should be string representation for alpha.
# Pls note : Same alpha configuration values can be used as mentioned in Ingress Dynamic Buffering section above
#
# Egress buffer per-port dynamic buffering quota (alpha ; Default: ALPHA_16)
#port.service_pool.0.egress_buffer.uc.dynamic_quota = ALPHA_16
#port.management.egress_buffer.dynamic_quota = ALPHA_8

# Egress buffer per-egress-queue dynamic buffering quota (alpha) for lossless egress queues (Default: ALPHA_INFINITY)
#flow_control.egress_buffer.dynamic_quota = ALPHA_INFINITY

# Egress buffer per-egress-queue dynamic buffering quota (alpha) for unicast (Default: ALPHA_8)
#egress_buffer.egr_queue_0.uc.dynamic_quota    = ALPHA_2
#egress_buffer.egr_queue_1.uc.dynamic_quota = ALPHA_4
#egress_buffer.egr_queue_2.uc.dynamic_quota = ALPHA_1
#egress_buffer.egr_queue_3.uc.dynamic_quota = ALPHA_1_2
#egress_buffer.egr_queue_4.uc.dynamic_quota = ALPHA_1_4
#egress_buffer.egr_queue_5.uc.dynamic_quota = ALPHA_1_8
#egress_buffer.egr_queue_6.uc.dynamic_quota = ALPHA_1_16
#egress_buffer.egr_queue_7.uc.dynamic_quota = ALPHA_1_32

# Egress buffer per-egress-queue dynamic buffering quota (alpha) for multicast (Default: ALPHA_INFINITY)
#egress_buffer.egr_queue_0.mc.dynamic_quota    = ALPHA_2
#egress_buffer.egr_queue_1.mc.dynamic_quota = ALPHA_4
#egress_buffer.egr_queue_2.mc.dynamic_quota = ALPHA_1
#egress_buffer.egr_queue_3.mc.dynamic_quota = ALPHA_1_2
#egress_buffer.egr_queue_4.mc.dynamic_quota = ALPHA_1_4
#egress_buffer.egr_queue_5.mc.dynamic_quota = ALPHA_1_8
#egress_buffer.egr_queue_6.mc.dynamic_quota = ALPHA_1_16
#egress_buffer.egr_queue_7.mc.dynamic_quota = ALPHA_INFINITY

# These parameters can be assigned to the virtual Multicast port as well (Default: ALPHA_1_4)
#egress_buffer.cos_0.mc.dynamic_quota = ALPHA_1_4
#egress_buffer.cos_1.mc.dynamic_quota = ALPHA_8
#egress_buffer.cos_2.mc.dynamic_quota = ALPHA_4
#egress_buffer.cos_3.mc.dynamic_quota = ALPHA_2
#egress_buffer.cos_4.mc.dynamic_quota = ALPHA_1_8
#egress_buffer.cos_5.mc.dynamic_quota = ALPHA_1
#egress_buffer.cos_6.mc.dynamic_quota = ALPHA_1_2
#egress_buffer.cos_7.mc.dynamic_quota = ALPHA_1_4

# internal cos values mapped to egress queues
# multicast queue: same as unicast queue
cos_egr_queue.cos_0.uc  = 0
cos_egr_queue.cos_0.cpu = 0

cos_egr_queue.cos_1.uc  = 1
cos_egr_queue.cos_1.cpu = 1

cos_egr_queue.cos_2.uc  = 2
cos_egr_queue.cos_2.cpu = 2

cos_egr_queue.cos_3.uc  = 3
cos_egr_queue.cos_3.cpu = 3

cos_egr_queue.cos_4.uc  = 4
cos_egr_queue.cos_4.cpu = 4

cos_egr_queue.cos_5.uc  = 5
cos_egr_queue.cos_5.cpu = 5

cos_egr_queue.cos_6.uc  = 6
cos_egr_queue.cos_6.cpu = 6

cos_egr_queue.cos_7.uc  = 7

Caveats

Configure QoS and Breakout Ports Simultaneously

If you configure both breakout ports by modifying ports.conf and QoS settings by modifying qos_features.conf, then apply the settings with reload switchd, errors might occur.

You must apply breakout port configuration before QoS configuration on the breakout ports. Modify ports.conf first, reload switchd, then modify qos_features.conf and reload switchd a second time.

QoS Settings on Bond Member Interfaces

If you apply QoS settings on bond member interfaces instead of the logical bond interface, the members must share identical QoS configuration. If the configuration is not identical between bond interfaces, the bond inherits the _last_ interface you apply to the bond.

If QoS settings do not match, switchd reload fails; however, switchd restart does not fail.

Cut-through Switching

You cannot disable cut-through switching on Spectrum ASICs. Cumulus Linux ignores the cut_through_enable = false setting in the qos_features.conf file.

RDMA over Converged Ethernet - RoCE

RDMA over Converged Ethernet (RoCE) enables you to write to compute or storage elements using remote direct memory access (RDMA) over an Ethernet network instead of using host CPUs. RoCE relies on explicit congestion notification (ECN) and priority flow control (PFC) to operate. Cumulus Linux supports features that can enable lossless Ethernet for RoCE environments.

While Cumulus Linux can support RoCE environments, the end hosts must support the RoCE protocol.

RoCE helps you obtain a converged network, where all services run over the Ethernet infrastructure, including Infiniband apps.

Enable RDMA over Converged Ethernet lossless (with PFC and ECN)

RoCE uses the Infiniband (IB) Protocol over converged Ethernet. The IB global route header rides directly on top of the Ethernet header. The lossless Ethernet layer handles congestion hop by hop.

To configure RoCE with PFC and ECN:

cumulus@switch:~$ net add roce lossless
cumulus@switch:~$ net commit

NVUE defaults to roce mode lossless. The command nv set qos roce and nv set qos roce mode lossless are equivalent.

If you enable mode lossy, configuring nv set qos roce without a mode does not change the RoCE mode. To change to lossless, you must configure mode lossless.

cumulus@switch:~$ nv set qos roce
cumulus@switch:~$ nv config apply

Link pause is another way to provide lossless ethernet; however, PFC is the preferred method. PFC allows more granular control by pausing the traffic flow for a given CoS group instead of the entire link.

Enable RDMA over Converged Ethernet lossy (with ECN)

RoCEv2 requires flow control for lossless Ethernet. RoCEv2 uses the Infiniband (IB) Transport Protocol over UDP. The IB transport protocol includes an end-to-end reliable delivery mechanism and has its own sender notification mechanism.

RoCEv2 congestion management uses RFC 3168 to signal congestion experienced to the receiver. The receiver generates an RoCEv2 congestion notification packet directed to the source of the packet.

To configure RoCE with ECN:

cumulus@switch:~$ net add roce lossy
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set qos roce mode lossy
cumulus@switch:~$ nv config apply

Remove RoCE Configuration

To remove RoCE configurations:

cumulus@switch:~$ net del roce
cumulus@switch:~$ net commit
cumulus@switch:~$ nv unset qos roce
cumulus@switch:~$ nv config apply

Verify RoCE Configuration

You can verify RoCE configuration with NVUE nv show commands.

To show detailed information about the configured buffers, utilization and DSCP markings, run the nv show qos roce command:

cumulus@switch:mgmt:~$ nv show qos roce
                    operational  applied   description
------------------  -----------  --------  ------------------------------------------------------
enable                           on        Turn the feature 'on' or 'off'.  The default is 'off'.
mode                lossless     lossless  Roce Mode
cable-length        100          100       Cable Length(in meters) for Roce Lossless Config
congestion-control
  congestion-mode   ECN                    Congestion config mode
  enabled-tc        0,3                    Congestion config enabled Traffic Class
  max-threshold     1.43 MB                Congestion config max-threshold
  min-threshold     146.48 KB              Congestion config min-threshold
pfc
  pfc-priority      3                      switch-prio on which PFC is enabled
  rx-enabled        enabled                PFC Rx Enabled status
  tx-enabled        enabled                PFC Tx Enabled status
trust
  trust-mode        pcp,dscp               Trust Setting on the port for packet classification
 
 
RoCE PCP/DSCP->SP mapping configurations
===========================================
        pcp  dscp                     switch-prio
    --  ---  -----------------------  -----------
    0   0    0,1,2,3,4,5,6,7          0
    1   1    8,9,10,11,12,13,14,15    1
    2   2    16,17,18,19,20,21,22,23  2
    3   3    24,25,26,27,28,29,30,31  3
    4   4    32,33,34,35,36,37,38,39  4
    5   5    40,41,42,43,44,45,46,47  5
    6   6    48,49,50,51,52,53,54,55  6
    7   7    56,57,58,59,60,61,62,63  7
 
 
RoCE SP->TC mapping and ETS configurations
=============================================
        switch-prio  traffic-class  scheduler-weight
    --  -----------  -------------  ----------------
    0   0            0              DWRR-50%
    1   1            0              DWRR-50%
    2   2            0              DWRR-50%
    3   3            3              DWRR-50%
    4   4            0              DWRR-50%
    5   5            0              DWRR-50%
    6   6            6              strict-priority
    7   7            0              DWRR-50%
 
 
RoCE pool config
===================
        name                   mode     size   switch-priorities  traffic-class
    --  ---------------------  -------  -----  -----------------  -------------
    0   lossy-default-ingress  Dynamic  50.0%  0,1,2,4,5,6,7      -
    1   roce-reserved-ingress  Dynamic  50.0%  3                  -
    2   lossy-default-egress   Dynamic  50.0%  -                  0,6
    3   roce-reserved-egress   Dynamic  inf    -                  3
 
 
Exception List
=================
        description
    --  -----------

To show detailed RoCE information about a single interface, run the nv show interface qos roce status command.

cumulus@switch:mgmt:~$ nv show interface swp16 qos roce status
                    operational    applied  description
------------------  -------------  -------  ---------------------------------------------------
congestion-control
  congestion-mode   ecn, absolute           Congestion config mode
  enabled-tc        0,3                     Congestion config enabled Traffic Class
  max-threshold     1.43 MB                 Congestion config max-threshold
  min-threshold     153.00 KB               Congestion config min-threshold
pfc
  pfc-priority      3                       switch-prio on which PFC is enabled
  rx-enabled        yes                     PFC Rx Enabled status
  tx-enabled        yes                     PFC Tx Enabled status
trust
  trust-mode        pcp,dscp                Trust Setting on the port for packet classification
mode                lossless                Roce Mode
 
 
RoCE PCP/DSCP->SP mapping configurations
===========================================
          pcp  dscp  switch-prio
    ----  ---  ----  -----------
    cnp   6    48    6
    roce  3    26    3
 
 
RoCE SP->TC mapping and ETS configurations
=============================================
          switch-prio  traffic-class  scheduler-weight
    ----  -----------  -------------  ----------------
    cnp   6            6              strict priority
    roce  3            3              dwrr-50%
 
 
RoCE Pool Status
===================
        name                   mode     pool-id  switch-priorities  traffic-class  size      current-usage  max-usage
    --  ---------------------  -------  -------  -----------------  -------------  --------  -------------  ---------
    0   lossy-default-ingress  DYNAMIC  2        0,1,2,4,5,6,7      -              15.16 MB  0 Bytes        16.00 MB
    1   roce-reserved-ingress  DYNAMIC  3        3                  -              15.16 MB  7.30 MB        7.90 MB
    2   lossy-default-egress   DYNAMIC  13       -                  0,6            15.16 MB  0 Bytes        16.01 MB
    3   roce-reserved-egress   DYNAMIC  14       -                  3              inf       7.29 MB        13.47 MB

To show detailed information about current buffer utilization as well as historic RoCE byte and packet counts, run the nv show interface qos roce counters command:

cumulus@switch:mgmt:~$ nv show interface swp16 qos roce counters
                               operational   applied  description
-----------------------------  ------------  -------  ------------------------------------------------------
rx-stats
  rx-non-roce-stats
    buffer-max-usage           144 Bytes              Max Ingress Pool-buffer usage for non-RoCE traffic
    buffer-usage               0 Bytes                Current Ingress Pool-buffer usage for non-RoCE traffic
    no-buffer-discard          55                     Rx buffer discards for non-RoCE traffic
    non-roce-bytes             56.52 MB               non-roce rx bytes
    non-roce-packets           462975                 non-roce rx packets
    pg-max-usage               144 Bytes              Max PG-buffer usage for non-RoCE traffic
    pg-usage                   0 Bytes                Current PG-buffer usage for non-RoCE traffic
  rx-pfc-stats
    pause-duration             0                      Rx PFC pause duration for RoCE traffic
    pause-packets              0                      Rx PFC pause packets for RoCE traffic
  rx-roce-stats
    buffer-max-usage           0 Bytes                Max Ingress Pool-buffer usage for RoCE traffic
    buffer-usage               0 Bytes                Current Ingress Pool-buffer usage for RoCE traffic
    no-buffer-discard          0                      Rx buffer discards for RoCE traffic
    pg-max-usage               0 Bytes                Max PG-buffer usage for RoCE traffic
    pg-usage                   0 Bytes                Current PG-buffer usage for RoCE traffic
    roce-bytes                 0 Bytes                Rx RoCE Bytes
    roce-packets               0                      Rx RoCE Packets
tx-stats
  tx-cnp-stats
    buffer-max-usage           16.02 MB               Max Egress Pool-buffer usage for CNP traffic
    buffer-usage               0 Bytes                Current Egress Pool-buffer usage for CNP traffic
    cnp-bytes                  0 Bytes                Tx CNP Packet Bytes
    cnp-packets                0                      Tx CNP Packets
    tc-max-usage               0 Bytes                Max TC-buffer usage for CNP traffic
    tc-usage                   0 Bytes                Current TC-buffer usage for CNP traffic
    unicast-no-buffer-discard  0                      Tx buffer discards for CNP traffic
  tx-ecn-stats
    ecn-marked-packets         693777677344           Tx ECN marked packets
  tx-pfc-stats
    pause-duration             0                      Tx PFC pause duration for RoCE traffic
    pause-packets              0                      Tx PFC pause packets for RoCE traffic
  tx-roce-stats
    buffer-max-usage           13.47 MB               Max Egress Pool-buffer usage for RoCE traffic
    buffer-usage               7.29 MB                Current Egress Pool-buffer usage for RoCE traffic
    roce-bytes                 92824.38 GB            Tx RoCE Packet bytes
    roce-packets               803785675319           Tx RoCE Packets
    tc-max-usage               16.02 MB               Max TC-buffer usage for RoCE traffic
    tc-usage                   7.29 MB                Current TC-buffer usage for RoCE traffic
    unicast-no-buffer-discard  663060754115           Tx buffer discards for RoCE traffic

DHCP

This section describes how to configure DHCP relays and DHCP servers.

DHCP Relays

DHCP is a client server protocol that automatically provides IP hosts with IP addresses and other related configuration information. A DHCP relay (agent) is a host that forwards DHCP packets between clients and servers that are not on the same physical subnet.

This topic describes how to configure DHCP relays for IPv4 and IPv6 using the following topology:

If you intend to run the dhcrelay service within a VRF, including the management VRF, follow these steps.

Basic Configuration

To set up DHCP relay, you need to provide the IP address of the DHCP server and the interfaces participating in DHCP relay (facing the server and facing the client). The number of IP addresses must fit in 255 octets.

In the example commands below, the DHCP server IP address is 172.16.1.102, vlan10 is the SVI for VLAN 10 and the uplinks are swp51 and swp52.

cumulus@leaf01:~$ net add dhcp relay interface swp51
cumulus@leaf01:~$ net add dhcp relay interface swp52
cumulus@leaf01:~$ net add dhcp relay interface vlan10
cumulus@leaf01:~$ net add dhcp relay server 172.16.1.102
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
You cannot configure IPv6 relays with NCLU commands. Use the Linux Commands.

Specify the IP address of each DHCP server and both interfaces participating in DHCP relay (facing the server and facing the client).

In the example commands below, the DHCP server IP address is 172.16.1.102, vlan10 is the SVI for VLAN 10, and the uplinks are swp51 and swp52. As per RFC 3046, you can specify as many server IP addresses that can fit in 255 octets. You can specify each address only once.

cumulus@leaf01:~$ nv set service dhcp-relay default interface swp51
cumulus@leaf01:~$ nv set service dhcp-relay default interface swp52
cumulus@leaf01:~$ nv set service dhcp-relay default interface vlan10
cumulus@leaf01:~$ nv set service dhcp-relay default server 172.16.1.102
cumulus@leaf01:~$ nv config apply

Specify the IP address of each DHCP server and both interfaces participating in DHCP relay (facing the server and facing the client).

In the example commands below, the DHCP server IP address is 2001:db8:100::2, vlan10 is the SVI for VLAN 10, and the uplinks are swp51 and swp52. As per RFC 3046, you can specify as many server IP addresses that can fit in 255 octets. You can specify each address only once.

cumulus@leaf01:~$ nv set service dhcp-relay6 default interface swp51
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface swp52
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface vlan10
cumulus@leaf01:~$ nv set service dhcp-relay6 default server 2001:db8:100::2
cumulus@leaf01:~$ nv config apply
  1. Edit the /etc/default/isc-dhcp-relay file to add the IP address of the DHCP server and the interfaces participating in DHCP relay. In the example below, the DHCP server IP address is 172.16.1.102, vlan10 is the SVI for VLAN 10, and the uplinks are swp51 and swp52.

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    SERVERS="172.16.1.102"
    INTF_CMD="-i vlan10 -i swp51 -i swp52"
    OPTIONS=""
    
  2. Enable, then restart the dhcrelay service so that the configuration persists between reboots:

    cumulus@leaf01:~$ sudo systemctl enable dhcrelay.service
    cumulus@leaf01:~$ sudo systemctl restart dhcrelay.service
    
  1. Edit the /etc/default/isc-dhcp-relay6 file to add the IP address of the DHCP server and the interfaces participating in DHCP relay. In the example below, the DHCP server IP address is 2001:db8:100::2, vlan10 is the SVI for VLAN 10, and the uplinks are swp51 and swp52.

    cumulus@leaf01:$ sudo nano /etc/default/isc-dhcp-relay6
    SERVERS=" -u 2001:db8:100::2%swp51 -u 2001:db8:100::2%swp52"
    INTF_CMD="-l vlan10"
    
  2. Enable, then restart the dhcrelay6 service so that the configuration persists between reboots:

    cumulus@switch:~$ sudo systemctl enable dhcrelay6.service
    cumulus@switch:~$ sudo systemctl restart dhcrelay6.service
    

  • You configure a DHCP relay on a per-VLAN basis, specifying the SVI, not the parent bridge. In the example above, you specify vlan10 as the SVI for VLAN 10 but you do not specify the bridge named bridge.
  • When you configure DHCP relay with VRR, the DHCP relay client must run on the SVI; not on the -v0 interface.

Optional Configuration

This section describes optional DHCP relay configuration. The steps provided in this section assume that you already done basic DHCP relay configuration, described above.

DHCP Agent Information Option (Option 82)

Cumulus Linux supports DHCP Agent Information Option 82, which allows a DHCP relay to insert circuit or relay specific information into a request that the switch forwards to a DHCP server. You can use the following options:

To configure DHCP Agent Information Option 82:

  1. Edit the /etc/default/isc-dhcp-relay file and add one of the following options:

    To inject the ingress SVI interface against which DHCP processes the relayed DHCP discover packet, add -a to the OPTIONS line:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-a"
    

    To inject the physical switch port on which the relayed DHCP discover packet arrives instead of the SVI, add -a --use-pif-circuit-id to the OPTIONS line:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-a --use-pif-circuit-id"
    

    To customize the Remote ID sub-option, add -a -r to the OPTIONS line followed by a custom string (up to 255 characters):

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-a -r CUSTOMVALUE"
    
  2. Restart the dhcrelay service to apply the new configuration:

    cumulus@leaf01:~$ sudo systemctl restart dhcrelay.service
    

Control the Gateway IP Address with RFC 3527

When you need DHCP relay in an environment that relies on an anycast gateway (such as EVPN), a unique IP address is necessary on each device for return traffic. By default, in a BGP unnumbered environment with DHCP relay, the source IP address is the loopback IP address and the gateway IP address (giaddr) is the SVI IP address. However with anycast traffic, the SVI IP address is not unique to each rack; it is typically shared between racks. Most EVPN ToR deployments only use a single unique IP address, which is the loopback IP address.

RFC 3527 enables the DHCP server to react to these environments by introducing a new parameter to the DHCP header called the link selection sub-option, which the DHCP relay agent builds. The link selection sub-option takes on the normal role of the giaddr in relaying to the DHCP server which subnet correlates to the DHCP request. When using this sub-option, the giaddr continues to be present but only relays the return IP address that the DHCP server uses; the giaddr becomes the unique loopback IP address.

When enabling RFC 3527 support, you can specify an interface, such as the loopback interface or a switch port interface to use as the giaddr. The relay picks the first IP address on that interface. If the interface has multiple IP addresses, you can specify a specific IP address for the interface.

RFC 3527 supports IPv4 DHCP relays only.

To enable RFC 3527 support and control the giaddr:

Run the net add dhcp relay giaddr-interface command with the interface or the interface and IP address you want to use.

This example uses the first IP address on the loopback interface as the giaddr:

cumulus@leaf01:~$ net add dhcp relay giaddr-interface lo

The first IP address on the loopback interface is typically the 127.0.0.1 address. This example uses IP address 10.10.10.1 on the loopback interface as the giaddr:

cumulus@leaf01:~$ net add dhcp relay giaddr-interface lo 10.10.10.1

This example uses the first IP address on swp2 as the giaddr:

cumulus@leaf01:~$ net add dhcp relay giaddr-interface swp2

This example uses IP address 10.0.0.4 on swp2 as the giaddr:

cumulus@leaf01:~$ net add dhcp relay giaddr-interface swp2 10.0.0.4

Run the nv set service dhcp-relay default giaddress-interface command with the interface/IP address you want to use. The following example uses the first IP address on the loopback interface as the gateway IP address:

cumulus@leaf01:~$ nv set service dhcp-relay default giaddress-interface lo

The first IP address on the loopback interface is typically the 127.0.0.1 address. This example uses IP address 10.10.10.1 on the loopback interface as the giaddr:

cumulus@leaf01:~$ nv set service dhcp-relay default giaddress-interface lo 10.10.10.1

This example uses the first IP address on swp2 as the giaddr:

cumulus@leaf01:~$ nv set service dhcp-relay default giaddr-interface swp2

This example uses IP address 10.0.0.4 on swp2 as the giaddr:

cumulus@leaf01:~$ nv set service dhcp-relay default giaddr-interface swp2 10.0.0.4
  1. Edit the /etc/default/isc-dhcp-relay file and provide the -U option with the interface or IP address you want to use as the giaddr.

    This example uses the first IP address on the loopback interface as the giaddr:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-U lo"
    

    The first IP address on the loopback interface is typically the 127.0.0.1 address. This example uses IP address 10.10.10.1 on the loopback interface as the giaddr:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-U 10.10.10.1%lo"
    

    This example uses the first IP address on swp2 as the giaddr:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-U swp2"
    

    This example uses IP address 10.0.0.4 on swp2 as the giaddr:

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    ...
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS="-U 10.0.0.4%swp2"
    
  2. Restart the dhcrelay service to apply the configuration change:

    cumulus@leaf01:~$ sudo systemctl restart dhcrelay.service
    

Gateway IP Address as Source IP for Relayed DHCP Packets (Advanced)

You can configure the dhcrelay service to forward IPv4 (only) DHCP packets to a DHCP server and ensure that the source IP address of the relayed packet is the same as the gateway IP address.

This option impacts all relayed IPv4 packets globally.

To use the gateway IP address as the source IP address:

cumulus@leaf:~$ net add dhcp relay use-giaddr-as-src
cumulus@leaf:~$ net pending
cumulus@leaf:~$ net commit
cumulus@leaf01:~$ nv set service dhcp-relay default source-ip giaddress
cumulus@leaf01:~$ nv config apply
  1. Edit the /etc/default/isc-dhcp-relay file to add --giaddr-src to the OPTIONS line.

    cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay
    SERVERS="172.16.1.102"
    INTF_CMD="-i vlan10 -i swp51 -i swp52 -U swp2"
    OPTIONS="--giaddr-src"
    
  2. Restart the dhcrelay service to apply the configuration change:

    cumulus@leaf01:~$ sudo systemctl restart dhcrelay.service
    

Configure Multiple DHCP Relays

Cumulus Linux supports multiple DHCP relay daemons on a switch to enable relaying of packets from different bridges to different upstream interfaces.

To configure multiple DHCP relay daemons on a switch:

  1. In the /etc/default directory, create a configuration file for each DHCP relay daemon. Use the naming scheme isc-dhcp-relay-<dhcp-name> for IPv4 or isc-dhcp-relay6-<dhcp-name> for IPv6. This is an example configuration file for IPv4:

    # Defaults for isc-dhcp-relay initscript
    # sourced by /etc/init.d/isc-dhcp-relay
    # installed at /etc/default/isc-dhcp-relay by the maintainer scripts
    
    #
    # This is a POSIX shell fragment
    #
    
    # What servers should the DHCP relay forward requests to?
    SERVERS="102.0.0.2"
    # On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
    # Always include the interface towards the DHCP server.
    # This variable requires a -i for each interface configured above.
    # This will be used in the actual dhcrelay command
    # For example, "-i eth0 -i eth1"
    INTF_CMD="-i swp2s2 -i swp2s3"
    
    # Additional options that are passed to the DHCP relay daemon?
    OPTIONS=""
    
  2. Run the following command to start a dhcrelay instance, where <dhcp-name> is the instance name or number.

    cumulus@leaf01:~$ sudo systemctl start dhcrelay@<dhcp-name>
    

Troubleshooting

This section provides troubleshooting tips.

Show DHCP Relay Status

To show the DHCP relay status:

Run the nv show service dhcp-relay command for IPv4 or the nv show service dhcp-relay6 command for IPv6:

cumulus@leaf01:~$ nv show service dhcp-relay
           source-ip  Summary
---------  ---------  -----------------------
+ default  auto       giaddress-interface: lo
  default             interface:        swp51
  default             interface:        swp52
  default             interface:        vlan10
  default             server:    172.16.1.102

Run the Linux systemctl status dhcrelay.service command for IPv4 or the systemctl status dhcrelay6.service command for IPv6:

cumulus@leaf01:~$ sudo systemctl status dhcrelay.service
● dhcrelay.service - DHCPv4 Relay Agent Daemon
    Loaded: loaded (/lib/systemd/system/dhcrelay.service; enabled)
    Active: active (running) since Fri 2016-12-02 17:09:10 UTC; 2min 16s ago
      Docs: man:dhcrelay(8)
Main PID: 1997 (dhcrelay)
    CGroup: /system.slice/dhcrelay.service
            └─1997 /usr/sbin/dhcrelay --nl -d -q -i vlan10 -i swp51 -i swp52 172.16.1.102

Check systemd

If you are experiencing issues with DHCP relay, check if there is a problem with systemd:

To see how DHCP relay is working on your switch, run the journalctl command:

cumulus@leaf01:~$ sudo journalctl -l -n 20 | grep dhcrelay
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.

To specify a time period with the journalctl command, use the --since flag:

cumulus@leaf01:~$ sudo journalctl -l --since "2 minutes ago" | grep dhcrelay
Dec 05 21:08:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp51

Configuration Errors

If you configure DHCP relays by editing the /etc/default/isc-dhcp-relay file manually, you can introduce configuration errors that cause the switch to crash.

For example, if you see an error similar to the following, check that there is no space between the DHCP server address and the interface you use as the uplink.

Core was generated by /usr/sbin/dhcrelay --nl -d -i vx-40 -i vlan10 10.0.0.4 -U 10.0.1.2  %vlan20.
Program terminated with signal SIGSEGV, Segmentation fault.

To resolve the issue, manually edit the /etc/default/isc-dhcp-relay file to remove the space, then run the systemctl restart dhcrelay.service command to restart the dhcrelay service and apply the configuration change.

Considerations

The dhcrelay command does not bind to an interface if the interface name is longer than 14 characters. This is a known limitation in dhcrelay.

DHCP Servers

A DHCP server automatically provides and assigns IP addresses and other network parameters to client devices. It relies on DHCP to respond to broadcast requests from clients.

If you intend to run the dhcpd service within a VRF, including the management VRF, follow these steps.

Basic Configuration

This section shows you how to configure a DHCP server using the following topology, where the DHCP server is a switch running Cumulus Linux.

To configure the DHCP server on a Cumulus Linux switch:

In addition, you can configure a static IP address for a resource, such as a server or printer:

  • To configure static IP address assignments, you must first configure a pool.
  • You can set the DNS server IP address and domain name globally or specify different DNS server IP addresses and domain names for different pools. The following example commands configure a DNS server IP address and domain name for a pool.

cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 pool-name storage-servers
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 domain-name example.com
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 domain-name-server 192.168.200.53
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 range 10.1.10.100 to 10.1.10.199
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 gateway 10.1.10.1
cumulus@switch:~$ nv set service dhcp-server default static server1
cumulus@switch:~$ nv set service dhcp-server default static server1 ip-address 10.0.0.2
cumulus@switch:~$ nv set service dhcp-server default static server1 mac-address 44:38:39:00:01:7e
cumulus@switch:~$ nv config apply

To set the DNS server IP address and domain name globally, use the nv set service dhcp-server <vrf> domain-name-server <address> and nv set service dhcp-server <vrf> domain-name <domain> commands.

cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 pool-name storage-servers
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 domain-name-server 2001:db8:100::64
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 domain-name example.com
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 range 2001:db8:1::100 to 2001:db8:1::199 
cumulus@switch:~$ nv set service dhcp-server6 default static server1
cumulus@switch:~$ nv set service dhcp-server6 default static server1 ip-address 2001:db8:1::100
cumulus@switch:~$ nv set service dhcp-server6 default static server1 mac-address 44:38:39:00:01:7e
cumulus@switch:~$ nv config apply

To set the DNS server IP address and domain name globally, use the nv set service dhcp-server6 <vrf> domain-name-server <address> and nv set service dhcp-server6 <vrf> domain-name <domain> commands.

  1. In a text editor, edit the /etc/dhcp/dhcpd.conf file. Use following configuration as an example:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd.conf
    authoritative;
    subnet 10.1.10.0 netmask 255.255.255.0 {
       option domain-name-servers 192.168.200.53;
       option domain-name example.com;
       option routers 10.1.10.1;
       default-lease-time 3600;
       max-lease-time 3600;
       default-url ;
    pool {
           range 10.1.10.100 10.1.10.199;
           }
    }
    #Statics
    group {
       host server1 {
          hardware ethernet 44:38:39:00:01:7e;
          fixed-address 10.0.0.2;
       }
    }
    

To set the DNS server IP address and domain name globally, add the DNS server IP address and domain name before the pool information in the /etc/dhcp/dhcpd.conf file. For example:

cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd.conf
authoritative;
option domain-name servers;
option domain-name-servers 192.168.200.51;
subnet 10.1.10.0 netmask 255.255.255.0 {
   option routers 10.10.10.1;
   default-lease-time 3600;
   max-lease-time 3600;
...
  1. Edit the /etc/default/isc-dhcp-server configuration file so that the DHCP server starts when the system boots. Here is an example configuration:

    cumulus@switch:~$ sudo nano /etc/default/isc-dhcp-server
    DHCPD_CONF="-cf /etc/dhcp/dhcpd.conf"
    

    INTERFACES="swp1"

  2. Enable and start the dhcpd service:

    cumulus@switch:~$ sudo systemctl enable dhcpd.service
    cumulus@switch:~$ sudo systemctl start dhcpd.service
    
  1. In a text editor, edit the /etc/dhcp/dhcpd6.conf file. Use following configuration as an example:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd6.conf
    authoritative;
    subnet6 2001:db8::1/128 {
       option domain-name-servers 2001:db8:100::64;
       option domain-name example.com;
       option routers 2001:db8::a0a:0a01;
       default-lease-time 3600;
       max-lease-time 3600;
       default-url ;
       pool {
           range6 2001:db8:1::100 2001:db8:1::199;
       }
    }
    #Statics
    group {
       host server1 {
           hardware ethernet 44:38:39:00:01:7e;
           fixed-address6 2001:db8:1::100;
       }
    }
    

To set the DNS server IP address and domain name globally, add the DNS server IP address and domain name before the pool information in the /etc/dhcp/dhcpd6.conf file. For example:

cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd6.conf
authoritative;
option domain-name servers;
option domain-name-servers 2001:db8:100::64;
subnet6 2001:db8::1/128 {
   option routers 2001:db8::a0a:0a01;
   default-lease-time 3600;
   max-lease-time 3600;
...
  1. Edit the /etc/default/isc-dhcp-server6 file so that the DHCP server launches when the system boots. Here is an example configuration:

    cumulus@switch:~$ sudo nano /etc/default/isc-dhcp-server6
    DHCPD_CONF="-cf /etc/dhcp/dhcpd6.conf"
    

    INTERFACES="swp1"

  2. Enable and start the dhcpd6 service:

    cumulus@switch:~$ sudo systemctl enable dhcpd6.service
    cumulus@switch:~$ sudo systemctl start dhcpd6.service
    

Optional Configuration

Lease Time

You can set the network address lease time assigned to DHCP clients. You can specify a number between 180 and 31536000. The default lease time is 600 seconds.

cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 lease-time 200000
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 lease-time 200000
cumulus@switch:~$ nv config apply
  1. Edit the /etc/dhcp/dhcpd.conf file to set the lease time (in seconds):

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd.conf
    authoritative;
    subnet 10.1.10.0 netmask 255.255.255.0 {
       option domain-name-servers 192.168.200.53;
       option domain-name example.com;
       option routers 10.1.10.1;
       default-lease-time 200000;
       max-lease-time 200000;
       default-url ;
    pool {
           range 10.1.10.100 10.1.10.199;
           }
    }
    
  2. Restart the dhcpd service:

    cumulus@switch:~$ sudo systemctl restart dhcpd.service
    
  1. Edit the /etc/dhcp/dhcpd6.conf file to set the lease time (in seconds):

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd6.conf
    authoritative;
    subnet6 2001:db8::1/128 {
       option domain-name-servers 2001:db8:100::64;
       option domain-name example.com;
       option routers 2001:db8::a0a:0a01;
       default-lease-time 200000;
       max-lease-time 200000;
       default-url ;
       pool {
           range6 2001:db8:1::100 2001:db8:1::199;
       }
    }
    
  2. Restart the dhcpd6 service:

    cumulus@switch:~$ sudo systemctl restart dhcpd6.service
    

Ping Check

Configure the DHCP server to ping the address you want to assign to a client before issuing the IP address. If there is no response, DHCP delivers the IP address; otherwise, it attempts the next available address in the range.

cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 ping-check on
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::1/128 ping-check on
cumulus@switch:~$ nv config apply
  1. Edit the /etc/dhcp/dhcpd.conf file to add ping-check true;:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd.conf
    authoritative;
    subnet 10.1.10.0 netmask 255.255.255.0 {
       option domain-name-servers 192.168.200.53;
       option domain-name example.com;
       option routers 10.1.10.1;
       default-lease-time 200000;
       max-lease-time 200000;
       ping-check true;
       default-url ;
    pool {
           range 10.1.10.100 10.1.10.199;
           }
    }
    
  2. Restart the dhcpd service:

    cumulus@switch:~$ sudo systemctl restart dhcpd.service
    
  1. Edit the /etc/dhcp/dhcpd6.conf file to add ping-check true;:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd6.conf
    authoritative;
    subnet6 2001:db8::1/128 {
       option domain-name-servers 2001:db8:100::64;
       option domain-name example.com;
       option routers 2001:db8::a0a:0a01;
       default-lease-time 200000;
       max-lease-time 200000;
       ping-check true;
       default-url ;
       pool {
           range6 2001:db8:1::100 2001:db8:1::199;
       }
    }
    
  2. Restart the dhcpd6 service:

    cumulus@switch:~$ sudo systemctl restart dhcpd6.service
    

Assign Port-based IP Addresses

You can assign an IP address and other DHCP options based on physical location or port regardless of MAC address to clients that attach directly to the Cumulus Linux switch through a switch port. This is helpful when swapping out switches and servers; you can avoid the inconvenience of collecting the MAC address and sending it to the network administrator to modify the DHCP server configuration.

cumulus@switch:~$ nv set service dhcp-server default static server1 ip-address 10.0.0.2
cumulus@switch:~$ nv set service dhcp-server default interface swp1
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set service dhcp-server6 default static server1 ip-address 2001:db8:1::100
cumulus@switch:~$ nv set service dhcp-server6 default interface swp1
cumulus@switch:~$ nv config apply
  1. Edit the /etc/dhcp/dhcpd.conf file to add the interface and IP address:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd.conf
    ...
    host myhost {
        ifname "swp1" ;
        fixed-address 10.0.0.10 ;
    }
    ...
    
  2. Restart the dhcpd service:

    cumulus@switch:~$ sudo systemctl restart dhcpd.service
    
  1. Edit the /etc/dhcp/dhcpd6.conf file to add the interface and IP address:

    cumulus@switch:~$ sudo nano /etc/dhcp/dhcpd6.conf
    ...
    host myhost {
        ifname "swp1" ;
        fixed-address 2001:db8:1::100 ;
    }
    ...
    
  2. Restart the dhcpd6 service:

    cumulus@switch:~$ sudo systemctl restart dhcpd6.service
    

Troubleshooting

To show the current DHCP server settings, run the nv show service dhcp-server command:

cumulus@leaf01:mgmt:~$ nv show service dhcp-server
           Summary
---------  ------------------
+ default  interface:   "swp1
  default  pool: 10.1.10.0/24
  default  static:    server1

The DHCP server determines if a DHCP request is a relay or a non-relay DHCP request. Run the following command to see the DHCP request:

cumulus@server02:~$ sudo tail /var/log/syslog | grep dhcpd
2016-12-05T19:03:35.379633+00:00 server02 dhcpd: Relay-forward message from 2001:db8:101::1 port 547, link address 2001:db8:101::1, peer address fe80::4638:39ff:fe00:3
2016-12-05T19:03:35.380081+00:00 server02 dhcpd: Advertise NA: address 2001:db8:1::110 to client with duid 00:01:00:01:1f:d8:75:3a:44:38:39:00:00:03 iaid = 956301315 valid for 600 seconds
2016-12-05T19:03:35.380470+00:00 server02 dhcpd: Sending Relay-reply to 2001:db8:101::1 port 547

Prescriptive Topology Manager - PTM

In data center topologies, right cabling is time consuming and error prone. Prescriptive Topology Manager (PTM) is a dynamic cabling verification tool that can detect and eliminate errors. PTM uses a Graphviz-DOT specified network cabling plan in a topology.dot file and couples it with runtime information from LLDP to verify that the cabling matches the specification. The check occurs on every link transition on each node in the network.

You can customize the topology.dot file to control ptmd at both the global/network level and the node/port level.

PTM runs as a daemon, named ptmd.

Supported Features

Configure PTM

ptmd verifies the physical network topology against a DOT-specified network graph file, /etc/ptm.d/topology.dot.

PTM supports undirected graphs.

At startup, ptmd connects to lldpd (the LLDP daemon) over a Unix socket and retrieves the neighbor name and port information. It then compares the retrieved port information with the configuration information that it reads from the topology file. If there is a match, it is a PASS, otherwise it is a FAIL.

PTM performs its LLDP neighbor check using the PortID ifname TLV information.

ptmd Scripts

ptmd executes scripts at /etc/ptm.d/if-topo-pass and /etc/ptm.d/if-topo-failfor each interface that goes through a change and runs if-topo-pass when an LLDP or BFD check passes or if-topo-fails when the check fails. The scripts receive an argument string that is the result of the ptmctl command; see ptmd commands below.

You can modify these default scripts.

Configuration Parameters

You can configure ptmd parameters in the topology file. The parameters are host-only, global, per-port/node and templates.

Host-only Parameters

Host-only parameters apply to the entire host on which PTM is running. You can include the hostnametype host-only parameter, which specifies if PTM uses only the hostname (hostname) or the fully qualified domain name (fqdn) while looking for the self-node in the graph file. For example, in the graph file below PTM ignores the FQDN and only looks for switch04 because that is the hostname of the switch on which it is running:

  • Always wrap the hostname in double quotes; for example, "www.example.com" to prevent ptmd from failing.
  • To avoid errors when starting the ptmd process, make sure that /etc/hosts and /etc/hostname both reflect the hostname you are using in the topology.dot file.

graph G {
          hostnametype="hostname"
          BFD="upMinTx=150,requiredMinRx=250"
          "cumulus":"swp44" -- "switch04.cumulusnetworks.com":"swp20"
          "cumulus":"swp46" -- "switch04.cumulusnetworks.com":"swp22"
}

In this next example, PTM compares using the FQDN and looks for switch05.cumulusnetworks.com, which is the FQDN of the switch ion which it is running:

graph G {
          hostnametype="fqdn"
          "cumulus":"swp44" -- "switch05.cumulusnetworks.com":"swp20"
          "cumulus":"swp46" -- "switch05.cumulusnetworks.com":"swp22"
}

Global Parameters

Global parameters apply to every port in the topology file. There are two global parameters: LLDP and BFD. LLDP is on by default; if no keyword is present, PTM uses the default values for all ports. However, BFD is off if no keyword is present unless a per-port override exists. For example:

graph G {
          LLDP=""
          BFD="upMinTx=150,requiredMinRx=250,afi=both"
          "cumulus":"swp44" -- "qct-ly2-04":"swp20"
          "cumulus":"swp46" -- "qct-ly2-04":"swp22"
}

Per-port Parameters

Per-port parameters provide finer-grained control at the port level. These parameters override any global or compiled defaults. For example:

graph G {
          LLDP=""
          BFD="upMinTx=300,requiredMinRx=100"
          "cumulus":"swp44" -- "qct-ly2-04":"swp20" [BFD="upMinTx=150,requiredMinRx=250,afi=both"]
          "cumulus":"swp46" -- "qct-ly2-04":"swp22"
}

Templates

Templates provide flexibility in choosing different parameter combinations and applying them to a given port. A template instructs ptmd to reference a named parameter string instead of a default one. There are two parameter strings ptmd supports:

For example:

graph G {
          LLDP=""
          BFD="upMinTx=300,requiredMinRx=100"
          BFD1="upMinTx=200,requiredMinRx=200"
          BFD2="upMinTx=100,requiredMinRx=300"
          LLDP1="match_type=ifname"
          LLDP2="match_type=portdescr"
          "cumulus":"swp44" -- "qct-ly2-04":"swp20" [BFD="bfdtmpl=BFD1", LLDP="lldptmpl=LLDP1"]
          "cumulus":"swp46" -- "qct-ly2-04":"swp22" [BFD="bfdtmpl=BFD2", LLDP="lldptmpl=LLDP2"]
          "cumulus":"swp46" -- "qct-ly2-04":"swp22"
}

In this template, LLDP1 and LLDP2 are templates for LLDP parameters. BFD1 and BFD2 are templates for BFD parameters.

Supported BFD and LLDP Parameters

ptmd supports the following BFD parameters:

The following is an example of a topology with BFD at the port level:

graph G {
          "cumulus-1":"swp44" -- "cumulus-2":"swp20" [BFD="upMinTx=300,requiredMinRx=100,afi=v6"]
          "cumulus-1":"swp46" -- "cumulus-2":"swp22" [BFD="detectMult=4"]
}

ptmd supports the following LLDP parameters:

The following is an example of a topology with LLDP at the port level:

graph G {
          "cumulus-1":"swp44" -- "cumulus-2":"swp20" [LLDP="match_hostname=fqdn"]
          "cumulus-1":"swp46" -- "cumulus-2":"swp22" [LLDP="match_type=portdescr"]
}

When you specify match_hostname=fqdn, ptmd matches the entire FQDN, (cumulus-2.domain.com in the example below). If you do not specify anything for match_hostname, ptmd matches based on hostname only, (cumulus-3 below), and ignores the rest of the URL:

graph G {
          "cumulus-1":"swp44" -- "cumulus-2.domain.com":"swp20" [LLDP="match_hostname=fqdn"]
          "cumulus-1":"swp46" -- "cumulus-3":"swp22" [LLDP="match_type=portdescr"]
}

Bidirectional Forwarding Detection (BFD)

BFD provides low overhead and rapid detection of failures in the paths between two network devices. It provides a unified mechanism for link detection over all media and protocol layers. Use BFD to detect failures for IPv4 and IPv6 single or multihop paths between any two network devices, including unidirectional path failure detection. For information about configuring BFD using PTM, see BFD.

The FRRouting routing suite enables additional checks to ensure that routing adjacencies form only on links that have connectivity that conform to the specification that ptmd defines.

You only need to do this to check link state; you do not need to enable PTM to determine BFD status.

When you enable the global ptm-enable option, every interface has an implied ptm-enable line in the configuration stanza in the /etc/network/interfaces file.

To enable the global ptm-enable option, run the following FRRouting command:

cumulus@switch:~$ sudo vtysh

switch# configure terminal
switch(config)# ptm-enable
switch(config)# end
switch# write memory
switch# exit
cumulus@switch:~$

To disable the checks, delete the ptm-enable parameter from the interface:

cumulus@switch:~$ sudo vtysh
switch# conf t
switch(config)# interface swp51
switch(config-if)# no ptm-enable
switch(config-if)# end
switch# write memory
switch# exit
cumulus@switch:~$

If you need to reenable PTM for that interface:

cumulus@switch:~$ sudo vtysh
switch# conf t
switch(config)# interface swp51
switch(config-if)# ptm-enable

switch(config-if)# end
switch# write memory
switch# exit
cumulus@switch:~$

With PTM on an interface, the zebra daemon connects to ptmd over a Unix socket. Any time there is a change of status for an interface, ptmd sends notifications to zebra. Zebra maintains a ptm-status flag per interface and evaluates routing adjacency based on this flag. To check the per-interface ptm-status, run the NVUE nv show interface <interface> command or the vtysh show interface <interface> command.

cumulus@switch:~$ sudo vtysh
switch# show interface swp1
Interface swp1 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 3 metric 0 mtu 1550
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  HWaddr: c4:54:44:bd:01:41
...

ptmd Service Commands

PTM sends client notifications in CSV format.

To start or restart the ptmd service, run the following command. The topology.dot file must be present for the service to start.

cumulus@switch:~$ sudo systemctl start|restart|force-reload ptmd.service

To instruct ptmd to read the topology.dot file again to apply the new configuration to the running state without restarting:

cumulus@switch:~$ sudo systemctl reload ptmd.service

To stop the ptmd service:

cumulus@switch:~$ sudo systemctl stop ptmd.service

To retrieve the current running state of ptmd:

cumulus@switch:~$ sudo systemctl status ptmd.service

ptmctl Commands

ptmctl is a client of ptmd that retrieves the operational state of the ports configured on the switch and information about BFD sessions from ptmd. ptmctl parses the CSV notifications sent by ptmd. See man ptmctl for more information.

ptmctl Examples

The examples below contain the following keywords in the output of the cbl status column:

cbl status KeywordDefinition
passThe topology file defines the interface, the interface receives LLDP information, and the LLDP information for the interface matches the information in the topology file.
failThe topology file defines the interface, the interface receives LLDP information, and the LLDP information for the interface does not match the information in the topology file.
N/AThe topology file defines the interface but the interface does not receive LLDP information. The interface might be down or disconnected, or the neighbor is not sending LLDP packets.
The N/A and fail status might indicate a wiring problem to investigate.
The N/A status does not show when you use the -l option with ptmctl; The output shows only interfaces that are receiving LLDP information.

For basic output, use ptmctl without any options:

cumulus@switch:~$ sudo ptmctl

-------------------------------------------------------------
port  cbl     BFD     BFD                  BFD    BFD
      status  status  peer                 local  type
-------------------------------------------------------------
swp1  pass    pass    11.0.0.2             N/A    singlehop
swp2  pass    N/A     N/A                  N/A    N/A
swp3  pass    N/A     N/A                  N/A    N/A

For more detailed output, use the -d option:

cumulus@switch:~$ sudo ptmctl -d

--------------------------------------------------------------------------------------
port  cbl    exp     act      sysname  portID  portDescr  match  last    BFD   BFD
      status nbr     nbr                                  on     upd     Type  state
--------------------------------------------------------------------------------------
swp45 pass   h1:swp1 h1:swp1  h1       swp1    swp1       IfName 5m: 5s  N/A   N/A
swp46 fail   h2:swp1 h2:swp1  h2       swp1    swp1       IfName 5m: 5s  N/A   N/A

#continuation of the output
-------------------------------------------------------------------------------------------------
BFD   BFD       det_mult  tx_timeout  rx_timeout  echo_tx_timeout  echo_rx_timeout  max_hop_cnt
peer  DownDiag
-------------------------------------------------------------------------------------------------
N/A   N/A       N/A       N/A         N/A         N/A              N/A              N/A
N/A   N/A       N/A       N/A         N/A         N/A              N/A              N/A

To return information on active BFD sessions ptmd is tracking, use the -b option:

cumulus@switch:~$ sudo ptmctl -b

----------------------------------------------------------
port  peer        state  local         type       diag

----------------------------------------------------------
swp1  11.0.0.2    Up     N/A           singlehop  N/A
N/A   12.12.12.1  Up     12.12.12.4    multihop   N/A

To return LLDP information, use the -l option. The output returns only the active neighbors that ptmd is tracking.

cumulus@switch:~$ sudo ptmctl -l

---------------------------------------------
port  sysname  portID  port   match  last
                       descr  on     upd
---------------------------------------------
swp45 h1       swp1    swp1   IfName 5m:59s
swp46 h2       swp1    swp1   IfName 5m:59s

To return detailed information on active BFD sessions ptmd is tracking, use the -b and -d option (results are for an IPv6-connected peer):

cumulus@switch:~$ sudo ptmctl -b -d

----------------------------------------------------------------------------------------
port  peer                 state  local  type       diag  det   tx_timeout  rx_timeout
                                                          mult
----------------------------------------------------------------------------------------
swp1  fe80::202:ff:fe00:1  Up     N/A    singlehop  N/A   3     300         900
swp1  3101:abc:bcad::2     Up     N/A    singlehop  N/A   3     300         900

#continuation of output
---------------------------------------------------------------------
echo        echo        max      rx_ctrl  tx_ctrl  rx_echo  tx_echo
tx_timeout  rx_timeout  hop_cnt
---------------------------------------------------------------------
0           0           N/A      187172   185986   0        0
0           0           N/A      501      533      0        0

ptmctl Error Outputs

If there are errors in the topology file or there is no session, PTM returns appropriate outputs. Typical error strings are:

Topology file error [/etc/ptm.d/topology.dot] [cannot find node cumulus] -
please check /var/log/ptmd.log for more info

Topology file error [/etc/ptm.d/topology.dot] [cannot open file (errno 2)] -
please check /var/log/ptmd.log for more info

No Hostname/MgmtIP found [Check LLDPD daemon status] -
please check /var/log/ptmd.log for more info

No BFD sessions . Check connections

No LLDP ports detected. Check connections

Unsupported command

For example:

cumulus@switch:~$ sudo ptmctl
-------------------------------------------------------------------------
cmd         error
-------------------------------------------------------------------------
get-status  Topology file error [/etc/ptm.d/topology.dot]
            [cannot open file (errno 2)] - please check /var/log/ptmd.log
            for more info

If you encounter errors with the topology.dot file, you can use dot (included in the Graphviz package) to validate the syntax of the topology file.

Open the topology file with Graphviz to ensure that it is readable and that the file format is correct.

If you edit topology.dot file from a Windows system, be sure to double check the file formatting; there might be extra characters that keep the graph from working correctly.

Basic Topology Example

The following example shows a basic example DOT file and its corresponding topology diagram. Use the same topology.dot file on all switches and do not split the file for each device to allow for easy automation by using the same exact file on each device.

graph G {
    "spine1":"swp1" -- "leaf1":"swp1";
    "spine1":"swp2" -- "leaf2":"swp1";
    "spine2":"swp1" -- "leaf1":"swp2";
    "spine2":"swp2" -- "leaf2":"swp2";
    "leaf1":"swp3" -- "leaf2":"swp3";
    "leaf1":"swp4" -- "leaf2":"swp4";
    "leaf1":"swp5s0" -- "server1":"eth1";
    "leaf2":"swp5s0" -- "server2":"eth1";
}

Considerations

ptmd Is in an Incorrect Failure State

When ptmd is in an incorrect failure state and you enable the Zebra interface, PIF BGP sessions do not establish the route but the subinterface does establish routes.

If the subinterface is on the physical interface and PTM marks the physical interface in a PTM FAIL state, FRR does not process routes on the physical interface, but the subinterface is working.

Commas in Port Descriptions

If an LLDP neighbor advertises a PortDescr that contains commas, ptmctl -d splits the string on the commas and misplaces its components in other columns. Do not use commas in your port descriptions.

Layer 2

This section describes layer 2 configuration, such as Ethernet bridging, bonding, spanning tree protocol, multi-chassis link aggregation (MLAG), link layer discovery protocol (LLDP), LACP bypass, virtual router redundancy (VRR) and IGMP and MLD snooping.

Link Layer Discovery Protocol

The lldpd daemon implements the IEEE802.1AB Link Layer Discovery Protocol (LLDP) standard. LLDP shows which ports are neighbors of a given port.

By default, lldpd runs as a daemon and starts at system boot. lldpd command line arguments are in the /etc/default/lldpd file. Cumulus Linux saves all lldpd configuration options in the /etc/lldpd.conf file or under /etc/lldpd.d/.

lldpd supports CDP (Cisco Discovery Protocol, v1 and v2) and logs by default into /var/log/daemon.log with an lldpd prefix.

You can use the lldpcli CLI tool to query the lldpd daemon for neighbors, statistics, and other running configuration information. See man lldpcli(8) for details.

Configure LLDP Timers

You can configure the frequency of LLDP updates (between 10 and 300 seconds) and the amount of time to hold the information before discarding it. The hold time interval is a multiple of the tx-interval.

The following example commands configure the frequency of LLDP updates to 100 and the hold time to 3.

cumulus@switch:~$ nv set service lldp tx-interval 100
cumulus@switch:~$ nv set service lldp tx-hold-multiplier 3
cumulus@switch:~$ nv config apply

Add the timers to the /etc/lldpd.conf file or to your .conf file in the /etc/lldpd.d/ directory.

Cumulus Linux does not ship with a /etc/lldpd.conf file. You must create the /etc/lldpd.conf file or create a .conf file in the /etc/lldpd.d/ directory.

cumulus@switch:~$ sudo nano /etc/lldpd.conf
configure lldp tx-interval 40
configure lldp tx-hold 3
...

Disable LLDP on an Interface

To disable LLDP on a single interface, edit the /etc/default/lldpd file. This file specifies the default options to present to the lldpd service when it starts. The following example uses the -I option to disable LLDP on swp43:

cumulus@switch:~$ sudo nano /etc/default/lldpd

# Add "-x" to DAEMON_ARGS to start SNMP subagent
# Enable CDP by default
DAEMON_ARGS="-c -I *,!swp43"
Runtime Configuration (Advanced)

A runtime configuration does not persist when you reboot the switch; you lose all changes.

To configure active interfaces:

cumulus@switch:~$ sudo lldpcli configure system interface pattern "swp*"

To configure inactive interfaces:

cumulus@switch:~$ sudo lldpcli configure system interface pattern *,!eth0,swp*

The active interface list always overrides the inactive interface list.

To reset any interface list to none:

cumulus@switch:~$ sudo lldpcli configure system interface pattern ""

Enable the SNMP Subagent

LLDP does not enable the SNMP subagent by default. To enable the SNMP subagent, edit the /etc/default/lldpd file and add the -x option:

cumulus@switch:~$ sudo nano /etc/default/lldpd

# Add "-x" to DAEMON_ARGS to start SNMP subagent

# Enable CDP by default
DAEMON_ARGS="-c -x -M 4"

The -c option enables backwards compatibility with CDP and the -M 4 option sends a field in discovery packets to indicate that the switch is a network device.

Troubleshooting

To show all neighbors on all ports and interfaces:

cumulus@switch:~$ sudo lldpcli show neighbors
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    eth0, via: LLDP, RID: 1, Time: 0 day, 17:38:08
  Chassis:
    ChassisID:    mac 08:9e:01:e9:66:5a
    SysName:      PIONEERMS22
    SysDescr:     Cumulus Linux version 4.1.0 running on quanta lb9
    MgmtIP:       192.168.0.22
    Capability:   Bridge, on
    Capability:   Router, on
  Port:
    PortID:       ifname swp47
    PortDescr:    swp47
-------------------------------------------------------------------------------
Interface:    swp1, via: LLDP, RID: 10, Time: 0 day, 17:08:27
  Chassis:
    ChassisID:    mac 00:01:00:00:09:00
    SysName:      MSP-1
    SysDescr:     Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
    MgmtIP:       192.0.2.9
    MgmtIP:       fe80::201:ff:fe00:900
    Capability:   Bridge, off
    Capability:   Router, on
  Port:
    PortID:       ifname swp1
    PortDescr:    swp1
-------------------------------------------------------------------------------
Interface:    swp2, via: LLDP, RID: 10, Time: 0 day, 17:08:27
  Chassis:
    ChassisID:    mac 00:01:00:00:09:00
    SysName:      MSP-1
    SysDescr:     Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
    MgmtIP:       192.0.2.9
    MgmtIP:       fe80::201:ff:fe00:900
    Capability:   Bridge, off
    Capability:   Router, on
  Port:
    PortID:       ifname swp2
    PortDescr:    swp2
-------------------------------------------------------------------------------
Interface:    swp3, via: LLDP, RID: 11, Time: 0 day, 17:08:27
  Chassis:
    ChassisID:    mac 00:01:00:00:0a:00
    SysName:      MSP-2
    SysDescr:     Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
    MgmtIP:       192.0.2.10
    MgmtIP:       fe80::201:ff:fe00:a00
    Capability:   Bridge, off
    Capability:   Router, on
  Port:
    PortID:       ifname swp1
    PortDescr:    swp1
...

To show lldpd statistics for all ports:

cumulus@switch:~$ sudo lldpcli show statistics
----------------------------------------------------------------------
LLDP statistics:
----------------------------------------------------------------------
Interface:    eth0
  Transmitted:  9423
  Received:     17634
  Discarded:    0
  Unrecognized: 0
  Ageout:       10
  Inserted:     20
  Deleted:      10
--------------------------------------------------------------------
Interface:    swp1
  Transmitted:  9423
  Received:     6264
  Discarded:    0
  Unrecognized: 0
  Ageout:       0
  Inserted:     2
  Deleted:      0
---------------------------------------------------------------------
Interface:    swp2
  Transmitted:  9423
  Received:     6264
  Discarded:    0
  Unrecognized: 0
  Ageout:       0
  Inserted:     2
  Deleted:      0
---------------------------------------------------------------------
Interface:    swp3
  Transmitted:  9423
  Received:     6265
  Discarded:    0
  Unrecognized: 0
  Ageout:       0
  Inserted:     2
  Deleted:      0
----------------------------------------------------------------------
...

To show a summary of lldpd statistics for all ports:

cumulus@switch:~$ sudo lldpcli show statistics summary
---------------------------------------------------------------------
LLDP Global statistics:
---------------------------------------------------------------------
Summary of stats:
  Transmitted:  648186
  Received:     437557
  Discarded:    0
  Unrecognized: 0
  Ageout:       10
  Inserted:     38
  Deleted:      10

To show the running LLDP configuration:

cumulus@switch:~$ sudo lldpcli show running-configuration
--------------------------------------------------------------------
Global configuration:
--------------------------------------------------------------------
Configuration:
  Transmit delay: 30
  Transmit hold: 4
  Receive mode: no
  Pattern for management addresses: (none)
  Interface pattern: (none)
  Interface pattern blacklist: (none)
  Interface pattern for chassis ID: (none)
  Override description with: (none)
  Override platform with: Linux
  Override system name with: (none)
  Advertise version: yes
  Update interface descriptions: no
  Promiscuous mode on managed interfaces: no
  Disable LLDP-MED inventory: yes
  LLDP-MED fast start mechanism: yes
  LLDP-MED fast start interval: 1
  Source MAC for LLDP frames on bond slaves: local
  Portid TLV Subtype for lldp frames: ifname
--------------------------------------------------------------------

You can also run the NVUE nv show system lldp command to show the running LLDP configuration.

Considerations

Annex E (and hence Annex D) of IEEE802.1AB (lldp) is not supported.

Ethernet Bridging - VLANs

Ethernet bridges enable hosts to communicate through layer 2 by connecting the physical and logical interfaces in the system into a single layer 2 domain. The bridge is a logical interface with a MAC address and an MTU (maximum transmission unit). The bridge MTU is the minimum MTU among all its members. By default, the bridge’s MAC address is the MAC address of the first port in the bridge-ports list. You can also assign an IP address to the bridge; see below.

  • Bridge members can be individual physical interfaces, bonds, or logical interfaces that traverse an 802.1Q VLAN trunk.
  • Cumulus Linux does not put all ports into a bridge by default.

Ethernet Bridge Types

The Cumulus Linux bridge driver supports two configuration modes; one that is VLAN-aware and one that follows a more traditional Linux bridge model.

NVIDIA recommends that you use VLAN-aware mode bridges instead of traditional mode bridges. The Cumulus Linux bridge driver is capable of VLAN filtering, which allows for configurations that are similar to incumbent network devices. For a comparison of traditional and VLAN-aware modes, see this knowledge base article.

You can configure both VLAN-aware and traditional mode bridges on the same network in Cumulus Linux.

Bridge MAC Addresses

The switch learns the MAC address for a frame when the frame enters the bridge through an interface and records the MAC address in the bridge table. The bridge forwards the frame to its intended destination by looking up the destination MAC address. Cumulus Linux maintains the MAC entry for 1800 seconds (30 minutes). If the switch sees the frame with the same source MAC address before the MAC entry age expires, it refreshes the MAC entry age; if the MAC entry age expires, the switch deletes the MAC address from the bridge table.

The following example NCLU command output shows a MAC address table for the bridge.

cumulus@switch:~$ net show bridge macs
VLAN      Master    Interface    MAC                  TunnelDest  State      Flags    LastSeen
--------  --------  -----------  -----------------  ------------  ---------  -------  -----------------
untagged  bridge    swp1         44:38:39:00:00:03                                    00:00:15
untagged  bridge    swp1         44:38:39:00:00:04                permanent           20 days, 01:14:03

bridge fdb Command Output

The Linux bridge fdb command interacts with the forwarding database table (FDB), which the bridge uses to store the MAC addresses it learns and the ports on which it learns those MAC addresses. The bridge fdb show command output contains some specific keywords:

KeywordDescription
selfThe FDB entry belongs to the FDB on the device referenced by the device.
For example, this FDB entry belongs to the VXLAN device:
vx-1000: 00:02:00:00:00:08 dev vx-1000 dst 27.0.0.10 self
masterThe FDB entry belongs to the FDB on the device’s master and the FDB entry is pointing to a master’s port.
For example, this FDB entry is from the master device named bridge and is pointing to the VXLAN bridge port:
vx-1001: 02:02:00:00:00:08 dev vx-1001 vlan 1001 master bridge
extern_learnAn external control plane, such as the BGP control plane for EVPN, manages (offloads) the FDB entry.

The following example shows the bridge fdb show command output:

cumulus@switch:~$ bridge fdb show | grep 02:02:00:00:00:08
02:02:00:00:00:08 dev vx-1001 vlan 1001 extern_learn master bridge
02:02:00:00:00:08 dev vx-1001 dst 27.0.0.10 self extern_learn

Considerations

VLAN-aware Bridge Mode

VLAN-aware bridge mode in Cumulus Linux implements a configuration model for large-scale layer 2 environments, with one single instance of spanning tree protocol. Each physical bridge member port includes the list of allowed VLANs as well as the port VLAN ID, either the primary VLAN Identifier (PVID) or native VLAN. MAC address learning, filtering and forwarding are VLAN-aware. This reduces the configuration size, and eliminates the large overhead of managing the port and VLAN instances as subinterfaces, replacing them with lightweight VLAN bitmaps and state updates.

On NVIDIA Spectrum-2 and Spectrum-3 switches, Cumulus Linux supports multiple VLAN-aware bridges but with the following limitations:

Configure a VLAN-aware Bridge

The example commands below create a VLAN-aware bridge for STP that contains two switch ports and includes 3 VLANs; tagged VLANs 10 and 20, and untagged (native) VLAN 1.

cumulus@switch:~$ net add bridge bridge ports swp1-2
cumulus@switch:~$ net add bridge bridge vids 10,20
cumulus@switch:~$ net add bridge bridge pvid 1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

The above commands create the following code snippet in the /etc/network/interfaces file:

auto bridge
iface bridge
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 10 20
    bridge-vlan-aware yes

With NVUE, there is a default bridge called br_default, which has no ports assigned. The example below configures this default bridge.

cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file and add the bridge:

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-vids 10 20
    bridge-pvid 1
    bridge-vlan-aware yes
...

Run the ifreload -a command to load the new configuration:

cumulus@switch:~$ ifreload -a

The Primary VLAN Identifier (PVID) of the bridge defaults to 1. You do not have to specify bridge-pvid for a bridge or a port. However, even though this does not affect the configuration, it helps other users for readability. The following configurations are identical to each other and the configuration above:

auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-vids 1 10 20
    bridge-vlan-aware yes
auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 1 10 20
    bridge-vlan-aware yes
auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-vids 10 20
    bridge-vlan-aware yes

  • If you specify bridge-vids or bridge-pvid at the bridge level, all ports in the bridge inherit these configurations. However, specifying any of these settings for a specific port overrides the setting in the bridge.
  • Do not bridge the management port eth0 with any switch ports. For example, if you create a bridge with eth0 and swp1, the bridge does not work correctly and disrupts access to the management interface.

Configure Multiple VLAN-aware Bridges

This example shows the commands required to create two VLAN-aware bridges on the switch.

Bridges are independent so you can reuse VLANs between bridges. Each VLAN-aware bridge maintains its own MAC address and VLAN tag table; MAC and VLAN tags in one bridge are not visible to the other table.

cumulus@switch:~$ net add bridge bridge1 ports swp1-2
cumulus@switch:~$ net add bridge bridge1 vids 10,20
cumulus@switch:~$ net add bridge bridge1 pvid 1
cumulus@switch:~$ net add bridge bridge2 ports swp3
cumulus@switch:~$ net add bridge bridge2 vids 10
cumulus@switch:~$ net add bridge bridge2 pvid 1
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp1-2 bridge domain bridge1
cumulus@switch:~$ nv set bridge domain bridge1 vlan 10,20
cumulus@switch:~$ nv set bridge domain bridge1 untagged 1
cumulus@switch:~$ nv set interface swp3 bridge domain bridge2
cumulus@switch:~$ nv set bridge domain bridge2 vlan 10
cumulus@switch:~$ nv set bridge domain bridge2 untagged 1
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file and add the bridge:

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge1
iface bridge1
    bridge-ports swp1 swp2
    bridge-vlan-aware yes
    bridge-vids 10 20
    bridge-pvid 1

auto bridge2
iface bridge2
    bridge-ports swp3
    bridge-vlan-aware yes
    bridge-vids 10
    bridge-pvid 1
...

Run the ifreload -a command to load the new configuration:

cumulus@switch:~$ ifreload -a

NVIDIA Spectrum switches support a maximum of 6000 VLAN elements and calculate the total number of VLAN elements as the number of VLANs times the number of configured bridges. For example, 6 bridges, each containing 1000 VLANS totals 6000 VLAN elements.

Reserved VLAN Range

For hardware data plane internal operations, the switching silicon requires VLANs for every physical port, Linux bridge, and layer 3 subinterface. Cumulus Linux reserves a range of VLANs by default; the reserved range is 3725-3999.

If the reserved VLAN range conflicts with any user-defined VLANs, you can modify the range. The new range must be a contiguous set of VLANs with IDs between 2 and 4094. For a single VLAN-aware bridge, the minimum size of the range is 2 VLANs. For multiple VLAN-aware bridges, the minimum size of the range is the number of VLAN-aware bridges on the system plus one.

To configure the reserved range, edit the /etc/cumulus/switchd.conf file to uncomment the resv_vlan_range line and specify a new range. After you save the file, you must restart switchd.

VLAN Pruning

By default, the bridge port inherits the bridge VIDs, however, you can configure a port to override the bridge VIDs.

This example commands configure swp3 to override the bridge VIDs:

cumulus@switch:~$ net add bridge bridge ports swp1-3
cumulus@switch:~$ net add bridge bridge vids 10,20
cumulus@switch:~$ net add bridge bridge pvid 1
cumulus@switch:~$ net add interface swp3 bridge vids 20
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp1-3 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv set interface swp3 bridge domain br_default vlan 20
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example commands configure swp3 to override the bridge VIDs:

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto br_default
iface br_default
    bridge-ports swp1 swp2 swp3
    bridge-pvid 1
    bridge-vids 10 20
    bridge-vlan-aware yes

auto swp3
iface swp3
  bridge-vids 20
...
cumulus@switch:~$ ifreload -a

Access Ports and Tagged Packets

Access ports ignore all tagged packets. In the configuration below, swp1 and swp2 are access ports, while all untagged traffic goes to VLAN 10:

cumulus@switch:~$ net add bridge bridge ports swp1-2
cumulus@switch:~$ net add bridge bridge vids 10,20
cumulus@switch:~$ net add bridge bridge pvid 1
cumulus@switch:~$ net add interface swp1 bridge access 10
cumulus@switch:~$ net add interface swp2 bridge access 10
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv set interface swp1 bridge domain br_default access 10
cumulus@switch:~$ nv set interface swp2 bridge domain br_default access 10
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 10 20
    bridge-vlan-aware yes

auto swp1
iface swp1
    bridge-access 10

auto swp2
iface swp2
    bridge-access 10
...
cumulus@switch:~$ ifreload -a

Drop Untagged Frames

With VLAN-aware bridge mode, you can configure a switch port to drop any untagged frames. To do this, add bridge-allow-untagged no to the switch port (not to the bridge). The bridge port is without a PVID and drops untagged packets.

The following example command configures swp2 to drop untagged frames:

cumulus@switch:~$ net add interface swp2 bridge allow-untagged no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit 
cumulus@switch:~$ nv set interface swp2 bridge domain br_default untagged none
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file to add the bridge-allow-untagged no line under the switch port interface stanza, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1

auto swp2
iface swp2
    bridge-allow-untagged no

auto br_default
iface br_default
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 10 20
    bridge-vlan-aware yes
...
cumulus@switch:~$ sudo ifreload -a

When you check VLAN membership for that port, it shows that there is no untagged VLAN.

cumulus@switch:~$ bridge -c vlan show
portvlan ids
swp1 1 PVID Egress Untagged
  10 20

swp2 10 20

bridge 1

VLAN Layer 3 Addressing

When configuring the VLAN attributes for the bridge, specify the attributes for each VLAN interface. If you are configuring the switch virtual interface (SVI) for the native VLAN, you must declare the native VLAN and specify its IP address. Specifying the IP address in the bridge stanza itself returns an error.

The following example commands declare native VLAN 10 with IPv4 address 10.1.10.2/24 and IPv6 address 2001:db8::1/32.

The NVUE and Linux commands also show an example with multiple VLAN-aware bridges.

cumulus@switch:~$ net add vlan 10 ip address 10.1.10.2/24
cumulus@switch:~$ net add vlan 10 ipv6 address 2001:db8::1/32
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface vlan10 ip address 2001:db8::1/32
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set interface bridge2_vlan10 type svi
cumulus@switch:~$ nv set interface bridge2_vlan10 vlan 10
cumulus@switch:~$ nv set interface bridge2_vlan10 base-interface bridge2
cumulus@switch:~$ nv set interface bridge2_vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface bridge1_vlan10 type svi
cumulus@switch:~$ nv set interface bridge1_vlan10 vlan 10
cumulus@switch:~$ nv set interface bridge1_vlan10 base-interface bridge1
cumulus@switch:~$ nv set interface bridge1_vlan10 ip address 12.1.10.2/24
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
    bridge-ports swp1 swp2
    bridge-pvid 1
    bridge-vids 10 20
    bridge-vlan-aware yes

auto vlan10 iface vlan10 address 10.1.10.2/24 address 2001:db8::1/32 vlan-id 10 vlan-raw-device br_default

cumulus@switch:~$ ifreload -a
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge2_vlan10
iface bridge2_vlan10
    address 10.1.10.2/24
    hwaddress 1c:34:da:1d:e6:fd
    vlan-raw-device bridge2
    vlan-id 10

auto bridge1_vlan10 iface bridge1_vlan10 address 12.1.10.2/24 hwaddress 1c:34:da:1d:e6:fd vlan-raw-device bridge1 vlan-id 10

The first time you configure a switch, all southbound bridge ports are down; therefore, by default, the SVI is also down. You can force the SVI to always be up by disabling interface state tracking so that the SVI is always in the UP state, even if all member ports are down. Other implementations describe this feature as no autostate. This is beneficial if you want to perform connectivity testing.

To keep the SVI perpetually UP, create a dummy interface, then make the dummy interface a member of the bridge.

Example Configuration

Consider the following configuration, without a dummy interface in the bridge:

cumulus@switch:~$ sudo cat /etc/network/interfaces
...
auto br_default
iface br_default
    bridge-vlan-aware yes
    bridge-ports swp3
    bridge-vids 10
    bridge-pvid 1
...

With this configuration, when swp3 is down, the SVI is also down:

cumulus@switch:~$ ip link show swp3
5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br_default state DOWN mode DEFAULT group default qlen 1000
    link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show br_default
35: br_default: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff

Now add the dummy interface to your network configuration:

  1. Edit the /etc/network/interfaces file and add the dummy interface stanza before the bridge stanza:

    cumulus@switch:~$ sudo nano /etc/network/interfaces
    ...
    auto dummy
    iface dummy
        link-type dummy
    
    auto br_default
    iface br_default
    ...
    
  2. Add the dummy interface to the bridge-ports line in the bridge configuration:

    auto br_default
    iface br_default
        bridge-vlan-aware yes
        bridge-ports swp3 dummy
        bridge-vids 10
        bridge-pvid 1
    
  3. Save and exit the file, then reload the configuration:

    cumulus@switch:~$ sudo ifreload -a
    

    Now, even when swp3 is down, both the dummy interface and the bridge remain up:

    cumulus@switch:~$ ip link show swp3
    5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br_default state DOWN mode DEFAULT group default qlen 1000
        link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
    cumulus@switch:~$ ip link show dummy
    37: dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_default state UNKNOWN mode DEFAULT group default
        link/ether 66:dc:92:d4:f3:68 brd ff:ff:ff:ff:ff:ff
    cumulus@switch:~$ ip link show br_default
    35: br_default: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
        link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
    

By default, Cumulus Linux automatically generates IPv6 link-local addresses on VLAN interfaces. If you want to use a different mechanism to assign link-local addresses, you can disable this feature. You can disable link-local automatic address generation for both regular IPv6 addresses and address-virtual (macvlan) addresses.

To disable automatic address generation for a regular IPv6 address on a VLAN, run the following command. The following example command disables automatic address generation for a regular IPv6 address on VLAN 10.

cumulus@switch:~$ net add vlan 10 ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE commands are not supported.

Edit the /etc/network/interfaces file to add the line ipv6-addrgen off to the VLAN stanza, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan 10
    ipv6-addrgen off
    vlan-id 10
    vlan-raw-device br_default
...
cumulus@switch:~$ ifreload -a

To reenable automatic link-local address generation for a VLAN:

cumulus@switch:~$ net del vlan 10 ipv6-addrgen off
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE commands are not supported.
Edit the /etc/network/interfaces file to remove the line ipv6-addrgen off from the VLAN stanza, then run the ifreload -a command.

Static MAC Address Entries

You can add a static MAC address entry to the layer 2 table for an interface within the VLAN-aware bridge by running a command similar to the following:

cumulus@switch:~$ sudo bridge fdb add 12:34:56:12:34:56 dev swp1 vlan 150 master static
cumulus@switch:~$ sudo bridge fdb show
44:38:39:00:00:7c dev swp1 master bridge permanent
12:34:56:12:34:56 dev swp1 vlan 150 master bridge static
44:38:39:00:00:7c dev swp1 self permanent
12:12:12:12:12:12 dev swp1 self permanent
12:34:12:34:12:34 dev swp1 self permanent
12:34:56:12:34:56 dev swp1 self permanent
12:34:12:34:12:34 dev bridge master bridge permanent
44:38:39:00:00:7c dev bridge vlan 500 master bridge permanent
12:12:12:12:12:12 dev bridge master bridge permanent

Example Configuration

The following example configuration contains a pruned access port and switch port; the ports only send and receive traffic tagged to and from a specific set of VLANs that the bridge-vids attribute specifies. The example also contains other switch ports that send and receive traffic from all the defined VLANs.

...
# ports swp3-swp48 are trunk ports which inherit vlans from the 'bridge'
# ie vlans 310,700,707,712,850,910
#
auto br_default
iface br_default
      bridge-ports swp1 swp2 swp3 ... swp51 swp52
      bridge-vids 310 700 707 712 850 910
      bridge-vlan-aware yes
auto swp1
iface swp1
      bridge-access 310
      mstpctl-bpduguard yes
      mstpctl-portadminedge yes
# The following is a trunk port that is "pruned".
# native vlan is 1, but only .1q tags of 707, 712, 850 are
# sent and received
#
auto swp2
iface swp2
      mstpctl-bpduguard yes
      mstpctl-portadminedge yes
      bridge-vids 707 712 850
# The following port is the trunk uplink and inherits all vlans
# from 'bridge'; bridge assurance is enabled using 'portnetwork' attribute
auto swp49
iface swp49
      mstpctl-portnetwork yes
      mstpctl-portpathcost 10
# The following port is the trunk uplink and inherits all vlans
# from 'br_default'; bridge assurance is enabled using 'portnetwork' attribute
 auto swp50
 iface swp50
      mstpctl-portnetwork yes
      mstpctl-portpathcost 0
...

Considerations

Spanning Tree Protocol (STP)

VLAN Translation

You cannot enable VLAN translation on a bridge in VLAN-aware mode. Only traditional mode bridges support VLAN translation.

Bridge Conversion

You cannot convert traditional mode bridges automatically to and from a VLAN-aware bridge. You must delete the original configuration and bring down all member switch ports before creating a new bridge.

Traditional Bridge Mode

For traditional Linux bridges, the kernel supports VLANs in the form of VLAN subinterfaces. Enabling bridging on multiple VLANs means configuring a bridge for each VLAN and, for each member port on a bridge, creating one or more VLAN subinterfaces out of that port. This mode can pose scalability challenges with configuration size as well as boot time and run time state management, when the number of ports times the number of VLANs becomes large.

Use VLAN-aware mode bridges instead of traditional mode bridges. Use traditional mode bridges if you need to use PVSTP+.

Configure a Traditional Mode Bridge

The following examples show how to create a simple traditional mode bridge configuration on the switch. The example uses some optional elements:

To configure spanning tree options for a bridge interface, refer to Spanning Tree and Rapid Spanning Tree - STP.

The following example commands configure a traditional mode bridge called my_bridge with IP address 10.10.10.10/24. swp1, swp2, swp3, and swp4 are members of the bridge.

cumulus@switch:~$ net add bridge my_bridge ports swp1-4
cumulus@switch:~$ net add bridge my_bridge ip address 10.10.10.10/24
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE commands are not supported.

Edit the /etc/network/interfaces file, then run the ifreload -a command.

...
auto swp1
iface swp1

auto swp2
iface swp2

auto swp3
iface swp3

auto swp4
iface swp4

auto my_bridge
iface my_bridge
    address 10.10.10.10/24
    bridge-ports swp1 swp2 swp3 swp4
    bridge-vlan-aware no
...
cumulus@switch:~$ sudo ifreload -a

  • Do not bridge the management port, eth0, with any switch ports (swp0, swp1, and so on). For example, if you create a bridge with eth0 and swp1, it does not work.
  • The name of the bridge must be compliant with Linux interface naming conventions and unique within the switch.

Configure Multiple Traditional Mode Bridges

You can configure multiple bridges to logically divide a switch into multiple layer 2 domains. This allows for hosts to communicate with other hosts in the same domain, while separating them from hosts in other domains.

The example below shows a multiple bridge configuration, where host-1 and host-2 connect to bridge-A, and host-3 and host-4 connect to bridge-B:

This example configuration looks like this in the /etc/network/interfaces file:

...
auto bridge-A
iface bridge-A
    bridge-ports swp1 swp2
    bridge-vlan-aware no

auto bridge-B
iface bridge-B
    bridge-ports swp3 swp4
    bridge-vlan-aware no
...

Trunks in Traditional Bridge Mode

The standard for trunking is 802.1Q. The 802.1Q specification adds a four byte header within the Ethernet frame that identifies the VLAN of which the frame is a member.

802.1Q also identifies an untagged frame as belonging to the native VLAN (most network devices default their native VLAN to 1). In Cumulus Linux:

A bridge in traditional mode has no concept of trunks, just tagged or untagged frames. With a trunk of 200 VLANs, there would need to be 199 bridges, each containing a tagged physical interface, and one bridge containing the native untagged VLAN.

The interaction of tagged and un-tagged frames on the same trunk often leads to undesired and unexpected behavior. A switch that uses VLAN 1 for the native VLAN can send frames to a switch that uses VLAN 2 for the native VLAN, merging those two VLANs and their spanning tree state.

To create the above example:

cumulus@switch:~$ net add bridge br-VLAN10 ports swp1.10,swp2.10
cumulus@switch:~$ net add bridge br-VLAN20 ports swp1.20,swp2.20
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
NVUE commands are not supported.

Add the following configuration to the /etc/network/interfaces file:

...
auto br-VLAN10
iface br-VLAN10
   bridge-ports swp1.10 swp2.10

auto br-VLAN20
iface br-VLAN20
   bridge-ports swp1.20 swp2.20
...

For more examples of VLAN tagging, see VLAN Tagging.

VLAN Tagging

This topic shows two examples of VLAN tagging, one basic and one more advanced. Both examples show streamlined interface configuration from ifupdown2.

Basic VLAN Tagging Example

The following simple VLAN tagging configuration shows two hosts that connect to a switch.

To configure the above example, edit the /etc/network/interfaces file and add a configuration like the following:

# Config for host1

auto swp1
iface swp1

auto swp1.100
iface swp1.100

# Config for host2
# swp2 must exist to create the .1Q subinterfaces, but it is not assigned an address

auto swp2
iface swp2

auto swp2.120
iface swp2.120

auto swp2.130
iface swp2.130

Advanced VLAN Tagging Example

The following advanced VLAN tagging configuration shows three hosts and two switches, with several bridges and a bond that connects them all.

The bridge member ports function as 802.1Q access ports and trunk ports. To compare Cumulus Linux with a traditional Cisco device:

To create the above configuration, edit the /etc/network/interfaces file and add a configuration like the following:

# Config for host1

# swp1 does not need an iface section unless it has a specific setting,
# it will be picked up as a dependent of swp1.100.
# And swp1 must exist in the system to create the .1q subinterfaces..
# but it is not applied to any bridge..or assigned an address.

auto swp1.100
iface swp1.100

# Config for host2
# swp2 does not need an iface section unless it has a specific setting,
# it will be picked up as a dependent of swp2.100 and swp2.120.
# And swp2 must exist in the system to create the .1q subinterfaces..
# but it is not applied to any bridge..or assigned an address.

auto swp2.100
iface swp2.100

auto swp2.120
iface swp2.120

# Config for host3
# swp3 does not need an iface section unless it has a specific setting,
# it will be picked up as a dependent of swp3.120 and swp3.130.
# And swp3 must exist in the system to create the .1q subinterfaces..
# but it is not applied to any bridge..or assigned an address.

auto swp3.120
iface swp3.120

auto swp3.130
iface swp3.130

# Configure the bond

auto bond2
iface bond2
  bond-slaves glob swp4-7

# configure the bridges

auto br-untagged
iface br-untagged
    address 10.0.0.1/24
    bridge-ports swp1 bond2
    bridge-stp on

auto br-tag100
iface br-tag100
    address 10.0.100.1/24
    bridge-ports swp1.100 swp2.100 bond2.100
    bridge-stp on

auto br-vlan120
iface br-vlan120
    address 10.0.120.1/24
    bridge-ports swp2.120 swp3.120 bond2.120
    bridge-stp on

auto v130
iface v130
    address 10.0.130.1/24
    bridge-ports swp3.130 bond2.130
    bridge-stp on

#

To verify the configuration:

cumulus@switch:~$ sudo mstpctl showbridge br-tag100
br-tag100 CIST info
  enabled         yes
  bridge id       8.000.44:38:39:00:32:8B
  designated root 8.000.44:38:39:00:32:8B
  regional root   8.000.44:38:39:00:32:8B
  root port       none
  path cost     0          internal path cost   0
  max age       20         bridge max age       20
  forward delay 15         bridge forward delay 15
  tx hold count 6          max hops             20
  hello time    2          ageing time          300
  force protocol version     rstp
  time since topology change 333040s
  topology change count      1
  topology change            no
  topology change port       swp2.100
  last topology change port  None
cumulus@switch:~$ sudo mstpctl showportdetail br-tag100  | grep -B 2 state
br-tag100:bond2.100 CIST info
  enabled            yes                     role                 Designated
  port id            8.003                   state                forwarding
--
br-tag100:swp1.100 CIST info
  enabled            yes                     role                 Designated
  port id            8.001                   state                forwarding
--
  br-tag100:swp2.100 CIST info
  enabled            yes                     role                 Designated
  port id            8.002                   state                forwarding
cumulus@switch:~$ cat /proc/net/vlan/config
VLAN Dev name    | VLAN ID
Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
bond2.100      | 100  | bond2
bond2.120      | 120  | bond2
bond2.130      | 130  | bond2
swp1.100       | 100  | swp1
swp2.100       | 100  | swp2
swp2.120       | 120  | swp2
swp3.120       | 120  | swp3
swp3.130       | 130  | swp3
cumulus@switch:~$ cat /proc/net/bonding/bond2
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
    Aggregator ID: 3
    Number of ports: 4
    Actor Key: 33
    Partner Key: 33
    Partner Mac Address: 44:38:39:00:32:cf

Slave Interface: swp4
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:32:8e
Aggregator ID: 3
Slave queue ID: 0

Slave Interface: swp5
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:32:8f
Aggregator ID: 3
Slave queue ID: 0

Slave Interface: swp6
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:32:90
Aggregator ID: 3
Slave queue ID: 0

Slave Interface: swp7
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:32:91
Aggregator ID: 3
Slave queue ID: 0

A single bridge cannot contain multiple subinterfaces of the same port. If you try to apply this configuration, you see an error:

cumulus@switch:~$ sudo brctl addbr another_bridge
cumulus@switch:~$ sudo brctl addif another_bridge swp9 swp9.100
bridge cannot contain multiple subinterfaces of the same port: swp9, swp9.100

VLAN Translation

By default, Cumulus Linux does not allow VLAN subinterfaces associated with different VLAN IDs to be part of the same bridge. Base interfaces do not associate with any VLAN IDs and are exempt from this restriction.

In some cases, it is useful to relax this restriction. For example, when two servers connect to the switch using VLAN trunks, but the VLAN numbering on the two servers is not consistent. You can bridge two VLAN subinterfaces of different VLAN IDs from the servers by enabling the sysctl net.bridge.bridge-allow-multiple-vlans option. Packets that enter a bridge from a member VLAN subinterface egress another member VLAN subinterface with the VLAN ID translated.

You cannot enable VLAN translation on a bridge in VLAN-aware mode; only bridges in traditional mode can use VLAN translation.

The following example enables the VLAN translation sysctl:

cumulus@switch:~$ echo net.bridge.bridge-allow-multiple-vlans = 1 | sudo tee /etc/sysctl.d/multiple_vlans.conf
net.bridge.bridge-allow-multiple-vlans = 1
cumulus@switch:~$ sudo sysctl -p /etc/sysctl.d/multiple_vlans.conf
net.bridge.bridge-allow-multiple-vlans = 1

After you enable sysctl, you can add ports with different VLAN IDs to the same bridge. In the following example, the switch bridges packets that enter the bridge br-mix from swp10.100 to swp11.200. Cumulus Linux translates the VLAN ID from 100 to 200:

cumulus@switch:~$ sudo brctl addif br_mix swp10.100 swp11.200

cumulus@switch:~$ sudo brctl show br_mix
bridge name     bridge id               STP enabled     interfaces
br_mix          8000.4438390032bd       yes             swp10.100
                                                        swp11.200

Spanning Tree and Rapid Spanning Tree - STP

Spanning tree protocol (STP) identifies links in the network and shuts down redundant links, preventing possible network loops and broadcast radiation on a bridged network. STP also provides redundant links for automatic failover when an active link fails. Cumulus Linux enables STP by default for both VLAN-aware and traditional bridges.

Cumulus Linux supports RSTP, PVST, and PVRST modes:

STP for a Traditional Mode Bridge

Per VLAN Spanning Tree (PVST) creates a spanning tree instance for a bridge. Rapid PVST (PVRST) supports RSTP enhancements for each spanning tree instance. To use PVRST with a traditional bridge, you must create a bridge corresponding to the untagged native VLAN and all the physical switch ports must be part of the same VLAN.

For maximum interoperability, when connected to a switch that has a native VLAN configuration, you must configure the native VLAN to VLAN 1.

STP for a VLAN-aware Bridge

VLAN-aware bridges operate in RSTP mode only. RSTP on VLAN-aware bridges works with other modes in the following ways:

RSTP and STP

If a bridge running RSTP (802.1w) receives a common STP (802.1D) BPDU, it falls back to 802.1D automatically.

RSTP and PVST

The RSTP domain sends BPDUs on the native VLAN, whereas PVST sends BPDUs on a per VLAN basis. For both protocols to work together, you need to enable the native VLAN on the link between the RSTP to PVST domain; the spanning tree builds according to the native VLAN parameters.

The RSTP protocol does not send or parse BPDUs on other VLANs, but floods BPDUs across the network, enabling the PVST domain to maintain its spanning-tree topology and provide a loop-free network.

RSTP and MST

RSTP works with MST seamlessly, creating a single instance of spanning tree that transmits BPDUs on the native VLAN.

RSTP treats the MST domain as one giant switch, whereas MST treats the RSTP domain as a different region. To ensure proper communication between the regions, MST creates a Common Spanning Tree (CST) that connects all the boundary switches and forms the overall view of the MST domain. Because changes in the CST must reflect in all regions, the RSTP tree exists is in the CST to ensure that changes on the RSTP domain are in the CST domain. Topology changes on the RSTP domain impact the rest of the network but inform the MST domain of every change occurring in the RSTP domain, ensuring a loop-free network.

Configure the root bridge within the MST domain by changing the priority on the relevant MST switch. When MST detects an RSTP link, it falls back into RSTP mode. The MST domain chooses the switch with the lowest cost to the CST root bridge as the CIST root bridge.

RSTP with MLAG

More than one spanning tree instance enables switches to load balance and use different links for different VLANs. With RSTP, there is only one instance of spanning tree. To better utilize the links, you can configure MLAG on the switches connected to the MST or PVST domain and set up these interfaces as an MLAG port. The PVST or MST domain thinks it connects to a single switch and utilizes all the links connected to it. Load balancing depends on the port channel hashing mechanism instead of different spanning tree instances and uses all the links between the RSTP to the PVST or MST domains. For information about configuring MLAG, see Multi-Chassis Link Aggregation - MLAG.

Configure STP

There several ways to customize STP in Cumulus Linux. Exercise caution when changing the settings below to prevent malfunctions in STP loop avoidance.

Spanning Tree Priority

If you have a multiple spanning tree instance (MSTI 0, also known as a common spanning tree, or CST), you can set the tree priority for a bridge. The bridge with the lowest priority is the root bridge. The priority must be a number between 0 and 61440, and must be a multiple of 4096. The default is 32768.

The following example command sets the tree priority to 8192:

cumulus@switch:~$ net add bridge stp treeprio 8192
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set bridge domain br_default stp priority 8192
cumulus@switch:~$ nv config apply

Configure the tree priority (mstpctl-treeprio) under the bridge stanza in the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
    # bridge-ports includes all ports related to VxLAN and CLAG.
    # does not include the Peerlink.4094 subinterface
    bridge-ports bond01 bond02 peerlink vni13 vni24 vxlan4001
    bridge-pvid 1
    bridge-vids 13 24
    bridge-vlan-aware yes
    mstpctl-treeprio 8192
...
cumulus@switch:~$ ifreload -a

Cumulus Linux supports MSTI 0 only. It does not support MSTI 1 through 15.

PortAdminEdge (PortFast Mode)

PortAdminEdge is equivalent to the PortFast feature offered by other vendors. It enables or disables the initial edge state of a port in a bridge. All ports with PortAdminEdge bypass the listening and learning states and go straight to forwarding.

PortAdminEdge mode causes loops if you do not use it with BPDU guard.

You typically configure edge ports as access ports for a simple end host. In the data center, edge ports connect to servers, which pass both tagged and untagged traffic.

The following example commands configure PortAdminEdge and BPDU guard for swp5:

cumulus@switch:~$ net add interface swp5 stp bpduguard
cumulus@switch:~$ net add interface swp5 stp portadminedge
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp admin-edge on
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp bpdu-guard on
cumulus@switch:~$ nv config apply

Configure PortAdminEdge and BPDU guard under the switch port interface stanza in the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/netowrk/interfaces
...
auto swp5
iface swp5
    mstpctl-bpduguard yes
    mstpctl-portadminedge yes
...
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

A runtime configuration is non-persistent, which means the configuration you create here does not persist after you reboot the switch.

To configure PortAdminEdge and BPDU guard at runtime, run the following commands:

cumulus@switch:~$ sudo mstpctl setportadminedge br2 swp1 yes
cumulus@switch:~$ sudo mstpctl setbpduguard br2 swp1 yes

PortAutoEdge

PortAutoEdge is an enhancement to the standard PortAdminEdge (PortFast) mode, which allows for the automatic detection of edge ports. PortAutoEdge enables and disables the auto transition to and from the edge state of a port in a bridge.

Edge ports and access ports are not the same. Edge ports transition directly to the forwarding state and skip the listening and learning stages. Upstream topology change notifications are not generated when an edge port link changes state. Access ports only forward untagged traffic; however, there is no such restriction on edge ports, which can forward both tagged and untagged traffic.

When a port with PortAutoEdge receives a BPDU, the port stops being in the edge port state and transitions into a normal STP port. When the interface no longer receives BPDUs, the port becomes an edge port, and transitions through the discarding and learning states before it resumes forwarding.

Cumulus Linux enables PortAutoEdge by default.

The following example commands disable PortAutoEdge on swp1:

cumulus@switch:~$ net add interface swp1 stp portautoedge no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp1 bridge domain br_default stp auto-edge off
cumulus@switch:~$ nv config apply

Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portautoedge no line, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
    alias to Server01
    # Port to Server02
    mstpctl-portautoedge no
...
cumulus@switch:~$ sudo ifreload -a

The following example commands reenable PortAutoEdge on swp1:

cumulus@switch:~$ net del interface swp1 stp portautoedge no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp1 bridge domain br_default stp auto-edge on
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to remove mstpctl-portautoedge no, then run the ifreload -a command.

BPDU Guard

You can configure BPDU guard to protect the spanning tree topology from unauthorized switches affecting the forwarding path. For example, if you add a new switch to an access port off a leaf switch and this new switch has a low priority, it can become the new root switch and affect the forwarding path for the entire layer 2 topology.

The following example commands set BPDU guard for swp5:

cumulus@switch:~$ net add interface swp5 stp bpduguard
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp bpdu-guard on
cumulus@switch:~$ nv config apply

Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-bpduguard yes line, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp5
iface swp5
    mstpctl-bpduguard yes
...
cumulus@switch:~$ sudo ifreload -a

If a port receives a BPDU, STP brings down the port and logs an error in /var/log/syslog. The following is a sample error:

mstpd: error, MSTP_IN_rx_bpdu: bridge:bond0 Recvd BPDU on BPDU Guard Port - Port Down

To see if a port has BPDU guard on or if the port receives a BPDU:

cumulus@switch:~$ net show bridge spanning-tree | grep bpdu
  bpdu guard port    yes                bpdu guard error     yes
cumulus@switch:~$ nv show bridge domain br_default stp
cumulus@switch:~$ mstpctl showportdetail bridge bond0
bridge:bond0 CIST info
  enabled            no                      role                 Disabled
  port id            8.001                   state                discarding
  external port cost 305                     admin external cost  0
  internal port cost 305                     admin internal cost  0
  designated root    8.000.6C:64:1A:00:4F:9C dsgn external cost   0
  dsgn regional root 8.000.6C:64:1A:00:4F:9C dsgn internal cost   0
  designated bridge  8.000.6C:64:1A:00:4F:9C designated port      8.001
  admin edge port    no                      auto edge port       yes
  oper edge port     no                      topology change ack  no
  point-to-point     yes                     admin point-to-point auto
  restricted role    no                      restricted TCN       no
  port hello time    10                      disputed             no
  bpdu guard port    yes                      bpdu guard error     yes
  network port       no                      BA inconsistent      no
  Num TX BPDU        3                       Num TX TCN           2
  Num RX BPDU        488                     Num RX TCN           2
  Num Transition FWD 1                       Num Transition BLK   2
  bpdufilter port    no
  clag ISL           no                      clag ISL Oper UP     no
  clag role          unknown                 clag dual conn mac   0:0:0:0:0:0
  clag remote portID F.FFF                   clag system mac      0:0:0:0:0:0

The only way to recover a port that is in the disabled state is to manually bring up the port with the sudo ifup <interface> command. See Interface Configuration and Management for more information about ifupdown.

Bringing up the disabled port does not correct the problem if the configuration on the connected end-station does not resolve.

Bridge Assurance

On a point-to-point link where RSTP is running, if you want to detect unidirectional links and put the port in a discarding state, you can enable bridge assurance on the port by enabling a port type network. The port is then in a bridge assurance inconsistent state until it receives a BPDU from the peer. You need to configure the port type network on both ends of the link for bridge assurance.

Cumulus Linux disables bridge assurance by default.

The following example commands enable bridge assurance on swp1:

cumulus@switch:~$ net add interface swp1 stp portnetwork
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp network on
cumulus@switch:~$ nv config apply

Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portnetwork yes line, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp5
iface swp5
    mstpctl-portnetwork yes
...
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

A runtime configuration is non-persistent, which means the configuration you create here does not persist after you reboot the switch.

To enable bridge assurance at runtime, run mstpctl:

cumulus@switch:~$ sudo mstpctl setportnetwork br1007 swp1.1007 yes

cumulus@switch:~$ sudo mstpctl showportdetail br1007 swp1.1007 | grep network
  network port       yes                     BA inconsistent      yes

To monitor logs for bridge assurance messages, run the following command:

cumulus@switch:~$ sudo grep -in assurance /var/log/syslog | grep mstp
  1365:Jun 25 18:03:17 mstpd: br1007:swp1.1007 Bridge assurance inconsistent

BPDU Filter

You can enable bpdufilter on a switch port, which filters BPDUs in both directions. This disables STP on the port as no BPDUs are transiting.

Using BDPU filter sometimes causes layer 2 loops. Use this feature with caution.

The following example commands configure the BPDU filter on swp6:

cumulus@switch:~$ net add interface swp6 stp portbpdufilter
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface swp6 bridge domain br_default stp bpdu-filter on
cumulus@switch:~$ nv config apply

Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portbpdufilter yes line, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp6
iface swp6
    mstpctl-portbpdufilter yes
...
cumulus@switch:~$ sudo ifreload -a

Runtime Configuration (Advanced)

A runtime configuration is non-persistent, which means the configuration you create here does not persist after you reboot the switch.

To enable BPDU filter at runtime, run mstpctl. For example:

cumulus@switch:~$ sudo mstpctl setportbpdufilter br100 swp1.100=yes swp2.100=yes

Root Role

To enable the interface in the bridge to take the root role:

cumulus@switch:~$ nv set interface swp1 bridge domain br_default stp restrrole on
cumulus@switch:~$ nv config apply

Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portrestrrole yes line, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp6
iface swp6
    mstpctl-portrestrrole yes
...
cumulus@switch:~$ sudo ifreload -a

Additional STP Parameters

The table below describes additional STP configuration parameters available in Cumulus Linux. You can set these optional parameters manually by editing the /etc/network/interfaces file. NVUE commands are not supported.

The IEEE 802.1D and 802.1Q specifications describe STP parameters. For a comparison of STP parameter configuration between mstpctl and other vendors, read this knowledge base article.

ParameterDescription
mstpctl-maxageSets the maximum age of the bridge in seconds. The default is 20. The maximum age must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-ageingSets the MAC address ageing time for the bridge in seconds when the running version is STP, but not RSTP or MSTP. The default is 1800.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-fdelaySets the bridge forward delay time in seconds. The default value is 15. The bridge forward delay must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-maxhopsSets the maximum hops for the bridge. The default is 20.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-txholdcountSets the bridge transmit hold count. The default value is 6 seconds.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-forceversSets the force STP version of the bridge to either RSTP/STP. The default is RSTP.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-helloSets the bridge hello time in seconds. The default is 2.
Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-portpathcostSets the port cost of the interface in the bridge. The default is 0.
mstpd supports only long mode; 32 bits for the path cost.
Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-portp2pEnables or disables point-to-point detection mode of the interface in the bridge.
Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-portrestrtcnEnables or disables the interface in the bridge to propagate received topology change notifications. The default is no.
Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-treeportcostSets the spanning tree port cost to a value from 0 to 255. The default is 0.
Add this parameter to the interface stanza of the /etc/network/interfaces file.

Be sure to run the sudo ifreload -a command after you set the STP parameter in the /etc/network/interfaces file.

Troubleshooting

To check STP status for a bridge:

cumulus@switch:~$ net show bridge spanning-tree
Bridge info
  enabled         yes
  bridge id       8.000.44:38:39:FF:40:94
    Priority:     32768
    Address:      44:38:39:FF:40:94
  This bridge is root.

  designated root 8.000.44:38:39:FF:40:94
    Priority:     32768
    Address:      44:38:39:FF:40:94

  root port       none
  path cost     0          internal path cost   0
  max age       20         bridge max age       20
  forward delay 15         bridge forward delay 15
  tx hold count 6          max hops             20
  hello time    2          ageing time          300
  force protocol version     rstp

INTERFACE  STATE  ROLE  EDGE
---------  -----  ----  ----
peerlink   forw   Desg  Yes
vni13      forw   Desg  Yes
vni24      forw   Desg  Yes
vxlan4001  forw   Desg  Yes
cumulus@switch:~$ nv show bridge domain br_default stp

The mstpctl utility provided by the mstpd service configures STP. The mstpd daemon is an open source project used by Cumulus Linux to implement IEEE802.1D 2004 and IEEE802.1Q 2011.

The mstpd daemon starts by default when the switch boots and logs errors to /var/log/syslog.

mstpd is the preferred utility for interacting with STP on Cumulus Linux. brctl also provides certain tools for configuring STP; however, they are not as complete and output from brctl is sometimes misleading.

To show the bridge state, run the brctl show command:

cumulus@switch:~$ sudo brctl show
  bridge name     bridge id               STP enabled     interfaces
  bridge          8000.001401010100       yes             swp1
                                                          swp4
                                                          swp5

To show the mstpd bridge port state, run the mstpctl showport bridge command:

cumulus@switch:~$ sudo mstpctl showport bridge
  E swp1 8.001 forw F.000.00:14:01:01:01:00 F.000.00:14:01:01:01:00 8.001 Desg
    swp4 8.002 forw F.000.00:14:01:01:01:00 F.000.00:14:01:01:01:00 8.002 Desg
  E swp5 8.003 forw F.000.00:14:01:01:01:00 F.000.00:14:01:01:01:00 8.003 Desg

Storm Control

Storm control provides protection against excessive inbound BUM (broadcast, unknown unicast, multicast) traffic on layer 2 switch port interfaces, which can cause poor network performance.

Configure Storm Control

To configure storm control for physical ports, edit the /etc/cumulus/switchd.conf file. For example, to enable broadcast storm control for swp1 at 400 packets per second (pps), multicast storm control at 3000 pps, and unknown unicast at 500 pps, edit the /etc/cumulus/switchd.conf file and uncomment the storm_control.broadcast, storm_control.multicast, and storm_control.unknown_unicast lines:

cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
# Storm Control setting on a port, in pps, 0 means disable
interface.swp1.storm_control.broadcast = 400
interface.swp1.storm_control.multicast = 3000
interface.swp1.storm_control.unknown_unicast = 500
...

When you update the /etc/cumulus/switchd.conf file, you must restart switchd for the changes to take effect.

cumulus@switch:~$ sudo systemctl restart switchd.service

Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.

You can also run the following commands. The configuration below takes effect, but does not persist if you reboot the switch. For a persistent configuration, edit the /etc/cumulus/switchd.conf file, as described above.

cumulus@switch:~$ sudo sh -c 'echo 400 > /cumulus/switchd/config/interface/swp1/storm_control/broadcast'
cumulus@switch:~$ sudo sh -c 'echo 3000 > /cumulus/switchd/config/interface/swp1/storm_control/multicast'
cumulus@switch:~$ sudo sh -c 'echo 500 > /cumulus/switchd/config/interface/swp1/storm_control/unknown_unicast'

Bonding - Link Aggregation

Linux bonding provides a method for aggregating multiple network interfaces (slaves) into a single logical bonded interface (bond). Link aggregation is useful for linear scaling of bandwidth, load balancing, and failover protection.

Cumulus Linux supports two bonding modes:

Cumulus Linux uses version 1 of the LAG control protocol (LACP).

To temporarily bring up a bond even when there is no LACP partner, use LACP Bypass.

Hash Distribution

The switch distributes egress traffic through a bond to a slave based on a packet hash calculation, providing load balancing over the slaves; the switch distributes conversation flows over all available slaves to load balance the total traffic. Traffic for a single conversation flow always hashes to the same slave.

The hash calculation uses packet header data to choose to which slave to transmit the packet:

In a failover event, the switch adjusts the hash calculation to steer traffic over available slaves.

LAG Custom Hashing

You can configure the fields you want to use in the LAG hash calculation. For example, if you do not want to use source or destination port numbers in the hash calculation, you can disable the source port and destination port fields.

You can configure the following fields:

To configure custom hash, edit the /etc/cumulus/datapath/traffic.conf file:

  1. To enable custom hashing, uncomment the lag_hash_config.enable = true line.
  2. To enable a field, set the field to true. To disable a field, set the field to false.
  3. Run the echo 1 > /cumulus/switchd/ctrl/hash_config_reload command. This command does not cause any traffic interruptions.

The following shows an example /etc/cumulus/datapath/traffic.conf file:

cumulus@switch:~$ sudo nano /etc/cumulus/datapath/traffic.conf
...
#LAG HASH config
#HASH config for LACP to enable custom fields
#Fields will be applicable for LAG hash
#calculation
#Uncomment to enable custom fields configured below
lag_hash_config.enable = true

lag_hash_config.smac = true
lag_hash_config.dmac = true
lag_hash_config.sip  = true
lag_hash_config.dip  = true
lag_hash_config.ether_type = true
lag_hash_config.vlan_id = true
lag_hash_config.sport = false
lag_hash_config.dport = false
lag_hash_config.ip_prot = true
...

Cumulus Linux enables symmetric hashing by default. Make sure that the settings for the source IP (lag_hash_config.sip) and destination IP (lag_hash_config.dip) fields match, and that the settings for the source port (lag_hash_config.sport) and destination port (lag_hash_config.dport) fields match; otherwise symmetric hashing disables automatically. You can disable symmetric hashing manually in the /etc/cumulus/datapath/traffic.conf file by setting symmetric_hash_enable = FALSE.

You can set a unique hash seed for each switch to avoid hash polarization. See Configure a Hash Seed to Avoid Hash Polarization.

Create a Bond

In the example below, the front panel port interfaces swp1 thru swp4 are slaves in bond0, while swp5 and swp6 are not part of bond0.

To create and configure a bond:

cumulus@switch:~$ net add bond bond0 bond slaves swp1-4
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface bond0 bond member swp1-4
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file to add a stanza for the bond, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bond0
iface bond0
    bond-slaves swp1 swp2 swp3 swp4
...
cumulus@switch:~$ ifreload -a

  • By default, the bond uses IEEE 802.3ad link aggregation mode. To configure the bond in balance-xor mode, see Configuration Parameters below.
  • If the bond is not going to be part of a bridge, you must specify an IP address.
  • Make sure the name of the bond adheres to Linux interface naming conventions and is unique within the switch.
  • Cumulus Linux does not support bond members at 200G or greater.

When you start networking, the switch creates bond0 as MASTER and interfaces swp1 thru swp4 come up in SLAVE mode:

cumulus@switch:~$ ip link show
...

3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
4: swp2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
5: swp3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
6: swp4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT qlen 500
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
...

55: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
    link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff

All slave interfaces within a bond have the same MAC address as the bond. Typically, the first slave you add to the bond donates its MAC address as the bond MAC address, whereas the MAC addresses of the other slaves are the bond MAC address. The bond MAC address is the source MAC address for all traffic leaving the bond and provides a single destination MAC address to address traffic to the bond.

Removing a bond slave interface from which a bond derives its MAC address affects traffic when the bond interface flaps to update the MAC address.

Configure Bond Options

The table below describes the configuration options for a bond. To configure a bond, run the commands shown or add the parameter to the bond stanza in the /etc/network/interfaces file.

The following example sets the bond mode for bond01 to balance-xor (static):

cumulus@switch:~$ net add bond bond1 bond mode balance-xor
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface bond01 bond mode static 
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file to add the balance-xor parameter to the bond stanza, then run the ifreload -a command:

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bond1
iface bond1
    bond-mode balance-xor
    bond-slaves swp1 swp2 swp3 swp4
...
cumulus@switch:~$ ifreload -a

Each bond configuration option, except for bond slaves, has the recommended value by default in Cumulus Linux. Only configure an option if you need a different setting. For more information on configuration values, refer to the Related Information section below.

ParameterDescription
bond-mode 802.3ad

bond-mode balance-xor
Cumulus Linux supports IEEE 802.3ad link aggregation mode (802.3ad) and balance-xor mode.
The default mode is 802.3ad.

Note: When you enable balance-xor mode, the bonding of slave interfaces are static and all slave interfaces are active for load balancing and fault tolerance. The switch bases packet transmission on the bond on the hash policy in xmit-hash-policy.

When you use balance-xor mode to dual-connect host-facing bonds in an MLAG environment, you must configure the clag-id parameter with the same value on both MLAG switches. Otherwise, the MLAG switch pair treats the bonds as single-connected.

Use balance-xor mode only if you cannot use LACP; LACP can detect mismatched link attributes between bond members and can even detect misconnections.

NCLU command: net add bond <bond-name> bond mode balance-xor
NVUE command: nv set interface <bond-name> bond mode static
bond miimon <value>Defines how often to inspect the link state of each slave for failures. You can specify a value between 0 and 255. The default value is 100.
bond-use-carrier noDetermines the link state.
bond-lacp-bypass-allowEnables LACP bypass.

NCLU command: net add bond <bond-name> bond lacp-bypass-allow
NVUE command: nv set interface <bond-name> bond lacp-bypass on
bond-lacp-rate slowSets the rate to ask the link partner to transmit LACP control packets. slow is the only option.
bond-min-linksDefines the minimum number of links (between 0 and 255) that must be active before the bond goes into service. The default value is 1.

Use a value greater than 1 if you need higher level services to ensure a minimum aggregate bandwidth level before activating a bond. Keeping the bond-min-links value at 1 indicates the bond must have at least one active member. If the number of active members drops below the bond-min-links setting, the bond appears to upper-level protocols as link-down. When the number of active links returns to greater than or equal to bond-min-links, the bond becomes link-up.

Show Bond Information

To show information for a bond:

Run the net show interface <bond> command:

cumulus@switch:~$ net show interface bond1
    Name    MAC                Speed    MTU    Mode
--  ------  -----------------  -------  -----  ------
UP  bond1   00:02:00:00:00:12  20G      1500   Bond


Bond Details
---------------  -------------
Bond Mode:       Balance-XOR
Load Balancing:  Layer3+4
Minimum Links:   1
In CLAG:         CLAG Inactive


    Port     Speed      TX    RX    Err    Link Failures
--  -------  -------  ----  ----  -----  ---------------
UP  swp3(P)  10G         0     0      0                0
UP  swp4(P)  10G         0     0      0                0


LLDP
-------  ----  ------------
swp3(P)  ====  swp1(p1c1h1)
swp4(P)  ====  swp2(p1c1h1)Routing
-------
  Interface bond1 is up, line protocol is up
  Link ups:       3    last: 2017/04/26 21:00:38.26
  Link downs:     2    last: 2017/04/26 20:59:56.78
  PTM status: disabled
  vrf: Default-IP-Routing-Table
  index 31 metric 0 mtu 1500
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: 00:02:00:00:00:12
  inet6 fe80::202:ff:fe00:12/64
  Interface Type Other

Run the sudo cat /proc/net/bonding/<bond> command:

cumulus@switch:~$ sudo cat /proc/net/bonding/bond01

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: load balancing (xor)
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0


Slave Interface: swp1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:00:03
Slave queue ID: 0

The detailed output in /proc/net/bonding/<filename> includes the actor/partner LACP information. This information is not necessary and requires you to use sudo to view the file.

Considerations

Multi-Chassis Link Aggregation - MLAG

MLAG or CLAG: Other vendors refer to the Cumulus Linux implementation of MLAG as CLAG, MC-LAG or VPC. You even see references to CLAG in Cumulus Linux, including the management daemon, named clagd, and other options in the code, such as clag-id, which exist for historical purposes. The Cumulus Linux implementation is truly a multi-chassis link aggregation protocol so this document uses MLAG.

Multi-Chassis Link Aggregation (MLAG) enables a server or switch with a two-port bond, such as a link aggregation group (LAG), EtherChannel, port group or trunk, to connect those ports to different switches and operate as if they connect to a single, logical switch. This provides greater redundancy and greater system throughput.

Dual-connected devices can create LACP bonds that contain links to each physical switch; Cumulus Linux supports active-active links from the dual-connected devices even though they connect to two different physical switches.

How Does MLAG Work?

A basic MLAG configuration looks like this:


  • The two switches, leaf01 and leaf02, known as peer switches, appear as a single device to the bond on server01.
  • server01 distributes traffic between the two links to leaf01 and leaf02 in the way you configure on the host.
  • Traffic inbound to server01 can traverse leaf01 or leaf02 and arrive at server01.

More elaborate configurations are also possible. The number of links between the host and the switches can be greater than two and does not have to be symmetrical. Also, because the two peer switches appear as a single switch to other bonding devices, you can also connect pairs of MLAG switches to each other in a switch-to-switch MLAG configuration:



  • leaf01 and leaf02 are also MLAG peer switches and present a two-port bond from a single logical system to spine01 and spine02.

Link Aggregation Control Protocol (LACP), the IEEE standard protocol for managing bonds, verifies dual-connectedness. LACP runs on the dual-connected devices and on each of the MLAG peer switches. On a dual-connected device, the only configuration requirement is to create a bond that LACP manages.

On each of the peer switches, you must place the links that connect to the dual-connected host or switch in the bond. This is true even if the links are a single port on each peer switch, where each port is in a bond, as shown below:

The dual-connected bonds on the peer switches have their system ID set to the MLAG system ID. Therefore, from the point of view of the hosts, each of the links in its bond connects to the same system and so the host uses both links.

Each peer switch periodically makes a list of the LACP partner MAC addresses for its bonds and sends that list to its peer (using the clagd service). The LACP partner MAC address is the MAC address of the system at the other end of a bond (server01, server02, and server03 in the figure above). When a switch receives this list from its peer, it compares the list to the LACP partner MAC addresses on its switch. If there are any matches and the clag-id for those bonds match, then that bond is a dual-connected bond. You can find the LACP partner MAC address by the running net show bridge macs command.

Requirements

MLAG has these requirements:

  • MLAG is not supported in a multiple VLAN-aware bridge configuration.
  • Both MLAG peers must use the same VXLAN device type (single or traditional).

Basic Configuration

To configure MLAG, you need to create a bond that uses LACP on the dual-connected devices and configure the interfaces (including bonds, VLANs, bridges, and peer links) on each peer switch.

Follow these steps on each peer switch in the MLAG pair:

  1. On the dual-connected device, such as a host or server that sends traffic to and from the switch, create a bond that uses LACP. The method you use varies with the type of device you are configuring.

    If you cannot use LACP in your environment, you can configure the bonds in balance-xor mode.

  2. Place every interface that connects to the MLAG pair from a dual-connected device into a bond, even if the bond contains only a single link on a single physical switch.

    The following examples place swp1 in bond1 and swp2 in bond2.

    The example also adds a description for the bonds (an alias), which is optional.

    cumulus@leaf01:~$ net add bond bond1 bond slaves swp1
    cumulus@leaf01:~$ net add bond bond1 alias bond1 on swp1
    cumulus@leaf01:~$ net add bond bond2 bond slaves swp2
    cumulus@leaf01:~$ net add bond bond2 alias bond2 on swp2
    cumulus@leaf01:~$ net pending
    cumulus@leaf01:~$ net commit
    
    cumulus@leaf01:~$ nv set interface bond1 bond member swp1
    cumulus@leaf01:~$ nv set interface bond2 bond member swp2
    cumulus@leaf01:~$ nv config apply
    

    Add the following lines to the /etc/network/interfaces file. The example also adds a description for the bonds (an alias), which is optional.

    cumulus@leaf01:~$ sudo nano /etc/network/interfaces
    ...
    auto bond1
    iface bond1
        alias bond1 on swp1
        bond-slaves swp1
    ...
    
    auto bond2
    iface bond2
        alias bond2 on swp2
        bond-slaves swp2
    ...
    
  3. Add a unique MLAG ID (clag-id) to each bond.

    You must specify a unique MLAG ID (clag-id) for every dual-connected bond on each peer switch so that switches know which links dual-connect or connect to the same host or switch. The value must be between 1 and 65535 and must be the same on both peer switches. A value of 0 disables MLAG on the bond.

    The example commands below add an MLAG ID of 1 to bond1 and 2 to bond2:

    cumulus@leaf01:~$ net add bond bond1 clag id 1
    cumulus@leaf01:~$ net add bond bond2 clag id 2
    cumulus@leaf01:~$ net pending
    cumulus@leaf01:~$ net commit
    
    cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
    cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2 
    cumulus@leaf01:~$ nv config apply
    

    In the /etc/network/interfaces file, add the line clag-id 1 to the auto bond1 stanza and clag-id 2 to auto bond2 stanza:

    cumulus@switch:~$ sudo nano /etc/network/interfaces
    ...
    auto bond1
    iface bond1
        alias bond1 on swp1
        bond-slaves swp1
        clag-id 1
    
    auto bond2
    iface bond2
        alias bond2 on swp2
        bond-slaves swp2
        clag-id 2
    ...
    
  4. Add the bonds you created above to a bridge. The example commands below add bond1 and bond2 to a VLAN-aware bridge.

    You must add all VLANs configured on the MLAG bond to the bridge so that traffic to the downstream device connected in MLAG redirects over the peerlink in case the MLAG bond fails.

    cumulus@leaf01:~$ net add bridge bridge ports bond1,bond2
    cumulus@leaf01:~$ net pending
    cumulus@leaf01:~$ net commit
    
    cumulus@leaf01:~$ nv set interface bond1-2 bridge domain br_default 
    cumulus@leaf01:~$ nv config apply
    

    Edit the /etc/network/interfaces file to add the bridge-ports bond1 bond2 lines to the auto bridge stanza:

    cumulus@leaf01:~$ sudo nano /etc/network/interfaces
    ...
    auto bridge
    iface bridge
        bridge-ports bond1 bond2
        bridge-vlan-aware yes
    ...
    
  5. Create the inter-chassis bond and the peer link VLAN (as a VLAN subinterface). You also need to provide the peer link IP address, the MLAG bond interfaces, the MLAG system MAC address, and the backup interface.

    • By default, Cumulus Linux configures the inter-chassis bond with the name peerlink and the peer link VLAN with the name peerlink.4094. Use peerlink.4094 to ensure that the VLAN is independent of the bridge and spanning tree forwarding decisions.
    • The peer link IP address is a link-local address that provides layer 3 connectivity between the peer switches.
    • NVIDIA provides a reserved range of MAC addresses for MLAG (between 44:38:39:ff:00:00 and 44:38:39:ff:ff:ff). Use a MAC address from this range to prevent conflicts with other interfaces in the same bridged network.
      • Do not to use a multicast MAC address.
      • Do not use the same MAC address for different MLAG pairs; make sure you specify a different MAC address for each MLAG pair in the network.
    • The backup IP address is any layer 3 backup interface for the peer link, which the switch uses when the peer link goes down. You must add the backup IP address, which must be different than the peer link IP address. Make sure that any route that does not use the peer link can reach the backup IP address. Use the loopback or management IP address of the switch.
      Loopback or Management IP Address?
      • If your MLAG configuration has bridged uplinks (such as a campus network or a large, flat layer 2 network), use the peer switch eth0 address. When the peer link is down, the secondary switch routes towards the eth0 address using the OOB network (provided you have implemented an OOB network).
      • If your MLAG configuration has routed uplinks (a modern approach to the data center fabric network), use the peer switch loopback address. When the peer link is down, the secondary switch routes towards the loopback address using uplinks (towards the spine layer). If the primary switch also has a more significant problem (for example, switchd does not respond or stops), the secondary switch promotes itself to primary and the traffic flows.

      When using BGP, to ensure IP connectivity between the loopbacks, the MLAG peer switches must use unique BGP ASNs; if they use the same ASN, you must bypass the BGP loop prevention check on the AS_PATH attribute.

    The following examples show commands for both MLAG peers (leaf01 and leaf02).

    The NCLU command is a macro command that:

    • Automatically creates the inter-chassis bond (peerlink) and the peer link VLAN subinterface (peerlink.4094), and adds the peerlink bond to the bridge
    • Configures the peer link IP address (primary is the link-local address)
    • Adds the MLAG system MAC address, the MLAG bond interfaces, and the backup IP address you specify
    cumulus@leaf01:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.2
    cumulus@leaf01:~$ net pending
    cumulus@leaf01:~$ net commit
    

    To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf01:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.2 vrf RED
    cumulus@leaf01:~$ net pending
    cumulus@leaf01:~$ net commit
    
    cumulus@leaf02:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.1
    cumulus@leaf02:~$ net pending
    cumulus@leaf02:~$ net commit
    

    To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf02:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.1 vrf RED
    cumulus@leaf02:~$ net pending
    cumulus@leaf02:~$ net commit
    
    cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
    cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
    cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
    cumulus@leaf01:~$ nv set mlag peer-ip linklocal
    cumulus@leaf01:~$ nv config apply
    

    To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf01:~$ nv set mlag backup 10.10.10.2 vrf RED
    cumulus@leaf01:~$ nv config apply
    
    cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
    cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
    cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
    cumulus@leaf02:~$ nv set mlag peer-ip linklocal
    cumulus@leaf02:~$ nv config apply
    

    To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf02:~$ nv set mlag backup 10.10.10.1 vrf RED
    cumulus@leaf02:~$ nv config apply
    

    Edit the /etc/network/interfaces file to add the following parameters, then run the sudo ifreload -a command.

    • The inter-chasis bond (peerlink) with two ports in the bond (swp49 and swp50 in the example command below)
    • The peerlink bond to the bridge
    • The peer link VLAN (peerlink.4094) with the backup IP address, the peer link IP address (linklocal), and the MLAG system MAC address (from the reserved range of addresses).
    cumulus@leaf01:~$ sudo nano /etc/network/interfaces
    ...
    auto br_default
    iface br_default
        bridge-ports bond1 bond2 peerlink
        bridge-vlan-aware yes
    ...
    auto peerlink
    iface peerlink
        bond-slaves swp49 swp50
    

    auto peerlink.4094 iface peerlink.4094 clagd-backup-ip 10.10.10.2 clagd-peer-ip linklocal clagd-sys-mac 44:38:39:BE:EF:AA …

    To configure the backup link to a VRF, include the name of the VRF with the clagd-backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf01:~$ sudo nano /etc/network/interfaces
    ...
    auto peerlink.4094
    iface peerlink.4094
        clagd-backup-ip 10.10.10.2 vrf RED
        clagd-peer-ip linklocal
        clagd-sys-mac 44:38:39:BE:EF:AA
    ...
    

    Run the sudo ifreload -a command to apply all the configuration changes:

    cumulus@leaf01:~$ sudo ifreload -a
    
    cumulus@leaf02:~$ sudo nano /etc/network/interfaces
    ...
    auto br_default
    iface br_default
        bridge-ports bond1 bond2 peerlink
        bridge-vlan-aware yes
    ...
    auto peerlink
    iface peerlink
        bond-slaves swp49 swp50
    

    auto peerlink.4094 iface peerlink.4094 clagd-backup-ip 10.10.10.1 clagd-peer-ip linklocal clagd-sys-mac 44:38:39:BE:EF:AA …

    To configure the backup link to a VRF, include the name of the VRF with the clagd-backup-ip parameter. The following example configures the backup link to VRF RED:

    cumulus@leaf02:~$ sudo nano /etc/network/interfaces
    ...
    auto peerlink.4094
    iface peerlink.4094
        clagd-backup-ip 10.10.10.1 vrf RED
        clagd-peer-ip linklocal
        clagd-sys-mac 44:38:39:BE:EF:AA
    ...
    

    Run the sudo ifreload -a command to apply all the configuration changes:

    cumulus@leaf02:~$ sudo ifreload -a
    

  • Do not add VLAN 4094 to the bridge VLAN list; You cannot configure VLAN 4094 for the peer link subinterface as a bridged VLAN with bridge VIDs under the bridge.
  • Do not use 169.254.0.1 as the MLAG peer link IP address; Cumulus Linux uses this address for BGP unnumbered interfaces.
  • When you configure MLAG manually in the /etc/network/interfaces file, the changes take effect when you bring the peer link interface up with the sudo ifreload -a command. Do not use systemctl restart clagd.service to apply the new configuration.

MLAG synchronizes the dynamic state between the two peer switches but it does not synchronize the switch configurations. After modifying the configuration of one peer switch, you must make the same changes to the configuration on the other peer switch. This applies to all configuration changes, including:

Optional Configuration

This section describes optional configuration procedures.

Set Roles and Priority

Each MLAG-enabled switch in the pair has a role. When the peering relationship establishes between the two switches, one switch goes into the primary role and the other into the secondary role. When an MLAG-enabled switch is in the secondary role, it does not send STP BPDUs on dual-connected links; it only sends BPDUs on single-connected links. The switch in the primary role sends STP BPDUs on all single- and dual-connected links.

By default, the switch determines the role by comparing the MAC addresses of the two sides of the peering link; the switch with the lower MAC address assumes the primary role. You can override this by setting the priority option for the peer link:

cumulus@leaf01:~$ net add interface peerlink.4094 clag priority 2048
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set mlag priority 2084
cumulus@leaf01:~$ nv config apply

Edit the /etc/network/interfaces file and add the clagd-priority option, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto peerlink.4094
iface peerlink.4094
    clagd-peer-ip linklocal
    clagd-backup-ip 10.10.10.2
    clagd-sys-mac 44:38:39:BE:EF:AA
    clagd-priority 2048
...
cumulus@switch:~$ sudo ifreload -a

The switch with the lower priority value is in the primary role; the default value is 32768 and the range is between 0 and 65535.

When the MLAG service exits during switch reboot or if you stop the service on the primary switch, the peer switch that is in the secondary role becomes the primary.

However, if the primary switch goes down without stopping the MLAG service or if the peer link goes down, the secondary switch does not change its role. If the peer switch is not alive, the switch in the secondary role rolls back the LACP system ID to be the bond interface MAC address instead of the MLAG system MAC address (clagd-sys-mac). The switch in the primary role uses the MLAG system MAC address as the LACP system ID on the bonds.

Set clagctl Timers

The clagd service has several timers that you can tune for enhanced performance:

Timer
Description
--reloadTimer <seconds>The number of seconds to wait for the peer switch to become active. If the peer switch does not become active after the timer expires, the MLAG bonds leave the initialization (protodown) state and become active. This provides clagd with sufficient time to determine whether the peer switch is coming up or if it is permanently unreachable.
The default is 300 seconds.
--peerTimeout <seconds>
The number of seconds clagd waits without receiving any messages from the peer switch before it determines that the peer is no longer active. At this point, the switch reverts all configuration changes so that it operates as a standard non-MLAG switch. This includes removing all statically assigned MAC addresses, clearing the egress forwarding mask, and allowing addresses to move from any port to the peer port. After a message is again received from the peer, MLAG operation restarts. If this parameter is not specified, clagd uses ten times the local lacpPoll value.
--initDelay <seconds>The number of seconds clagd delays bringing up MLAG bonds and anycast IP addresses.
The default is 180 seconds.
This timer sets to 0 automatically under the following conditions:
  • When the peer is not alive and the backup link is not active after a reload timeout
  • When the peer sends a goodbye (through the peerlink or the backup link)
  • When both MLAG sessions come up at the same time
--sendTimeout <seconds>The number of seconds clagd waits until the sending socket times out. If it takes longer than the sendTimeout value to send data to the peer, clagd generates an exception.
The default is 30 seconds.
--lacpPoll <seconds>The number of seconds clagd waits before obtaining local LACP information.
The default is 2 seconds.

Run the net add interface peerlink.4094 clag args <timer> <value> command. The following example command sets the peerlink timer to 900 seconds:

cumulus@leaf01:~$ net add interface peerlink.4094 clag args --initDelay 100
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The only timer you can set with NVUE is the initial delay timer. The following example NVUE Command sets the initial delay to 100 seconds:

cumulus@leaf01:~$ nv set mlag init-delay 100
cumulus@leaf01:~$ nv config apply

To set the clagd timers, edit the /etc/network/interfaces file to add the clagd-args --<timer> line to the peerlink.4094 stanza, then run the ifreload -a command.

The following example command sets the initial delay timer to 100 seconds:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto peerlink.4094
iface peerlink.4094
    clagd-args --initDelay 100
    clagd-peer-ip linklocal
    clagd-backup-ip 10.10.10.2
    clagd-sys-mac 44:38:39:BE:EF:AA
    clagd-priority 2048
...
cumulus@leaf01:~$ sudo ifreload -a

The following example command sets the peer timeout to 900 seconds:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto peerlink.4094
iface peerlink.4094
    clagd-args --peerTimeout 900
    clagd-peer-ip linklocal
    clagd-backup-ip 10.10.10.2
    clagd-sys-mac 44:38:39:BE:EF:AA
    clagd-priority 2048
...
cumulus@leaf01:~$ sudo ifreload -a

Configure MLAG with a Traditional Mode Bridge

To configure MLAG with a traditional mode bridge instead of a VLAN-aware mode bridge, you must configure the peer link and all dual-connected links as untagged (native) ports on a bridge (note the absence of any VLANs in the bridge-ports line and the lack of the bridge-vlan-aware parameter below):

...
auto br0
iface br0
    bridge-ports peerlink bond1 bond2
...

The following example shows you how to allow VLAN 10 across the peer link:

...
auto br0.10
iface br0.10
    bridge-ports peerlink.10 bond1.10 bond2.10
    bridge-stp on
...

In an MLAG and traditional bridge configuration, NVIDIA recommends that you set bridge learning to off on all VLANs over the peerlink except for the layer 3 peerlink subinterface.

Configure a Backup UDP Port

By default, Cumulus Linux uses UDP port 5342 with the backup IP address. To change the backup UDP port, edit the /etc/network/interfaces file to add clagd-args --backupPort <port> to the auto peerlink.4094 stanza. For example:

cumulus@leaf01:~$ net add interface peerlink.4094 clag args --backupPort 5400
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
...
auto peerlink.4094
iface peerlink.4094
    clagd-args --backupPort 5400
    clagd-backup-ip 10.10.10.2
    clagd-peer-ip linklocal
    clagd-sys-mac 44:38:39:BE:EF:AA
...

Run the sudo ifreload -a command to apply all the configuration changes:

cumulus@leaf01:~$ sudo ifreload -a

Best Practices

Follow these best practices when configuring MLAG on your switches.

MTU and MLAG

The bridge MTU determines the MTU in MLAG traffic. The lowest MTU setting of an interface that is a member of the bridge determines the bridge MTU. If you want to set an MTU other than the default of 9216 bytes, you must configure the MTU on each physical interface and the bond interface that is a member of every MLAG bridge in the entire bridged domain.

The following example commands set an MTU of 1500 for each of the bond interfaces (peerlink, uplink, bond1, bond2), which are members of bridge bridge:

cumulus@leaf01:~$ net add bond peerlink mtu 1500
cumulus@leaf01:~$ net add bond uplink mtu 1500
cumulus@leaf01:~$ net add bond bond1 mtu 1500
cumulus@leaf01:~$ net add bond bond2 mtu 1500
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set interface peerlink.4094 link mtu 1500
cumulus@leaf01:~$ nv set interface uplink link mtu 1500
cumulus@leaf01:~$ nv set interface bond1 link mtu 1500
cumulus@leaf01:~$ nv set interface bond2 link mtu 1500
cumulus@leaf01:~$ nv config apply

Edit the /etc/network/interfaces file, then run the ifreload -a command. For example:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto br_default
iface br_default
    bridge-ports peerlink uplink bond1 bond2

auto peerlink
iface peerlink
    mtu 1500

auto bond1
iface bond1
    mtu 1500

auto bond2
iface bond2
    mtu 1500

auto uplink
iface uplink
    mtu 1500
...
cumulus@leaf01:~$ sudo ifreload -a

STP and MLAG

Always enable STP in your layer 2 network and BPDU Guard on the host-facing bond interfaces.

The peer link carries little traffic when compared to the bandwidth consumed by data plane traffic. In a typical MLAG configuration, most connections between the two switches in the MLAG pair are dual-connected; the only traffic going across the peer link is traffic from the clagd process and some LLDP or LACP traffic. The switch does not forward traffic received on the peer link out of the dual-connected bonds.

However, there are some instances where a host connects to only one switch in the MLAG pair; for example:

Determine how much bandwidth is traveling across the single-connected interfaces and set half of that bandwidth to the peer link. On average, one half of the traffic destined to the single-connected host arrives on the switch directly connected to the single-connected host and the other half arrives on the switch that is not directly connected to the single-connected host. When this happens, only the traffic that arrives on the switch that is not directly connected to the single-connected host needs to traverse the peer link.

In addition, you can add extra links to the peer link bond to handle link failures in the peer link bond itself.


  • Each host has two 10G links, with each 10G link going to each switch in the MLAG pair.
  • Each host has 20G of dual-connected bandwidth; all three hosts have a total of 60G of dual-connected bandwidth.
  • Set at least 15G of bandwidth for each peer link bond, which represents half of the single-connected bandwidth.

When planning for link failures for a full rack, you need only set enough bandwidth to meet your site strategy for handling failure scenarios. For example, for a full rack with 40 servers and two switches, you can plan for four to six servers to lose connectivity to a single switch and become single connected before you respond to the event. Therefore, if you have 40 hosts each with 20G of bandwidth dual-connected to the MLAG pair, you can set between 20G and 30G of bandwidth to the peer link, which accounts for half of the single-connected bandwidth for four to six hosts.

When enabling a routing protocol in an MLAG environment, it is also necessary to manage the uplinks; by default MLAG is not aware of layer 3 uplink interfaces. If there is a peer link failure, MLAG does not remove static routes or bring down a BGP or OSPF adjacency unless you use a separate link state daemon such as ifplugd.

When you use MLAG with VRR, set up a routed adjacency across the peerlink.4094 interface. If a routed connection is not built across the peer link, during an uplink failure on one of the switches in the MLAG pair, egress traffic does not forward if the destination is on the switch whose uplinks are down.

To set up the adjacency, configure a BGP or OSPF unnumbered peering, as appropriate for your network.

For BGP, use a configuration like this:

cumulus@leaf01:~$ net add bgp neighbor peerlink.4094 interface remote-as internal
cumulus@leaf01:~$ net commit

If you are using EVPN and MLAG, you need to enable the EVPN address family across the peerlink.4094 interface as well:

cumulus@leaf01:~$ net add bgp neighbor peerlink.4094 interface remote-as internal
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor peerlink.4094 activate
cumulus@leaf01:~$ net commit

If you use NCLU to create an iBGP peering across the peer link, the net add bgp l2vpn evpn neighbor peerlink.4094 activate command creates a new eBGP neighbor when one is already configured for iBGP. The existing iBGP configuration is still valid.

cumulus@leaf01:~$ nv set vrf default router bgp peer peerlink remote-as internal
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# bgp router-id 10.10.10.1
leaf01(config-router)# neighbor peerlink remote-as external
leaf01(config-router)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$

If you are using EVPN and MLAG, you need to enable the EVPN address family across the peerlink.4094 interface as well:

cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# bgp router-id 10.10.10.1
leaf01(config-router)# neighbor peerlink remote-as external
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# neighbor peerlink activate
leaf01(config-router-af)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$

For OSPF, use a configuration like this:

cumulus@leaf01:~$ net add interface peerlink.4094 ospf area 0.0.0.1
cumulus@v:~$ net commit
cumulus@leaf01:~$ nv set interface peerlink.4094 router ospf area 0.0.0.1
cumulus@leaf01:~$ nv config apply

Configuration Example

The example below shows a basic MLAG configuration, where:

For an example configuration with MLAG and BGP, see the BGP configuration example.

NCLU Commands

cumulus@leaf01:~$ net add loopback lo ip address 10.10.10.1/32
cumulus@leaf01:~$ net add interface swp1-3,swp49-51
cumulus@leaf01:~$ net add bond bond1 bond slaves swp1
cumulus@leaf01:~$ net add bond bond2 bond slaves swp2
cumulus@leaf01:~$ net add bond bond3 bond slaves swp3
cumulus@leaf01:~$ net add bond bond1 clag id 1
cumulus@leaf01:~$ net add bond bond2 clag id 2
cumulus@leaf01:~$ net add bond bond3 clag id 3
cumulus@leaf01:~$ net add bridge bridge ports bond1,bond2,bond3
cumulus@leaf01:~$ net add vlan 10 vlan-id 10
cumulus@leaf01:~$ net add vlan 20 vlan-id 20
cumulus@leaf01:~$ net add vlan 30 vlan-id 30
cumulus@leaf01:~$ net add vlan 10 ip address 10.1.10.2/24
cumulus@leaf01:~$ net add vlan 20 ip address 10.1.20.2/24
cumulus@leaf01:~$ net add vlan 30 ip address 10.1.30.2/24
cumulus@leaf01:~$ net add bridge bridge vids 10,20,30
cumulus@leaf01:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.2
cumulus@leaf01:~$ net add interface peerlink.4094 clag args --initDelay 100
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf02:~$ net add loopback lo ip address 10.10.10.2/32
cumulus@leaf02:~$ net add interface swp1-3,swp49-51
cumulus@leaf02:~$ net add bond bond1 bond slaves swp1
cumulus@leaf02:~$ net add bond bond2 bond slaves swp2
cumulus@leaf02:~$ net add bond bond3 bond slaves swp3
cumulus@leaf02:~$ net add bond bond1 clag id 1
cumulus@leaf02:~$ net add bond bond2 clag id 2
cumulus@leaf02:~$ net add bond bond3 clag id 3
cumulus@leaf02:~$ net add bridge bridge ports bond1,bond2,bond3
cumulus@leaf02:~$ net add vlan 10 vlan-id 10
cumulus@leaf02:~$ net add vlan 20 vlan-id 20
cumulus@leaf02:~$ net add vlan 30 vlan-id 30
cumulus@leaf02:~$ net add vlan 10 ip address 10.1.10.3/24
cumulus@leaf02:~$ net add vlan 20 ip address 10.1.20.3/24
cumulus@leaf02:~$ net add vlan 30 ip address 10.1.30.3/24
cumulus@leaf02:~$ net add bridge bridge vids 10,20,30
cumulus@leaf02:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.1
cumulus@leaf02:~$ net add interface peerlink.4094 clag args --initDelay 100
cumulus@leaf02:~$ net pending
cumulus@leaf02:~$ net commit
cumulus@spine01:~$ net add loopback lo ip address 10.10.10.101/32
cumulus@spine01:~$ net add interface swp1-2
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
cumulus@leaf01:~$ cat /etc/network/interfaces

auto lo iface lo inet loopback address 10.10.10.1/32

auto swp1 iface swp1

auto swp2 iface swp2

auto swp3 iface swp3

auto swp49 iface swp49

auto swp50 iface swp50

auto swp51 iface swp51

auto bond1 iface bond1 bond-slaves swp1 clag-id 1

auto bond2 iface bond2 bond-slaves swp2 clag-id 2

auto bond3 iface bond3 bond-slaves swp3 clag-id 3

auto bridge iface bridge bridge-ports bond1 bond2 bond3 peerlink bridge-vids 10 20 30 bridge-vlan-aware yes

auto mgmt iface mgmt vrf-table auto address 127.0.0.1/8 address ::1/128

auto eth0 iface eth0 inet dhcp vrf mgmt

auto peerlink iface peerlink bond-slaves swp49 swp50

auto peerlink.4094 iface peerlink.4094 clagd-args –initDelay 100 clagd-backup-ip 10.10.10.2 clagd-peer-ip linklocal clagd-priority 1000 clagd-sys-mac 44:38:39:BE:EF:AA

auto vlan10 iface vlan10 address 10.1.10.2/24 vlan-id 10 vlan-raw-device bridge

auto vlan20 iface vlan20 address 10.1.20.2/24 vlan-id 20 vlan-raw-device bridge

auto vlan30 iface vlan30 address 10.1.30.2/24 vlan-id 30 vlan-raw-device bridge

cumulus@leaf02:~$ cat /etc/network/interfaces

auto lo iface lo inet loopback address 10.10.10.2/32

auto swp1 iface swp1

auto swp2 iface swp2

auto swp3 iface swp3

auto swp49 iface swp49

auto swp50 iface swp50

auto swp51 iface swp51

auto bond1 iface bond1 bond-slaves swp1 clag-id 1

auto bond2 iface bond2 bond-slaves swp2 clag-id 2

auto bond3 iface bond3 bond-slaves swp3 clag-id 3

auto bridge iface bridge bridge-ports bond1 bond2 bond3 peerlink bridge-vids 10 20 30 bridge-vlan-aware yes

auto mgmt iface mgmt vrf-table auto address 127.0.0.1/8 address ::1/128

auto eth0 iface eth0 inet dhcp vrf mgmt

auto peerlink iface peerlink bond-slaves swp49 swp50

auto peerlink.4094 iface peerlink.4094 clagd-args –initDelay 100 clagd-backup-ip 10.10.10.1 clagd-peer-ip linklocal clagd-priority 32768 clagd-sys-mac 44:38:39:BE:EF:AA

auto vlan10 iface vlan10 address 10.1.10.3/24 vlan-id 10 vlan-raw-device bridge

auto vlan20 iface vlan20 address 10.1.20.3/24 vlan-id 20 vlan-raw-device bridge

auto vlan30 iface vlan30 address 10.1.30.3/24 vlan-id 30 vlan-raw-device bridge

cumulus@spine01:~$ cat /etc/network/interfaces

auto lo iface lo inet loopback address 10.10.10.101/32

auto swp1 iface swp1

auto swp2 iface swp2

auto mgmt iface mgmt vrf-table auto address 127.0.0.1/8 address ::1/128

auto eth0 iface eth0 inet dhcp vrf mgmt

NVUE Commands

cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-3,swp49-51
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default 
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set mlag init-delay 100
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1-3,swp49-51
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set mlag init-delay 100
cumulus@leaf02:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-2
cumulus@spine01:~$ nv config apply
- set:
     interface:
      lo:
        ip:
          address:
            10.10.10.1/32: {}
        type: loopback
      swp1:
        type: swp
      swp2:
        type: swp
      swp3:
        type: swp
      swp49:
        type: swp
      swp50:
        type: swp
      swp51:
        type: swp
      bond1:
        bond:
          member:
            swp1: {}
          mlag:
            id: 1
        type: bond
        bridge:
          domain:
            br_default: {}
      bond2:
        bond:
          member:
            swp2: {}
          mlag:
            id: 2
        type: bond
        bridge:
          domain:
            br_default: {}
      bond3:
        bond:
          member:
            swp3: {}
          mlag:
            id: 3
        type: bond
        bridge:
          domain:
            br_default: {}
      vlan10:
        ip:
          address:
            10.1.10.2/24: {}
        type: svi
        vlan: 10
      vlan20:
        ip:
          address:
            10.1.20.2/24: {}
        type: svi
        vlan: 20
      vlan30:
        ip:
          address:
            10.1.30.2/24: {}
        type: svi
        vlan: 30
      peerlink:
        bond:
          member:
            swp49: {}
            swp50: {}
        type: peerlink
      peerlink.4094:
        type: sub
        base-interface: peerlink
        vlan: 4094
    bridge:
      domain:
        br_default:
          vlan:
            '10': {}
            '20': {}
            '30': {}
    mlag:
      mac-address: 44:38:39:BE:EF:AA
      backup:
        10.10.10.2: {}
      peer-ip: linklocal
      init-delay: 100
- set:
    interface:
      lo:
        ip:
          address:
            10.10.10.2/32: {}
        type: loopback
      swp1:
        type: swp
      swp2:
        type: swp
      swp3:
        type: swp
      swp49:
        type: swp
      swp50:
        type: swp
      swp51:
        type: swp
      bond1:
        bond:
          member:
            swp1: {}
          mlag:
            id: 1
        type: bond
        bridge:
          domain:
            br_default: {}
      bond2:
        bond:
          member:
            swp2: {}
          mlag:
            id: 2
        type: bond
        bridge:
          domain:
            br_default: {}
      bond3:
        bond:
          member:
            swp3: {}
          mlag:
            id: 3
        type: bond
        bridge:
          domain:
            br_default: {}
      vlan10:
        ip:
          address:
            10.1.10.3/24: {}
        type: svi
        vlan: 10
      vlan20:
        ip:
          address:
            10.1.20.3/24: {}
        type: svi
        vlan: 20
      vlan30:
        ip:
          address:
            10.1.30.3/24: {}
        type: svi
        vlan: 30
      peerlink:
        bond:
          member:
            swp49: {}
            swp50: {}
        type: peerlink
      peerlink.4094:
        type: sub
        base-interface: peerlink
        vlan: 4094
    bridge:
      domain:
        br_default:
          vlan:
            '10': {}
            '20': {}
            '30': {}
    mlag:
      mac-address: 44:38:39:BE:EF:AA
      backup:
        10.10.10.1: {}
      peer-ip: linklocal
      init-delay: 100
- set:
    nterface:
      lo:
        ip:
          address:
            10.10.10.101/32: {}
        type: loopback
      swp1:
        type: swp
      swp2:
        type: swp
auto lo
iface lo inet loopback
    address 10.10.10.1/32
auto mgmt
iface mgmt
    address 127.0.0.1/8
    address ::1/128
    vrf-table auto
auto eth0
iface eth0 inet dhcp
    ip-forward off
    ip6-forward off
    vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto bond1
iface bond1
    bond-slaves swp1
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 1
auto bond2
iface bond2
    bond-slaves swp2
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 2
auto bond3
iface bond3
    bond-slaves swp3
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 3

auto vlan10 iface vlan10 address 10.1.10.2/24 hwaddress 44:38:39:22:01:b1 vlan-raw-device br_default vlan-id 10 auto vlan20 iface vlan20 address 10.1.20.2/24 hwaddress 44:38:39:22:01:b1 vlan-raw-device br_default vlan-id 20 auto vlan30 iface vlan30 address 10.1.30.2/24 hwaddress 44:38:39:22:01:b1 vlan-raw-device br_default vlan-id 30 auto peerlink iface peerlink bond-slaves swp49 swp50 bond-mode 802.3ad bond-lacp-bypass-allow no auto peerlink.4094 iface peerlink.4094 clagd-peer-ip linklocal clagd-backup-ip 10.10.10.2 clagd-sys-mac 44:38:39:BE:EF:AA clagd-args –initDelay 100 auto br_default iface br_default bridge-ports bond1 bond2 bond3 peerlink hwaddress 44:38:39:22:01:b1 bridge-vlan-aware yes bridge-vids 10 20 30 bridge-pvid 1

auto lo
iface lo inet loopback
    address 10.10.10.2/32
auto mgmt
iface mgmt
    address 127.0.0.1/8
    address ::1/128
    vrf-table auto
auto eth0
iface eth0 inet dhcp
    ip-forward off
    ip6-forward off
    vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto bond1
iface bond1
    bond-slaves swp1
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 1
auto bond2
iface bond2
    bond-slaves swp2
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 2
auto bond3
iface bond3
    bond-slaves swp3
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
    clag-id 3
auto vlan10
iface vlan10
    address 10.1.10.3/24
    hwaddress 44:38:39:22:01:af
    vlan-raw-device br_default
    vlan-id 10
auto vlan20
iface vlan20
    address 10.1.20.3/24
    hwaddress 44:38:39:22:01:af
    vlan-raw-device br_default
    vlan-id 20
auto vlan30
iface vlan30
    address 10.1.30.3/24
    hwaddress 44:38:39:22:01:af
    vlan-raw-device br_default
    vlan-id 30
auto peerlink
iface peerlink
    bond-slaves swp49 swp50
    bond-mode 802.3ad
    bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
    clagd-peer-ip linklocal
    clagd-backup-ip 10.10.10.1
    clagd-sys-mac 44:38:39:BE:EF:AA
    clagd-args --initDelay 100
auto br_default
iface br_default
    bridge-ports bond1 bond2 bond3 peerlink
    hwaddress 44:38:39:22:01:af
    bridge-vlan-aware yes
    bridge-vids 10 20 30
    bridge-pvid 1
auto lo
iface lo inet loopback
    address 10.10.10.101/32
auto mgmt
iface mgmt
    address 127.0.0.1/8
    address ::1/128
    vrf-table auto
auto eth0
iface eth0 inet dhcp
    ip-forward off
    ip6-forward off
    vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2

This simulation starts with the example MLAG configuration. The demo is pre-configured using NVUE commands.

To validate the configuration, run the commands listed in the troubleshooting section below.

Troubleshooting

Use the following troubleshooting tips to check MLAG configuration.

Check MLAG Status

To check the status of your MLAG configuration, run the NCLU net show clag command or the Linux clagctl command. For example:

cumulus@leaf01:~$ net show clag
The peer is alive
     Our Priority, ID, and Role: 32768 44:38:39:00:00:59 primary
    Peer Priority, ID, and Role: 32768 44:38:39:00:00:5a secondary
          Peer Interface and IP: peerlink.4094 fe80::4638:39ff:fe00:5a (linklocal)
                      Backup IP: 10.10.10.2 (inactive)
                     System MAC: 44:38:39:be:ef:aa

CLAG Interfaces
Our Interface      Peer Interface     CLAG Id   Conflicts              Proto-Down Reason
----------------   ----------------   -------   --------------------   -----------------
           bond1   -                  1         -                      -              
           bond2   -                  2         -                      -              
           bond3   -                  3         -                      -              

Show All MLAG Settings

To see all MLAG settings, run the clagctl params command:

cumulus@leaf01:~$ clagctl params
clagVersion = 1.4.0
clagDataVersion = 1.4.0
clagCmdVersion = 1.1.0
peerIp = linklocal
peerIf = peerlink.4094
sysMac = 44:38:39:be:ef:aa
lacpPoll = 2
currLacpPoll = 2
peerConnect = 1
cmdConnect = 1
peerLinkPoll = 1
switchdReadyTimeout = 120
reloadTimer = 300
periodicRun = 4
priority = 32768
quiet = False
debug = 0x0
verbose = False
log = syslog
vm = True
peerPort = 5342
peerTimeout = 20
initDelay = 100
sendTimeout = 30
sendBufSize = 65536
forceDynamic = False
dormantDisable = False
redirectEnable = False
redirect2Enable = True
backupIp = 10.10.10.2
backupVrf = None
backupPort = 5342
vxlanAnycast = None
neighSync = True
permanentMacSync = True
cmdLine = /usr/sbin/clagd --daemon linklocal peerlink.4094 44:38:39:BE:EF:AA --priority 32768 --backupIp 10.10.10.2 --initDelay 100
peerlinkLearnEnable = False

View the MLAG Log File

By default, when running, the clagd service logs status messages to the /var/log/clagd.log file and to syslog:

cumulus@spine01:~$ sudo tail /var/log/clagd.log
2016-10-03T20:31:50.471400+00:00 spine01 clagd[1235]: Initial config loaded
2016-10-03T20:31:52.479769+00:00 spine01 clagd[1235]: The peer switch is active.
2016-10-03T20:31:52.496490+00:00 spine01 clagd[1235]: Initial data sync to peer done.
2016-10-03T20:31:52.540186+00:00 spine01 clagd[1235]: Role is now primary; elected
2016-10-03T20:31:54.250572+00:00 spine01 clagd[1235]: HealthCheck: role via backup is primary
2016-10-03T20:31:54.252642+00:00 spine01 clagd[1235]: HealthCheck: backup active
2016-10-03T20:31:54.537967+00:00 spine01 clagd[1235]: Initial data sync from peer done.
2016-10-03T20:31:54.538435+00:00 spine01 clagd[1235]: Initial handshake done.
2016-10-03T22:47:35.255317+00:00 spine01 clagd[1235]: leaf01-02 is now dual connected.

Monitor the clagd Service

Due to the critical nature of the clagd service, systemd continuously monitors its status by receiving notify messages every 30 seconds. If the clagd service terminates or becomes unresponsive for any reason and systemd receives no messages after 60 seconds, systemd restarts the clagd service. systemd logs these failures in the /var/log/syslog file and, on the first failure, also generates a cl-supportfile.

Monitoring occurs automatically as long as:

You can check if clagd is running with the cl-service-summary or the systemctl status command:

cumulus@leaf01:~$ cl-service-summary
Service cron               enabled    active
Service ssh                enabled    active
Service syslog             enabled    active
Service asic-monitor       enabled    inactive
Service clagd              enabled    active
...
cumulus@leaf01:~$ systemctl status clagd.service
 ● clagd.service - Cumulus Linux Multi-Chassis LACP Bonding Daemon
    Loaded: loaded (/lib/systemd/system/clagd.service; enabled)
    Active: active (running) since Fri 2021-06-11 16:17:19 UTC; 12min ago
        Docs: man:clagd(8)
    Main PID: 27078 (clagd)
    CGroup: /system.slice/clagd.service
            └─27078 /usr/bin/python3 /usr/sbin/clagd --daemon linklocal peerlink.4094 44:38:39:BE:EF:AA --priority 32768

You can expect a large volume of packet drops across one of the peer link interfaces. These drops serve to prevent looping of BUM (broadcast, unknown unicast, multicast) packets. When the switch receives a packet across the peer link, if the destination lookup results in an egress interface that is a dual-connected bond, the switch does not forward the packet (to prevent loops). The peer link records a dropped packet.

To check packet drops across peer link interfaces, run the following command:

Run the net show counters command. The number of dropped packets shows in the RX_DRP column.

cumulus@leaf01:~$ net show counters

Kernel Interface table
Iface            MTU    RX_OK    RX_ERR    RX_DRP    RX_OVR    TX_OK    TX_ERR    TX_DRP    TX_OVR  Flg
-------------  -----  -------  --------  --------  --------  -------  --------  --------  --------  -----
bond1           9216        0         0         0         0      542         0         0         0  BMmU
bond2           9216        0         0         0         0      542         0         0         0  BMmU
bridge          9216        0         0         0         0       17         0         0         0  BMRU
eth0            1500     5497         0         0         0      933         0         0         0  BMRU
lo             65536     1328         0         0         0     1328         0         0         0  LRU
mgmt           65536      790         0         0         0        0         0        33         0  OmRU
peerlink        9216    23626         0       520         0    23665         0         0         0  BMmRU
peerlink.4094   9216     8013         0         0         0     8017         0         0         0  BMRU
swp1            9216        5         0         0         0      553         0         0         0  BMsRU
swp2            9216        3         0         0         0      552         0         0         0  BMsRU
swp49           9216    11822         0         0         0    11852         0         0         0  BMsRU
swp50           9216    11804         0         0         0    11841         0         0         0  BMsRU
swp51           9216        0         0         0         0      292         0         0         0  BMRU

Run the ethtool -S <interface> command:

cumulus@leaf01:mgmt:~$ ethtool -S swp49
NIC statistics:
     rx_queue_0_packets: 136
     rx_queue_0_bytes: 36318
     rx_queue_0_drops: 0
     rx_queue_0_xdp_packets: 0
     rx_queue_0_xdp_tx: 0
     rx_queue_0_xdp_redirects: 0
     rx_queue_0_xdp_drops: 0
     rx_queue_0_kicks: 1
     tx_queue_0_packets: 200
     tx_queue_0_bytes: 44244
     tx_queue_0_xdp_tx: 0
     tx_queue_0_xdp_tx_drops: 0
     tx_queue_0_kicks: 195

In addition to the standard UP and DOWN administrative states, an interface that is a member of an MLAG bond can also be in a protodown state. When MLAG detects a problem that can result in connectivity issues, it puts that interface into protodown state. Such connectivity issues include:

When an interface goes into a protodown state, it results in a local OPER DOWN (carrier down) on the interface.

To show an interface in protodown state, run the NCLU net show bridge link command or the Linux ip link show command. For example:

cumulus@leaf01:~$ net show bridge link
3: swp1 state DOWN: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 9216 master pfifo_fast master host-bond1 state DOWN mode DEFAULT qlen 500 protodown on
    link/ether 44:38:39:00:69:84 brd ff:ff:ff:ff:ff:ff

LACP Partner MAC Address Duplicate or Mismatch

Cumulus Linux puts interfaces in a protodown state under the following conditions:

After you make the necessary cable or configuration changes to avoid the protodown state and you want MLAG to reevaluate the LACP partners, use the clagctl clearconflictstate command to remove duplicate-partner-mac or partner-mac-mismatch from the protodown bonds, allowing them to come back up.

LACP Bypass

On Cumulus Linux, LACP bypass allows a bond configured in 802.3ad mode to become active and forward traffic even when there is no LACP partner. For example, you can enable a host that does not have the capability to run LACP to PXE boot while connected to a switch on a bond configured in 802.3ad mode. After the pre-boot process completes and the host is capable of running LACP, the normal 802.3ad link aggregation operation takes over.

LACP Bypass All-active Mode

In all-active mode, when a bond has multiple slave interfaces, each bond slave interface operates as an active link while the bond is in bypass mode. This is useful during PXE boot of a server with multiple NICs, when you cannot determine beforehand which port needs to be active.

  • All-active mode is not supported on bonds that are not specified as bridge ports on the switch.
  • STP does not run on the individual bond slave interfaces when the LACP bond is in all-active mode. Only use all-active mode on host-facing LACP bonds. Configure STP BPDU guard together with all-active mode.
  • In an MLAG deployment where bond slaves of a host connect to two switches and the bond is in all-active mode, all the slaves of the bond are active on both the primary and secondary MLAG nodes. If multiple physical NIC interfaces or more than one physical NIC is present on the physical host, NVIDIA recommends that you define which physical NIC or interface runs the PXE boot inside the PXE boot configuration file. If you do not define a specific NIC or interface, the switch sends a PXE boot request on all the interfaces in the bond and the PXE request fails.
  • LACP bypass works with EVPN multihoming.
  • Cumulus Linux does not support priority mode, bond-lacp-bypass-period, bond-lacp-bypass-priority, and bond-lacp-bypass-all-active.

Configure LACP Bypass

To enable LACP bypass on the host-facing bond:

The following commands create a VLAN-aware bridge with LACP bypass enabled:

cumulus@switch:~$ net add bond bond1 bond slaves swp1,swp2
cumulus@switch:~$ net add bond bond1 clag id 1
cumulus@switch:~$ net add bond bond1 bond lacp-bypass-allow
cumulus@switch:~$ net add bond bond1 stp bpduguard
cumulus@switch:~$ net add bridge bridge ports bond1,bond2,bond3
cumulus@switch:~$ net add bridge bridge vids 10,20,30
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit

The following commands create a VLAN-aware bridge with LACP bypass enabled:

cumulus@leaf01:~$ nv set interface bond1 bond member swp1-2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv config apply

Edit the /etc/network/interfaces file to add the set bond-lacp-bypass-allow to yes option, then run the ifreload -a command. The following configuration creates a VLAN-aware bridge with LACP bypass enabled.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bond1
iface bond1
    bond-slaves swp1 swp2
    bond-mode 802.3ad
    bond-lacp-bypass-allow yes
    clag-id 1
...
auto bridge
iface bridge
    bridge-ports bond1 bond2 bond3
    bridge-vids 10 20 30
    bridge-vlan-aware yes
...
cumulus@switch:~$ sudo ifreload -a

To check the status of the configuration, run the NCLU net show interface <bond> command or the Linux ip link show command on the bond and its slave interfaces. The NVUE Command is nv show interface <bond>.

cumulus@switch:~$ ip link show bond1
164: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DORMANT group default
    link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show swp1
55: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT group default qlen 1000
    link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show swp2
56: swp2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT group default qlen 1000
    link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff

Virtual Router Redundancy - VRR and VRRP

Cumulus Linux provides the option of using Virtual Router Redundancy (VRR) or Virtual Router Redundancy Protocol (VRRP).

You cannot configure both VRR and VRRP on the same switch.

VRR

The diagram below illustrates a basic VRR-enabled network configuration.

The network includes three hosts and two routers running Cumulus Linux that use multi-chassis link aggregation (MLAG).

Configure the Routers

The routers implement the layer 2 network interconnecting the hosts and the redundant routers. To configure the routers, add a bridge with the following interfaces to each router:

Cumulus Linux only supports VRR on switched virtual interfaces (SVIs). You cannot configure VRR on physical interfaces or virtual subinterfaces.

The example commands below create a VLAN-aware bridge interface for a VRR-enabled network. The example assumes you have already configured a VLAN-aware bridge with VLAN 10 and that VLAN 10 has an IP address:

cumulus@switch:~$ net add vlan 10 ip address-virtual 00:00:5E:00:01:01 10.1.10.1/24
cumulus@switch:~$ net add vlan 10 ipv6 address-virtual 00:00:5e:00:01:00 10.1.20.1/24
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@switch:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@switch:~$ nv set interface vlan10 ip vrr state up
cumulus@switch:~$ nv config apply

Use the same commands for IPV6 addresses; for example:

cumulus@switch:~$ nv set interface vlan10 ip vrr address 2001:db8::1/32
cumulus@switch:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@switch:~$ nv set interface vlan10 ip vrr state up

Edit the /etc/network/interfaces file, then run the ifreload -a command.

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
    address 10.1.10.2/24
    address-virtual 00:00:5e:00:01:00 10.1.10.1/24
    vlan-raw-device br_default
    vlan-id 10
...
cumulus@switch:~$ sudo ifreload -a

Configure the Hosts

Each host must have two network interfaces. The routers configure the interfaces as bonds running LACP; the hosts must also configure the two interfaces using teaming, port aggregation, port group, or EtherChannel running LACP. Configure the hosts either statically or with DHCP, with a gateway address that is the IP address of the virtual router; this default gateway address never changes.

Configure the links between the hosts and the routers in active-active mode for First Hop Redundancy Protocol.

Example VRR Configuration with MLAG

To create an MLAG configuration that incorporates VRR, use a configuration similar to the following.

The following examples uses a single virtual MAC address for VLANs. You can add a unique MAC address for each VLAN, but this is not necessary.

cumulus@leaf01:~$ net add interface eth0 ip address 192.168.200.11/24
cumulus@leaf01:~$ net add bond bond1 bond slaves swp1
cumulus@leaf01:~$ net add bond bond2 bond slaves swp2
cumulus@leaf01:~$ net add bond bond1 clag id 1
cumulus@leaf01:~$ net add bond bond2 clag id 2
cumulus@leaf01:~$ net add bridge bridge ports bond1,bond2
cumulus@leaf01:~$ net add clag peer sys-mac 44:38:39:BE:EF:AA interface swp49-50 primary backup-ip 10.10.10.2
cumulus@leaf01:~$ net add vlan 10 ip address 10.1.10.2/24
cumulus@leaf01:~$ net add vlan 10 ip address-virtual 00:00:5E:00:01:01 10.1.10.1/24
cumulus@leaf01:~$ net add vlan 20 ip address 10.1.20.2/24
cumulus@leaf01:~$ net add vlan 20 ip address-virtual 00:00:5E:00:01:01 10.1.20.1/24
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set interface eth0 ip address 192.168.200.11/24
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf01:~$ nv set bridge domain br_default untagged 1
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf01:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf01:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:~$ nv config apply
auto lo
iface lo inet loopback

auto mgmt iface mgmt address 127.0.0.1/8 address ::1/128 vrf-table auto

auto eth0 iface eth0 inet dhcp address 192.168.200.11/24 ip-forward off ip6-forward off vrf mgmt

auto bond1 iface bond1 bond-slaves swp1 bond-mode 802.3ad bond-lacp-bypass-allow no clag-id 1

auto bond2 iface bond2 bond-slaves swp2 bond-mode 802.3ad bond-lacp-bypass-allow no clag-id 2

auto peerlink iface peerlink bond-slaves swp49 swp50 bond-mode 802.3ad bond-lacp-bypass-allow no

auto peerlink.4094 iface peerlink.4094 clagd-peer-ip linklocal clagd-backup-ip 10.10.10.2 clagd-sys-mac 44:38:39:BE:EF:AA clagd-args –initDelay 180

auto vlan10 iface vlan10 address 10.1.10.2/24 address-virtual 00:00:5e:00:01:00 10.1.10.1/24 vlan-raw-device br_default vlan-id 10

auto vlan20 iface vlan20 address 10.1.20.2/24 address-virtual 00:00:5e:00:01:00 10.1.20.1/24 vlan-raw-device br_default vlan-id 20

auto br_default iface br_default bridge-ports peerlink bond1 bond2 bridge-vlan-aware yes bridge-vids 10 20 bridge-pvid 1

cumulus@leaf02:~$ nv set interface eth0 ip address 192.168.200.12/24
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf02:~$ nv set bridge domain br_default untagged 1
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf02:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf02:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf02:~$ nv config apply
auto lo
iface lo inet loopback

auto mgmt iface mgmt address 127.0.0.1/8 address ::1/128 vrf-table auto

auto eth0 iface eth0 address 192.168.200.12/24 ip-forward off ip6-forward off vrf mgmt

auto bond1 iface bond1 bond-slaves swp1 bond-mode 802.3ad bond-lacp-bypass-allow no clag-id 1

auto bond2 iface bond2 bond-slaves swp2 bond-mode 802.3ad bond-lacp-bypass-allow no clag-id 2

auto peerlink iface peerlink bond-slaves swp49 swp50 bond-mode 802.3ad bond-lacp-bypass-allow no

auto peerlink.4094 iface peerlink.4094 clagd-peer-ip linklocal clagd-backup-ip 10.10.10.1 clagd-sys-mac 44:38:39:BE:EF:AA clagd-args –initDelay 180

auto vlan10 iface vlan10 address 10.1.10.3/24 address-virtual 00:00:5e:00:01:00 10.1.10.1/24 vlan-raw-device br_default vlan-id 10

auto vlan20 iface vlan20 address 10.1.20.3/24 address-virtual 00:00:5e:00:01:00 10.1.20.1/24 vlan-raw-device br_default vlan-id 20

auto br_default iface br_default bridge-ports peerlink bond1 bond2 bridge-vlan-aware yes bridge-vids 10 20 bridge-pvid 1

Create a configuration similar to the following on an Ubuntu host:

auto eth0
iface eth0 inet dhcp

auto eth1
iface eth1 inet manual
    bond-master uplink

auto eth2
iface eth2 inet manual
    bond-master uplink

auto uplink
iface uplink inet static
    bond-slaves eth1 eth2
    bond-mode 802.3ad
    bond-miimon 100
    bond-lacp-rate 1
    bond-min-links 1
    bond-xmit-hash-policy layer3+4
    address 172.16.1.101
    netmask 255.255.255.0
    post-up ip route add 172.16.0.0/16 via 172.16.1.1
    post-up ip route add 10.0.0.0/8 via 172.16.1.1

auto uplink:200
iface uplink:200 inet static
    address 10.0.2.101

auto uplink:300
iface uplink:300 inet static
    address 10.0.3.101

auto uplink:400
iface uplink:400 inet static
    address 10.0.4.101

# modprobe bonding

Create a configuration similar to the following on an Ubuntu host:

auto eth0
iface eth0 inet dhcp

auto eth1
iface eth1 inet manual
    bond-master uplink

auto eth2
iface eth2 inet manual
    bond-master uplink

auto uplink
iface uplink inet static
    bond-slaves eth1 eth2
    bond-mode 802.3ad
    bond-miimon 100
    bond-lacp-rate 1
    bond-min-links 1
    bond-xmit-hash-policy layer3+4
    address 172.16.1.101
    netmask 255.255.255.0
    post-up ip route add 172.16.0.0/16 via 172.16.1.1
    post-up ip route add 10.0.0.0/8 via 172.16.1.1

auto uplink:200
iface uplink:200 inet static
    address 10.0.2.101

auto uplink:300
iface uplink:300 inet static
    address 10.0.3.101

auto uplink:400
iface uplink:400 inet static
    address 10.0.4.101

# modprobe bonding

VRRP

VRRP allows two or more network devices in an active standby configuration to share a single virtual default gateway. The VRRP router that forwards packets at any given time is the master. If this VRRP router fails, another VRRP standby router automatically takes over as master. The master sends VRRP advertisements to other VRRP routers in the same virtual router group, which include the priority and state of the master. VRRP router priority determines the role that each virtual router plays and who becomes the new master if the master fails.

All virtual routers use 00:00:5E:00:01:XX for IPv4 gateways or 00:00:5E:00:02:XX for IPv6 gateways as their MAC address. The last byte of the address is the Virtual Router IDentifier (VRID), which is different for each virtual router in the network. Only one physical router uses this MAC address at a time. The router replies with this address when it receives ARP requests or neighbor solicitation packets for the IP addresses of the virtual router.

  • Cumulus Linux supports both VRRPv2 and VRRPv3. The default protocol version is VRRPv3.
  • You can configure a maximum of 255 virtual routers on a switch.
  • You cannot use VRRP with MLAG.
  • To configure VRRP on an SVI or traditional mode bridge, you need to edit the etc/network/interfaces and /etc/frr/frr.conf files. NCLU commands do not support SVIs or traditional mode bridges.
  • You can use VRRP with EVPN, and on layer 3 interfaces and subinterfaces that are part of a VRF.

RFC 5798 describes VRRP in detail.

The following example illustrates a basic VRRP configuration.

Configure VRRP

To configure VRRP, specify the following information on each switch:

You can also set these optional parameters:

Optional ParameterDefault ValueDescription
priority100The priority level of the virtual router within the virtual router group, which determines the role that each virtual router plays and what happens if the master fails. Virtual routers have a priority between 1 and 254; the router with the highest priority becomes the master.
advertisement interval1000 millisecondsThe advertisement interval is the interval between successive advertisements by the master in a virtual router group. You can specify a value between 10 and 40950.
preemptenabledPreempt mode lets the router take over as master for a virtual router group if it has a higher priority than the current master. Preempt mode is on by default. To disable preempt mode, edit the /etc/frr/frr.conf file to add the line no vrrp <VRID> preempt to the interface stanza, then restart the FRR service.

The following example commands configure two switches (spine01 and spine02) that form one virtual router group (VRID 44) with IPv4 address 10.0.0.1/24 and IPv6 address 2001:0db8::1/64. spine01 is the master; it has a priority of 254. spine02 is the backup VRRP router.

The parent interface must use a primary address as the source address on VRRP advertisement packets.

cumulus@spine01:~$ net add interface swp1 ip address 10.0.0.2/24
cumulus@spine01:~$ net add interface swp1 ipv6 address 2001:0db8::2/64
cumulus@spine01:~$ net add interface swp1 vrrp 44 10.0.0.1/24
cumulus@spine01:~$ net add interface swp1 vrrp 44 2001:0db8::1/64
cumulus@spine01:~$ net add interface swp1 vrrp 44 priority 254
cumulus@spine01:~$ net add interface swp1 vrrp 44 advertisement-interval 5000
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
cumulus@spine02:~$ net add interface swp1 ip address 10.0.0.3/24
cumulus@spine02:~$ net add interface swp1 ipv6 address 2001:0db8::3/64
cumulus@spine02:~$ net add interface swp1 vrrp 44 10.0.0.1/24
cumulus@spine02:~$ net add interface swp1 vrrp 44 2001:0db8::1/64
cumulus@spine02:~$ net pending
cumulus@spine02:~$ net commit
NVUE commands are not supported.
NVUE commands are not supported.
  1. Edit the /etc/network/interface file to assign an IP address to the parent interface; for example:

    cumulus@spine01:~$ sudo vi /etc/network/interfaces
    ...
    auto swp1
    iface swp1
        address 10.0.0.2/24
        address 2001:0db8::2/64
    
  2. Enable the vrrpd daemon, then start the FRRouting service. See FRRouting.

  3. From the vtysh shell, configure VRRP.

    cumulus@spine01:~$ sudo vtysh
    

    spine01# configure terminal spine01(config)# interface swp1 spine01(config-if)# vrrp 44 ip 10.0.0.1 spine01(config-if)# vrrp 44 ipv6 2001:0db8::1 spine01(config-if)# vrrp 44 priority 254 spine01(config-if)# vrrp 44 advertisement-interval 5000 spine01(config-if)# end spine01# write memory spine01# exit

  1. Edit the /etc/network/interface file to assign an IP address to the parent interface; for example:

    cumulus@spine02:~$ sudo vi /etc/network/interfaces
    ...
    auto swp1
    iface swp1
        address 10.0.0.3/24
        address 2001:0db8::3/64
    
  2. Enable the vrrpd daemon, then start the FRRouting service. See FRRouting.

  3. From the vtysh shell, configure VRRP.

    cumulus@spine02:~$ sudo vtysh
    

    spine02# configure terminal spine02(config)# interface swp1 spine02(config-if)# vrrp 44 ip 10.0.0.1 spine02(config-if)# vrrp 44 ipv6 2001:0db8::1 spine02(config-if)# end spine02# write memory spine02# exit

The NCLU and vtysh commands save the configuration in the /etc/network/interfaces file and the /etc/frr/frr.conf file. For example:

cumulus@spine01:~$ sudo cat /etc/network/interfaces
...
auto swp1
iface swp1
    address 10.0.0.2/24
    address 2001:0db8::2/64
    vrrp 44 10.0.0.1/24 2001:0db8::1/64
...
cumulus@spine01:~$ sudo cat /etc/frr/frr.conf
...
interface swp1
vrrp 44
vrrp 44 advertisement-interval 5000
vrrp 44 priority 254
vrrp 44 ip 10.0.0.1
vrrp 44 ipv6 2001:0db8::1
...

Show VRRP Configuration

To show virtual router information on a switch, run the NCLU net show vrrp <VRID> command or the vtysh show vrrp <VRID> command. For example:

cumulus@spine01:~$ net show vrrp 44
Virtual Router ID                    44
Protocol Version                     3
Autoconfigured                       No
Shutdown                             No
Interface                            swp1
 VRRP interface (v4)                 vrrp4-3-1
VRRP interface (v6)                  vrrp6-3-1
Primary IP (v4)                      10.0.0.2
Primary IP (v6)                      2001:0db8::2
Virtual MAC (v4)                     00:00:5e:00:01:01
Virtual MAC (v6)                     00:00:5e:00:02:01
Status (v4)                          Master
Status (v6)                          Master
Priority                             254
Effective Priority (v4)              254
Effective Priority (v6)              254
Preempt Mode                         Yes
Accept Mode                          Yes
Advertisement Interval               5000 ms
Master Advertisement Interval (v4)   0 ms
Master Advertisement Interval (v6)   5000 ms
Advertisements Tx (v4)               17
Advertisements Tx (v6)               17
Advertisements Rx (v4)               0
Advertisements Rx (v6)               0
Gratuitous ARP Tx (v4)               1
Neigh. Adverts Tx (v6)               1
State transitions (v4)               2
State transitions (v6)               2
Skew Time (v4)                       0 ms
Skew Time (v6)                       0 ms
Master Down Interval (v4)            0 ms
Master Down Interval (v6)            0 ms
IPv4 Addresses                       1
. . . . . . . . . . . . . . . . . .  10.0.0.1
IPv6 Addresses                       1
. . . . . . . . . . . . . . . . . .  2001:0db8::1

IGMP and MLD Snooping

Internet Group Management Protocol (IGMP) snooping and Multicast Listener Discovery (MLD) snooping prevent hosts on a local network from receiving traffic for a multicast group they have not explicitly joined. IGMP snooping is for IPv4 environments and MLD snooping is for IPv6 environments.

The bridge driver in Cumulus Linux kernel includes IGMP and MLD snooping. If you disable IGMP or MLD snooping, multicast traffic floods to all the bridge ports in the bridge. Similarly, in the absence of receivers in a VLAN, multicast traffic floods to all ports in the VLAN.

Configure the IGMP and MLD Querier

Without a multicast router, a single switch in an IP subnet can coordinate multicast traffic flows. This switch is the querier or the designated router. The querier generates query messages to check group membership, and processes membership reports and leave messages.

To configure the querier on the switch for a VLAN-aware bridge, enable the multicast querier on the bridge and add the source IP address of the queries to the VLAN.

The following configuration example enables the multicast querier and sets source IP address of the queries to 10.10.10.1 (the loopback address of the switch).

NCLU commands are not supported.
cumulus@switch:~$ nv set bridge domain br_default multicast snooping querier enable on
cumulus@switch:~$ nv set bridge domain br_default vlan vlan10 multicast snooping querier source-ip 10.10.10.1
cumulus@switch:~$ nv config apply

NVUE commands for a bridge in traditional mode are not supported.

Edit the /etc/network/interfaces file to add bridge-mcquerier 1 to the bridge stanza (this enables the multicast querier on the bridge) and add bridge-igmp-querier-src <ip-address> to the VLAN stanza (the is the source IP address of the queries).

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
  address 10.1.10.2/24
  vlan-id 10
  vlan-raw-device bridge
  bridge-igmp-querier-src 10.10.10.1

auto br_default
iface br_default
  bridge-ports swp1 swp2 swp3
  bridge-vlan-aware yes
  bridge-vids 10 20
  bridge-pvid 1
  bridge-mcquerier 1
...

Run the ifreload -a command to reload the configuration:

cumulus@switch:~$ sudo ifreload -a

To configure the querier on the switch for a bridge in traditional mode, edit the bridge stanza in the /etc/network/interfaces file to add bridge-mcquerier 1 (this enables the multicast querier on the bridge) and bridge-mcqifaddr to 1 (this configures the source IP address of the queries to be the bridge IP address).

...
auto br0
iface br0
  address 10.10.10.10/24
  bridge-ports swp1 swp2 swp3
  bridge-vlan-aware no
  bridge-mcquerier 1
  bridge-mcqifaddr 1
...

Run the ifreload -a command to reload the configuration:

cumulus@switch:~$ sudo ifreload -a

Optimized Multicast Flooding (OMF)

IGMP snooping restricts multicast forwarding only to the ports that receive IGMP report messages. If the ports do not receive IGMP reports, multicast traffic floods to all ports in the bridge domain (also know as unregistered multicast (URMC) traffic). To restrict this flooding to only mrouter ports, you can enable OMF.

To enable OMF:

  1. Configure an IGMP querier. See Configure the IGMP and MLD Querier above.

  2. In the IGMP snooping unregistered L2 multicast flood control section of the /etc/cumulus/switchd.conf file, uncomment and change these settings to TRUE, then restart switchd.

    • bridge.unreg_mcast_init
    • bridge.unreg_v4_mcast_prune
    • bridge.unreg_v6_mcast_prune
    cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
    ...
    #IGMP snooping unregistered L2 multicast flood control
    #
    #Initialize prune module:
    bridge.unreg_mcast_init = TRUE
    #
    #Note:
    #Below configuration allowed only when bridge.unreg_mcast_init is set to TRUE
    #
    #Set below to TRUE to enable unregistered L2 multicast prune to mrouter ports.
    #Default is to flood the unregistered L2 multicast
    #
    bridge.unreg_v4_mcast_prune = TRUE
    bridge.unreg_v6_mcast_prune = TRUE
    
cumulus@switch:~$ sudo systemctl restart switchd.service

Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.

When IGMP reports go to a multicast group, OMF has no effect; normal IGMP snooping occurs.

OMF increases memory usage, which can impact scaling on Spectrum 1 switches.

Improve Multicast Convergence

For large multicast environments, the default CoPP policer might be too restrictive. You can adjust the policer to improve multicast convergence.

For both IGMP and MLD, the default forwarding rate is set to 300 packets per second and the default burst rate is set to 100 packets. To tune the IGMP and MLD forwarding and burst rates, edit the /etc/cumulus/acl/policy.d/00control_plane.rules file and change --set-rate and --set-burst in the IGMP and MLD policer lines.

The following command example changes the IGMP forwarding rate to 400 packets per second and the burst rate to 200 packets.

-A $INGRESS_CHAIN -p igmp -j POLICE --set-mode pkt --set-rate 400 --set-burst 200

For MLD, you need to change several lines in the /etc/cumulus/acl/policy.d/00control_plane.rules file.

All the MLD packet types use same policer internally; you must set all the lines with the same rates.

The following command examples change the MLD forwarding rate to 400 packets per second and the burst rate to 200 packets.

# link-local multicast receiver: Listener Query
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p ipv6-icmp -m icmp6 --icmpv6-type 130 -j POLICE --set-mode pkt --set-rate 400 --set-burst 200 --set-class 6

# link-local multicast receiver: Listener Report
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p ipv6-icmp -m icmp6 --icmpv6-type 131 -j POLICE --set-mode pkt --set-rate 400 --set-burst 200 --set-class 6

# link-local multicast receiver: Listener Done
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p ipv6-icmp -m icmp6 --icmpv6-type 132 -j POLICE --set-mode pkt --set-rate 400 --set-burst 200 --set-class 6

# link-local multicast receiver: Listener Report v2
-A $INGRESS_CHAIN --in-interface $INGRESS_INTF -p ipv6-icmp -m icmp6 --icmpv6-type 143 -j POLICE --set-mode pkt --set-rate 400 --set-burst 200 --set-class 6

Apply the rules with the sudo cl-acltool -i command:

cumulus@switch:~$ sudo cl-acltool -i

Disable IGMP and MLD Snooping

If you do not use mirroring functions or other types of multicast traffic, you can disable IGMP and MLD snooping.

cumulus@switch:~$ net add bridge bridge mcsnoop no
cumulus@switch:~$ net pending
cumulus@switch:~$ net commit
cumulus@switch:~$ nv set bridge domain br_default multicast snooping enable off
cumulus@switch:~$ nv config apply

Edit the /etc/network/interfaces file and set bridge-mcsnoop to 0 in the bridge stanza:

cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
  bridge-mcquerier 1
  bridge-mcsnoop 0
  bridge-ports swp1 swp2 swp3
  bridge-pvid 1
  bridge-vids 100 200
  bridge-vlan-aware yes
...

Run the ifreload -a command to reload the configuration:

cumulus@switch:~$ sudo ifreload -a

Troubleshooting

To show the IGMP/MLD snooping bridge state, run the brctl showstp <bridge> command:

cumulus@switch:~$ sudo brctl showstp bridge
  bridge
  bridge id              8000.7072cf8c272c
  designated root        8000.7072cf8c272c
  root port                 0                    path cost                  0
  max age                  20.00                 bridge max age            20.00
  hello time                2.00                 bridge hello time          2.00
  forward delay            15.00                 bridge forward delay      15.00
  ageing time             300.00
  hello timer               0.00                 tcn timer                  0.00
  topology change timer     0.00                 gc timer                 263.70
  hash elasticity        4096                    hash max                4096
  mc last member count      2                    mc init query count        2
  mc router                 1                    mc snooping                1
  mc last member timer      1.00                 mc membership timer      260.00
  mc querier timer        255.00                 mc query interval        125.00
  mc response interval     10.00                 mc init query interval    31.25
  mc querier                0                    mc query ifaddr            0
  flags

swp1 (1)
  port id                8001                    state                forwarding
  designated root        8000.7072cf8c272c       path cost                  2
  designated bridge      8000.7072cf8c272c       message age timer          0.00
  designated port        8001                    forward delay timer        0.00
  designated cost           0                    hold timer                 0.00
  mc router                 1                    mc fast leave              0
  flags

swp2 (2)
  port id                8002                    state                forwarding
  designated root        8000.7072cf8c272c       path cost                  2
  designated bridge      8000.7072cf8c272c       message age timer          0.00
  designated port        8002                    forward delay timer        0.00
  designated cost           0                    hold timer                 0.00
  mc router                 1                    mc fast leave              0
  flags

swp3 (3)
  port id                8003                    state                forwarding
  designated root        8000.7072cf8c272c       path cost                  2
  designated bridge      8000.7072cf8c272c       message age timer          0.00
  designated port        8003                    forward delay timer        8.98
  designated cost           0                    hold timer                 0.00
  mc router                 1                    mc fast leave              0
  flags

To show the groups and bridge port state, run the NCLU net show bridge mdb command or the Linux sudo bridge mdb show command. To show detailed router ports and group information, run the sudo bridge -d -s mdb show command:

cumulus@switch:~$ sudo bridge -d -s mdb show
  dev bridge port swp2 grp 234.10.10.10 temp 241.67
  dev bridge port swp1 grp 238.39.20.86 permanent 0.00
  dev bridge port swp1 grp 234.1.1.1 temp 235.43
  dev bridge port swp2 grp ff1a::9 permanent 0.00
  router ports on bridge: swp3

Network Virtualization

Cumulus Linux supports several forms of network virtualization.

VXLAN (Virtual Extensible LAN) is a standard overlay protocol that abstracts logical virtual networks from the physical network underneath. You can deploy simple and scalable layer 3 Clos architectures while extending layer 2 segments over that layer 3 network.

VXLAN uses a VLAN-like encapsulation technique to encapsulate MAC-based layer 2 Ethernet frames within layer 3 UDP packets. Each virtual network is a VXLAN logical layer 2 segment. VXLAN scales to 16 million segments - a 24-bit VXLAN network identifier (VNI ID) in the VXLAN header - for multi-tenancy.

Hosts on a given virtual network join together through an overlay protocol that initiates and terminates tunnels at the edge of the multi-tenant network, typically the hypervisor vSwitch or top of rack. These edge points are the VXLAN tunnel end points (VTEP).

Cumulus Linux can start and stop VTEPs in hardware and supports wire-rate VXLAN. VXLAN provides an efficient hashing scheme across the IP fabric during the encapsulation process; the source UDP port is unique, with the hash based on layer 2 through layer 4 information from the original frame. The UDP destination port is the standard port 4789.

Cumulus Linux includes the native Linux VXLAN kernel support and integrates with controller-based overlay solutions, such as VMware NSX.

Cumulus Linux does not support VXLAN encapsulation over layer 3 subinterfaces (for example, swp3.111) or SVIs as traffic transiting through the switch drops, even if you use the subinterface only for underlay traffic and it does not perform VXLAN encapsulation. Only configure VXLAN uplinks as layer 3 interfaces without any subinterfaces (for example, swp3). The VXLAN tunnel endpoints cannot share a common subnet; there must be at least one layer 3 hop between the VXLAN source and destination.

Considerations

Cut-through Mode and Store and Forward Switching

On switches with the NVIDIA Spectrum ASICs, Cumulus Linux supports cut-through mode for VXLANs but does not support store and forward switching.

MTU Size for Virtual Network Interfaces

The maximum transmission unit (MTU) size for a virtual network interface should be 50 bytes smaller than the MTU for the physical interfaces on the switch. For more information on setting MTU, read Layer 1 and Switch Port Attributes.

Layer 3 and Layer 2 VNIs Cannot Share the Same ID

A layer 3 VNI and a layer 2 VNI cannot have the same ID. If the VNI IDs are the same, the layer 2 VNI does not get created.

Ethernet Virtual Private Network - EVPN

VXLAN enables layer 2 segments to extend over an IP core (the underlay). The initial definition of VXLAN (RFC 7348) does not include any control plane and relied on a flood-and-learn approach for MAC address learning.

Ethernet Virtual Private Network (EVPN) is a standards-based control plane for VXLAN defined in RFC 7432 and draft-ietf-bess-evpn-overlay that allows for building and deploying VXLANs at scale. It relies on multi-protocol BGP (MP-BGP) to exchange information and uses BGP-MPLS IP VPNs (RFC 4364). It enables not only bridging between end systems in the same layer 2 segment but also routing between different segments (subnets). There is also inherent support for multi-tenancy.

Cumulus Linux installs the routing control plane (including EVPN) as part of the FRRouting (FRR) package. For more information about FRR, refer to FRRouting.

Key Features

Cumulus Linux fully supports EVPN as the control plane for VXLAN, including for both intra-subnet bridging and inter-subnet routing, and provides these key features:

Cumulus Linux supports the EVPN address family with both eBGP and iBGP peering. If you configure underlay routing with eBGP, you can use the same eBGP session to carry EVPN routes. In a typical 2-tier Clos network where the leafs are VTEPs, if you use eBGP sessions between the leafs and spines for underlay routing, the same sessions exchange EVPN routes. The spine switches act as route forwarders and do not install any forwarding state as they are not VTEPs. When the switch exchanges EVPN routes over iBGP peering, you can use OSPF as the IGP or resolve next hops using iBGP.

Cumulus Linux disables data plane MAC learning by default on VXLAN interfaces. Do not enable MAC learning on VXLAN interfaces: EVPN installs remote MACs.

Basic Configuration

The following sections provide the basic configuration needed to use EVPN as the control plane for VXLAN in a BGP-EVPN-based layer 2 extension deployment. For layer 3 multi-tenancy configuration, see Inter-subnet Routing. For additional EVPN configuration, see EVPN Enhancements.

Basic EVPN Configuration Commands

Basic configuration in a BGP-EVPN-based layer 2 extension deployment requires you to:

For a non-VTEP device that is only participating in EVPN route exchange, such as a spine switch where the network deployment uses hop-by-hop eBGP or the switch is acting as an iBGP route reflector, configuring VXLAN interfaces is not required.

  1. Configure VXLAN Interfaces. The following example creates two VXLAN interfaces (vni10 and vni20), adds the VXLAN devices to the bridge, and sets the VXLAN local tunnel IP address to 10.10.10.1. For more information on how to configure VXLAN interfaces, see VXLAN Devices.

    cumulus@leaf01:~$ net add vxlan vni10 vxlan id 10
    cumulus@leaf01:~$ net add vxlan vni20 vxlan id 20
    cumulus@leaf01:~$ net add bridge bridge ports vni10,vni20
    cumulus@leaf01:~$ net add bridge bridge vids 10,20
    cumulus@leaf01:~$ net add vxlan vni10 bridge access 10
    cumulus@leaf01:~$ net add vxlan vni20 bridge access 20
    cumulus@leaf01:~$ net add loopback lo vxlan local-tunnelip 10.10.10.1
    
  2. Configure BGP. The following example commands assign an ASN and router ID to leaf01 and spine01, specify the interfaces between the two BGP peers, and the prefixes to originate. For complete information on how to configure BGP, see Border Gateway Protocol - BGP.

cumulus@leaf01:~$ net add bgp autonomous-system 65101
cumulus@leaf01:~$ net add bgp router-id 10.10.10.1
cumulus@leaf01:~$ net add bgp neighbor swp51 interface remote-as external
cumulus@leaf01:~$ net add bgp ipv4 unicast network 10.10.10.1/32
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@spine01:~$ net add bgp autonomous-system 65199
cumulus@spine01:~$ net add bgp router-id 10.10.10.101
cumulus@spine01:~$ net add bgp neighbor swp1 remote-as external
cumulus@spine01:~$ net add bgp ipv4 unicast network 10.10.10.101/32
cumulus@spine01:~$ net pending
cumulus@spine01:~$ net commit
  1. Activate the EVPN address family and enable EVPN between BGP neighbors. The following example commands enable EVPN between leaf01 and spine01:
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor swp51 activate
cumulus@spine01:~$ net add bgp l2vpn evpn neighbor swp1 activate
  1. FRR is not aware of any local VNIs, MAC addresses, or neighbors associated with those VNIs until you enable the BGP control plane for all VNIs configured on the switch by setting the advertise-all-vni option.
cumulus@leaf01:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

You only need this configuration on leafs that are VTEPs. The switch accepts EVPN routes from a BGP peer even without this explicit EVPN configuration and maintains the routes in the global EVPN routing table. However, Cumulus Linux only imports the routes into the per-VNI routing table and installs the appropriate entries in the kernel when the VNI corresponding to the received route is locally known.

  1. Configure VXLAN Interfaces. The following example creates a single VXLAN interface (vxlan0), maps VLAN 10 to vni10 and VLAN 20 to vni20, adds the VXLAN device to the default bridge br_default, and sets the VXLAN local tunnel IP address to 10.10.10.10.

    cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
    cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
    cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.10
    cumulus@leaf01:~$ nv config apply
    

    To create a traditional VXLAN device, where each VNI represents a separate device instead of a set of VNIs in a single device model, see VXLAN-Devices.

  2. Configure BGP. The following example commands assign an ASN and router ID to leaf01 and spine01, specify the interfaces between the two BGP peers, and the prefixes to originate. For complete information on how to configure BGP, see Border Gateway Protocol - BGP.

cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast static-network 10.10.10.1/32
cumulus@leaf01:~$ nv config apply
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp peer swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast static-network 10.10.10.101/32
cumulus@spine01:~$ nv config apply
  1. Activate the EVPN address family and enable EVPN between BGP neighbors. The following example commands enable EVPN between leaf01 and spine01:
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp peer swp51 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv config apply

Unlike with NCLU, you do not need enable the BGP control plane for all VNIs configured on the switch with NVUE with the advertise-all-vni option. FRR is aware of any local VNIs and MACs, and hosts (neighbors) associated with those VNIs.

The NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:

cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
    interface:
      lo:
        ip:
          address:
            10.10.10.1/32: {}
        type: loopback
    bridge:
      domain:
        br_default:
          vlan:
            '10':
              vni:
                '10': {}
            '20':
              vni:
                '20': {}
    nve:
      vxlan:
        enable: on
        source:
          address: 10.10.10.10
    router:
      bgp:
        autonomous-system: 65101
        enable: on
        router-id: 10.10.10.1
    vrf:
      default:
        router:
          bgp:
            peer:
              swp51:
                remote-as: external
                type: unnumbered
            enable: on
            address-family:
              ipv4-unicast:
                static-network:
                  10.10.10.1/32: {}
                enable: on
cumulus@spine01:~$ nv set evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp peer swp1 address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv config apply

The NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:

cumulus@spine01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
    interface:
      lo:
        ip:
          address:
            10.10.10.101/32: {}
        type: loopback
    bridge:
      domain:
        br_default:
          vlan:
            '10':
              vni:
                '10': {}
            '20':
              vni:
                '20': {}
    nve:
      vxlan:
        enable: on
        source:
          address: 10.10.10.101
    router:
      bgp:
        autonomous-system: 65199
        enable: on
        router-id: 10.10.10.101
    vrf:
      default:
        router:
          bgp:
            peer:
              swp1:
                remote-as: external
                type: unnumbered
            enable: on
            address-family:
              ipv4-unicast:
                static-network:
                  10.10.10.101/32: {}
                enable: on
  1. Configure VXLAN Interfaces. Edit the /etc/network/interfaces file to create the VXLAN interfaces, attach them to a bridge, map the VLANs to the VNIs, and set the VXLAN local tunnel IP address. The example below creates two VXLAN interfaces (vni10 and vni20), maps VLAN 10 to vni10 and VLAN 20 to vni20, and sets the VXLAN local tunnel IP address to 10.10.10.10.

    cumulus@leaf01:~$ sudo nano /etc/network/interfaces
    ...
    auto lo
    iface lo inet loopback
            address 10.10.10.1/32
            vxlan-local-tunnelip 10.10.10.10
    
    auto vni10
    iface vni10
            bridge-access 10
            bridge-learning off
            vxlan-id 10
    
    auto vni20
    iface vni20
            bridge-access 20
            bridge-learning off
            vxlan-id 20
    
    auto vlan10
    iface vlan10
        vlan-raw-device br_default
        vlan-id 10
    
    auto vlan20
    iface vlan20
        vlan-raw-device br_default
        vlan-id 20
    
    auto br_default
    iface br_default
            bridge-ports vni10 vni20
            bridge-vlan-aware yes
            bridge-vids 10 20
            bridge-pvid 1
    
  2. Configure BGP with vtysh commands. The following example commands assign an ASN and router ID to leaf01 and spine01, specify the interfaces between the two BGP peers, and the prefixes to originate. For complete information on how to configure BGP, see Border Gateway Protocol - BGP.

cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# bgp router-id 10.10.10.1
leaf01(config-router)# neighbor swp51 remote-as external
leaf01(config-router)# address-family ipv4
leaf01(config-router-af)# network 10.10.10.1/32
leaf01(config-router-af)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$
cumulus@spine01:~$ sudo vtysh
spine01# configure terminal
spine01(config)# router bgp 65199
spine01(config-router)# bgp router-id 10.10.10.101
spine01(config-router)# neighbor swp1 remote-as external
spine01(config-router)# address-family ipv4
spine01(config-router-af)# network 10.10.10.101/32
spine01(config-router-af)# end
spine01# write memory
spine01# exit
cumulus@spine01:~$
  1. Activate the EVPN address family and enable EVPN between BGP neighbors. The following example commands enable EVPN between leaf01 and spine01. The commands automatically provision all locally configured VNIs so the BGP control plane can advertise them.
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal leaf01(config)# router bgp 65101 leaf01(config-router)# bgp router-id 10.10.10.1 leaf01(config-router)# neighbor swp51 interface remote-as external leaf01(config-router)# address-family l2vpn evpn leaf01(config-router-af)# neighbor swp51 activate leaf01(config-router-af)# advertise-all-vni leaf01(config-router-af)# end leaf01)# write memory leaf01)# exit cumulus@leaf01:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
router bgp 65101
  bgp router-id 10.10.10.1
  neighbor swp51 interface remote-as external
  address-family l2vpn evpn
neighbor swp1 activate
  advertise-all-vni
...
cumulus@spine01:~$ sudo vtysh

spine01# configure terminal spine01(config)# router bgp 65199 spine01(config-router)# bgp router-id 10.10.10.101 spine01(config-router)# neighbor swp1 interface remote-as external spine01(config-router)# address-family l2vpn evpn spine01(config-router-af)# neighbor swp51 activate spine01(config-router-af)# end spine01)# write memory spine01)# exit cumulus@spine01:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
router bgp 65199
  bgp router-id 10.10.10.101
  neighbor swp1 interface remote-as external
  address-family l2vpn evpn
neighbor swp1 activate
...

You only need to set the advertise-all-vni option on leafs that are VTEPs. The switch accepts EVPN routes from a BGP peer even without this option. The routes are in the global EVPN routing table but Cumulus Linux only imports them into the per-VNI routing table and installs the appropriate entries in the kernel when the VNI corresponding to the received route is locally known.

EVPN and VXLAN Active-active Mode

For EVPN in VXLAN active-active mode, both switches in the MLAG pair establish EVPN peering with other EVPN speakers (for example, with spine switches if using hop-by-hop eBGP) and inform about their locally known VNIs and MACs. When MLAG is active, both switches announce this information with the shared anycast IP address.

For active-active configuration, make sure that:

MLAG synchronizes information between the two switches in the MLAG pair; EVPN does not synchronize.

For type-5 routes in an EVPN symmetric configuration with VXLAN active-active mode, Cumulus Linux uses Primary IP Address Advertisement. For information on configuring Primary IP Address Advertisement, see Advertise Primary IP Address.

For information about active-active VTEPs and anycast IP behavior, and for failure scenarios, see VXLAN Active-active Mode.

Considerations

EVPN Enhancements

This section describes EVPN enhancements.

Define RDs and RTs

When FRR learns about a local VNI and there is no explicit configuration for that VNI in FRR, the switch derives the route distinguisher (RD), and import and export route targets (RTs) for this VNI automatically. The RD uses RouterId:VNI-Index and the import and export RTs use AS:VNI. For routes that come from a layer 2 VNI (type-2 and type-3), the RD uses the vxlan-local-tunnelip from the layer 2 VNI interface instead of the RouterId (vxlan-local-tunnelip:VNI). EVPN route exchange uses the RD and RTs.

The RD disambiguates EVPN routes in different VNIs (they can have the same MAC and/or IP address) while the RTs describe the VPN membership for the route. The VNI-Index for the RD is a unique number that the switch generates. It only has local significance; on remote switches, its only role is for route disambiguation. The switch uses this number instead of the VNI value itself because this number has to be less than or equal to 65535. In the RT, the AS is always a 2-byte value to allow room for a large VNI. If the router has a 4-byte AS, it only uses the lower 2 bytes. This ensures a unique RT for different VNIs while having the same RT for the same VNI across routers in the same AS.

For eBGP EVPN peering, the peers are in a different AS so using an automatic RT of AS:VNI does not work for route import. Therefore, Cumulus Linux treats the import RT as *:VNI to determine which received routes apply to a particular VNI. This only applies when the switch auto-derives the import RT.

If you do not want to derive RDs and RTs automatically, you can define them manually. The following example commands are per VNI.

cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 rd 10.10.10.1:20
cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 route-target export 65101:10
cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 route-target import 65102:10
cumulus@leaf01:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The NCLU commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
address-family l2vpn evpn
  advertise-all-vni
  vni 10
   rd 10.10.10.1:20
   route-target export 65101:10
   route-target import 65102:10
...
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 10 rd 10.10.10.3:20
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 10 route-target export 65102:10
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 10 route-target import 65101:10
cumulus@leaf03:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@leaf03:~$ net pending
cumulus@leaf03:~$ net commit

The NCLU commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
address-family l2vpn evpn
  advertise-all-vni
  vni 10
   rd 10.10.10.3:20
   route-target export 65102:10
   route-target import 65101:10
...
cumulus@leaf01:~$ nv set evpn evi 10 rd 10.10.10.1:20
cumulus@leaf01:~$ nv set evpn evi 10 route-target export 65101:10
cumulus@leaf01:~$ nv set evpn evi 10 route-target import 65102:10
cumulus@leaf01:~$ nv config apply
cumulus@leaf03:~$ nv set evpn evi 10 rd 10.10.10.3:20
cumulus@leaf03:~$ nv set evpn evi 10 route-target export 65102:10
cumulus@leaf03:~$ nv set evpn evi 10 route-target import 65101:10
cumulus@leaf03:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal leaf01(config)# router bgp 65101 leaf01(config-router)# address-family l2vpn evpn leaf01(config-router-af)# vni 10 leaf01(config-router-af-vni)# rd 10.10.10.1:20 leaf01(config-router-af-vni)# route-target export 65101:10 leaf01(config-router-af-vni)# route-target import 65102:10 leaf01(config-router-af-vni)# exit leaf01(config-router-af)# advertise-all-vni leaf01(config-router-af)# end leaf01)# write memory leaf01)# exit cumulus@leaf01:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
address-family l2vpn evpn
  advertise-all-vni
  vni 10
   rd 10.10.10.1:20
   route-target export 65101:10
   route-target import 65102:10
...
cumulus@leaf03:~$ sudo vtysh

leaf03# configure terminal leaf03(config)# router bgp 65102 leaf03(config-router)# address-family l2vpn evpn leaf03(config-router-af)# vni 10 leaf03(config-router-af-vni)# rd 10.10.10.3:20 leaf03(config-router-af-vni)# route-target export 65102:10 leaf03(config-router-af-vni)# route-target import 65101:10 leaf03(config-router-af-vni)# exit leaf03(config-router-af)# advertise-all-vni leaf03(config-router-af)# end leaf03)# write memory leaf03)# exit cumulus@leaf03:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
address-family l2vpn evpn
  advertise-all-vni
  vni 10
   rd 10.10.10.3:20
   route-target export 65102:10
   route-target import 65101:10

If you delete the RD or RT later, it reverts back to its corresponding default value.

You can configure multiple RT values. In addition, you can configure both the import and export route targets with a single command by using route-target both:

cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 route-target import 65102:10
cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 route-target import 65102:20
cumulus@leaf01:~$ net add bgp l2vpn evpn vni 20 route-target both 65101:10
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The NCLU commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
address-family l2vpn evpn
  vni 10
    route-target import 65102:10
    route-target import 65102:20
  vni 20
    route-target import 65101:10
    route-target export 65101:10
...
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 10 route-target import 65101:10
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 10 route-target import 65101:20
cumulus@leaf03:~$ net add bgp l2vpn evpn vni 20 route-target both 65102:10
cumulus@leaf03:~$ net pending
cumulus@leaf03:~$ net commit

The NCLU commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
address-family l2vpn evpn
  vni 10
    route-target import 65101:10
    route-target import 65101:20
  vni 20
    route-target import 65102:10
    route-target export 65102:10
...
cumulus@leaf01:~$ nv set evpn evi 10 route-target import 65102:10
cumulus@leaf01:~$ nv set evpn evi 10 route-target import 65102:20
cumulus@leaf01:~$ nv set evpn evi 20 route-target both 65101:10
cumulus@leaf01:~$ nv config apply
cumulus@leaf03:~$ nv set evpn evi 10 route-target import 65101:10
cumulus@leaf03:~$ nv set evpn evi 10 route-target import 65101:20
cumulus@leaf03:~$ nv set evpn evi 20 route-target both 65102:10
cumulus@leaf03:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal leaf01(config)# router bgp 65101 leaf01(config-router)# address-family l2vpn evpn leaf01(config-router-af)# vni 10 leaf01(config-router-af-vni)# route-target import 65102:10 leaf01(config-router-af-vni)# route-target import 65102:20 leaf01(config-router-af-vni)# exit leaf01(config-router-af)# vni 20 leaf01(config-router-af-vni)# route-target both 65101:10 leaf01(config-router-af)# end leaf01)# write memory leaf01)# exit cumulus@leaf01:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
address-family l2vpn evpn
  vni 10
    route-target import 65102:10
    route-target import 65102:20
  vni 20
    route-target import 65101:10
    route-target export 65101:10
...
cumulus@leaf03:~$ sudo vtysh

leaf03# configure terminal leaf03(config)# router bgp 65102 leaf03(config-router)# address-family l2vpn evpn leaf03(config-router-af)# vni 10 leaf03(config-router-af-vni)# route-target import 65101:10 leaf03(config-router-af-vni)# route-target import 65101:20 leaf03(config-router-af-vni)# exit leaf03(config-router-af)# vni 20 leaf03(config-router-af-vni)# route-target both 65102:10 leaf03(config-router-af)# end leaf03)# write memory leaf03)# exit cumulus@leaf03:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file:

...
address-family l2vpn evpn
  vni 10
    route-target import 65101:10
    route-target import 65101:20
  vni 20
    route-target import 65102:10
    route-target export 65102:10
...

Enable EVPN in an iBGP Environment with an OSPF Underlay

You can use EVPN with an OSPF or static route underlay. This is a more complex configuration than using eBGP. In this case, iBGP advertises EVPN routes directly between VTEPs and the spines are unaware of EVPN or BGP.

The leafs peer with each other in a full mesh within the EVPN address family without using route reflectors. The leafs generally peer to their loopback addresses, which advertise in OSPF. The receiving VTEP imports routes into a specific VNI with a matching route target community.

cumulus@leaf01:~$ net add bgp autonomous-system 65101
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.2 remote-as internal
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.3 remote-as internal
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.4 remote-as internal
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.2 activate
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.3 activate
cumulus@leaf01:~$ net add bgp l2vpn evpn neighbor 10.10.10.4 activate
cumulus@leaf01:~$ net add bgp l2vpn evpn advertise-all-vni
cumulus@leaf01:~$ net add ospf router-id 10.10.10.1
cumulus@leaf01:~$ net add loopback lo ospf area 0.0.0.0
cumulus@leaf01:~$ net add ospf passive-interface lo
cumulus@leaf01:~$ net add interface swp49 ospf area 0.0.0.0
cumulus@leaf01:~$ net add interface swp50 ospf area 0.0.0.0
cumulus@leaf01:~$ net add interface swp51 ospf area 0.0.0.0
cumulus@leaf01:~$ net add interface swp52 ospf area 0.0.0.0
cumulus@leaf01:~$ net add interface swp49 ospf network point-to-point
cumulus@leaf01:~$ net add interface swp50 ospf network point-to-point
cumulus@leaf01:~$ net add interface swp51 ospf network point-to-point
cumulus@leaf01:~$ net add interface swp52 ospf network point-to-point
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The NCLU commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
interface lo
  ip ospf area 0.0.0.0
!
interface swp49
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp50
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp51
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp52
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
router bgp 65101
  neighbor 10.10.10.2 remote-as internal
  neighbor 10.10.10.3 remote-as internal
  neighbor 10.10.10.4 remote-as internal
  !
  address-family l2vpn evpn
  neighbor 10.10.10.2 activate
  neighbor 10.10.10.3 activate
  neighbor 10.10.10.4 activate
  advertise-all-vni
  exit-address-family
  !
Router ospf
  Ospf router-id 10.10.10.1
  Passive-interface lo
...
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.2 remote-as internal
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.3 remote-as internal
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.4 remote-as internal
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.2 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.3 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp peer 10.10.10.4 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router ospf router-id 10.10.10.1
cumulus@leaf01:~$ nv set interface lo router ospf area 0
cumulus@leaf01:~$ nv set interface lo router ospf passive on
cumulus@leaf01:~$ nv set interface swp49 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp50 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp51 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp52 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp49 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp50 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@leaf01:~$ nv config apply

The NVUE commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:

cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
      lo:
        ip:
          address:
            10.10.10.1/32: {}
        router:
          ospf:
            area: 0
            enable: on
            network-type: point-to-point    
        type: loopback
      swp49:
        router:
          ospf:
            area: 0.0.0.0
            enable: on
        type: swp
      swp50:
        router:
          ospf:
            area: 0.0.0.0
            enable: on
            network-type: point-to-point
        type: swp
      swp51:
        router:
          ospf:
            area: 0.0.0.0
            enable: on
            network-type: point-to-point
        type: swp
      swp52:
        router:
          ospf:
            area: 0.0.0.0
            enable: on
            network-type: point-to-point
        type: swp
    bridge:
      domain:
        br_default:
          multicast:
            snooping:
              enable: off
              querier:
                enable: on
    router:
      bgp:
        autonomous-system: 65101
        enable: on
        router-id: 10.10.10.1
      ospf:
        router-id: 10.10.10.1
        enable: on
    vrf:
      default:
        router:
          bgp:
            peer:
              10.10.10.2:
                remote-as: internal
                type: numbered
                address-family:
                  l2vpn-evpn:
                    enable: on
              10.10.10.3:
                remote-as: internal
                type: numbered
                address-family:
                  l2vpn-evpn:
                    enable: on
              10.10.10.4:
                remote-as: internal
                type: numbered
                address-family:
                  l2vpn-evpn:
                    enable: on
            enable: on
            address-family:
              l2vpn-evpn:
                enable: on
    evpn:
      enable: on
    nve:
      vxlan:
        enable: on
cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# neighbor 10.10.10.2 remote-as internal
leaf01(config-router)# neighbor 10.10.10.3 remote-as internal
leaf01(config-router)# neighbor 10.10.10.4 remote-as internal
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# neighbor 10.10.10.2 activate
leaf01(config-router-af)# neighbor 10.10.10.3 activate
leaf01(config-router-af)# neighbor 10.10.10.4 activate
leaf01(config-router-af)# advertise-all-vni
leaf01(config-router-af)# exit
leaf01(config-router)# exit
leaf01(config)# router ospf
leaf01(config-router)# router-id 10.10.10.1
leaf01(config-router)# passive-interface lo
leaf01(config-router)# exit
leaf01(config)# interface lo
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# exit
leaf01(config)# interface swp49
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp50
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp51
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp52
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# end
leaf01)# write memory
leaf01)# exit
cumulus@leaf01:~$

The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file.

...
interface lo
  ip ospf area 0.0.0.0
!
interface swp49
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp50
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp51
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
interface swp52
  ip ospf area 0.0.0.0
  ip ospf network point-to-point
!
router bgp 65101
  neighbor 10.10.10.2 remote-as internal
  neighbor 10.10.10.3 remote-as internal
  neighbor 10.10.10.4 remote-as internal
  !
  address-family l2vpn evpn
  neighbor 10.10.10.2 activate
  neighbor 10.10.10.3 activate
  neighbor 10.10.10.4 activate
  advertise-all-vni
  exit-address-family
  !
Router ospf
  Ospf router-id 10.10.10.1
  Passive-interface lo
...

ARP and ND Suppression

ARP suppression with EVPN allows a VTEP to suppress ARP flooding over VXLAN tunnels as much as possible. A local proxy handles ARP requests from locally attached hosts for remote hosts. ARP suppression is for IPv4; ND suppression is for IPv6.

Cumulus Linux enables ARP and ND suppression by default on all VNIs to reduce ARP and ND packet flooding over VXLAN tunnels.

ARP and ND suppression only suppresses the flooding of known hosts. To disable all flooding refer to the Disable BUM Flooding section.

In a centralized routing deployment, you must configure layer 3 interfaces even if you configure the switch only for layer 2 (you are not using VXLAN routing). To avoid installing unnecessary layer 3 information, you can turn off IP forwarding.

The following example commands turn off IPv4 and IPv6 forwarding on VLAN 10 and VLAN 20.

cumulus@leaf01:~$ net add bridge bridge ports vni10,vni20
cumulus@leaf01:~$ net add bridge bridge vids 10,20
cumulus@leaf01:~$ net add vxlan vni10 vxlan id 10
cumulus@leaf01:~$ net add vxlan vni20 vxlan id 20
cumulus@leaf01:~$ net add vxlan vni10 bridge access 10
cumulus@leaf01:~$ net add vxlan vni20 bridge access 20
cumulus@leaf01:~$ net add vxlan vni10 vxlan local-tunnelip 10.10.10.1
cumulus@leaf01:~$ net add vxlan vni20 vxlan local-tunnelip 10.10.10.1
cumulus@leaf01:~$ net add vlan 10 ip forward off
cumulus@leaf01:~$ net add vlan 10 ipv6 forward off
cumulus@leaf01:~$ net add vlan 20 ip forward off
cumulus@leaf01:~$ net add vlan 20 ipv6 forward off
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set interface vlan10 ip ipv4 forward off
cumulus@leaf01:~$ nv set interface vlan10 ip ipv6 forward off
cumulus@leaf01:~$ nv set interface vlan20 ip ipv4 forward off
cumulus@leaf01:~$ nv set interface vlan20 ip ipv6 forward off
cumulus@leaf01:~$ nv config apply

Edit the /etc/network/interfaces file.

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
    ip6-forward off
    ip-forward off
    vlan-id 10
    vlan-raw-device bridge

auto vlan20
iface vlan20
    ip6-forward off
    ip-forward off
    vlan-id 20
    vlan-raw-device bridge

auto vni10
iface vni10
    bridge-access 10
    vxlan-id 10
    bridge-learning off

auto vni20
iface vni20
      bridge-access 20
      vxlan-id 20
      bridge-learning off
...

For a bridge in traditional mode, you must edit the bridge configuration in the /etc/network/interfaces file using a text editor:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto bridge1
iface bridge1
    bridge-ports swp1.10 swp2.10 vni10
    ip6-forward off
    ip-forward off
...

When deploying EVPN and VXLAN using a hardware profile other than the default Forwarding Table Profile, ensure that the Linux kernel ARP sysctl settings gc_thresh2 and gc_thresh3 both have a value larger than the number of neighbor (ARP and ND) entries you expect in the deployment. To configure these settings, edit the /etc/sysctl.d/neigh.conf file, then reboot the switch. If your network has more hosts than the values in the example below, change the sysctl entries accordingly.

Example /etc/sysctl.d/neigh.conf file
cumulus@leaf01:~$ sudo nano /etc/sysctl.d/neigh.conf
...
net.ipv4.neigh.default.gc_thresh3=14336
net.ipv6.neigh.default.gc_thresh3=16384
net.ipv4.neigh.default.gc_thresh2=7168
net.ipv6.neigh.default.gc_thresh2=8192
...

Keep ARP and ND suppression on to reduce ARP and ND packet flooding over VXLAN tunnels. However, if you need to disable ARP and ND suppression, follow the example commands below.

cumulus@leaf01:~$ net del vxlan vni10 bridge arp-nd-suppress
cumulus@leaf01:~$ net del vxlan vni20 bridge arp-nd-suppress
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress off
cumulus@leaf01:~$ nv config apply

Edit the /etc/network/interfaces file to remove bridge-arp-nd-suppress on from the VNI.

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...

auto vni10
iface vni10
    bridge-access 10
    vxlan-id 10
    vxlan-local-tunnelip 10.10.10.1

auto vni20
iface vni20
      bridge-access 20
      vxlan-id 20
      vxlan-local-tunnelip 10.10.10.1
...

Configure Static MAC Addresses

You can configure a MAC address that you intend to pin to a particular VTEP on the VTEP as a static bridge FDB entry. EVPN picks up these MAC addresses and advertises them to peers as remote static MACs. You configure static bridge FDB entries for MAC addresses under the bridge configuration:

cumulus@leaf01:~$ net add bridge post-up bridge fdb add 26:76:e6:93:32:78 dev bond1 vlan 10 master static
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
NVUE commands are not supported.

Edit the /etc/network/interfaces file. For example:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
    bridge-ports bond1 vni10
    bridge-vids 10
    bridge-vlan-aware yes
    post-up bridge fdb add 26:76:e6:93:32:78 dev bond1 vlan 10 master static
...

For a bridge in traditional mode, you must edit the bridge configuration in the /etc/network/interfaces file using a text editor:

cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto br10
iface br10
    bridge-ports swp1.10 vni10
    post-up bridge fdb add 26:76:e6:93:32:78 dev swp1.10 master static
...

Filter EVPN Routes

It is common to sub divide the data center into multiple pods with full host mobility within a pod but only do prefix-based routing across pods. You can achieve this by only exchanging EVPN type-5 routes across pods.

The following example commands configure EVPN to advertise type-5 routes:

cumulus@leaf01:~$ net add routing route-map map1 permit 1 match evpn route-type prefix
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set router policy route-map map1 rule 10 match evpn-route-type ip-prefix
cumulus@leaf01:~$ nv set router policy route-map map1 rule 10 action permit
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast route-export to-evpn route-map map1
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal
leaf01(config)# route-map map1 permit 1
leaf01(config)# match evpn route-type prefix
leaf01(config)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$

You must apply the route map for the configuration to take effect. See Route Maps for more information.

In a typical EVPN deployment, you reuse SVI IP addresses on VTEPs across multiple racks. However, if you use unique SVI IP addresses across multiple racks and you want the local SVI IP address to be reachable via remote VTEPs, you can enable the advertise SVI IP/MAC address option. This option advertises the SVI IP/MAC address as a type-2 route and eliminates the need for any flooding over VXLAN to reach the IP address from a remote VTEP or rack.

  • When you enable the advertise SVI IP/MAC address option, the anycast IP and MAC address pair is not advertised. Be sure not to enable both the advertise-svi-ip option and the advertise-default-gw option at the same time. (The advertise-default-gw option configures the gateway VTEPs to advertise their IP and MAC address. See Advertising the Default Gateway.
  • If you use MLAG on your switch, refer to Advertise Primary IP Address.

To advertise all SVI IP/MAC addresses on the switch, run these commands:

cumulus@leaf01:~$ net add bgp l2vpn evpn advertise-svi-ip
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit
cumulus@leaf01:~$ nv set evpn route-advertise svi-ip on
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# advertise-svi-ip
leaf01(config-router-af)# end
leaf01)# write memory
leaf01)# exit
cumulus@leaf01:~$

To advertise a specific SVI IP/MAC address, run these commands:

cumulus@leaf01:~$ net add bgp l2vpn evpn vni 10 advertise-svi-ip
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The NCLU commands save the configuration in the /etc/frr/frr.conf file. For example:

cumulus@leaf01:~$ sudo cat /etc/frr/frr.conf
...
address-family l2vpn evpn
  vni 10
  advertise-svi-ip
exit-address-family
...
cumulus@leaf01:~$ nv set evpn evi 10 route-advertise svi-ip on
cumulus@leaf01:~$ nv config apply

The NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:

cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml
cumulus@leaf01:~$ sudo vtysh

leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# vni 10
leaf01(config-router-af-vni)# advertise-svi-ip
leaf01(config-router-af-vni)# end
leaf01)# write memory
leaf01)# exit
cumulus@leaf01:~$

The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:

cumulus@leaf01:~$ sudo cat /etc/frr/frr.conf
...
address-family l2vpn evpn
  vni 10
  advertise-svi-ip
exit-address-family
...

Disable BUM Flooding

By default, the VTEP floods all broadcast, and unknown unicast and multicast packets (such as ARP, NS, or DHCP) it receives to all interfaces (except for the incoming interface) and to all VXLAN tunnel interfaces in the same broadcast domain. When the switch receives such packets on a VXLAN tunnel interface, it floods the packets to all interfaces in the packet’s broadcast domain.

You can disable BUM flooding over VXLAN tunnels so that EVPN does not advertise type-3 routes for each local VNI and stops taking action on received type-3 routes.

Disabling BUM flooding is useful in a deployment with a controller or orchestrator, where the switch is pre-provisioned and there is no need to flood any ARP, NS, or DHCP packets.

For information on EVPN BUM flooding with PIM, refer to EVPN BUM Traffic with PIM-SM.

To disable BUM flooding:

cumulus@leaf01:~$ net add bgp l2vpn evpn disable-flooding
cumulus@leaf01:~$ net pending
cumulus@leaf01:~$ net commit

The NCLU commands save the configuration in the /etc/frr/frr.conf file. For example:

...
router bgp 65101
 !
 address-family l2vpn evpn
  flooding disable
 exit-address-family
...

To reenable BUM flooding, run the NCLU net del bgp l2vpn evpn disable-flooding command.

cumulus@leaf01:~$ nv set nve vxlan flooding enable off
cumulus@leaf01:~$ nv config apply

The NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:

cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml

To reenable BUM flooding, run the nv set nve vxlan flooding enable on command.

cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# flooding disable
leaf01(config-router-af)# end
leaf01)# write memory
leaf01)# exit
cumulus@leaf01:~$

The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:

...
router bgp 65101
 !
 address-family l2vpn evpn
  flooding disable
 exit-address-family
...

To reenable BUM flooding, run the vtysh flooding head-end-replication command.

cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# flooding head-end-replication
leaf01(config-router-af)# end
leaf01)# write memory
leaf01)# exit
cumulus@leaf01:~$

To show that BUM flooding is off, run the NCLU net show bgp l2vpn evpn vni command or the vtysh show bgp l2vpn evpn vni command. For example:

cumulus@leaf01:~$ net show bgp l2vpn evpn vni
Advertise Gateway Macip: Disabled
Advertise SVI Macip: Disabled
Advertise All VNI flag: Enabled
BUM flooding: Disabled
Number of L2 VNIs: 3
Number of L3 VNIs: 2
Flags: * - Kernel
  VNI        Type RD                    Import RT                 Export RT                Tenant VRF
* 20         L2   10.10.10.1:3          65101:20                  65101:20                 RED
* 30         L2   10.10.10.1:4          65101:30                  65101:30                 BLUE
* 10         L2   10.10.10.1:6          65101:10                  65101:10                 RED
* 4002       L3   10.1.30.2:2           65101:4002                65101:4002               BLUE
* 4001       L3   10.1.20.2:5           65101:4001                65101:4001               RED

Run the NCLU net show bgp l2vpn evpn route type multicast command to make sure there are no EVPN type-3 routes that originate locally.

Extended Mobility

Cumulus Linux supports scenarios where the IP to MAC binding for a host or virtual machine changes across the move. In addition to the simple mobility scenario where a host or virtual machine with a binding of IP1, MAC1 moves from one rack to another, Cumulus Linux supports additional scenarios where a host or virtual machine with a binding of IP1, MAC1 moves and takes on a new binding of IP2, MAC1 or IP1, MAC2. The EVPN protocol mechanism to handle extended mobility continues to use the MAC mobility extended community and is the same as the standard mobility procedures. Extended mobility defines how to compute the sequence number in this attribute when binding changes occur.

Extended mobility not only supports virtual machine moves, but also where one virtual machine shuts down and you provision another on a different rack that uses the IP address or the MAC address of the previous virtual machine. For example, in an EVPN deployment with OpenStack, where virtual machines for a tenant are provision and shut down dynamically, a new virtual machine can use the same IP address as an earlier virtual machine but with a different MAC address.

Cumulus Linux enables extended mobility by default.

To examine the sequence numbers for a host or virtual machine MAC address and IP address, run the NCLU net show evpn mac vni <vni> mac <address> command or the vtysh show evpn mac vni <vni> mac <address> command. For example:

cumulus@switch:~$ net show evpn mac vni 10100 mac 00:02:00:00:00:42
MAC: 00:02:00:00:00:42
  Remote VTEP: 10.0.0.2
  Local Seq: 0 Remote Seq: 3
  Neighbors:
    10.1.1.74 Active

cumulus@switch:~$ net show evpn arp vni 10100 ip 10.1.1.74
IP: 10.1.1.74
  Type: local
  State: active
  MAC: 44:39:39:ff:00:24
  Local Seq: 2 Remote Seq: 3

Duplicate Address Detection

Cumulus Linux can detect duplicate MAC and IPv4 or IPv6 addresses on hosts or virtual machines in a VXLAN-EVPN configuration. The Cumulus Linux switch (VTEP) considers a host MAC or IP address to be duplicate if the address moves across the network more than a certain number of times within a certain number of seconds (five moves within 180 seconds by default). In addition to legitimate host or VM mobility scenarios, address movement can occur when you configure IP addresses incorrectly on a host or when packet looping occurs in the network due to faulty configuration or behavior.

Cumulus Linux enables duplicate address detection by default, which triggers when:

By default, when the switch detects a duplicate address, it flags the address as a duplicate and generates an error in syslog so that you can troubleshoot the reason and address the fault, then clear the duplicate address flag. The switch does not take any functional action on the address.

  • If the switch flags a MAC address as duplicate, it also flags all IP addresses associated with that MAC as duplicates. However, in an MLAG configuration, sometimes only one of the MLAG peers flags the associated IP addresses as duplicates.

  • In an MLAG configuration, MAC mobility detection runs independently on each switch in the MLAG pair. Based on the sequence in which local learning and, or route withdrawal from the remote VTEP occurs, the MAC mobility counter for a type-2 route increments only on one of the switches in the MLAG pair. In rare cases, it is possible for neither VTEP to increment the MAC mobility counter for the type-2 prefix.

When Does Duplicate Address Detection Trigger?

The VTEP that sees an address move from remote to local begins the detection process by starting a timer. Each VTEP runs duplicate address detection independently. Detection always starts with the first mobility event from remote to local. If the address is initially remote, the detection count can start with the first move for the address. If the address is initially local, the detection count starts only with the second or higher move for the address. If an address is undergoing a mobility event between remote VTEPs, duplicate detection does not start.

The following illustration shows VTEP-A, VTEP-B, and VTEP-C in an EVPN configuration. D