This Reference Deployment Guide (RDG) document is aimed at having a practical and scalable Red-Hat OpenStack deployment suitable for high-performance workloads.
The deployment utilizes a single physical network based on NVIDIA high-speed NICs and switches.
Abbreviations and Acronyms
|AI||Artificial Intelligence||MLAG||Multi-Chassis Link Aggregation|
|ASAP2||Accelerated Switching and Packet Processing®||MLNX_OFED||NVIDIA OpenFabrics Enterprise Distribution for Linux (network driver)|
|BGP||Border Gateway Protocol||NFV||Network Functions Virtualization|
|BOM||Bill of Materials||NIC||Network Interface Card|
|CPU||Central Processing Unit||OS||Operating System|
|CUDA||Compute Unified Device Architecture||OVS||Open vSwitch|
|DHCP||Dynamic Host Configuration Protocol||RDG||Reference Deployment Guide|
|DPDK||Data Plane Development Kit||RDMA||Remote Direct Memory Access|
|DVR||Distributed Virtual Routing||RHEL||Red Hat Enterprise Linux|
|ECMP||Equal Cost Multi-Pathing||RH-OSP||Red Hat OpenStack Platform|
|FW||FirmWare||RoCE||RDMA over Converged Ethernet|
|GPU||Graphics Processing Unit||SDN||Software Defined Networking|
|HA||High Availability||SR-IOV||Single Root Input/Output Virtualization|
|IP||Internet Protocol||VF||Virtual Function|
|IPMI||Intelligent Platform Management Interface||VF-LAG||Virtual Function Link Aggregation|
|L3||IP Network Layer 3||VLAN||Virtual LAN|
|LACP||Link Aggregation Control Protocol||VM||Virtual Machine|
|MGMT||Management||VNF||Virtualized Network Function|
|ML2||Modular Layer 2 Openstack Plugin|
This document demonstrates the deployment of a large-scale OpenStack cloud over a single high-speed fabric.
The fabric provides the cloud a mix of L3-routed networks and L2-stretched EVPN networks -
L2-stretched networks are used for the "Deployment/Provisioning" network (as they greatly simplify DHCP and PXE operations) and for the "External" network (as they allow attaching a single external network segment to the cluster, typically a subnet that has real Internet addressing).
Red Hat OpenStack Platform (RH-OSP) is a cloud computing solution that enables the creation, deployment, scale and management of a secure and reliable public or private OpenStack-based cloud. This production-ready platform offers a tight integration with NVIDIA networking and data processing technologies.
The solution demonstrated in this article can be easily applied to diverse use cases, such as core or edge computing, with hardware accelerated packet and data processing for NFV, Big Data, and AI workloads over IP, DPDK, and RoCE stacks.
RH-OSP16.1 Release Notes
Red Hat OpenStack Platform 16.1, supports offloading of the OVS switching function to the SmartNIC hardware. This enhancement reduces the processing resources required and accelerates the data path. In Red Hat OpenStack Platform 16.1, this feature has graduated from Technology Preview and is now fully supported.
All configuration files used in this article can be found here: 47036708-1.0.0.zip
Key Components and Technologies
- CONNECTX®-6 Dx is a member of the world-class, award-winning ConnectX series of network adapters. ConnectX-6 Dx delivers two ports of 10/25/40/50/100Gb/s or a single-port of 200Gb/s Ethernet connectivity paired with best-in-class hardware capabilities that accelerate and secure cloud and data center workloads.
- NVIDIA Spectrum™ Ethernet Switch product family includes a broad portfolio of top-of-rack and aggregation switches, that can be deployed in layer-2 and layer-3 cloud designs, in overlay-based virtualized networks, or as part of high-performance, mission-critical ethernet storage fabrics.
- LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and EDR, HDR, and NDR in InfiniBand products for Cloud, HPC, Web 2.0, Enterprise, telco, storage, and artificial intelligence and data center applications. LinkX cables and transceivers are often used to link top-of-rack switches downwards to network adapters in NVIDIA GPUs and CPU servers, and storage and/or upwards in switch-to-switch applications throughout the network infrastructure.
- CUMULUS Linux is the world’s most robust open networking operating system. It includes a comprehensive list of advanced, modern networking features, and is built for scale.
- Red Hat OpenStack Platform is a cloud computing platform that virtualizes resources from industry-standard hardware, organizes those resources into clouds, and manages them so users can access what they need, when they need it.
A typical OpenStack deployment uses several infrastructure networks, as portrayed in the following diagram:
A straight forward approach would be to use a separate physical network for each infrastructure network above, but this is not practical. Most deployments converge several networks onto a physical fabric and typically use one or more 1/10GbE fabrics and sometimes an additional high-speed fabric (25/100/200GbE).
In this document, we will demonstrate a deployment that converges all the below networks onto a single 100/200GbE high-speed fabric.
Network Fabric Design
- In this network design, Compute Nodes are connected to the External network as required for Distributed Virtual Routing (DVR) configuration. For more information, please refer to Red Hat Openstack Platform DVR.
- Routed Spine-Leaf networking architecture is used in this RDG for high speed Control and Data networks. However, the provisioning and External network use L2-Stretched EVPN networks. For more information, please see Red Hat Openstack Platform Spine Leaf Networking.
The deployment includes two separate networking layers:
- High-speed Ethernet fabric
- IPMI and switch management (not covered in this document)
The high-speed fabric described in this document contains the following building blocks:
- Routed Spine-Leaf networking architecture
- 2 x MSN3700C Spine switches
- L3 BGP unnumbered with ECMP is configured on the Leaf and Spine switches to allow multipath routing between racks located on different L3 network segments
- 2 x MSN2100 Leaf switches per rack in MLAG topology
- 2 x 100GbE ports for MLAG peerlink between Leaf pairs
- L3 VRR VLAN interfaces on a Leaf pair that is used as the default gateway for the host servers VLAN interfaces
- VXLAN VTEPs with BGP EVPN control plane for creating stretch L2 networks for provisioning and for the external network
- Host servers with 2 x 100GbE ports configured with LACP Active-Active bonding with multiple VLAN interfaces
- The entire fabric is configured to support Jumbo Frames (MTU=9000)
This document will demonstrate a minimalistic scale of two racks with 1 compute node each. Though using the same design, the fabric can be scaled to accommodate up to 236 compute nodes with 2x100GbE connectivity and a non-blocking fabric.
This is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 2x100GbE to the hosts (16 racks, 15 servers each using 15 spines and 32 leafs):
Host Accelerated Bonding Logical Design
In order to use the MLAG-based network high availability, the hosts must have two high-speed interfaces that are bonded together with an LACP bond.
In the solution described in this article, enhanced SR-IOV with bonding support (ASAP2 VF-LAG) is used to offload network processing from the host and VM into the network adapter hardware, while providing fast data plane with high availability functionality.
Two Virtual Functions, each on a different physical port, are bonded and allocated to the VM as a single LAGed VF. The bonded interface is connected to a single or multiple ToR switches, using Active-Standby or Active-Active bond modes.
For additional information, refer to QSG: High Availability with ASAP2 Enhanced SR-IOV (VF-LAG).
Host and Application Logical Design
Compute host components:
- ConnectX-6 Dx High Speed NIC with a dual physical port that is configured with LACP bonding in MLAG topology and providing VF-LAG redundancy to the VM
- Storage Drives for local OS usage
- RHEL as a base OS
- Red Hat OpenStack Platform containerized software stack with the following:
- KVM-based hypervisor
- Openvswitch (OVS) with hardware offload support
- Distributed Virtual Routing (DVR) configuration
Virtual Machine components:
- CentOS 7.x/8.x as a base OS
- SR-IOV VF allocated using PCI passthrough which allows to bypass the compute server hypervisor
- perftest-tools Performance and benchmark testing tool set
Software Stack Components
Bill of Materials
Deployment and Configuration
|Hostname||Router ID||Autonomous System||Downlinks|
|Rack||Hostname||Router ID||Autonomous System||Uplinks||ISL ports||CLAG System MAC||CLAG Priority||VXLAN_Anycast_IP|
|VLAN ID||Virtual MAC||Virtual IP||Primary Router IP||Secondary Router IP||Purpose|
|L2-Stretched VLANs (EVPN)|
|VLAN ID||VNI||Used Subnet||Purpose|
|Rack||VLAN ID||Access Ports||Trunk Ports||Network Purpose|
Director Node (Undercloud)
ens2f0 → Leaf1A, swp1
ens2f1 → Leaf1B, swp1
ens2f0 → Leaf1A, swp2
ens2f1 → Leaf1B, swp2
ens2f0 → Leaf1A, swp3
ens2f1 → Leaf1B, swp3
ens2f0 → Leaf1A, swp4
ens2f1 → Leaf1B, swp4
enp57s0f0 → Leaf1A, swp5
enp57s0f1 → Leaf1B, swp5
enp57s0f1 → Leaf2A, swp1
enp57s0f0 → Leaf2A, swp1
The wiring principal for the high-speed Ethernet fabric is as follows:
- Each server in the racks is wired to two leaf(or "TOR") switch
- Leaf switches are interconnected using two ports (to create an MLAG)
- Every leaf is wired to all the spines
Below is the full wiring diagram for the demonstrated fabric:
Updating Cumulus Linux
As a best practice, make sure to use the latest released Cumulus Linux NOS version.
Please see this guide on how to upgrade Cumulus Linux.
Configuring the Cumulus Linux switch
Make sure your Cumulus Linux switch has passed its initial configuration stages (for additional information, see the Quick-Start Guide for version 4.4):
- License installation
- Creation of switch interfaces (e.g., swp1-32)
The configuration of the spine switches includes the following:
- Routing & BGP
- Downlinks configuration
The configuration of the leaf switches includes the following:
- Routing & BGP
- Uplinks configuration
- L3-Routed VLANs
- L2-Stretched VLANs
- Bonds configuration
Following are the configuration commands:
- The configuration below does not include the connection of an external router/gateway for external/Internet connectivity
- The "external" network (stretched L2 over VLAN 60) is assumed to have a gateway (at 18.104.22.168) and to provide Internet connectivity for the cluster
- The director (undercloud) node has an interface on VLAN 60 (bond0.60 with address 22.214.171.124)
Repeat the following commands on all the leafs:
In order to achieve optimal results for DPDK use cases, the compute node must be correctly optimized for DPDK performance.
The optimization might require specific BIOS and NIC Firmware settings. Please refer to the official DPDK performance document provided by your CPU vendor.
For our deployment we used the following document from AMD: https://www.amd.com/system/files/documents/epyc-7Fx2-processors-dpdk-nw-performance-brief.pdf
Please note that the host boot settings (Linux grub command line) are done later on as part of the overcloud image configuration and that the hugepages allocation is done on the actual VM used for testing.
The RedHat OpenStack Platform (RHOSP) version 16.1 will be used.
The deployment is divided into three major steps:
- Installing the director node
- Deploying the undercloud on the director node
- Deploying the overcloud using the undercloud
Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.
All nodes which belong to the same profile (e.g., controller, compute) must have the same PCIe placement for the NIC and must expose the same interface name.
The director node needs to have the following network interfaces:
- Untagged over bond0—used for provisioning the overcloud nodes (connected to stretched VLAN 50 on the switch)
- Tagged VLAN 60 over bond0.60—used for Internet access over the external network segment (connected to stretched VLAN 60 on the switch)
- An interface on the IPMI network—used for accessing to the bare metal nodes BMCs
Undercloud Director Installation
Follow RH-OSP Preparing for Director Installation up to Preparing Container Images.
Use the following environment file for OVS-based RH-OSP 16.1 container preparation. Remember to update your Red Hat registry credentials:
- Proceed with the director installation steps, described in RH-OSP Installing Director, up to the RH-OSP Installing Director execution.
The following undercloud configuration file was used in our deployment:
- Follow the instructions in RH-OSP Obtain Images for Overcloud Nodes (section 4.9.1, steps 1–3), without importing the images into the director yet.
- Once obtained, follow the customization process described in RH-OSP Working with Overcloud Images:
- Start from 3.3. QCOW: Installing virt-customize to director
- Skip 3.4
- Run 3.5. QCOW: Setting the root password (optional)
- Run 3.6. QCOW: Registering the image
Run the following command to locate your subscription pool ID:
Replace [subscription-pool] in the below command with your relevant subscription pool ID:
- Skip 3.7 and 3.8
Run the following command to add mstflint to overcloud image to allow the NIC firmware provisioning during overcloud deployment (similar to 3.9).
mstflint is required for the overcloud nodes to support the automatic NIC firmware upgrade by the cloud orchestration system during deployment.
- Run 3.10. QCOW: Cleaning the subscription pool
- Run 3.11. QCOW: Unregistering the image
- Run 3.12. QCOW: Reset the machine ID
- Run 3.13. Uploading the images to director
Undercloud Director Preparation for Automatic NIC Firmware Provisioning
- Download the latest ConnectX NIC firmware binary file (fw-<NIC-Model>.bin) from NVIDIA Networking Firmware Download Site.
- Create a directory named
/var/lib/ironic/httpboot/in the Director node, and place the firmware binary file in it.
Extract the connectx_first_boot.yaml file from the configuration files attached to this guide, and place it in the /home/stack/templates/ directory in the Director node
The connectx_first_boot.yaml file is called by another deployment configuration file (env-ovs-dvr.yaml), so please use the instructed location, or change the configuration files accordingly.
Overcloud Nodes Introspection
A full overcloud introspection procedure is described in RH-OSP Configuring a Basic Overcloud. In this RDG, the following configuration steps were used for introspecting overcloud bare metal nodes to be deployed later-on over two routed Spine-Leaf racks:
Prepare a bare metal inventory file - instackenv.json, with the overcloud nodes information. In this case, the inventory file is listing 5 bare metal nodes to be deployed as overcloud nodes: 3 controller nodes and 2 compute nodes (1 in each routed rack). Make sure to update the file with the IPMI servers addresses and credentials and with the MAC address of one of the physical ports in order to avoid issues in the introspection process:
Import the overcloud baremetal nodes inventory, and wait until all nodes are listed in "manageable" state.
Start the baremetal nodes introspection:
Set the root device for deployment, and provide all baremetal nodes to reach "available" state:
Tag the controller nodes into the "control" profile, which is later mapped to the overcloud controller role:
Note: The Role to Profile mapping is specified in the node-info.yaml file used during the overcloud deployment.
Create a new compute flavor, and tag one compute node into the "compute-r0" profile, which is later mapped to the overcloud "compute in rack 0" role:
Create a new compute flavor, and tag the last compute node into the "compute-r1" profile, which is later mapped to the overcloud "compute in rack 1" role:
Verify the overcloud nodes profiles allocation:
Overcloud Deployment Configuration Files
Prepare the following cloud deployment configuration files, and place it under the /home/stack/templates/dvr directory.
Note: The full files are attached to this article, and can be found here: 47036708-1.0.0.zip
Some configuration files are customized specifically the to the /home/stack/templates/dvr location. If you place the template files in a different location, adjust it accordingly.
This template file contains the network settings for the controller nodes, including large MTU and bonding configuration
This template file contains the network settings for compute node, including SR-IOV VFs, large MTU and accelerated bonding (VF-LAG) configuration for data path.
This environment file contains the count of nodes per role and the role to the baremetal profile mapping.
This environment file contains the services enabled on each cloud role and the networks associated with its rack location.
This environment file contains a cloud network configuration for routed Spine-Leaf topology with large MTU. Rack0 and Rack1 L3 segments are listed as subnets of each cloud network. For further information refer to RH-OSP Configuring the Overcloud Leaf Networks.
This environment file contains the following settings:
- Overcloud nodes time settings
- ConnectX First Boot parameters (by calling /home/stack/templates/connectx_first_boot.yaml file)
- Neutron Jumbo Frame MTU and DVR mode
- Compute nodes CPU partitioning and isolation adjusted to Numa topology
- Nova PCI passthrough settings adjusted to VXLAN hardware offload
Issue the overcloud deploy command to start cloud deployment with the prepared configuration files.
Once deployed, load the necessary environment variables to interact with your overcloud:
Applications and Use Cases
Accelerated Packet Processing (SDN Acceleration)
Note: The following use case is demonstrating SDN layer acceleration using hardware offload capabilities.
The appendix below includes benchmarks that demonstrate SDN offload performance and usability.
- Build a VM cloud image (qcow2) with packet processing performance tools and cloud-init elements as described in How-to: Create OpenStack Cloud Image with Performance Tools.
Upload the image to the overcloud image store:
Create a flavor:
Set hugepages and cpu-pinning parameters:
VM Networks and Ports
Create a VXLAN network with normal ports to be used for instance management and access:
Create a VXLAN network to be used for accelerated data traffic between the VM instances with Jumbo Frames support:
Create 2 x SR-IOV direct ports with hardware offload capabilities:
Create an external network for public access:
Create a public router, and add the management network subnet:
Create a VM instance with a management port and a direct port on the compute node located on the Rack0 L3 network segment:
Create a VM instance with a management port and a direct port on the compute node located on the Rack1 L3 network segment:
Wait until the VM instances status is changed to ACTIVE:
In order to verify that external/public network access is functioning properly, create and assign external floating IP addresses for the VMs:
The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.
Note: The tools used in the below tests are included in the VM image built with the perf-tools element, as instructed in previous steps.
iperf TCP Test
On the Transmitter VM, disable XPS (Transmit Packet Steering):
- This step is required to allow packet distribution of the iperf traffic using ConnectX VF-LAG.
- Use interface statistic/counters on the Leaf switches' bond interfaces to confirm traffic is distributed by the Transmitter VM to both switches using VF-LAG (Cumulus "net show counters" command).
On the Receiver VM, start multiple iperf3 servers:
On the Transmitter VM, start multiple iperf3 clients for multi-core parallel streams:
Check the results:
The test results above demonstrate a total of around 100Gbps line rate for IP TCP traffic.
Before proceeding to the next test, stop all iperf servers on the Receiver VM:
RoCE Bandwidth Test over the bond
On the Receiver VM, start ib_write_bw server with 2 QPs in order to utilize the VF-LAG infrastructure:
On the Transmitter VM, start ib_write_bw client with 2 QPs and a duration of 10 seconds:
Check the test results:
Single-core DPDK packet-rate test over a single port
The traffic generator sends 64 byte packets to the test-pmd tool and the tool sends them back (thus testing the DPDK packet-processing pipeline).
We tested single-core packet processing rate between two VMs using 2 tx and rx queues.
Please note that this test requires using specific versions of DPDK and the VM image kernel and compatibility issues might exist when using the latest releases.
For this test we used a CentOS 7.8 image with kernel 3.10.0-1160.42.2 and DPDK 20.11.
On one VM we activated the dpdk-testpmd utility, this is the command line used to run testpmd (this command displays the MAC address of the port to be used later on with trex as the destination port MAC):
Next, we created a second VM with two direct ports (on the vx_data network) and used it for running the trex traffic generator (version 2.82):
We ran the DPDK port setup interactive wizard and used the MAC address of the TestPMD VM collected beforehand:
We create the following UDP packet stream configuration file under the /root/trex/v2.82 directory:
Next, we ran the TRex application in the background over 8 out of 10 cores:
And finally ran the TRex Console:
In the TRex Console, entered the UI (TUI):
And started a 35MPPS stream using the stream configuration file created in previous steps:
And these are the measured results on the testpmd VM (~32.5mpps):
Troubleshooting the Nodes in Case Network Is Not Functioning
In case there is a need to debug the overcloud nodes due to issues in the deployment, the only way to connect to them is using an SSH key that is placed in the director node.
Connecting through a BMC console is useless since there is no way to log in to the nodes using a username and a password.
Since the nodes are connected using a single network, there can be a situation in which they are completely inaccessible for troubleshooting in case there is an issue with the YAML files and so forth.
The solution for this is to add the following lines to the env-ovs-dvr.yaml file (under the mentioned sections) that will allow for setting a root password for the nodes so they become accessible from the BMC console:
The overcloud to e deleted and redeployed in order for the above changes to affect.
Note: this feature and the "Automatic NIC Firmware Provisioning" feature are mutually exclusive and can't be used in parallel as they both use the "NodeUserData" variable!
Note: Make sure to remove the above lines in case of a production cloud as they pose a security hazard
Once the nodes are accessible, an additional helpful step to troubleshoot network issues on the nodes would be to run the following script on the problematic nodes:
This script will try to reconfigure the network and will indicate any relevant issues.
Over the past few years, Itai Levy has worked as a Solutions Architect and member of the NVIDIA Networking “Solutions Labs” team. Itai designs and executes cutting-edge solutions around Cloud Computing, SDN, SDS and Security. His main areas of expertise include NVIDIA BlueField Data Processing Unit (DPU) solutions and accelerated OpenStack/K8s platforms.
Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.
Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.