NVIDIA Docs Hub Homepage NVIDIA Networking Networking Solutions RDG for Red-Hat OpenStack Cloud over NVIDIA Converged High-Performance Ethernet Network

RDG for Red-Hat OpenStack Cloud over NVIDIA Converged High-Performance Ethernet Network

Scope

This R eference D eployment G uide ( RDG ) document is aimed at having a practical and scalable Red-Hat OpenStack deployment suitable for high-performance workloads.

The deployment utilizes a single physical network based on NVIDIA high-speed NICs and switches.

Abbreviations and Acronyms

Term	Definition	Term	Definition
AI	Artificial Intelligence	MLAG	Multi-Chassis Link Aggregation
ASAP²	Accelerated Switching and Packet Processing®	MLNX_OFED	NVIDIA OpenFabrics Enterprise Distribution for Linux (network driver)
BGP	Border Gateway Protocol	NFV	Network Functions Virtualization
BOM	Bill of Materials	NIC	Network Interface Card
CPU	Central Processing Unit	OS	Operating System
CUDA	Compute Unified Device Architecture	OVS	Open vSwitch
DHCP	Dynamic Host Configuration Protocol	RDG	Reference Deployment Guide
DPDK	Data Plane Development Kit	RDMA	Remote Direct Memory Access
DVR	Distributed Virtual Routing	RHEL	Red Hat Enterprise Linux
ECMP	Equal Cost Multi-Pathing	RH-OSP	Red Hat OpenStack Platform
FW	FirmWare	RoCE	RDMA over Converged Ethernet
GPU	Graphics Processing Unit	SDN	Software Defined Networking
HA	High Availability	SR-IOV	Single Root Input/Output Virtualization
IP	Internet Protocol	VF	Virtual Function
IPMI	Intelligent Platform Management Interface	VF-LAG	Virtual Function Link Aggregation
L3	IP Network Layer 3	VLAN	Virtual LAN
LACP	Link Aggregation Control Protocol	VM	Virtual Machine
MGMT	Management	VNF	Virtualized Network Function
ML2	Modular Layer 2 Openstack Plugin

Introduction

This document demonstrates the deployment of a large-scale OpenStack cloud over a single high-speed fabric.

The fabric provides the cloud a mix of L3-routed networks and L2-stretched EVPN networks -

L2-stretched networks are used for the "Deployment/Provisioning" network (as they greatly simplify DHCP and PXE operations) and for the "External" network (as they allow attaching a single external network segment to the cluster, typically a subnet that has real Internet addressing).

Red Hat OpenStack Platform (RH-OSP) is a cloud computing solution that enables the creation, deployment, scale and management of a secure and reliable public or private OpenStack-based cloud. This production-ready platform offers a tight integration with NVIDIA networking and data processing technologies.

The solution demonstrated in this article can be easily applied to diverse use cases, such as core or edge computing, with hardware accelerated packet and data processing for NFV, Big Data, and AI workloads over IP, DPDK, and RoCE stacks.

Note

RH-OSP16.1 Release Notes

Red Hat OpenStack Platform 16.1, supports offloading of the OVS switching function to the SmartNIC hardware. This enhancement reduces the processing resources required and accelerates the data path. In Red Hat OpenStack Platform 16.1, this feature has graduated from Technology Preview and is now fully supported.

Note

Downloadable Content

All configuration files used in this article can be found here: 47036708-1.0.0.zip

References

Red Hat OpenStack Platform 16.1 Installation Guide

Red Hat Openstack Platform 16.1 Spine Leaf Networking

QSG: High Availability with ASAP2 Enhanced SR-IOV (VF-LAG).

Solution Architecture

Key Components and Technologies

CONNECTX®-6 Dx is a member of the world-class, award-winning ConnectX series of network adapters. ConnectX-6 Dx delivers two ports of 10/25/40/50/100Gb/s or a single-port of 200Gb/s Ethernet connectivity paired with best-in-class hardware capabilities that accelerate and secure cloud and data center workloads.
NVIDIA Spectrum™ Ethernet Switch product family includes a broad portfolio of top-of-rack and aggregation switches, that can be deployed in layer-2 and layer-3 cloud designs, in overlay-based virtualized networks, or as part of high-performance, mission-critical ethernet storage fabrics.
LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and EDR, HDR, and NDR in InfiniBand products for Cloud, HPC, Web 2.0, Enterprise, telco, storage, and artificial intelligence and data center applications. LinkX cables and transceivers are often used to link top-of-rack switches downwards to network adapters in NVIDIA GPUs and CPU servers, and storage and/or upwards in switch-to-switch applications throughout the network infrastructure.
CUMULUS Linux is the world’s most robust open networking operating system. It includes a comprehensive list of advanced, modern networking features, and is built for scale.
Red Hat OpenStack Platform is a cloud computing platform that virtualizes resources from industry-standard hardware, organizes those resources into clouds, and manages them so users can access what they need, when they need it.

Logical Design

A typical OpenStack deployment uses several infrastructure networks, as portrayed in the following diagram:

A straight forward approach would be to use a separate physical network for each infrastructure network above, but this is not practical. Most deployments converge several networks onto a physical fabric and typically use one or more 1/10GbE fabrics and sometimes an additional high-speed fabric (25/100/200GbE).

In this document, we will demonstrate a deployment that converges all the below networks onto a single 100/200GbE high-speed fabric.

Network Fabric Design

Note

Note

In this network design, Compute Nodes are connected to the External network as required for Distributed Virtual Routing (DVR) configuration. For more information, please refer to Red Hat Openstack Platform DVR.
Routed Spine-Leaf networking architecture is used in this RDG for high speed Control and Data networks. However, the provisioning and External network use L2-Stretched EVPN networks. For more information, please see Red Hat Openstack Platform Spine Leaf Networking.

The deployment includes two separate networking layers:

High-speed Ethernet fabric
IPMI and switch management (not covered in this document)

The high-speed fabric described in this document contains the following building blocks:

Routed Spine-Leaf networking architecture
2 x MSN3700C Spine switches
L3 BGP unnumbered with ECMP is configured on the Leaf and Spine switches to allow multipath routing between racks located on different L3 network segments
2 x MSN2100 Leaf switches per rack in MLAG topology
2 x 100GbE ports for MLAG peerlink between Leaf pairs
L3 VRR VLAN interfaces on a Leaf pair that is used as the default gateway for the host servers VLAN interfaces
VXLAN VTEPs with BGP EVPN control plane for creating stretch L2 networks for provisioning and for the external network
Host servers with 2 x 100GbE ports configured with LACP Active-Active bonding with multiple VLAN interfaces
The entire fabric is configured to support Jumbo Frames (MTU=9000)

This document will demonstrate a minimalistic scale of two racks with 1 compute node each. Though using the same design, the fabric can be scaled to accommodate up to 236 compute nodes with 2x100GbE connectivity and a non-blocking fabric.

This is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 2x100GbE to the hosts (16 racks, 15 servers each using 15 spines and 32 leafs):

Host Accelerated Bonding Logical Design

In order to use the MLAG-based network high availability, the hosts must have two high-speed interfaces that are bonded together with an LACP bond.

In the solution described in this article, enhanced SR-IOV with bonding support ( ASAP ² VF-LAG) is used to offload network processing from the host and VM into the network adapter hardware, while providing fast data plane with high availability functionality.

Two Virtual Functions, each on a different physical port, are bonded and allocated to the VM as a single LAGed VF. The bonded interface is connected to a single or multiple ToR switches, using Active-Standby or Active-Active bond modes.

For additional information, refer to QSG: High Availability with ASAP2 Enhanced SR-IOV (VF-LAG).

Host and Application Logical Design

Compute host components:

ConnectX-6 Dx High Speed NIC with a dual physical port that is configured with LACP bonding in MLAG topology and providing VF-LAG redundancy to the VM
Storage Drives for local OS usage
RHEL as a base OS
Red Hat OpenStack Platform containerized software stack with the following:
- KVM-based hypervisor
- Openvswitch (OVS) with hardware offload support
- Distributed Virtual Routing (DVR) configuration

Virtual Machine components:

CentOS 7.x/8.x as a base OS
SR-IOV VF allocated using PCI passthrough which allows to bypass the compute server hypervisor
perftest-tools Performance and benchmark testing tool set

Software Stack Components

Bill of Materials

Deployment and Configuration

Spines
Hostname	Router ID	Autonomous System	Downlinks
spine1	10.10.10.101/32	65199	swp1-4
spine2	10.10.10.102/32	65199	swp1-4

Leafs
Rack	Hostname	Router ID	Autonomous System	Uplinks	ISL ports	CLAG System MAC	CLAG Priority	VXLAN_Anycast_IP
1	leaf1a	10.10.10.1/32	65100	swp13-14	swp15-16	44:38:39:BE:EF:AA	1000	10.0.0.10
1	leaf1b	10.10.10.2/32	65100	swp13-14	swp15-16	44:38:39:BE:EF:AA	32768	10.0.0.10
2	leaf2a	10.10.10.3/32	65101	swp13-14	swp15-16	44:38:39:BE:EF:BB	1000	10.0.0.20
2	leaf2b	10.10.10.4/32	65101	swp13-14	swp15-16	44:38:39:BE:EF:BB	32768	10.0.0.20

L3-Routed VLANs
VLAN ID	Virtual MAC	Virtual IP	Primary Router IP	Secondary Router IP	Purpose
10	00:00:5E:00:01:00	172.16.0.254/24	172.16.0.252/24	172.16.0.253/24	Internal_API
20	00:00:5E:00:01:00	172.17.0.254/24	172.17.0.252/24	172.17.0.253/24	Storage
30	00:00:5E:00:01:00	172.18.0.254/24	172.18.0.252/24	172.18.0.253/24	Storage Mgmt
40	00:00:5E:00:01:00	172.19.0.254/24	172.19.0.252/24	172.19.0.253/24	Tenant
11	00:00:5E:00:01:01	172.16.1.254/24	172.16.1.252/24	172.16.1.253/24	Internal_API
21	00:00:5E:00:01:01	172.17.1.254/24	172.17.1.252/24	172.17.1.253/24	Storage
31	00:00:5E:00:01:01	172.18.1.254/24	172.18.1.252/24	172.18.1.253/24	Storage Mgmt
41	00:00:5E:00:01:01	172.19.1.254/24	172.19.1.252/24	172.19.1.253/24	Tenant

L2-Stretched VLANs (EVPN)
VLAN ID	VNI	Used Subnet	Purpose
50	10050	192.168.24.0/24	Provisioning (generated by the undercloud node)
60	10060	172.60.0.0/24	External/Internet Access The undercloud node has an address in this subnet (172.60.0.1) and its default gateway is 172.60.0.254

Server Ports
Rack	VLAN ID	Access Ports	Trunk Ports	Network Purpose
1	10		swp1-5	Internal API
1	20		swp1-5	Storage
1	30		swp1-5	Storage Mgmt
1	40		swp1-5	Tenant
1	50	swp1-5		Provisioning/Mgmt (Undercloud-Overcloud)
1	60		swp1-5	External
2	11		swp1	Internal API
2	21		swp1	Storage
2	31		swp1	Storage Mgmt
2	41		swp1	Tenant
2	50	swp1		Provisioning/Mgmt (Undercloud-Overcloud)
2	60		swp1	External

Server Wiring
Rack1	Director Node (Undercloud)	ens2f0 → Leaf1A, swp1 ens2f1 → Leaf1B, swp1
Controller-1	ens2f0 → Leaf1A, swp2 ens2f1 → Leaf1B, swp2
Controller-2	ens2f0 → Leaf1A, swp3 ens2f1 → Leaf1B, swp3
Controller-3	ens2f0 → Leaf1A, swp4 ens2f1 → Leaf1B, swp4
Compute-1	enp57s0f0 → Leaf1A, swp5 enp57s0f1 → Leaf1B, swp5
Rack2	Compute-2	enp57s0f1 → Leaf2A, swp1 enp57s0f0 → Leaf2A, swp1

Server Wiring

Rack1

Director Node (Undercloud)

ens2f0 → Leaf1A, swp1

ens2f1 → Leaf1B, swp1

Controller-1

ens2f0 → Leaf1A, swp2

ens2f1 → Leaf1B, swp2

Controller-2

ens2f0 → Leaf1A, swp3

ens2f1 → Leaf1B, swp3

Controller-3

ens2f0 → Leaf1A, swp4

ens2f1 → Leaf1B, swp4

Compute-1

enp57s0f0 → Leaf1A, swp5

enp57s0f1 → Leaf1B, swp5

Rack2

Compute-2

enp57s0f1 → Leaf2A, swp1

enp57s0f0 → Leaf2A, swp1

Wiring

The wiring principal for the high-speed Ethernet fabric is as follows:

Each server in the racks is wired to two leaf(or "TOR") switch
Leaf switches are interconnected using two ports (to create an MLAG)
Every leaf is wired to all the spines

Below is the full wiring diagram for the demonstrated fabric:

Network/Fabric

Updating Cumulus Linux

As a best practice, make sure to use the latest released Cumulus Linux NOS version.

Please see this guide on how to upgrade Cumulus Linux.

Configuring the Cumulus Linux switch

Make sure your Cumulus Linux switch has passed its initial configuration stages (for additional information, see the Quick-Start Guide for version 4.4):

License installation
Creation of switch interfaces (e.g., swp1-32)

The configuration of the spine switches includes the following:

Routing & BGP
EVPN
Downlinks configuration

The configuration of the leaf switches includes the following:

Routing & BGP
EVPN
Uplinks configuration
MLAG
L3-Routed VLANs
L2-Stretched VLANs
Bonds configuration

Following are the configuration commands:

Warning

The configuration below does not include the connection of an external router/gateway for external/Internet connectivity
The "external" network (stretched L2 over VLAN 60) is assumed to have a gateway (at 172.60.0.254) and to provide Internet connectivity for the cluster
The director (undercloud) node has an interface on VLAN 60 (bond0.60 with address 172.60.0.1)

Spine1 Console

Copy
Copied!

            
            net add loopback lo ip address 10.10.10.101/32    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp autonomous-system 65199
net add bgp router-id 10.10.10.101   
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net commit

Spine2 Console

Copy
Copied!

            
            net add loopback lo ip address 10.10.10.102/32    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp autonomous-system 65199
net add bgp router-id 10.10.10.102   
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net commit

Leaf1A Console

Copy
Copied!

            
            net add interface swp15,swp16 mtu 9216
net add bond peerlink bond slaves swp15,swp16
net add interface swp1 mtu 9216
net add bond bond1 bond slaves swp1
net add bond bond1 clag id 1
net add interface swp2 mtu 9216
net add bond bond2 bond slaves swp2
net add bond bond2 clag id 2
net add interface swp3 mtu 9216
net add bond bond3 bond slaves swp3
net add bond bond3 clag id 3
net add interface swp4 mtu 9216
net add bond bond4 bond slaves swp4
net add bond bond4 clag id 4
net add interface swp5 mtu 9216
net add bond bond5 bond slaves swp5
net add bond bond5 clag id 5
net add bond bond1,bond2,bond3,bond4,bond5 bond lacp-bypass-allow
net add bond bond1,bond2,bond3,bond4,bond5 stp bpduguard
net add bond bond1,bond2,bond3,bond4,bond5 stp portadminedge
net add bridge bridge ports bond1,bond2,bond3,bond4,bond5,peerlink
net add bridge bridge vlan-aware
net add bgp autonomous-system 65100    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp router-id 10.10.10.1
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor peerlink.4094 interface remote-as internal
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add loopback lo ip address 10.10.10.1/32   
net add interface peerlink.4094 clag backup-ip 10.10.10.2  
net add interface peerlink.4094 clag priority 1000  
net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:AA   
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag args --initDelay 10
net add bridge bridge vids 10  
net add vlan 10 ip address 172.16.0.252/24
net add vlan 10 ip address-virtual 00:00:5E:00:01:00 172.16.0.254/24    
net add vlan 10 vlan-id 10    
net add vlan 10 vlan-raw-device bridge
net add bridge bridge vids 20  
net add vlan 20 ip address 172.17.0.252/24
net add vlan 20 ip address-virtual 00:00:5E:00:01:00 172.17.0.254/24    
net add vlan 20 vlan-id 20    
net add vlan 20 vlan-raw-device bridge
net add bridge bridge vids 30  
net add vlan 30 ip address 172.18.0.252/24
net add vlan 30 ip address-virtual 00:00:5E:00:01:00 172.18.0.254/24    
net add vlan 30 vlan-id 30    
net add vlan 30 vlan-raw-device bridge
net add bridge bridge vids 40  
net add vlan 40 ip address 172.19.0.252/24
net add vlan 40 ip address-virtual 00:00:5E:00:01:00 172.19.0.254/24    
net add vlan 40 vlan-id 40    
net add vlan 40 vlan-raw-device bridge
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  neighbor peerlink.4094 activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise-default-gw
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bgp ipv6 unicast neighbor underlay activate
net add loopback lo clag vxlan-anycast-ip 10.0.0.10
net add loopback lo vxlan local-tunnelip 10.10.10.1
net add vlan 60 vlan-id 60
net add vlan 60 vlan-raw-device bridge
net add vlan 50 vlan-id 50
net add vlan 50 vlan-raw-device bridge
net add vxlan vtep60 vxlan id 10060
net add vxlan vtep50 vxlan id 10050
net add vxlan vtep60,50 bridge arp-nd-suppress on
net add vxlan vtep60,50 bridge learning off
net add vxlan vtep60,50 stp bpduguard
net add vxlan vtep60,50 stp portbpdufilter
net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.1
net add vxlan vtep60 bridge access 60
net add vxlan vtep50 bridge access 50
net add bridge bridge ports vtep60,vtep50
net add bridge bridge vids 60,50
net add bond bond1-5 bridge pvid 50
net commit

Leaf1B Console

Copy
Copied!

            
            net add interface swp15,swp16 mtu 9216
net add bond peerlink bond slaves swp15,swp16
net add interface swp1 mtu 9216
net add bond bond1 bond slaves swp1
net add bond bond1 clag id 1
net add interface swp2 mtu 9216
net add bond bond2 bond slaves swp2
net add bond bond2 clag id 2
net add interface swp3 mtu 9216
net add bond bond3 bond slaves swp3
net add bond bond3 clag id 3
net add interface swp4 mtu 9216
net add bond bond4 bond slaves swp4
net add bond bond4 clag id 4
net add interface swp5 mtu 9216
net add bond bond5 bond slaves swp5
net add bond bond5 clag id 5
net add bond bond1,bond2,bond3,bond4,bond5 bond lacp-bypass-allow
net add bond bond1,bond2,bond3,bond4,bond5 stp bpduguard
net add bond bond1,bond2,bond3,bond4,bond5 stp portadminedge
net add bridge bridge ports bond1,bond2,bond3,bond4,bond5,peerlink
net add bridge bridge vlan-aware
net add bgp autonomous-system 65100    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp router-id 10.10.10.2
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor peerlink.4094 interface remote-as internal
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add loopback lo ip address 10.10.10.2/32   
net add interface peerlink.4094 clag backup-ip 10.10.10.1  
net add interface peerlink.4094 clag priority 32768  
net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:AA   
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag args --initDelay 10
net add bridge bridge vids 10  
net add vlan 10 ip address 172.16.0.253/24
net add vlan 10 ip address-virtual 00:00:5E:00:01:00 172.16.0.254/24    
net add vlan 10 vlan-id 10    
net add vlan 10 vlan-raw-device bridge
net add bridge bridge vids 20  
net add vlan 20 ip address 172.17.0.253/24
net add vlan 20 ip address-virtual 00:00:5E:00:01:00 172.17.0.254/24    
net add vlan 20 vlan-id 20    
net add vlan 20 vlan-raw-device bridge
net add bridge bridge vids 30  
net add vlan 30 ip address 172.18.0.253/24
net add vlan 30 ip address-virtual 00:00:5E:00:01:00 172.18.0.254/24    
net add vlan 30 vlan-id 30    
net add vlan 30 vlan-raw-device bridge
net add bridge bridge vids 40  
net add vlan 40 ip address 172.19.0.253/24
net add vlan 40 ip address-virtual 00:00:5E:00:01:00 172.19.0.254/24    
net add vlan 40 vlan-id 40    
net add vlan 40 vlan-raw-device bridge
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  neighbor peerlink.4094 activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise-default-gw
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bgp ipv6 unicast neighbor underlay activate
net add loopback lo clag vxlan-anycast-ip 10.0.0.10
net add loopback lo vxlan local-tunnelip 10.10.10.2
net add vlan 60 vlan-id 60
net add vlan 60 vlan-raw-device bridge
net add vlan 50 vlan-id 50
net add vlan 50 vlan-raw-device bridge
net add vxlan vtep60 vxlan id 10060
net add vxlan vtep50 vxlan id 10050
net add vxlan vtep60,50 bridge arp-nd-suppress on
net add vxlan vtep60,50 bridge learning off
net add vxlan vtep60,50 stp bpduguard
net add vxlan vtep60,50 stp portbpdufilter
net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.2
net add vxlan vtep60 bridge access 60
net add vxlan vtep50 bridge access 50
net add bridge bridge ports vtep60,vtep50
net add bridge bridge vids 60,50
net add bond bond1-5 bridge pvid 50
net commit

Leaf2A Console

Copy
Copied!

            
            net add interface swp15,swp16 mtu 9216
net add bond peerlink bond slaves swp15,swp16
net add interface swp1 mtu 9216
net add bond bond1 bond slaves swp1
net add bond bond1 clag id 1
net add bond bond1 bond lacp-bypass-allow
net add bond bond1 stp bpduguard
net add bond bond1 stp portadminedge
net add bridge bridge ports bond1,peerlink
net add bridge bridge vlan-aware
net add bgp autonomous-system 65101    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp router-id 10.10.10.3
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor peerlink.4094 interface remote-as internal
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add loopback lo ip address 10.10.10.3/32   
net add interface peerlink.4094 clag backup-ip 10.10.10.4  
net add interface peerlink.4094 clag priority 1000  
net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:BB   
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag args --initDelay 10
net add bridge bridge vids 11  
net add vlan 11 ip address 172.16.1.252/24
net add vlan 11 ip address-virtual 00:00:5E:00:01:01 172.16.1.254/24    
net add vlan 11 vlan-id 11    
net add vlan 11 vlan-raw-device bridge
net add bridge bridge vids 21  
net add vlan 21 ip address 172.17.1.252/24
net add vlan 21 ip address-virtual 00:00:5E:00:01:01 172.17.1.254/24    
net add vlan 21 vlan-id 21    
net add vlan 21 vlan-raw-device bridge
net add bridge bridge vids 31  
net add vlan 31 ip address 172.18.1.252/24
net add vlan 31 ip address-virtual 00:00:5E:00:01:01 172.18.1.254/24    
net add vlan 31 vlan-id 31    
net add vlan 31 vlan-raw-device bridge
net add bridge bridge vids 41  
net add vlan 41 ip address 172.19.1.252/24
net add vlan 41 ip address-virtual 00:00:5E:00:01:01 172.19.1.254/24    
net add vlan 41 vlan-id 41    
net add vlan 41 vlan-raw-device bridge
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  neighbor peerlink.4094 activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise-default-gw
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bgp ipv6 unicast neighbor underlay activate
net add loopback lo clag vxlan-anycast-ip 10.0.0.20
net add loopback lo vxlan local-tunnelip 10.10.10.3
net add vlan 60 vlan-id 60
net add vlan 60 vlan-raw-device bridge
net add vlan 50 vlan-id 50
net add vlan 50 vlan-raw-device bridge
net add vxlan vtep60 vxlan id 10060
net add vxlan vtep50 vxlan id 10050
net add vxlan vtep60,50 bridge arp-nd-suppress on
net add vxlan vtep60,50 bridge learning off
net add vxlan vtep60,50 stp bpduguard
net add vxlan vtep60,50 stp portbpdufilter
net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.3
net add vxlan vtep60 bridge access 60
net add vxlan vtep50 bridge access 50
net add bridge bridge ports vtep60,vtep50
net add bridge bridge vids 60,50
net add bond bond1 bridge pvid 50
net commit

Leaf2B Console

Copy
Copied!

            
            net add interface swp15,swp16 mtu 9216
net add bond peerlink bond slaves swp15,swp16
net add interface swp1 mtu 9216
net add bond bond1 bond slaves swp1
net add bond bond1 clag id 1
net add bond bond1 bond lacp-bypass-allow
net add bond bond1 stp bpduguard
net add bond bond1 stp portadminedge
net add bridge bridge ports bond1,peerlink
net add bridge bridge vlan-aware
net add bgp autonomous-system 65101    
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp router-id 10.10.10.4
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor peerlink.4094 interface remote-as internal
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add loopback lo ip address 10.10.10.4/32   
net add interface peerlink.4094 clag backup-ip 10.10.10.3  
net add interface peerlink.4094 clag priority 32768  
net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:BB   
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag args --initDelay 10
net add bridge bridge vids 11  
net add vlan 11 ip address 172.16.1.253/24
net add vlan 11 ip address-virtual 00:00:5E:00:01:01 172.16.1.254/24    
net add vlan 11 vlan-id 11    
net add vlan 11 vlan-raw-device bridge
net add bridge bridge vids 21  
net add vlan 21 ip address 172.17.1.253/24
net add vlan 21 ip address-virtual 00:00:5E:00:01:01 172.17.1.254/24    
net add vlan 21 vlan-id 21    
net add vlan 21 vlan-raw-device bridge
net add bridge bridge vids 31  
net add vlan 31 ip address 172.18.1.253/24
net add vlan 31 ip address-virtual 00:00:5E:00:01:01 172.18.1.254/24    
net add vlan 31 vlan-id 31    
net add vlan 31 vlan-raw-device bridge
net add bridge bridge vids 41  
net add vlan 41 ip address 172.19.1.253/24
net add vlan 41 ip address-virtual 00:00:5E:00:01:01 172.19.1.254/24    
net add vlan 41 vlan-id 41    
net add vlan 41 vlan-raw-device bridge
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  neighbor peerlink.4094 activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise-default-gw
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bgp ipv6 unicast neighbor underlay activate
net add loopback lo clag vxlan-anycast-ip 10.0.0.20
net add loopback lo vxlan local-tunnelip 10.10.10.4
net add vlan 60 vlan-id 60
net add vlan 60 vlan-raw-device bridge
net add vlan 50 vlan-id 50
net add vlan 50 vlan-raw-device bridge
net add vxlan vtep60 vxlan id 10060
net add vxlan vtep50 vxlan id 10050
net add vxlan vtep60,50 bridge arp-nd-suppress on
net add vxlan vtep60,50 bridge learning off
net add vxlan vtep60,50 stp bpduguard
net add vxlan vtep60,50 stp portbpdufilter
net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.4
net add vxlan vtep60 bridge access 60
net add vxlan vtep50 bridge access 50
net add bridge bridge ports vtep60,vtep50
net add bridge bridge vids 60,50
net add bond bond1 bridge pvid 50
net commit

Repeat the following commands on all the leafs:

Leaf Switch Console

Copy
Copied!

            
            cumulus@leaf1a:mgmt:~$ net show clag
The peer is alive
     Our Priority, ID, and Role: 1000 1c:34:da:b4:09:40 primary
    Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:40 secondary
          Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:640 (linklocal)
               VxLAN Anycast IP: 10.0.0.10
                      Backup IP: 10.10.10.2 (active)
                     System MAC: 44:38:39:be:ef:aa
 
CLAG Interfaces
Our Interface      Peer Interface     CLAG Id   Conflicts              Proto-Down Reason
----------------   ----------------   -------   --------------------   -----------------
           bond1   bond1              1         -                      -              
           bond2   bond2              2         -                      -              
           bond3   bond3              3         -                      -              
           bond4   bond4              4         -                      -              
           bond5   bond5              5         -                      -              
          vtep50   vtep50             -         -                      -              
          vtep60   vtep60             -         -                      -     
 
cumulus@leaf1a:mgmt:~$ net show route
show ip route
=============
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
C>* 10.0.0.10/32 is directly connected, lo, 01w5d21h
B>* 10.0.0.20/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                     via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
C>* 10.10.10.1/32 is directly connected, lo, 01w6d19h
B>* 10.10.10.2/32 [200/0] via fe80::1e34:daff:feb4:640, peerlink.4094, weight 1, 01w5d21h
B>* 10.10.10.3/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
B>* 10.10.10.4/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
B>* 10.10.10.101/32 [20/0] via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
B>* 10.10.10.102/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
C * 172.16.0.0/24 [0/1024] is directly connected, vlan10-v0, 01w5d21h
C>* 172.16.0.0/24 is directly connected, vlan10, 01w5d21h
B>* 172.16.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
C * 172.17.0.0/24 [0/1024] is directly connected, vlan20-v0, 01w5d21h
C>* 172.17.0.0/24 is directly connected, vlan20, 01w5d21h
B>* 172.17.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
C * 172.18.0.0/24 [0/1024] is directly connected, vlan30-v0, 01w5d21h
C>* 172.18.0.0/24 is directly connected, vlan30, 01w5d21h
B>* 172.18.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
C * 172.19.0.0/24 [0/1024] is directly connected, vlan40-v0, 01w5d21h
C>* 172.19.0.0/24 is directly connected, vlan40, 01w5d21h
B>* 172.19.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h
  *                      via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h
 
 
 
show ipv6 route
===============
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
C * fe80::/64 is directly connected, vlan50, 01w5d21h
C * fe80::/64 is directly connected, vlan60, 01w5d21h
C * fe80::/64 is directly connected, vlan40, 01w5d21h
C * fe80::/64 is directly connected, vlan40-v0, 01w5d21h
C * fe80::/64 is directly connected, vlan30-v0, 01w6d19h
C * fe80::/64 is directly connected, vlan30, 01w6d19h
C * fe80::/64 is directly connected, vlan20-v0, 01w6d19h
C * fe80::/64 is directly connected, vlan20, 01w6d19h
C * fe80::/64 is directly connected, vlan10-v0, 01w6d19h
C * fe80::/64 is directly connected, vlan10, 01w6d19h
C * fe80::/64 is directly connected, bridge, 01w6d19h
C * fe80::/64 is directly connected, peerlink.4094, 01w6d19h
C * fe80::/64 is directly connected, swid0_eth, 01w6d19h
C * fe80::/64 is directly connected, swp30, 01w6d19h
C>* fe80::/64 is directly connected, swp29, 01w6d19h

Host Preparation

In order to achieve optimal results for DPDK use cases, the compute node must be correctly optimized for DPDK performance.

The optimization might require specific BIOS and NIC Firmware settings. Please refer to the official DPDK performance document provided by your CPU vendor.

For our deployment we used the following document from AMD: https://www.amd.com/system/files/documents/epyc-7Fx2-processors-dpdk-nw-performance-brief.pdf

Note

Please note that the host boot settings (Linux grub command line) are done later on as part of the overcloud image configuration and that the hugepages allocation is done on the actual VM used for testing.

OpenStack Deployment

The RedHat OpenStack Platform (RHOSP) version 16.1 will be used.

The deployment is divided into three major steps:

Installing the director node
Deploying the undercloud on the director node
Deploying the overcloud using the undercloud

Important

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

Important

All nodes which belong to the same profile (e.g., controller, compute) must have the same PCIe placement for the NIC and must expose the same interface name.

The director node needs to have the following network interfaces:

Untagged over bond0—used for provisioning the overcloud nodes (connected to stretched VLAN 50 on the switch)
Tagged VLAN 60 over bond0.60—used for Internet access over the external network segment (connected to stretched VLAN 60 on the switch)
An interface on the IPMI network—used for accessing to the bare metal nodes BMCs

Undercloud Director Installation

Follow RH-OSP Preparing for Director Installation up to Preparing Container Images.
Use the following environment file for OVS-based RH-OSP 16.1 container preparation. Remember to update your Red Hat registry credentials:

/home/stack/containers-prepare-parameter.yaml

Copy
Copied!

            
            #global
parameter_defaults:
  ContainerImagePrepare:
  - push_destination: true
    excludes:
      - ceph
      - prometheus
    set:
      name_prefix: openstack-
      name_suffix: ''
      namespace: registry.redhat.io/rhosp-rhel8
      neutron_driver: null
      rhel_containers: false
      tag: '16.1'
    tag_from_label: '{version}-{release}'
  ContainerImageRegistryCredentials:
    registry.redhat.io:
      '<username>': '<password>'

Proceed with the director installation steps, described in RH-OSP Installing Director, up to the RH-OSP Installing Director execution.
The following undercloud configuration file was used in our deployment:

/home/stack/undercloud.conf

Copy
Copied!

            
            [DEFAULT]
undercloud_hostname = rhosp-director.localdomain
local_ip = 192.168.24.1/24
network_gateway = 192.168.24.1
undercloud_public_host = 192.168.24.2
undercloud_admin_host = 192.168.24.3
undercloud_nameservers = 8.8.8.8,8.8.4.4
undercloud_ntp_servers = 10.211.0.134,10.211.0.124
subnets = ctlplane-subnet
local_subnet = ctlplane-subnet
generate_service_certificate = True
certificate_generation_ca = local
local_interface = bond0
inspection_interface = br-ctlplane
undercloud_debug = true
enable_tempest = false
enable_telemetry = false
enable_validations = true
enable_novajoin = false
clean_nodes = true
custom_env_files = /home/stack/custom-undercloud-params.yaml,/home/stack/ironic_ipmitool.yaml
container_images_file = /home/stack/containers-prepare-parameter.yaml
 
[auth]
 
[ctlplane-subnet]
cidr = 192.168.24.0/24
dhcp_start = 192.168.24.5
dhcp_end = 192.168.24.30
inspection_iprange = 192.168.24.100,192.168.24.120
gateway = 192.168.24.1
masquerade = true

Follow the instructions in RH-OSP Obtain Images for Overcloud Nodes (section 4.9.1, steps 1–3), without importing the images into the director yet.
Once obtained, follow the customization process described in RH-OSP Working with Overcloud Images:
1. Start from 3.3. QCOW: Installing virt-customize to director
2. Skip 3.4
3. Run 3.5. QCOW: Setting the root password (optional)
4. Run 3.6. QCOW: Registering the image
5. Run the following command to locate your subscription pool ID:
  Copy
  
  Copied!
```
            
            $ sudo subscription-manager list --available --all --matches="Red Hat OpenStack"
        
```
6. Replace [subscription-pool] in the below command with your relevant subscription pool ID:
  Copy
  
  Copied!
```
            
            $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --run-command 'subscription-manager attach --pool [subscription-pool]'
        
```
7. Skip 3.7 and 3.8
8. Run the following command to add mstflint to overcloud image to allow the NIC firmware provisioning during overcloud deployment (similar to 3.9).
  Copy
  
  Copied!
```
            
            $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install mstflint
        
```
  Note
  
  mstflint is required for the overcloud nodes to support the automatic NIC firmware upgrade by the cloud orchestration system during deployment.
9. Run 3.10. QCOW: Cleaning the subscription pool
10. Run 3.11. QCOW: Unregistering the image
11. Run 3.12. QCOW: Reset the machine ID
12. Run 3.13. Uploading the images to director

Undercloud Director Preparation for Automatic NIC Firmware Provisioning

Download the latest ConnectX NIC firmware binary file (fw-<NIC-Model>.bin) from NVIDIA Networking Firmware Download Site.
Create a directory named mlnx_fw under /var/lib/ironic/httpboot/ in the Director node, and place the firmware binary file in it.
Extract the connectx_first_boot.yaml file from the configuration files attached to this guide, and place it in the /home/stack/templates/ directory in the Director node

Warning

The connectx_first_boot.yaml file is called by another deployment configuration file ( env-ovs-dvr.yaml) , so please use the instructed location, or change the configuration files accordingly.

Overcloud Nodes Introspection

A full overcloud introspection procedure is described in RH-OSP Configuring a Basic Overcloud. In this RDG, the following configuration steps were used for introspecting overcloud bare metal nodes to be deployed later-on over two routed Spine-Leaf racks:

Prepare a bare metal inventory file - instackenv.json, with the overcloud nodes information. In this case, the inventory file is listing 5 bare metal nodes to be deployed as overcloud nodes: 3 controller nodes and 2 compute nodes (1 in each routed rack). Make sure to update the file with the IPMI servers addresses and credentials and with the MAC address of one of the physical ports in order to avoid issues in the introspection process:

instackenv.json

Copy
Copied!

            
            {
    "nodes": [
        {
            "name": "controller-1",
            "pm_type":"ipmi",
            "pm_user":"root",
            "pm_password":"******",
            "pm_addr":"10.7.214.1",
			"ports":[
                {
                    "address":"<MAC_ADDRESS>", 
                    "pxe_enabled":true
                }
            ]
        },
        {
            "name": "controller-2",
            "pm_type":"ipmi",
            "pm_user":"root",
            "pm_password":"******",
            "pm_addr":"10.7.214.2",
			"ports":[
                {
                    "address":"<MAC_ADDRESS>", 
                    "pxe_enabled":true
                }
            ]
        },
        {
            "name": "controller-3",
            "pm_type":"ipmi",
            "pm_user":"root",
            "pm_password":"******",
            "pm_addr":"10.7.214.3",
			"ports":[
                {
                    "address":"<MAC_ADDRESS>", 
                    "pxe_enabled":true
                }
            ]
        },
        {
            "name": "compute-1",
            "pm_type":"ipmi",
            "pm_user":"root",
            "pm_password":"******",
            "pm_addr":"10.7.214.4",
			"ports":[
                {
                    "address":"<MAC_ADDRESS>", 
                    "pxe_enabled":true
                }
            ]
        },
        {
            "name": "compute-2",
            "pm_type":"ipmi",
            "pm_user":"root",
            "pm_password":"******",
            "pm_addr":"10.7.214.5",
			"ports":[
                {
                    "address":"<MAC_ADDRESS>", 
                    "pxe_enabled":true
                }
            ]
        }
    ]
}

Import the overcloud baremetal nodes inventory, and wait until all nodes are listed in "manageable" state.

Copy
Copied!

            
            [stack@rhosp-director ~]$ source ~/stackrc
(undercloud) [stack@rhosp-director ~]$ openstack overcloud node import /home/stack/instackenv.json

Copy
Copied!

            
            $ openstack baremetal node list
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+
| 476c7659-abc2-4d8c-9532-1756abbfd18a | controller-1 | None          | power off   | manageable         | False       |
| 3cbb74e5-6508-4ec8-91a8-870dbf28baed | controller-2 | None          | power off   | manageable         | False       |
| 457b329e-f1bc-476a-996d-eb82a56998e8 | controller-3 | None          | power off   | manageable         | False       |
| 870445b7-650f-40fc-8ac2-5c3df700ccdc | compute-1    | None          | power off   | manageable         | False       |
| baa7356b-11ca-4cb0-b58c-16c110bbbea0 | compute-2    | None          | power off   | manageable         | False       |
+--------------------------------------+--------------+---------------+-------------+--------------------+-------------+

Start the baremetal nodes introspection:

Copy
Copied!

            
            $ openstack overcloud node introspect --all-manageable

Set the root device for deployment, and provide all baremetal nodes to reach "available" state:

Copy
Copied!

            
            $ openstack overcloud node configure --all-manageable --instance-boot-option local --root-device largest
$ openstack overcloud node provide --all-manageable

Tag the controller nodes into the "control" profile, which is later mapped to the overcloud controller role:

Copy
Copied!

            
            $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-1
$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-2
$ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-3

Warning

Note: The Role to Profile mapping is specified in the node-info.yaml file used during the overcloud deployment.

Create a new compute flavor, and tag one compute node into the "compute-r0" profile, which is later mapped to the overcloud "compute in rack 0" role:

Copy
Copied!

            
            $ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r0
$ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r0" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r0
 
$ openstack baremetal node set --property capabilities='profile:compute-r0,boot_option:local' compute-1

Create a new compute flavor, and tag the last compute node into the "compute-r1" profile, which is later mapped to the overcloud "compute in rack 1" role:

Copy
Copied!

            
            $ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r1
$ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r1" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r1
 
$ openstack baremetal node set --property capabilities='profile:compute-r1,boot_option:local' compute-2

Verify the overcloud nodes profiles allocation:

Copy
Copied!

            
            $ openstack overcloud profiles list
+--------------------------------------+--------------+-----------------+-----------------+-------------------+
| Node UUID                            | Node Name    | Provision State | Current Profile | Possible Profiles |
+--------------------------------------+--------------+-----------------+-----------------+-------------------+
| 476c7659-abc2-4d8c-9532-1756abbfd18a | controller-1 | available       | control         |                   |
| 3cbb74e5-6508-4ec8-91a8-870dbf28baed | controller-2 | available       | control         |                   |
| 457b329e-f1bc-476a-996d-eb82a56998e8 | controller-3 | available       | control         |                   |
| 870445b7-650f-40fc-8ac2-5c3df700ccdc | compute-1    | available       | compute-r0      |                   |
| baa7356b-11ca-4cb0-b58c-16c110bbbea0 | compute-2    | available       | compute-r1      |                   |
+--------------------------------------+--------------+-----------------+-----------------+-------------------+

Overcloud Deployment Configuration Files

Prepare the following cloud deployment configuration files, and place it under the /home/stack/templates/dvr directory.

Warning

Note: The full files are attached to this article, and can be found here: 47036708-1.0.0.zip

Some configuration files are customized specifically the to the /home/stack/templates/dvr location. If you place the template files in a different location, adjust it accordingly.

containers-prepare-parameter.yaml
network-environment-dvr.yaml
controller.yaml

Note

This template file contains the network settings for the controller nodes, including large MTU and bonding configuration
computesriov-dvr.yaml

Note

This template file contains the network settings for compute node, including SR-IOV VFs, large MTU and accelerated bonding (VF-LAG) configuration for data path.
node-info.yaml

Note

This environment file contains the count of nodes per role and the role to the baremetal profile mapping.
roles_data_dvr.yaml

Note

This environment file contains the services enabled on each cloud role and the networks associated with its rack location.
network_data.yaml

Note

This environment file contains a cloud network configuration for routed Spine-Leaf topology with large MTU. Rack0 and Rack1 L3 segments are listed as subnets of each cloud network. For further information refer to RH-OSP Configuring the Overcloud Leaf Networks.
env-ovs-dvr.yaml
Note
- This environment file contains the following settings:
  - Overcloud nodes time settings
  - ConnectX First Boot parameters (by calling /home/stack/templates/connectx_first_boot.yaml file)
  - Neutron Jumbo Frame MTU and DVR mode
  - Compute nodes CPU partitioning and isolation adjusted to Numa topology
  - Nova PCI passthrough settings adjusted to VXLAN hardware offload

Overcloud Deployment

Issue the overcloud deploy command to start cloud deployment with the prepared configuration files.

Copy
Copied!

            
            $ openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \
--libvirt-type kvm \
-n /home/stack/templates/dvr/network_data.yaml \
-r /home/stack/templates/dvr/roles_data_dvr.yaml \
--validation-warnings-fatal \
-e /home/stack/templates/dvr/node-info.yaml \
-e /home/stack/templates/dvr/containers-prepare-parameter.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dvr.yaml \
-e /home/stack/templates/dvr/network-environment-dvr.yaml \
-e /home/stack/templates/dvr/env-ovs-dvr.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml

Once deployed, load the necessary environment variables to interact with your overcloud:
Copy

Copied!
```
            
            $ source ~/overcloudrc
        
```

Applications and Use Cases

Accelerated Packet Processing (SDN Acceleration)

Warning

Note: The following use case is demonstrating SDN layer acceleration using hardware offload capabilities.
The appendix below includes benchmarks that demonstrate SDN offload performance and usability.

VM Image

Build a VM cloud image (qcow2) with packet processing performance tools and cloud-init elements as described in How-to: Create OpenStack Cloud Image with Performance Tools.

Upload the image to the overcloud image store:

Copy
Copied!

            
            $ openstack image create perf --public --disk-format qcow2 --container-format bare --file /home/stack/images/guest/centos8-perf.qcow2

VM Flavor

Create a flavor:

Copy
Copied!

            
            $ openstack flavor create m1.packet --id auto --ram 8192 --disk 20 --vcpus 10

Set hugepages and cpu-pinning parameters:

Copy
Copied!

            
            $ openstack flavor set m1.packet --property hw:mem_page_size=large
$ openstack flavor set m1.packet --property hw:cpu_policy=dedicated

VM Networks and Ports

Create a VXLAN network with normal ports to be used for instance management and access:

Copy
Copied!

            
            $ openstack network create vx_mgmt --provider-network-type vxlan --share
$ openstack subnet create vx_mgmt_subnet --dhcp --network vx_mgmt --subnet-range 22.22.22.0/24 --dns-nameserver 8.8.8.8
$ openstack port create normal1 --network vx_mgmt --no-security-group --disable-port-security
$ openstack port create normal2 --network vx_mgmt --no-security-group --disable-port-security

Create a VXLAN network to be used for accelerated data traffic between the VM instances with Jumbo Frames support:

Copy
Copied!

            
            $ openstack network create vx_data --provider-network-type vxlan --share --mtu 8950
$ openstack subnet create vx_data_subnet --dhcp --network vx_data --subnet-range 33.33.33.0/24 --gateway none

Create 2 x SR-IOV direct ports with hardware offload capabilities:

Copy
Copied!

            
            $ openstack port create direct1 --vnic-type=direct --network vx_data --binding-profile '{"capabilities":["switchdev"]}'
$ openstack port create direct2 --vnic-type=direct --network vx_data --binding-profile '{"capabilities":["switchdev"]}'

Create an external network for public access:

Copy
Copied!

            
            $ openstack network create public --provider-physical-network datacentre --provider-network-type vlan --provider-segment 60 --share --mtu 8950 --external
$ openstack subnet create public_subnet --dhcp --network public --subnet-range 172.60.0.0/24 --allocation-pool start=172.60.0.100,end=172.60.0.150 --gateway 172.60.0.254

Create a public router, and add the management network subnet:

Copy
Copied!

            
            $ openstack router create public_router
$ openstack router set public_router --external-gateway public
$ openstack router add subnet public_router vx_mgmt_subnet

VM Instances

Create a VM instance with a management port and a direct port on the compute node located on the Rack0 L3 network segment:

Copy
Copied!

            
            $ openstack server create --flavor m1.packet --image perf --port normal1 --port direct1 vm1 --availability-zone nova:overcloud-computesriov-rack0-0.localdomain

Create a VM instance with a management port and a direct port on the compute node located on the Rack1 L3 network segment:

Copy
Copied!

            
            $ openstack server create --flavor m1.packet --image perf --port normal2 --port direct2 vm2 --availability-zone nova:overcloud-computesriov-rack1-0.localdomain

Wait until the VM instances status is changed to ACTIVE:

Copy
Copied!

            
            $ openstack server list
+--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+
| ID                                   | Name    | Status | Networks                                                | Image | Flavor |
+--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+
| 000d3e17-9583-4885-aa2d-79c3b5df3cc8 | vm2     | ACTIVE | vx_data=33.33.33.244; vx_mgmt=22.22.22.151              | perf  |        |
| 150013ad-170d-4587-850e-70af692aa74c | vm1     | ACTIVE | vx_data=33.33.33.186; vx_mgmt=22.22.22.59               | perf  |        |
+--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+

In order to verify that external/public network access is functioning properly, create and assign external floating IP addresses for the VMs:

Copy
Copied!

            
            $ openstack floating ip create --floating-ip-address 172.60.0.151 public
$ openstack server add floating ip vm1 172.60.0.151
$ openstack floating ip create --floating-ip-address 172.60.0.152 public
$ openstack server add floating ip vm2 172.60.0.152

Appendix

Performance Testing

Warning

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Warning

Note: The tools used in the below tests are included in the VM image built with the perf-tools element, as instructed in previous steps.

iperf TCP Test

On the Transmitter VM, disable XPS (Transmit Packet Steering):
Copy

Copied!
```
            
            # for i in {0..7}; do echo 0 > /sys/class/net/eth1/queues/tx-$i/xps_cpus;  done;
        
```
Note
- This step is required to allow packet distribution of the iperf traffic using ConnectX VF-LAG.
- Use interface statistic/counters on the Leaf switches' bond interfaces to confirm traffic is distributed by the Transmitter VM to both switches using VF-LAG (Cumulus "net show counters" command).

On the Receiver VM, start multiple iperf3 servers:

Copy
Copied!

            
            # iperf3 -s -p 5101&
# iperf3 -s -p 5102&
# iperf3 -s -p 5103&
# iperf3 -s -p 5104&

On the Transmitter VM, start multiple iperf3 clients for multi-core parallel streams:

Copy
Copied!

            
            # iperf3 -c 33.33.33.180 -T s1 -p 5101 -t 15 &
# iperf3 -c 33.33.33.180 -T s2 -p 5102 -t 15 &
# iperf3 -c 33.33.33.180 -T s3 -p 5103 -t 15 &
# iperf3 -c 33.33.33.180 -T s4 -p 5104 -t 15 &

Check the results:

Copy
Copied!

            
            s1:  - - - - - - - - - - - - - - - - - - - - - - - - -
s1:  [ ID] Interval           Transfer     Bitrate         Retr
s1:  [  5]   0.00-15.00  sec  38.4 GBytes  22.0 Gbits/sec  18239             sender
s1:  [  5]   0.00-15.04  sec  38.4 GBytes  22.0 Gbits/sec                  receiver
s1:  
s1:  iperf Done.
s2:  [  5]  14.00-15.00  sec  3.24 GBytes  27.8 Gbits/sec  997   1008 KBytes       
s2:  - - - - - - - - - - - - - - - - - - - - - - - - -
s2:  [ ID] Interval           Transfer     Bitrate         Retr
s2:  [  5]   0.00-15.00  sec  45.1 GBytes  25.8 Gbits/sec  22465             sender
s2:  [  5]   0.00-15.04  sec  45.1 GBytes  25.7 Gbits/sec                  receiver
s2:  
s2:  iperf Done.
s4:  [  5]  14.00-15.00  sec  2.26 GBytes  19.4 Gbits/sec  1435    443 KBytes       
s4:  - - - - - - - - - - - - - - - - - - - - - - - - -
s4:  [ ID] Interval           Transfer     Bitrate         Retr
s4:  [  5]   0.00-15.00  sec  46.5 GBytes  26.6 Gbits/sec  19964             sender
s4:  [  5]   0.00-15.04  sec  46.5 GBytes  26.6 Gbits/sec                  receiver
s4:  
s4:  iperf Done.
s3:  [  5]  14.00-15.00  sec  2.70 GBytes  23.2 Gbits/sec  1537    643 KBytes       
s3:  - - - - - - - - - - - - - - - - - - - - - - - - -
s3:  [ ID] Interval           Transfer     Bitrate         Retr
s3:  [  5]   0.00-15.00  sec  39.5 GBytes  22.6 Gbits/sec  18447             sender
s3:  [  5]   0.00-15.04  sec  39.5 GBytes  22.6 Gbits/sec                  receiver
s3:  
s3:  iperf Done.

Note

The test results above demonstrate a total of around 100Gbps line rate for IP TCP traffic.

Before proceeding to the next test, stop all iperf servers on the Receiver VM:

Copy
Copied!

            
            # killall iperf3
iperf3: interrupt - the server has terminated
iperf3: interrupt - the server has terminated
iperf3: interrupt - the server has terminated
iperf3: interrupt - the server has terminated

RoCE Bandwidth Test over the bond

On the Receiver VM, start ib_write_bw server with 2 QPs in order to utilize the VF-LAG infrastructure:
Copy

Copied!
```
            
            # ib_write_bw -R --report_gbits --qp 2
        
```

On the Transmitter VM, start ib_write_bw client with 2 QPs and a duration of 10 seconds:

Copy
Copied!

            
            # ib_write_bw -R --report_gbits --qp 2 -D 10 33.33.33.180

Check the test results:

Copy
Copied!

            
            ---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 2            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x016e PSN 0x3ab6d6
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:173
 local address: LID 0000 QPN 0x016f PSN 0xc9fa3f
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:173
 remote address: LID 0000 QPN 0x016e PSN 0x8d928c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:180
 remote address: LID 0000 QPN 0x016f PSN 0xc89786
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:180
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      22065492         0.00               192.81             0.367756
---------------------------------------------------------------------------------------

Single-core DPDK packet-rate test over a single port

In this test, we used two VMs - one running the trex traffic generator and the other running the dpdk-testpmd tool.

The traffic generator sends 64 byte packets to the test-pmd tool and the tool sends them back (thus testing the DPDK packet-processing pipeline).

We tested single-core packet processing rate between two VMs using 2 tx and rx queues.

Note

Please note that this test requires using specific versions of DPDK and the VM image kernel and compatibility issues might exist when using the latest releases.

For this test we used a CentOS 7.8 image with kernel 3.10.0-1160.42.2 and DPDK 20.11.

On one VM we activated the dpdk-testpmd utility, this is the command line used to run testpmd (this command displays the MAC address of the port to be used later on with trex as the destination port MAC):

testpmd VM console

Copy
Copied!

            
            # ./dpdk-testpmd -c 0x1ff -n 4 -m 1024 -a 00:05.0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --nb-cores=1 --rss-udp --forward-mode=macswap -a -i

Next, we created a second VM with two direct ports (on the vx_data network) and used it for running the trex traffic generator (version 2.82):

We ran the DPDK port setup interactive wizard and used the MAC address of the TestPMD VM collected beforehand:

trex VM console

Copy
Copied!

            
            # cd /root/trex/v2.82
# ./dpdk_setup_ports.py -i
By default, IP based configuration file will be created. Do you want to use MAC based config? (y/N)y
+----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+
| ID | NUMA |   PCI   |        MAC        |                     Name                     |   Driver   | Linux IF |  Active  |
+====+======+=========+===================+==============================================+============+==========+==========+
| 0  | -1   | 00:03.0 | fa:16:3e:11:3e:64 | Virtio network device                        | virtio-pci | eth0     | *Active* |
+----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+
| 1  | -1   | 00:05.0 | fa:16:3e:cb:a4:82 | ConnectX Family mlx5Gen Virtual Function     | mlx5_core  | eth1     |          |
+----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+
| 2  | -1   | 00:06.0 | fa:16:3e:58:79:e2 | ConnectX Family mlx5Gen Virtual Function     | mlx5_core  | eth2     |          |
+----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+
Please choose an even number of interfaces from the list above, either by ID, PCI or Linux IF
Stateful will use order of interfaces: Client1 Server1 Client2 Server2 etc. for flows.
Stateless can be in any order.
Enter list of interfaces separated by space (for example: 1 3) : 1 2
 
For interface 1, assuming loopback to its dual interface 2.
Destination MAC is fa:16:3e:58:79:e2. Change it to MAC of DUT? (y/N).y
Please enter a new destination MAC of interface 1: FA:16:3E:32:5C:A4
For interface 2, assuming loopback to its dual interface 1.
Destination MAC is fa:16:3e:cb:a4:82. Change it to MAC of DUT? (y/N).y
Please enter a new destination MAC of interface 2: FA:16:3E:32:5C:A4
Print preview of generated config? (Y/n)
### Config file generated by dpdk_setup_ports.py ###
 
- version: 2
  interfaces: ['00:05.0', '00:06.0']
  port_info:
      - dest_mac: fa:16:3e:32:5c:a4
        src_mac:  fa:16:3e:cb:a4:82
      - dest_mac: fa:16:3e:32:5c:a4
        src_mac:  fa:16:3e:58:79:e2
 
  platform:
      master_thread_id: 0
      latency_thread_id: 1
      dual_if:
        - socket: 0
          threads: [2,3,4,5,6,7,8,9]
 
 
Save the config to file? (Y/n)y
Default filename is /etc/trex_cfg.yaml
Press ENTER to confirm or enter new file:
Saved to /etc/trex_cfg.yaml.

We create the following UDP packet stream configuration file under the /root/trex/v2.82 directory:

udp_rss.py

Copy
Copied!

            
            from trex_stl_lib.api import *
 
class STLS1(object):
 
    def create_stream (self):
        
        pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(22*'x')
                  
        vm = STLScVmRaw( [
                                STLVmFlowVar(name="v_port",
                                                min_value=4337,
                                                  max_value=5337,
                                                  size=2, op="inc"),
                                STLVmWrFlowVar(fv_name="v_port",
                                            pkt_offset= "UDP.sport" ),
                                STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),
 
                            ]
                        )
 
        return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,
                                mode = STLTXCont(pps = 8000000) )
 
 
    def get_streams (self, direction = 0, **kwargs):
        # create 1 stream
        return [ self.create_stream() ]
 
 
# dynamic load - used for trex console or simulator
def register():
    return STLS1()

Next, we ran the TRex application in the background over 8 out of 10 cores:

trex VM console

Copy
Copied!

            
            # nohup ./t-rex-64 --no-ofed-check -i -c 8 &

And finally ran the TRex Console:

trex VM console

Copy
Copied!

            
            # ./trex-console 
 
Using 'python' as Python interpeter
 
 
Connecting to RPC server on localhost:4501                   [SUCCESS]
 
 
Connecting to publisher server on localhost:4500             [SUCCESS]
 
 
Acquiring ports [0, 1]:                                      [SUCCESS]
 
 
Server Info:
 
Server version:   v2.82 @ STL
Server mode:      Stateless
Server CPU:       8 x AMD EPYC Processor (with IBPB)
Ports count:      2 x 100Gbps @ ConnectX Family mlx5Gen Virtual Function
 
-=TRex Console v3.0=-
 
Type 'help' or '?' for supported actions
 
trex>

In the TRex Console, entered the UI (TUI):

trex VM console

Copy
Copied!

            
            trex>tui

And started a 35MPPS stream using the stream configuration file created in previous steps:

trex VM console

Copy
Copied!

            
            tui>start -f udp_rss.py -m 35mpps -p 0

And these are the measured results on the testpmd VM (~32.5mpps):

testpmd tool console

Copy
Copied!

            
            testpmd> show port stats all
 
######################## NIC statistics for port 0 ########################
RX-packets: 486825081 RX-missed: 14768224 RX-bytes: 29209504860
RX-errors: 0
RX-nombuf: 0 
TX-packets: 486712558 TX-errors: 0 TX-bytes: 29202753480
 
Throughput (since last show)
Rx-pps: 32542689 Rx-bps: 15620491128
Tx-pps: 32542685 Tx-bps: 15620488872
############################################################################

Troubleshooting the Nodes in Case Network Is Not Functioning

In case there is a need to debug the overcloud nodes due to issues in the deployment, the only way to connect to them is using an SSH key that is placed in the director node.

Connecting through a BMC console is useless since there is no way to log in to the nodes using a username and a password.

Since the nodes are connected using a single network, there can be a situation in which they are completely inaccessible for troubleshooting in case there is an issue with the YAML files and so forth.

The solution for this is to add the following lines to the env-ovs-dvr.yaml file (under the mentioned sections) that will allow for setting a root password for the nodes so they become accessible from the BMC console:

Copy
Copied!

            
            resource_registry:
   OS::TripleO::NodeUserData: /usr/share/openstack-tripleo-heat-templates/firstboot/userdata_root_password.yaml
 
parameter_defaults:
  NodeRootPassword: "******"

The overcloud to e deleted and redeployed in order for the above changes to affect.

Warning

Note: this feature and the "Automatic NIC Firmware Provisioning" feature are mutually exclusive and can't be used in parallel as they both use the "NodeUserData" variable!

Warning

Note: Make sure to remove the above lines in case of a production cloud as they pose a security hazard

Once the nodes are accessible, an additional helpful step to troubleshoot network issues on the nodes would be to run the following script on the problematic nodes:

Copy
Copied!

            
            # os-net-config -c /etc/os-net-config/config.json -d

This script will try to reconfigure the network and will indicate any relevant issues.

Authors

	Itai Levy Over the past few years, Itai Levy has worked as a Solutions Architect and member of the NVIDIA Networking “Solutions Labs” team. Itai designs and executes cutting-edge solutions around Cloud Computing, SDN, SDS and Security. His main areas of expertise include NVIDIA BlueField Data Processing Unit (DPU) solutions and accelerated OpenStack/K8s platforms.

Itai Levy

Over the past few years, Itai Levy has worked as a Solutions Architect and member of the NVIDIA Networking “Solutions Labs” team. Itai designs and executes cutting-edge solutions around Cloud Computing, SDN, SDS and Security. His main areas of expertise include NVIDIA BlueField Data Processing Unit (DPU) solutions and accelerated OpenStack/K8s platforms.

	Shachar Dor Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management. Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

Spectrum Ethernet Switches Openstack HPC / Scientific Computing ASAP² Cumulus Linux RHEL and CoreOS Enterprise Cloud Services VF-LAG Ethernet RoCE testpmd ConnectX DPDK Red Hat SR-IOV

Last updated on Sep 12, 2023

On This Page

Spine1 Console

Spine2 Console

Leaf1A Console

Leaf1B Console

Leaf2A Console

Leaf2B Console

Leaf Switch Console

/home/stack/containers-prepare-parameter.yaml

/home/stack/undercloud.conf

instackenv.json

testpmd VM console

trex VM console

udp_rss.py

trex VM console

trex VM console

trex VM console

trex VM console

testpmd tool console