RDG for Red-Hat OpenStack Cloud over NVIDIA Converged High-Performance Ethernet Network

RDG for Deploying Media Streaming Applications using Rivermax, DeepStream over Accelerated K8s Cluster

Scope

This R eference D eployment G uide ( RDG ) document is aimed at having a practical and scalable Red-Hat OpenStack deployment suitable for high-performance workloads.

The deployment utilizes a single physical network based on NVIDIA high-speed NICs and switches.

Abbreviations and Acronyms

Term

Definition

Term

Definition

AI

Artificial Intelligence

MLAG

Multi-Chassis Link Aggregation

ASAP2

Accelerated Switching and Packet Processing®

MLNX_OFED

NVIDIA OpenFabrics Enterprise Distribution for Linux (network driver)

BGP

Border Gateway Protocol

NFV

Network Functions Virtualization

BOM

Bill of Materials

NIC

Network Interface Card

CPU

Central Processing Unit

OS

Operating System

CUDA

Compute Unified Device Architecture

OVS

Open vSwitch

DHCP

Dynamic Host Configuration Protocol

RDG

Reference Deployment Guide

DPDK

Data Plane Development Kit

RDMA

Remote Direct Memory Access

DVR

Distributed Virtual Routing

RHEL

Red Hat Enterprise Linux

ECMP

Equal Cost Multi-Pathing

RH-OSP

Red Hat OpenStack Platform

FW

FirmWare

RoCE

RDMA over Converged Ethernet

GPU

Graphics Processing Unit

SDN

Software Defined Networking

HA

High Availability

SR-IOV

Single Root Input/Output Virtualization

IP

Internet Protocol

VF

Virtual Function

IPMI

Intelligent Platform Management Interface

VF-LAG

Virtual Function Link Aggregation

L3

IP Network Layer 3

VLAN

Virtual LAN

LACP

Link Aggregation Control Protocol

VM

Virtual Machine

MGMT

Management

VNF

Virtualized Network Function

ML2

Modular Layer 2 Openstack Plugin

Introduction

This document demonstrates the deployment of a large-scale OpenStack cloud over a single high-speed fabric.

The fabric provides the cloud a mix of L3-routed networks and L2-stretched EVPN networks -

L2-stretched networks are used for the “Deployment/Provisioning” network (as they greatly simplify DHCP and PXE operations) and for the “External” network (as they allow attaching a single external network segment to the cluster, typically a subnet that has real Internet addressing).

Red Hat OpenStack Platform (RH-OSP) is a cloud computing solution that enables the creation, deployment, scale and management of a secure and reliable public or private OpenStack-based cloud. This production-ready platform offers a tight integration with NVIDIA networking and data processing technologies.

The solution demonstrated in this article can be easily applied to diverse use cases, such as core or edge computing, with hardware accelerated packet and data processing for NFV, Big Data, and AI workloads over IP, DPDK, and RoCE stacks.

Note

RH-OSP16.1 Release Notes

Red Hat OpenStack Platform 16.1, supports offloading of the OVS switching function to the SmartNIC hardware. This enhancement reduces the processing resources required and accelerates the data path. In Red Hat OpenStack Platform 16.1, this feature has graduated from Technology Preview and is now fully supported.

Note

Downloadable Content

All configuration files used in this article can be found here: 47036708-1.0.0.zip

References

Red Hat OpenStack Platform 16.1 Installation Guide

Red Hat Openstack Platform 16.1 Spine Leaf Networking

QSG: High Availability with ASAP2 Enhanced SR-IOV (VF-LAG).

Solution Architecture

Key Components and Technologies

  • CONNECTX®-6 Dx is a member of the world-class, award-winning ConnectX series of network adapters. ConnectX-6 Dx delivers two ports of 10/25/40/50/100Gb/s or a single-port of 200Gb/s Ethernet connectivity paired with best-in-class hardware capabilities that accelerate and secure cloud and data center workloads.

  • NVIDIA Spectrum™ Ethernet Switch product family includes a broad portfolio of top-of-rack and aggregation switches, that can be deployed in layer-2 and layer-3 cloud designs, in overlay-based virtualized networks, or as part of high-performance, mission-critical ethernet storage fabrics.

  • LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and EDR, HDR, and NDR in InfiniBand products for Cloud, HPC, Web 2.0, Enterprise, telco, storage, and artificial intelligence and data center applications. LinkX cables and transceivers are often used to link top-of-rack switches downwards to network adapters in NVIDIA GPUs and CPU servers, and storage and/or upwards in switch-to-switch applications throughout the network infrastructure.

  • CUMULUS Linux is the world’s most robust open networking operating system. It includes a comprehensive list of advanced, modern networking features, and is built for scale.

  • Red Hat OpenStack Platform is a cloud computing platform that virtualizes resources from industry-standard hardware, organizes those resources into clouds, and manages them so users can access what they need, when they need it.

Logical Design

A typical OpenStack deployment uses several infrastructure networks, as portrayed in the following diagram:

image2021-10-27_13-12-50.png

A straight forward approach would be to use a separate physical network for each infrastructure network above, but this is not practical. Most deployments converge several networks onto a physical fabric and typically use one or more 1/10GbE fabrics and sometimes an additional high-speed fabric (25/100/200GbE).

In this document, we will demonstrate a deployment that converges all the below networks onto a single 100/200GbE high-speed fabric.

Network Fabric Design

image2021-10-27_13-14-19.png

image2021-8-22_14-2-47.png

image2021-8-25_14-22-7.png

Note

Note

  • In this network design, Compute Nodes are connected to the External network as required for Distributed Virtual Routing (DVR) configuration. For more information, please refer to Red Hat Openstack Platform DVR.

  • Routed Spine-Leaf networking architecture is used in this RDG for high speed Control and Data networks. However, the provisioning and External network use L2-Stretched EVPN networks. For more information, please see Red Hat Openstack Platform Spine Leaf Networking.

The deployment includes two separate networking layers:

  1. High-speed Ethernet fabric

  2. IPMI and switch management (not covered in this document)

The high-speed fabric described in this document contains the following building blocks:

  • Routed Spine-Leaf networking architecture

  • 2 x MSN3700C Spine switches

  • L3 BGP unnumbered with ECMP is configured on the Leaf and Spine switches to allow multipath routing between racks located on different L3 network segments

  • 2 x MSN2100 Leaf switches per rack in MLAG topology

  • 2 x 100GbE ports for MLAG peerlink between Leaf pairs

  • L3 VRR VLAN interfaces on a Leaf pair that is used as the default gateway for the host servers VLAN interfaces

  • VXLAN VTEPs with BGP EVPN control plane for creating stretch L2 networks for provisioning and for the external network

  • Host servers with 2 x 100GbE ports configured with LACP Active-Active bonding with multiple VLAN interfaces

  • The entire fabric is configured to support Jumbo Frames (MTU=9000)

image2021-8-22_14-16-45.png

This document will demonstrate a minimalistic scale of two racks with 1 compute node each. Though using the same design, the fabric can be scaled to accommodate up to 236 compute nodes with 2x100GbE connectivity and a non-blocking fabric.

This is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 2x100GbE to the hosts (16 racks, 15 servers each using 15 spines and 32 leafs):

image2021-4-11_15-24-5.png

Host Accelerated Bonding Logical Design

In order to use the MLAG-based network high availability, the hosts must have two high-speed interfaces that are bonded together with an LACP bond.

In the solution described in this article, enhanced SR-IOV with bonding support ( ASAP 2 VF-LAG) is used to offload network processing from the host and VM into the network adapter hardware, while providing fast data plane with high availability functionality.

Two Virtual Functions, each on a different physical port, are bonded and allocated to the VM as a single LAGed VF. The bonded interface is connected to a single or multiple ToR switches, using Active-Standby or Active-Active bond modes.

For additional information, refer to QSG: High Availability with ASAP2 Enhanced SR-IOV (VF-LAG).

image2021-2-22_16-19-4.png

Host and Application Logical Design

Compute host components:

  • ConnectX-6 Dx High Speed NIC with a dual physical port that is configured with LACP bonding in MLAG topology and providing VF-LAG redundancy to the VM

  • Storage Drives for local OS usage

  • RHEL as a base OS

  • Red Hat OpenStack Platform containerized software stack with the following:

    • KVM-based hypervisor

    • Openvswitch (OVS) with hardware offload support

    • Distributed Virtual Routing (DVR) configuration

Virtual Machine components:

  • CentOS 7.x/8.x as a base OS

  • SR-IOV VF allocated using PCI passthrough which allows to bypass the compute server hypervisor

  • perftest-tools Performance and benchmark testing tool set

image2021-2-3_18-46-8.png

Software Stack Components

image2021-8-29_18-39-19.png

Bill of Materials

image2021-8-22_12-9-30.png

Deployment and Configuration

Spines

Hostname

Router ID

Autonomous System

Downlinks

spine1

10.10.10.101/32

65199

swp1-4

spine2

10.10.10.102/32

65199

swp1-4

Leafs

Rack

Hostname

Router ID

Autonomous System

Uplinks

ISL ports

CLAG System MAC

CLAG Priority

VXLAN_Anycast_IP

1

leaf1a

10.10.10.1/32

65100

swp13-14

swp15-16

44:38:39:BE:EF:AA

1000

10.0.0.10

1

leaf1b

10.10.10.2/32

65100

swp13-14

swp15-16

44:38:39:BE:EF:AA

32768

10.0.0.10

2

leaf2a

10.10.10.3/32

65101

swp13-14

swp15-16

44:38:39:BE:EF:BB

1000

10.0.0.20

2

leaf2b

10.10.10.4/32

65101

swp13-14

swp15-16

44:38:39:BE:EF:BB

32768

10.0.0.20

L3-Routed VLANs

VLAN ID

Virtual MAC

Virtual IP

Primary Router IP

Secondary Router IP

Purpose

10

00:00:5E:00:01:00

172.16.0.254/24

172.16.0.252/24

172.16.0.253/24

Internal_API

20

00:00:5E:00:01:00

172.17.0.254/24

172.17.0.252/24

172.17.0.253/24

Storage

30

00:00:5E:00:01:00

172.18.0.254/24

172.18.0.252/24

172.18.0.253/24

Storage Mgmt

40

00:00:5E:00:01:00

172.19.0.254/24

172.19.0.252/24

172.19.0.253/24

Tenant

11

00:00:5E:00:01:01

172.16.1.254/24

172.16.1.252/24

172.16.1.253/24

Internal_API

21

00:00:5E:00:01:01

172.17.1.254/24

172.17.1.252/24

172.17.1.253/24

Storage

31

00:00:5E:00:01:01

172.18.1.254/24

172.18.1.252/24

172.18.1.253/24

Storage Mgmt

41

00:00:5E:00:01:01

172.19.1.254/24

172.19.1.252/24

172.19.1.253/24

Tenant

L2-Stretched VLANs (EVPN)

VLAN ID

VNI

Used Subnet

Purpose

50

10050

192.168.24.0/24

Provisioning

(generated by the undercloud node)

60

10060

172.60.0.0/24

External/Internet Access

The undercloud node has an address in this subnet (172.60.0.1) and its default gateway is 172.60.0.254

Server Ports

Rack

VLAN ID

Access Ports

Trunk Ports

Network Purpose

1

10

swp1-5

Internal API

1

20

swp1-5

Storage

1

30

swp1-5

Storage Mgmt

1

40

swp1-5

Tenant

1

50

swp1-5

Provisioning/Mgmt (Undercloud-Overcloud)

1

60

swp1-5

External

2

11

swp1

Internal API

2

21

swp1

Storage

2

31

swp1

Storage Mgmt

2

41

swp1

Tenant

2

50

swp1

Provisioning/Mgmt (Undercloud-Overcloud)

2

60

swp1

External

Server Wiring

Rack1

Director Node (Undercloud)

ens2f0 → Leaf1A, swp1

ens2f1 → Leaf1B, swp1

Controller-1

ens2f0 → Leaf1A, swp2

ens2f1 → Leaf1B, swp2

Controller-2

ens2f0 → Leaf1A, swp3

ens2f1 → Leaf1B, swp3

Controller-3

ens2f0 → Leaf1A, swp4

ens2f1 → Leaf1B, swp4

Compute-1

enp57s0f0 → Leaf1A, swp5

enp57s0f1 → Leaf1B, swp5

Rack2

Compute-2

enp57s0f1 → Leaf2A, swp1

enp57s0f0 → Leaf2A, swp1

Wiring

The wiring principal for the high-speed Ethernet fabric is as follows:

  • Each server in the racks is wired to two leaf(or “TOR”) switch

  • Leaf switches are interconnected using two ports (to create an MLAG)

  • Every leaf is wired to all the spines

Below is the full wiring diagram for the demonstrated fabric:

Network/Fabric

Updating Cumulus Linux

As a best practice, make sure to use the latest released Cumulus Linux NOS version.

Please see this guide on how to upgrade Cumulus Linux.

Configuring the Cumulus Linux switch

Make sure your Cumulus Linux switch has passed its initial configuration stages (for additional information, see the Quick-Start Guide for version 4.4):

  1. License installation

  2. Creation of switch interfaces (e.g., swp1-32)

The configuration of the spine switches includes the following:

  1. Routing & BGP

  2. EVPN

  3. Downlinks configuration

The configuration of the leaf switches includes the following:

  1. Routing & BGP

  2. EVPN

  3. Uplinks configuration

  4. MLAG

  5. L3-Routed VLANs

  6. L2-Stretched VLANs

  7. Bonds configuration

Following are the configuration commands:

Warning
  • The configuration below does not include the connection of an external router/gateway for external/Internet connectivity

  • The “external” network (stretched L2 over VLAN 60) is assumed to have a gateway (at 172.60.0.254) and to provide Internet connectivity for the cluster

  • The director (undercloud) node has an interface on VLAN 60 (bond0.60 with address 172.60.0.1)

Spine1 Console

Copy
Copied!
            

net add loopback lo ip address 10.10.10.101/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp autonomous-system 65199 net add bgp router-id 10.10.10.101 net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit

Spine2 Console

Copy
Copied!
            

net add loopback lo ip address 10.10.10.102/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp autonomous-system 65199 net add bgp router-id 10.10.10.102 net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit

Leaf1A Console

Copy
Copied!
            

net add interface swp15,swp16 mtu 9216 net add bond peerlink bond slaves swp15,swp16 net add interface swp1 mtu 9216 net add bond bond1 bond slaves swp1 net add bond bond1 clag id 1 net add interface swp2 mtu 9216 net add bond bond2 bond slaves swp2 net add bond bond2 clag id 2 net add interface swp3 mtu 9216 net add bond bond3 bond slaves swp3 net add bond bond3 clag id 3 net add interface swp4 mtu 9216 net add bond bond4 bond slaves swp4 net add bond bond4 clag id 4 net add interface swp5 mtu 9216 net add bond bond5 bond slaves swp5 net add bond bond5 clag id 5 net add bond bond1,bond2,bond3,bond4,bond5 bond lacp-bypass-allow net add bond bond1,bond2,bond3,bond4,bond5 stp bpduguard net add bond bond1,bond2,bond3,bond4,bond5 stp portadminedge net add bridge bridge ports bond1,bond2,bond3,bond4,bond5,peerlink net add bridge bridge vlan-aware net add bgp autonomous-system 65100 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp router-id 10.10.10.1 net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor peerlink.4094 interface remote-as internal net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add loopback lo ip address 10.10.10.1/32 net add interface peerlink.4094 clag backup-ip 10.10.10.2 net add interface peerlink.4094 clag priority 1000 net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:AA net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag args --initDelay 10 net add bridge bridge vids 10 net add vlan 10 ip address 172.16.0.252/24 net add vlan 10 ip address-virtual 00:00:5E:00:01:00 172.16.0.254/24 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add bridge bridge vids 20 net add vlan 20 ip address 172.17.0.252/24 net add vlan 20 ip address-virtual 00:00:5E:00:01:00 172.17.0.254/24 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add bridge bridge vids 30 net add vlan 30 ip address 172.18.0.252/24 net add vlan 30 ip address-virtual 00:00:5E:00:01:00 172.18.0.254/24 net add vlan 30 vlan-id 30 net add vlan 30 vlan-raw-device bridge net add bridge bridge vids 40 net add vlan 40 ip address 172.19.0.252/24 net add vlan 40 ip address-virtual 00:00:5E:00:01:00 172.19.0.254/24 net add vlan 40 vlan-id 40 net add vlan 40 vlan-raw-device bridge net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn neighbor peerlink.4094 activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise-default-gw net add bgp l2vpn evpn advertise ipv4 unicast net add bgp ipv6 unicast neighbor underlay activate net add loopback lo clag vxlan-anycast-ip 10.0.0.10 net add loopback lo vxlan local-tunnelip 10.10.10.1 net add vlan 60 vlan-id 60 net add vlan 60 vlan-raw-device bridge net add vlan 50 vlan-id 50 net add vlan 50 vlan-raw-device bridge net add vxlan vtep60 vxlan id 10060 net add vxlan vtep50 vxlan id 10050 net add vxlan vtep60,50 bridge arp-nd-suppress on net add vxlan vtep60,50 bridge learning off net add vxlan vtep60,50 stp bpduguard net add vxlan vtep60,50 stp portbpdufilter net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.1 net add vxlan vtep60 bridge access 60 net add vxlan vtep50 bridge access 50 net add bridge bridge ports vtep60,vtep50 net add bridge bridge vids 60,50 net add bond bond1-5 bridge pvid 50 net commit

Leaf1B Console

Copy
Copied!
            

net add interface swp15,swp16 mtu 9216 net add bond peerlink bond slaves swp15,swp16 net add interface swp1 mtu 9216 net add bond bond1 bond slaves swp1 net add bond bond1 clag id 1 net add interface swp2 mtu 9216 net add bond bond2 bond slaves swp2 net add bond bond2 clag id 2 net add interface swp3 mtu 9216 net add bond bond3 bond slaves swp3 net add bond bond3 clag id 3 net add interface swp4 mtu 9216 net add bond bond4 bond slaves swp4 net add bond bond4 clag id 4 net add interface swp5 mtu 9216 net add bond bond5 bond slaves swp5 net add bond bond5 clag id 5 net add bond bond1,bond2,bond3,bond4,bond5 bond lacp-bypass-allow net add bond bond1,bond2,bond3,bond4,bond5 stp bpduguard net add bond bond1,bond2,bond3,bond4,bond5 stp portadminedge net add bridge bridge ports bond1,bond2,bond3,bond4,bond5,peerlink net add bridge bridge vlan-aware net add bgp autonomous-system 65100 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp router-id 10.10.10.2 net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor peerlink.4094 interface remote-as internal net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add loopback lo ip address 10.10.10.2/32 net add interface peerlink.4094 clag backup-ip 10.10.10.1 net add interface peerlink.4094 clag priority 32768 net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:AA net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag args --initDelay 10 net add bridge bridge vids 10 net add vlan 10 ip address 172.16.0.253/24 net add vlan 10 ip address-virtual 00:00:5E:00:01:00 172.16.0.254/24 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add bridge bridge vids 20 net add vlan 20 ip address 172.17.0.253/24 net add vlan 20 ip address-virtual 00:00:5E:00:01:00 172.17.0.254/24 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add bridge bridge vids 30 net add vlan 30 ip address 172.18.0.253/24 net add vlan 30 ip address-virtual 00:00:5E:00:01:00 172.18.0.254/24 net add vlan 30 vlan-id 30 net add vlan 30 vlan-raw-device bridge net add bridge bridge vids 40 net add vlan 40 ip address 172.19.0.253/24 net add vlan 40 ip address-virtual 00:00:5E:00:01:00 172.19.0.254/24 net add vlan 40 vlan-id 40 net add vlan 40 vlan-raw-device bridge net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn neighbor peerlink.4094 activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise-default-gw net add bgp l2vpn evpn advertise ipv4 unicast net add bgp ipv6 unicast neighbor underlay activate net add loopback lo clag vxlan-anycast-ip 10.0.0.10 net add loopback lo vxlan local-tunnelip 10.10.10.2 net add vlan 60 vlan-id 60 net add vlan 60 vlan-raw-device bridge net add vlan 50 vlan-id 50 net add vlan 50 vlan-raw-device bridge net add vxlan vtep60 vxlan id 10060 net add vxlan vtep50 vxlan id 10050 net add vxlan vtep60,50 bridge arp-nd-suppress on net add vxlan vtep60,50 bridge learning off net add vxlan vtep60,50 stp bpduguard net add vxlan vtep60,50 stp portbpdufilter net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.2 net add vxlan vtep60 bridge access 60 net add vxlan vtep50 bridge access 50 net add bridge bridge ports vtep60,vtep50 net add bridge bridge vids 60,50 net add bond bond1-5 bridge pvid 50 net commit

Leaf2A Console

Copy
Copied!
            

net add interface swp15,swp16 mtu 9216 net add bond peerlink bond slaves swp15,swp16 net add interface swp1 mtu 9216 net add bond bond1 bond slaves swp1 net add bond bond1 clag id 1 net add bond bond1 bond lacp-bypass-allow net add bond bond1 stp bpduguard net add bond bond1 stp portadminedge net add bridge bridge ports bond1,peerlink net add bridge bridge vlan-aware net add bgp autonomous-system 65101 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp router-id 10.10.10.3 net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor peerlink.4094 interface remote-as internal net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add loopback lo ip address 10.10.10.3/32 net add interface peerlink.4094 clag backup-ip 10.10.10.4 net add interface peerlink.4094 clag priority 1000 net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:BB net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag args --initDelay 10 net add bridge bridge vids 11 net add vlan 11 ip address 172.16.1.252/24 net add vlan 11 ip address-virtual 00:00:5E:00:01:01 172.16.1.254/24 net add vlan 11 vlan-id 11 net add vlan 11 vlan-raw-device bridge net add bridge bridge vids 21 net add vlan 21 ip address 172.17.1.252/24 net add vlan 21 ip address-virtual 00:00:5E:00:01:01 172.17.1.254/24 net add vlan 21 vlan-id 21 net add vlan 21 vlan-raw-device bridge net add bridge bridge vids 31 net add vlan 31 ip address 172.18.1.252/24 net add vlan 31 ip address-virtual 00:00:5E:00:01:01 172.18.1.254/24 net add vlan 31 vlan-id 31 net add vlan 31 vlan-raw-device bridge net add bridge bridge vids 41 net add vlan 41 ip address 172.19.1.252/24 net add vlan 41 ip address-virtual 00:00:5E:00:01:01 172.19.1.254/24 net add vlan 41 vlan-id 41 net add vlan 41 vlan-raw-device bridge net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn neighbor peerlink.4094 activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise-default-gw net add bgp l2vpn evpn advertise ipv4 unicast net add bgp ipv6 unicast neighbor underlay activate net add loopback lo clag vxlan-anycast-ip 10.0.0.20 net add loopback lo vxlan local-tunnelip 10.10.10.3 net add vlan 60 vlan-id 60 net add vlan 60 vlan-raw-device bridge net add vlan 50 vlan-id 50 net add vlan 50 vlan-raw-device bridge net add vxlan vtep60 vxlan id 10060 net add vxlan vtep50 vxlan id 10050 net add vxlan vtep60,50 bridge arp-nd-suppress on net add vxlan vtep60,50 bridge learning off net add vxlan vtep60,50 stp bpduguard net add vxlan vtep60,50 stp portbpdufilter net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.3 net add vxlan vtep60 bridge access 60 net add vxlan vtep50 bridge access 50 net add bridge bridge ports vtep60,vtep50 net add bridge bridge vids 60,50 net add bond bond1 bridge pvid 50 net commit

Leaf2B Console

Copy
Copied!
            

net add interface swp15,swp16 mtu 9216 net add bond peerlink bond slaves swp15,swp16 net add interface swp1 mtu 9216 net add bond bond1 bond slaves swp1 net add bond bond1 clag id 1 net add bond bond1 bond lacp-bypass-allow net add bond bond1 stp bpduguard net add bond bond1 stp portadminedge net add bridge bridge ports bond1,peerlink net add bridge bridge vlan-aware net add bgp autonomous-system 65101 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp router-id 10.10.10.4 net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor peerlink.4094 interface remote-as internal net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add loopback lo ip address 10.10.10.4/32 net add interface peerlink.4094 clag backup-ip 10.10.10.3 net add interface peerlink.4094 clag priority 32768 net add interface peerlink.4094 clag sys-mac 44:38:39:BE:EF:BB net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag args --initDelay 10 net add bridge bridge vids 11 net add vlan 11 ip address 172.16.1.253/24 net add vlan 11 ip address-virtual 00:00:5E:00:01:01 172.16.1.254/24 net add vlan 11 vlan-id 11 net add vlan 11 vlan-raw-device bridge net add bridge bridge vids 21 net add vlan 21 ip address 172.17.1.253/24 net add vlan 21 ip address-virtual 00:00:5E:00:01:01 172.17.1.254/24 net add vlan 21 vlan-id 21 net add vlan 21 vlan-raw-device bridge net add bridge bridge vids 31 net add vlan 31 ip address 172.18.1.253/24 net add vlan 31 ip address-virtual 00:00:5E:00:01:01 172.18.1.254/24 net add vlan 31 vlan-id 31 net add vlan 31 vlan-raw-device bridge net add bridge bridge vids 41 net add vlan 41 ip address 172.19.1.253/24 net add vlan 41 ip address-virtual 00:00:5E:00:01:01 172.19.1.254/24 net add vlan 41 vlan-id 41 net add vlan 41 vlan-raw-device bridge net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn neighbor peerlink.4094 activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise-default-gw net add bgp l2vpn evpn advertise ipv4 unicast net add bgp ipv6 unicast neighbor underlay activate net add loopback lo clag vxlan-anycast-ip 10.0.0.20 net add loopback lo vxlan local-tunnelip 10.10.10.4 net add vlan 60 vlan-id 60 net add vlan 60 vlan-raw-device bridge net add vlan 50 vlan-id 50 net add vlan 50 vlan-raw-device bridge net add vxlan vtep60 vxlan id 10060 net add vxlan vtep50 vxlan id 10050 net add vxlan vtep60,50 bridge arp-nd-suppress on net add vxlan vtep60,50 bridge learning off net add vxlan vtep60,50 stp bpduguard net add vxlan vtep60,50 stp portbpdufilter net add vxlan vtep60,50 vxlan local-tunnelip 10.10.10.4 net add vxlan vtep60 bridge access 60 net add vxlan vtep50 bridge access 50 net add bridge bridge ports vtep60,vtep50 net add bridge bridge vids 60,50 net add bond bond1 bridge pvid 50 net commit

Repeat the following commands on all the leafs:

Leaf Switch Console

Copy
Copied!
            

cumulus@leaf1a:mgmt:~$ net show clag The peer is alive Our Priority, ID, and Role: 1000 1c:34:da:b4:09:40 primary Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:40 secondary Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:640 (linklocal) VxLAN Anycast IP: 10.0.0.10 Backup IP: 10.10.10.2 (active) System MAC: 44:38:39:be:ef:aa   CLAG Interfaces Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason ---------------- ---------------- ------- -------------------- ----------------- bond1 bond1 1 - - bond2 bond2 2 - - bond3 bond3 3 - - bond4 bond4 4 - - bond5 bond5 5 - - vtep50 vtep50 - - - vtep60 vtep60 - - -   cumulus@leaf1a:mgmt:~$ net show route show ip route ============= Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C>* 10.0.0.10/32 is directly connected, lo, 01w5d21h B>* 10.0.0.20/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h C>* 10.10.10.1/32 is directly connected, lo, 01w6d19h B>* 10.10.10.2/32 [200/0] via fe80::1e34:daff:feb4:640, peerlink.4094, weight 1, 01w5d21h B>* 10.10.10.3/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h B>* 10.10.10.4/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h B>* 10.10.10.101/32 [20/0] via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h B>* 10.10.10.102/32 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h C * 172.16.0.0/24 [0/1024] is directly connected, vlan10-v0, 01w5d21h C>* 172.16.0.0/24 is directly connected, vlan10, 01w5d21h B>* 172.16.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h C * 172.17.0.0/24 [0/1024] is directly connected, vlan20-v0, 01w5d21h C>* 172.17.0.0/24 is directly connected, vlan20, 01w5d21h B>* 172.17.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h C * 172.18.0.0/24 [0/1024] is directly connected, vlan30-v0, 01w5d21h C>* 172.18.0.0/24 is directly connected, vlan30, 01w5d21h B>* 172.18.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h C * 172.19.0.0/24 [0/1024] is directly connected, vlan40-v0, 01w5d21h C>* 172.19.0.0/24 is directly connected, vlan40, 01w5d21h B>* 172.19.1.0/24 [20/0] via fe80::1e34:daff:feb3:ff6c, swp30, weight 1, 01w5d21h * via fe80::1e34:daff:feb4:6c, swp29, weight 1, 01w5d21h       show ipv6 route =============== Codes: K - kernel route, C - connected, S - static, R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C * fe80::/64 is directly connected, vlan50, 01w5d21h C * fe80::/64 is directly connected, vlan60, 01w5d21h C * fe80::/64 is directly connected, vlan40, 01w5d21h C * fe80::/64 is directly connected, vlan40-v0, 01w5d21h C * fe80::/64 is directly connected, vlan30-v0, 01w6d19h C * fe80::/64 is directly connected, vlan30, 01w6d19h C * fe80::/64 is directly connected, vlan20-v0, 01w6d19h C * fe80::/64 is directly connected, vlan20, 01w6d19h C * fe80::/64 is directly connected, vlan10-v0, 01w6d19h C * fe80::/64 is directly connected, vlan10, 01w6d19h C * fe80::/64 is directly connected, bridge, 01w6d19h C * fe80::/64 is directly connected, peerlink.4094, 01w6d19h C * fe80::/64 is directly connected, swid0_eth, 01w6d19h C * fe80::/64 is directly connected, swp30, 01w6d19h C>* fe80::/64 is directly connected, swp29, 01w6d19h

Host Preparation

In order to achieve optimal results for DPDK use cases, the compute node must be correctly optimized for DPDK performance.

The optimization might require specific BIOS and NIC Firmware settings. Please refer to the official DPDK performance document provided by your CPU vendor.

For our deployment we used the following document from AMD: https://www.amd.com/system/files/documents/epyc-7Fx2-processors-dpdk-nw-performance-brief.pdf

Note

Please note that the host boot settings (Linux grub command line) are done later on as part of the overcloud image configuration and that the hugepages allocation is done on the actual VM used for testing.

OpenStack Deployment

The RedHat OpenStack Platform (RHOSP) version 16.1 will be used.

The deployment is divided into three major steps:

  1. Installing the director node

  2. Deploying the undercloud on the director node

  3. Deploying the overcloud using the undercloud

Important

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

Important

All nodes which belong to the same profile (e.g., controller, compute) must have the same PCIe placement for the NIC and must expose the same interface name.

The director node needs to have the following network interfaces:

  1. Untagged over bond0—used for provisioning the overcloud nodes (connected to stretched VLAN 50 on the switch)

  2. Tagged VLAN 60 over bond0.60—used for Internet access over the external network segment (connected to stretched VLAN 60 on the switch)

  3. An interface on the IPMI network—used for accessing to the bare metal nodes BMCs

Undercloud Director Installation

  • Follow RH-OSP Preparing for Director Installation up to Preparing Container Images.

  • Use the following environment file for OVS-based RH-OSP 16.1 container preparation. Remember to update your Red Hat registry credentials:

/home/stack/containers-prepare-parameter.yaml

Copy
Copied!
            

#global parameter_defaults: ContainerImagePrepare: - push_destination: true excludes: - ceph - prometheus set: name_prefix: openstack- name_suffix: '' namespace: registry.redhat.io/rhosp-rhel8 neutron_driver: null rhel_containers: false tag: '16.1' tag_from_label: '{version}-{release}' ContainerImageRegistryCredentials: registry.redhat.io: '<username>': '<password>'

/home/stack/undercloud.conf

Copy
Copied!
            

[DEFAULT] undercloud_hostname = rhosp-director.localdomain local_ip = 192.168.24.1/24 network_gateway = 192.168.24.1 undercloud_public_host = 192.168.24.2 undercloud_admin_host = 192.168.24.3 undercloud_nameservers = 8.8.8.8,8.8.4.4 undercloud_ntp_servers = 10.211.0.134,10.211.0.124 subnets = ctlplane-subnet local_subnet = ctlplane-subnet generate_service_certificate = True certificate_generation_ca = local local_interface = bond0 inspection_interface = br-ctlplane undercloud_debug = true enable_tempest = false enable_telemetry = false enable_validations = true enable_novajoin = false clean_nodes = true custom_env_files = /home/stack/custom-undercloud-params.yaml,/home/stack/ironic_ipmitool.yaml container_images_file = /home/stack/containers-prepare-parameter.yaml   [auth]   [ctlplane-subnet] cidr = 192.168.24.0/24 dhcp_start = 192.168.24.5 dhcp_end = 192.168.24.30 inspection_iprange = 192.168.24.100,192.168.24.120 gateway = 192.168.24.1 masquerade = true

  • Follow the instructions in RH-OSP Obtain Images for Overcloud Nodes (section 4.9.1, steps 1–3), without importing the images into the director yet.

  • Once obtained, follow the customization process described in RH-OSP Working with Overcloud Images:

    1. Start from 3.3. QCOW: Installing virt-customize to director

    2. Skip 3.4

    3. Run 3.5. QCOW: Setting the root password (optional)

    4. Run 3.6. QCOW: Registering the image

    5. Run the following command to locate your subscription pool ID:

      Copy
      Copied!
                  

      $ sudo subscription-manager list --available --all --matches="Red Hat OpenStack"

    6. Replace [subscription-pool] in the below command with your relevant subscription pool ID:

      Copy
      Copied!
                  

      $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --run-command 'subscription-manager attach --pool [subscription-pool]'

    7. Skip 3.7 and 3.8

    8. Run the following command to add mstflint to overcloud image to allow the NIC firmware provisioning during overcloud deployment (similar to 3.9).

      Copy
      Copied!
                  

      $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install mstflint

      Note

      mstflint is required for the overcloud nodes to support the automatic NIC firmware upgrade by the cloud orchestration system during deployment.

    9. Run 3.10. QCOW: Cleaning the subscription pool

    10. Run 3.11. QCOW: Unregistering the image

    11. Run 3.12. QCOW: Reset the machine ID

    12. Run 3.13. Uploading the images to director

Undercloud Director Preparation for Automatic NIC Firmware Provisioning

  1. Download the latest ConnectX NIC firmware binary file (fw-<NIC-Model>.bin) from NVIDIA Networking Firmware Download Site.

  2. Create a directory named mlnx_fw under /var/lib/ironic/httpboot/ in the Director node, and place the firmware binary file in it.

  3. Extract the connectx_first_boot.yaml file from the configuration files attached to this guide, and place it in the /home/stack/templates/ directory in the Director node

Warning

The connectx_first_boot.yaml file is called by another deployment configuration file ( env-ovs-dvr.yaml) , so please use the instructed location, or change the configuration files accordingly.

Overcloud Nodes Introspection

A full overcloud introspection procedure is described in RH-OSP Configuring a Basic Overcloud. In this RDG, the following configuration steps were used for introspecting overcloud bare metal nodes to be deployed later-on over two routed Spine-Leaf racks:

  • Prepare a bare metal inventory file - instackenv.json, with the overcloud nodes information. In this case, the inventory file is listing 5 bare metal nodes to be deployed as overcloud nodes: 3 controller nodes and 2 compute nodes (1 in each routed rack). Make sure to update the file with the IPMI servers addresses and credentials and with the MAC address of one of the physical ports in order to avoid issues in the introspection process:

    instackenv.json

    Copy
    Copied!
                

    { "nodes": [ { "name": "controller-1", "pm_type":"ipmi", "pm_user":"root", "pm_password":"******", "pm_addr":"10.7.214.1", "ports":[ { "address":"<MAC_ADDRESS>", "pxe_enabled":true } ] }, { "name": "controller-2", "pm_type":"ipmi", "pm_user":"root", "pm_password":"******", "pm_addr":"10.7.214.2", "ports":[ { "address":"<MAC_ADDRESS>", "pxe_enabled":true } ] }, { "name": "controller-3", "pm_type":"ipmi", "pm_user":"root", "pm_password":"******", "pm_addr":"10.7.214.3", "ports":[ { "address":"<MAC_ADDRESS>", "pxe_enabled":true } ] }, { "name": "compute-1", "pm_type":"ipmi", "pm_user":"root", "pm_password":"******", "pm_addr":"10.7.214.4", "ports":[ { "address":"<MAC_ADDRESS>", "pxe_enabled":true } ] }, { "name": "compute-2", "pm_type":"ipmi", "pm_user":"root", "pm_password":"******", "pm_addr":"10.7.214.5", "ports":[ { "address":"<MAC_ADDRESS>", "pxe_enabled":true } ] } ] }

  • Import the overcloud baremetal nodes inventory, and wait until all nodes are listed in “manageable” state.

    Copy
    Copied!
                

    [stack@rhosp-director ~]$ source ~/stackrc (undercloud) [stack@rhosp-director ~]$ openstack overcloud node import /home/stack/instackenv.json

    Copy
    Copied!
                

    $ openstack baremetal node list +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+ | 476c7659-abc2-4d8c-9532-1756abbfd18a | controller-1 | None | power off | manageable | False | | 3cbb74e5-6508-4ec8-91a8-870dbf28baed | controller-2 | None | power off | manageable | False | | 457b329e-f1bc-476a-996d-eb82a56998e8 | controller-3 | None | power off | manageable | False | | 870445b7-650f-40fc-8ac2-5c3df700ccdc | compute-1 | None | power off | manageable | False | | baa7356b-11ca-4cb0-b58c-16c110bbbea0 | compute-2 | None | power off | manageable | False | +--------------------------------------+--------------+---------------+-------------+--------------------+-------------+

  • Start the baremetal nodes introspection:

    Copy
    Copied!
                

    $ openstack overcloud node introspect --all-manageable

  • Set the root device for deployment, and provide all baremetal nodes to reach “available” state:

    Copy
    Copied!
                

    $ openstack overcloud node configure --all-manageable --instance-boot-option local --root-device largest $ openstack overcloud node provide --all-manageable

  • Tag the controller nodes into the “control” profile, which is later mapped to the overcloud controller role:

    Copy
    Copied!
                

    $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-1 $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-2 $ openstack baremetal node set --property capabilities='profile:control,boot_option:local' controller-3

    Warning

    Note: The Role to Profile mapping is specified in the node-info.yaml file used during the overcloud deployment.

  • Create a new compute flavor, and tag one compute node into the “compute-r0" profile, which is later mapped to the overcloud “compute in rack 0" role:

    Copy
    Copied!
                

    $ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r0 $ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r0" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r0   $ openstack baremetal node set --property capabilities='profile:compute-r0,boot_option:local' compute-1

  • Create a new compute flavor, and tag the last compute node into the “compute-r1" profile, which is later mapped to the overcloud “compute in rack 1" role:

    Copy
    Copied!
                

    $ openstack flavor create --id auto --ram 4096 --disk 40 --vcpus 1 compute-r1 $ openstack flavor set --property "capabilities:boot_option"="local" --property "capabilities:profile"="compute-r1" --property "resources:CUSTOM_BAREMETAL"="1" --property "resources:DISK_GB"="0" --property "resources:MEMORY_MB"="0" --property "resources:VCPU"="0" compute-r1   $ openstack baremetal node set --property capabilities='profile:compute-r1,boot_option:local' compute-2

  • Verify the overcloud nodes profiles allocation:

    Copy
    Copied!
                

    $ openstack overcloud profiles list +--------------------------------------+--------------+-----------------+-----------------+-------------------+ | Node UUID | Node Name | Provision State | Current Profile | Possible Profiles | +--------------------------------------+--------------+-----------------+-----------------+-------------------+ | 476c7659-abc2-4d8c-9532-1756abbfd18a | controller-1 | available | control | | | 3cbb74e5-6508-4ec8-91a8-870dbf28baed | controller-2 | available | control | | | 457b329e-f1bc-476a-996d-eb82a56998e8 | controller-3 | available | control | | | 870445b7-650f-40fc-8ac2-5c3df700ccdc | compute-1 | available | compute-r0 | | | baa7356b-11ca-4cb0-b58c-16c110bbbea0 | compute-2 | available | compute-r1 | | +--------------------------------------+--------------+-----------------+-----------------+-------------------+

Overcloud Deployment Configuration Files

Prepare the following cloud deployment configuration files, and place it under the /home/stack/templates/dvr directory.

Warning

Note: The full files are attached to this article, and can be found here: 47036708-1.0.0.zip

Some configuration files are customized specifically the to the /home/stack/templates/dvr location. If you place the template files in a different location, adjust it accordingly.

  • containers-prepare-parameter.yaml

  • network-environment-dvr.yaml

  • controller.yaml

    Note

    This template file contains the network settings for the controller nodes, including large MTU and bonding configuration

  • computesriov-dvr.yaml

    Note

    This template file contains the network settings for compute node, including SR-IOV VFs, large MTU and accelerated bonding (VF-LAG) configuration for data path.

  • node-info.yaml

    Note

    This environment file contains the count of nodes per role and the role to the baremetal profile mapping.

  • roles_data_dvr.yaml

    Note

    This environment file contains the services enabled on each cloud role and the networks associated with its rack location.

  • network_data.yaml

    Note

    This environment file contains a cloud network configuration for routed Spine-Leaf topology with large MTU. Rack0 and Rack1 L3 segments are listed as subnets of each cloud network. For further information refer to RH-OSP Configuring the Overcloud Leaf Networks.

  • env-ovs-dvr.yaml

    Note
    • This environment file contains the following settings:

      • Overcloud nodes time settings

      • ConnectX First Boot parameters (by calling /home/stack/templates/connectx_first_boot.yaml file)

      • Neutron Jumbo Frame MTU and DVR mode

      • Compute nodes CPU partitioning and isolation adjusted to Numa topology

      • Nova PCI passthrough settings adjusted to VXLAN hardware offload

Overcloud Deployment

  • Issue the overcloud deploy command to start cloud deployment with the prepared configuration files.

    Copy
    Copied!
                

    $ openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \ --libvirt-type kvm \ -n /home/stack/templates/dvr/network_data.yaml \ -r /home/stack/templates/dvr/roles_data_dvr.yaml \ --validation-warnings-fatal \ -e /home/stack/templates/dvr/node-info.yaml \ -e /home/stack/templates/dvr/containers-prepare-parameter.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dvr.yaml \ -e /home/stack/templates/dvr/network-environment-dvr.yaml \ -e /home/stack/templates/dvr/env-ovs-dvr.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml

  • Once deployed, load the necessary environment variables to interact with your overcloud:

    Copy
    Copied!
                

    $ source ~/overcloudrc

Applications and Use Cases

Accelerated Packet Processing (SDN Acceleration)

Warning

Note: The following use case is demonstrating SDN layer acceleration using hardware offload capabilities.
The appendix below includes benchmarks that demonstrate SDN offload performance and usability.

image2021-10-27_13-19-59.png

VM Image

  • Build a VM cloud image (qcow2) with packet processing performance tools and cloud-init elements as described in How-to: Create OpenStack Cloud Image with Performance Tools.

  • Upload the image to the overcloud image store:

    Copy
    Copied!
                

    $ openstack image create perf --public --disk-format qcow2 --container-format bare --file /home/stack/images/guest/centos8-perf.qcow2

VM Flavor

  • Create a flavor:

    Copy
    Copied!
                

    $ openstack flavor create m1.packet --id auto --ram 8192 --disk 20 --vcpus 10

  • Set hugepages and cpu-pinning parameters:

    Copy
    Copied!
                

    $ openstack flavor set m1.packet --property hw:mem_page_size=large $ openstack flavor set m1.packet --property hw:cpu_policy=dedicated

VM Networks and Ports

  • Create a VXLAN network with normal ports to be used for instance management and access:

    Copy
    Copied!
                

    $ openstack network create vx_mgmt --provider-network-type vxlan --share $ openstack subnet create vx_mgmt_subnet --dhcp --network vx_mgmt --subnet-range 22.22.22.0/24 --dns-nameserver 8.8.8.8 $ openstack port create normal1 --network vx_mgmt --no-security-group --disable-port-security $ openstack port create normal2 --network vx_mgmt --no-security-group --disable-port-security

  • Create a VXLAN network to be used for accelerated data traffic between the VM instances with Jumbo Frames support:

    Copy
    Copied!
                

    $ openstack network create vx_data --provider-network-type vxlan --share --mtu 8950 $ openstack subnet create vx_data_subnet --dhcp --network vx_data --subnet-range 33.33.33.0/24 --gateway none

  • Create 2 x SR-IOV direct ports with hardware offload capabilities:

    Copy
    Copied!
                

    $ openstack port create direct1 --vnic-type=direct --network vx_data --binding-profile '{"capabilities":["switchdev"]}' $ openstack port create direct2 --vnic-type=direct --network vx_data --binding-profile '{"capabilities":["switchdev"]}'

  • Create an external network for public access:

    Copy
    Copied!
                

    $ openstack network create public --provider-physical-network datacentre --provider-network-type vlan --provider-segment 60 --share --mtu 8950 --external $ openstack subnet create public_subnet --dhcp --network public --subnet-range 172.60.0.0/24 --allocation-pool start=172.60.0.100,end=172.60.0.150 --gateway 172.60.0.254

  • Create a public router, and add the management network subnet:

    Copy
    Copied!
                

    $ openstack router create public_router $ openstack router set public_router --external-gateway public $ openstack router add subnet public_router vx_mgmt_subnet

VM Instances

  • Create a VM instance with a management port and a direct port on the compute node located on the Rack0 L3 network segment:

    Copy
    Copied!
                

    $ openstack server create --flavor m1.packet --image perf --port normal1 --port direct1 vm1 --availability-zone nova:overcloud-computesriov-rack0-0.localdomain

  • Create a VM instance with a management port and a direct port on the compute node located on the Rack1 L3 network segment:

    Copy
    Copied!
                

    $ openstack server create --flavor m1.packet --image perf --port normal2 --port direct2 vm2 --availability-zone nova:overcloud-computesriov-rack1-0.localdomain

  • Wait until the VM instances status is changed to ACTIVE:

Copy
Copied!
            

$ openstack server list +--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+ | ID                                   | Name    | Status | Networks                                                | Image | Flavor | +--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+ | 000d3e17-9583-4885-aa2d-79c3b5df3cc8 | vm2 | ACTIVE | vx_data=33.33.33.244; vx_mgmt=22.22.22.151              | perf  |        | | 150013ad-170d-4587-850e-70af692aa74c | vm1     | ACTIVE | vx_data=33.33.33.186; vx_mgmt=22.22.22.59 | perf  |        | +--------------------------------------+---------+--------+---------------------------------------------------------+-------+--------+

  • In order to verify that external/public network access is functioning properly, create and assign external floating IP addresses for the VMs:

    Copy
    Copied!
                

    $ openstack floating ip create --floating-ip-address 172.60.0.151 public $ openstack server add floating ip vm1 172.60.0.151 $ openstack floating ip create --floating-ip-address 172.60.0.152 public $ openstack server add floating ip vm2 172.60.0.152

Appendix

Performance Testing

Warning

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Warning

Note: The tools used in the below tests are included in the VM image built with the perf-tools element, as instructed in previous steps.

iperf TCP Test

  • On the Transmitter VM, disable XPS (Transmit Packet Steering):

    Copy
    Copied!
                

    # for i in {0..7}; do echo 0 > /sys/class/net/eth1/queues/tx-$i/xps_cpus;  done;

    Note
    • This step is required to allow packet distribution of the iperf traffic using ConnectX VF-LAG.

    • Use interface statistic/counters on the Leaf switches’ bond interfaces to confirm traffic is distributed by the Transmitter VM to both switches using VF-LAG (Cumulus “net show counters” command).

  • On the Receiver VM, start multiple iperf3 servers:

    Copy
    Copied!
                

    # iperf3 -s -p 5101& # iperf3 -s -p 5102& # iperf3 -s -p 5103& # iperf3 -s -p 5104&

  • On the Transmitter VM, start multiple iperf3 clients for multi-core parallel streams:

    Copy
    Copied!
                

    # iperf3 -c 33.33.33.180 -T s1 -p 5101 -t 15 & # iperf3 -c 33.33.33.180 -T s2 -p 5102 -t 15 & # iperf3 -c 33.33.33.180 -T s3 -p 5103 -t 15 & # iperf3 -c 33.33.33.180 -T s4 -p 5104 -t 15 &

  • Check the results:

    Copy
    Copied!
                

    s1: - - - - - - - - - - - - - - - - - - - - - - - - - s1: [ ID] Interval Transfer Bitrate Retr s1: [ 5] 0.00-15.00 sec 38.4 GBytes 22.0 Gbits/sec 18239 sender s1: [ 5] 0.00-15.04 sec 38.4 GBytes 22.0 Gbits/sec receiver s1: s1: iperf Done. s2: [ 5] 14.00-15.00 sec 3.24 GBytes 27.8 Gbits/sec 997 1008 KBytes s2: - - - - - - - - - - - - - - - - - - - - - - - - - s2: [ ID] Interval Transfer Bitrate Retr s2: [ 5] 0.00-15.00 sec 45.1 GBytes 25.8 Gbits/sec 22465 sender s2: [ 5] 0.00-15.04 sec 45.1 GBytes 25.7 Gbits/sec receiver s2: s2: iperf Done. s4: [ 5] 14.00-15.00 sec 2.26 GBytes 19.4 Gbits/sec 1435 443 KBytes s4: - - - - - - - - - - - - - - - - - - - - - - - - - s4: [ ID] Interval Transfer Bitrate Retr s4: [ 5] 0.00-15.00 sec 46.5 GBytes 26.6 Gbits/sec 19964 sender s4: [ 5] 0.00-15.04 sec 46.5 GBytes 26.6 Gbits/sec receiver s4: s4: iperf Done. s3: [ 5] 14.00-15.00 sec 2.70 GBytes 23.2 Gbits/sec 1537 643 KBytes s3: - - - - - - - - - - - - - - - - - - - - - - - - - s3: [ ID] Interval Transfer Bitrate Retr s3: [ 5] 0.00-15.00 sec 39.5 GBytes 22.6 Gbits/sec 18447 sender s3: [ 5] 0.00-15.04 sec 39.5 GBytes 22.6 Gbits/sec receiver s3: s3: iperf Done.

    Note

    The test results above demonstrate a total of around 100Gbps line rate for IP TCP traffic.

Before proceeding to the next test, stop all iperf servers on the Receiver VM:

Copy
Copied!
            

# killall iperf3 iperf3: interrupt - the server has terminated iperf3: interrupt - the server has terminated iperf3: interrupt - the server has terminated iperf3: interrupt - the server has terminated

RoCE Bandwidth Test over the bond

  • On the Receiver VM, start ib_write_bw server with 2 QPs in order to utilize the VF-LAG infrastructure:

    Copy
    Copied!
                

    # ib_write_bw -R --report_gbits --qp 2

  • On the Transmitter VM, start ib_write_bw client with 2 QPs and a duration of 10 seconds:

    Copy
    Copied!
                

    # ib_write_bw -R --report_gbits --qp 2 -D 10 33.33.33.180

  • Check the test results:

    Copy
    Copied!
                

    --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 2 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : ON Data ex. method : rdma_cm --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x016e PSN 0x3ab6d6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:173 local address: LID 0000 QPN 0x016f PSN 0xc9fa3f GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:173 remote address: LID 0000 QPN 0x016e PSN 0x8d928c GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:180 remote address: LID 0000 QPN 0x016f PSN 0xc89786 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:33:33:180 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 22065492 0.00 192.81 0.367756 ---------------------------------------------------------------------------------------

Single-core DPDK packet-rate test over a single port

In this test, we used two VMs - one running the trex traffic generator and the other running the dpdk-testpmd tool.

The traffic generator sends 64 byte packets to the test-pmd tool and the tool sends them back (thus testing the DPDK packet-processing pipeline).

We tested single-core packet processing rate between two VMs using 2 tx and rx queues.

Note

Please note that this test requires using specific versions of DPDK and the VM image kernel and compatibility issues might exist when using the latest releases.

For this test we used a CentOS 7.8 image with kernel 3.10.0-1160.42.2 and DPDK 20.11.

On one VM we activated the dpdk-testpmd utility, this is the command line used to run testpmd (this command displays the MAC address of the port to be used later on with trex as the destination port MAC):

testpmd VM console

Copy
Copied!
            

# ./dpdk-testpmd -c 0x1ff -n 4 -m 1024 -a 00:05.0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --nb-cores=1 --rss-udp --forward-mode=macswap -a -i

Next, we created a second VM with two direct ports (on the vx_data network) and used it for running the trex traffic generator (version 2.82):

We ran the DPDK port setup interactive wizard and used the MAC address of the TestPMD VM collected beforehand:

trex VM console

Copy
Copied!
            

# cd /root/trex/v2.82 # ./dpdk_setup_ports.py -i By default, IP based configuration file will be created. Do you want to use MAC based config? (y/N)y +----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+ | ID | NUMA |   PCI   |        MAC        |                     Name                     |   Driver   | Linux IF |  Active  | +====+======+=========+===================+==============================================+============+==========+==========+ | 0  | -1   | 00:03.0 | fa:16:3e:11:3e:64 | Virtio network device                        | virtio-pci | eth0     | *Active* | +----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+ | 1  | -1   | 00:05.0 | fa:16:3e:cb:a4:82 | ConnectX Family mlx5Gen Virtual Function     | mlx5_core  | eth1     |          | +----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+ | 2  | -1   | 00:06.0 | fa:16:3e:58:79:e2 | ConnectX Family mlx5Gen Virtual Function     | mlx5_core  | eth2     |          | +----+------+---------+-------------------+----------------------------------------------+------------+----------+----------+ Please choose an even number of interfaces from the list above, either by ID, PCI or Linux IF Stateful will use order of interfaces: Client1 Server1 Client2 Server2 etc. for flows. Stateless can be in any order. Enter list of interfaces separated by space (for example: 1 3) : 1 2   For interface 1, assuming loopback to its dual interface 2. Destination MAC is fa:16:3e:58:79:e2. Change it to MAC of DUT? (y/N).y Please enter a new destination MAC of interface 1: FA:16:3E:32:5C:A4 For interface 2, assuming loopback to its dual interface 1. Destination MAC is fa:16:3e:cb:a4:82. Change it to MAC of DUT? (y/N).y Please enter a new destination MAC of interface 2: FA:16:3E:32:5C:A4 Print preview of generated config? (Y/n) ### Config file generated by dpdk_setup_ports.py ###   - version: 2   interfaces: ['00:05.0', '00:06.0']   port_info:       - dest_mac: fa:16:3e:32:5c:a4         src_mac:  fa:16:3e:cb:a4:82       - dest_mac: fa:16:3e:32:5c:a4         src_mac:  fa:16:3e:58:79:e2     platform:       master_thread_id: 0       latency_thread_id: 1       dual_if:         - socket: 0           threads: [2,3,4,5,6,7,8,9]     Save the config to file? (Y/n)y Default filename is /etc/trex_cfg.yaml Press ENTER to confirm or enter new file: Saved to /etc/trex_cfg.yaml.

We create the following UDP packet stream configuration file under the /root/trex/v2.82 directory:

udp_rss.py

Copy
Copied!
            

from trex_stl_lib.api import *   class STLS1(object):       def create_stream (self):                  pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(22*'x')                            vm = STLScVmRaw( [                                 STLVmFlowVar(name="v_port",                                                 min_value=4337,                                                   max_value=5337,                                                   size=2, op="inc"),                                 STLVmWrFlowVar(fv_name="v_port",                                             pkt_offset= "UDP.sport" ),                                 STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),                               ]                         )           return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,                                 mode = STLTXCont(pps = 8000000) )         def get_streams (self, direction = 0, **kwargs):         # create 1 stream         return [ self.create_stream() ]     # dynamic load - used for trex console or simulator def register():     return STLS1()

Next, we ran the TRex application in the background over 8 out of 10 cores:

trex VM console

Copy
Copied!
            

# nohup ./t-rex-64 --no-ofed-check -i -c 8 &

And finally ran the TRex Console:

trex VM console

Copy
Copied!
            

# ./trex-console   Using 'python' as Python interpeter     Connecting to RPC server on localhost:4501 [SUCCESS]     Connecting to publisher server on localhost:4500 [SUCCESS]     Acquiring ports [0, 1]: [SUCCESS]     Server Info:   Server version: v2.82 @ STL Server mode: Stateless Server CPU: 8 x AMD EPYC Processor (with IBPB) Ports count: 2 x 100Gbps @ ConnectX Family mlx5Gen Virtual Function   -=TRex Console v3.0=-   Type 'help' or '?' for supported actions   trex>

In the TRex Console, entered the UI (TUI):

trex VM console

Copy
Copied!
            

trex>tui

And started a 35MPPS stream using the stream configuration file created in previous steps:

trex VM console

Copy
Copied!
            

tui>start -f udp_rss.py -m 35mpps -p 0

And these are the measured results on the testpmd VM (~32.5mpps):

testpmd tool console

Copy
Copied!
            

testpmd> show port stats all   ######################## NIC statistics for port 0 ######################## RX-packets: 486825081 RX-missed: 14768224 RX-bytes: 29209504860 RX-errors: 0 RX-nombuf: 0 TX-packets: 486712558 TX-errors: 0 TX-bytes: 29202753480   Throughput (since last show) Rx-pps: 32542689 Rx-bps: 15620491128 Tx-pps: 32542685 Tx-bps: 15620488872 ############################################################################

Troubleshooting the Nodes in Case Network Is Not Functioning

In case there is a need to debug the overcloud nodes due to issues in the deployment, the only way to connect to them is using an SSH key that is placed in the director node.

Connecting through a BMC console is useless since there is no way to log in to the nodes using a username and a password.

Since the nodes are connected using a single network, there can be a situation in which they are completely inaccessible for troubleshooting in case there is an issue with the YAML files and so forth.

The solution for this is to add the following lines to the env-ovs-dvr.yaml file (under the mentioned sections) that will allow for setting a root password for the nodes so they become accessible from the BMC console:

Copy
Copied!
            

resource_registry: OS::TripleO::NodeUserData: /usr/share/openstack-tripleo-heat-templates/firstboot/userdata_root_password.yaml   parameter_defaults: NodeRootPassword: "******"

The overcloud to e deleted and redeployed in order for the above changes to affect.

Warning

Note: this feature and the “Automatic NIC Firmware Provisioning” feature are mutually exclusive and can’t be used in parallel as they both use the “NodeUserData” variable!

Warning

Note: Make sure to remove the above lines in case of a production cloud as they pose a security hazard

Once the nodes are accessible, an additional helpful step to troubleshoot network issues on the nodes would be to run the following script on the problematic nodes:

Copy
Copied!
            

# os-net-config -c /etc/os-net-config/config.json -d

This script will try to reconfigure the network and will indicate any relevant issues.

Authors

itai-image-gtc3.jpg

Itai Levy

Over the past few years, Itai Levy has worked as a Solutions Architect and member of the NVIDIA Networking “Solutions Labs” team. Itai designs and executes cutting-edge solutions around Cloud Computing, SDN, SDS and Security. His main areas of expertise include NVIDIA BlueField Data Processing Unit (DPU) solutions and accelerated OpenStack/K8s platforms.

image2021-2-3_21-28-15.png

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar’s focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

Last updated on Sep 12, 2023.