NVIDIA Docs Hub NVIDIA Networking Networking Solutions RDG for Deploying Media Streaming Applications using Rivermax, DeepStream over Accelerated K8s Cluster

RDG for Deploying Media Streaming Applications using Rivermax, DeepStream over Accelerated K8s Cluster

Created on June 15, 2022.

Scope

The following Reference Deployment Guide (RDG) shows deployment of Rivermax and DeepStream streaming apps over accelerated Kubernetes cluster.

Abbreviations and Acronyms

Term	Definition	Term	Definition
CDN	Content Delivery Network	LLDP	Link Layer Discovery Protocol
CNI	Container Network Interface	NFD	Node Feature Discovery
CR	Custom Resources	NCCL	NVIDIA Collective Communication Library
CRD	Custom Resources Definition	OCI	Open Container Initiative
CRI	Container Runtime Interface	PF	Physical Function
DHCP	Dynamic Host Configuration Protocol	QSG	Quick Start Guide
DNS	Domain Name System	RDG	Reference Deployment Guide
DP	Device Plugin	RDMA	Remote Direct Memory Access
DS	Deep Stream	RoCE	RDMA over Converged Ethernet
IPAM	IP Address Management	SR-IOV	Single Root Input Output Virtualization
K8s	Kubernetes	VF	Virtual Function

This guide supplies a complete solution cycle of K8s cluster deployment including technology overview, design, component selection, deployment steps and apps workload examples. The solution will be delivered on top of standard servers. The NVIDIA end-to-end Ethernet infrastructure is used to oversee the workload.
In this guide, we use the NVIDIA GPU Operator and the NVIDIA Network Operator, who manage deploying and configuring GPU and Network components in the K8s cluster. These components allow you to accelerate workload using CUDA, RDMA and GPUDirect technologies.

This guide shows the design of a K8s cluster with two K8s worker nodes and provides detailed instructions for deploying a K8s cluster.

A Greenfield deployment is assumed for this guide.

The information presented is written for experienced Media and Entertainment Broadcast System Admins, System Engineers and Solution Architects who need to deploy the Rivermax streaming apps for their customers.

References

Solution Architecture

Key Components and Technologies

NVIDIA DGX A100
NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

NVIDIA ConnectX SmartNICs
10/25/40/50/100/200 and 400G Ethernet Network Adapters
The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

NVIDIA LinkX Cables
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

NVIDIA Spectrum Ethernet Switches
Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
NVIDIA combines the benefits of NVIDIA Spectrum^™ switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus^® Linux , SONiC and NVIDIA Onyx^®.

Kubernetes
Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

Kubespray
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
- A highly available cluster
- Composable attributes
- Support for most popular Linux distributions

NVIDIA GPU Operator
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

NVIDIA Network Operator
An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

NVIDIA CUDA
CUDA^® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

NVIDIA Rivermax SDK
NVIDIA Rivermax offers a unique IP-based solution for any media and data streaming use case. Rivermax together with NVIDIA GPU accelerated computing technologies unlocks innovation for a wide range of applications in Media and Entertainment (M&E), Broadcast, Healthcare, Smart Cities and more. Rivermax leverages NVIDIA ConnectX and BlueField DPU hardware streaming acceleration technology that enables direct data transfers to and from the GPU, delivering best-in-class throughput and latency with minimal CPU utilization for streaming workloads.

NVIDIA DeepStream SDK
NVIDIA DeepStream allows the rapid development and deployment of Vision AI applications and services. DeepStream provides multi-platform, scalable, TLS-encrypted security that can be deployed on-premises, on the edge, and in the cloud. It delivers a complete streaming analytics toolkit for AI-based multi-sensor processing, video, audio and image understanding. Principally DeepStream is for vision AI developers, software partners, startups and OEMs building IVA apps and services.

Networked Media Open Specifications (NMOS)
NMOS specifications are a family of open, free-of-charge specifications that enable interoperability between media devices on an IP infrastructure. The core specifications, IS-04 Registration and Discovery and IS-05 Device Connection Management, provide uniform mechanisms to enable media devices and services to advertise their capabilities onto the network, and control systems to configure the video, audio and data streams between the devices' senders and receivers. NMOS is extensible and, for example, includes specifications for audio channel mapping, for exchange of event and tally information, and for securing the APIs, leveraging IT best practices. There are open-source NMOS implementations available, and NVIDIA provides a free NMOS Node library in the DeepStream SDK.

Logical Design

The logical design includes the following parts:

Deployment node running Kubespray that deploys Kubernetes cluster
K8s Master node running all Kubernetes management components
K8s Worker nodes with NVIDIA GPUs and NVIDIA ConnectsX-6Dx adapter
High-speed Ethernet fabric (Secondary K8s network)
Deployment and K8s Management networks

Application Logical Design

In our guide we deployed the following applications:

Rivermax Media node
NMOS registry controller
DeepStream gateway
Time synchronization service
VNC apps for internal GUI access

Software Stack Components

Bill of Materials

The following hardware setup is utilized in this guide to build K8s cluster with two K8s Worker nodes.

Warning

You can use any suitable hardware according to the network topology and software stack.

Deployment and Configuration

Network / Fabric

This RDG describes K8s cluster deployment with multiple K8s Worker Nodes.

The high-performance network is a secondary network for Kubernetes cluster and requires the L2 network topology.

The Deployment/Management network topology and DNS/DHCP network services are part of the IT infrastructure. The components installation procedure and configuration are not covered in this guide.

Network IP Configuration

Below are the server names with their relevant network configurations.

Server/Switch Type	Server/Switch Name	IP and NICs
Server/Switch Type	Server/Switch Name	High-Speed Network	Management Network
Deployment node	depserver	N/A	eth0: DHCP 192.168.100.202
K8s Master node	node1	N/A	eth0: DHCP 192.168.100.29
K8s Worker Node1	node2	enp57s0f0: no IP set	eth0: DHCP 192.168.100.34
K8s Worker Node2	node3	enp57s0f0: no IP set	eth0: DHCP 192.168.100.39
High-speed switch	switch		mgmt0: DHCP 192.168.100.49

enpXXs0f0 high-speed network interfaces do not require additional configuration.

Wiring

On each K8s Worker Node only the first port of NVIDIA Network Adapter is wired to an NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

The below figure illustrates the required wiring for building a K8s cluster.

Fabric configuration

Switch configuration is provided below:

Switch console

Copy
Copied!

            
            ##
## Running database "initial"
## Generated at 2022/05/10 15:49:25 +0200
## Hostname: switch
## Product release: 3.9.3202
##
 
##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable
 
##
## Interface Ethernet configuration
##
   interface ethernet 1/1-1/32 speed 100GxAuto force
   interface ethernet 1/1-1/32 switchport mode hybrid
##
## VLAN configuration
##
   vlan 2
   vlan 1001
   vlan 2 name "RiverData"
   vlan 1001 name "PTP"
   interface ethernet 1/1-1/32 switchport hybrid allowed-vlan all
   interface ethernet 1/5 switchport access vlan 1001
   interface ethernet 1/7 switchport access vlan 1001
   interface ethernet 1/5 switchport hybrid allowed-vlan add 2
   interface ethernet 1/7 switchport hybrid allowed-vlan add 2
 
##
## STP configuration
##
no spanning-tree
##
## L3 configuration
##
   interface vlan 1001
   interface vlan 1001 ip address 172.20.0.1/24 primary
##
## IGMP Snooping configuration
##
   ip igmp snooping unregistered multicast forward-to-mrouter-ports
   ip igmp snooping
   vlan 1001 ip igmp snooping
   vlan 1001 ip igmp snooping querier 
   interface ethernet 1/5 ip igmp snooping fast-leave
   interface ethernet 1/7 ip igmp snooping fast-leave
 
 
##
## Local user account configuration
##
   username admin password 7 $6$mSW1WwYI$M5xfvsphrTRht6J2ByfF.J475tq8YuGKR6K1FwSgvkdb1QQFZbx/PtqK.GVJEBoMcmXsnB57QycP7jSp.Hy/Q.
   username monitor password 7 $6$V/Og9kzY$qc.oU2Ma9MPJClZlbvymOrb1wtE0N5yfQYPamhzRYeN2npVY/lOE5iisHUpxNqm3Ku8lIWDTPiO/bklyCMi2o.
##
## AAA remote server configuration
##
# ldap bind-password ********
   ldap vrf default enable
   radius-server vrf default enable
# radius-server key ********
   tacacs-server vrf default enable
# tacacs-server key ********
##
## Password restriction configuration
##
no password hardening enable
##
## SNMP configuration
##
   snmp-server vrf default enable
##
## Network management configuration
##
# web proxy auth basic password ********
   clock timezone Asia Middle_East Jerusalem
   ntp vrf default disable
   terminal sysrq enable
   web vrf default enable
##
## PTP protocol
##
   protocol ptp
   ptp priority1 1
   ptp vrf default enable
   interface ethernet 1/5 ptp enable
   interface ethernet 1/7 ptp enable
   interface vlan 1001 ptp enable
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID ca9888a2ed650c5c4bd372c055bdc6b4da65eb1e
# (public-cert config omitted since private-key config is hidden)
 
##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Host

General Configuration

General Prerequisites:

Hardware
Ensure that all the K8s worker nodes have the exact hardware specification (see BoM for details).
Host BIOS
Verify that SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.
Host OS
The Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.
Experience with Kubernetes
Familiarization with the Kubernetes Cluster architecture is essential.

Important

Make sure that the BIOS settings on the K8s Worker Nodes are tuned for maximum performance.

All K8s Worker Nodes must have the exact PCIe placement for the NIC and should expose the same interface name.

Host OS Prerequisites

Ensure that a non-root depuser account is created during the deployment of the Ubuntu Server 20.04 operating system.

Update the Ubuntu software packages by running the following commands:

Server console

Copy
Copied!

            
            $ sudo apt-get update
$ sudo apt-get install linux-image-lowlatency -y
$ sudo apt-get upgrade -y
$ sudo reboot

Add to non-root depuser account sudo privileges without password.

In this solution, the following line was added to the EOF /etc/sudoers:

Server console

Copy
Copied!

            
            $ sudo vim /etc/sudoers
#includedir /etc/sudoers.d
#K8s cluster deployment user with sudo privileges without password
depuser ALL=(ALL) NOPASSWD:ALL

OFED Installation and Configuration

OFED installation is required only on the K8s Worker Nodes. To download the latest OFED version please visit Linux Drivers (nvidia.com).
Download and installation procedures are provided below. All steps are required with root privileges.
After OFED installation, please reboot your node.

Server console

Copy
Copied!

            
            wget https://content.mellanox.com/ofed/MLNX_OFED-5.5-1.0.3.2/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64.iso
mount -o loop ./MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64.iso /mnt/
/mnt/mlnxofedinstall --vma --without-fw-update
reboot

K8s Cluster Deployment

The Kubernetes cluster in this solution is installed using Kubespray with a non-root depuser account from the deployment node.

SSH Private Key and SSH Passwordless Login

Log in to the Deployment Node as a deployment user (in our case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Deployment Node console

Copy
Copied!

            
            $ ssh-keygen 
 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/depuser/.ssh/id_rsa): 
Created directory '/home/depuser/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/depuser/.ssh/id_rsa
Your public key has been saved in /home/depuser/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver
The key's randomart image is:
+---[RSA 3072]----+
|                *|
|               .*|
|      . o . .  o=|
|       o + . o +E|
|        S o  .**O|
|         . .o=OX=|
|           . o%*.|
|             O.o.|
|           .*.ooo|
+----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Deployment Node console

Copy
Copied!

            
            $ ssh-copy-id depuser@192.168.100.29
 
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub"
The authenticity of host '192.168.100.29 (192.168.100.29)' can't be established.
ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
depuser@192.168.100.29's password: 
 
Number of key(s) added: 1
 
Now try logging into the machine, with:   "ssh 'depuser@192.168.100.29'"
and check to make sure that only the key(s) you wanted were added.

Verify that you have passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node console

Copy
Copied!

            
            $ ssh depuser@192.168.100.29

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the Deployment Node, please run the following commands:

Deployment Node console

Copy
Copied!

            
            $ cd ~
$ sudo apt -y install python3-pip jq
$ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.18.1.tar.gz
$ tar -zxf v2.18.1.tar.gz
$ cd kubespray-2.18.1
$ sudo pip3 install -r requirements.txt

Warning

The default folder for subsequent commands is ~/kubespray-2.18.1.
To download the latest Kubespray version please visit Releases · kubernetes-sigs/kubespray · GitHub.

Deployment Customization

Create a new cluster configuration and host configuration file .
Replace the IP addresses below with your nodes' IP addresses:

Deployment Node console

Copy
Copied!

            
            $ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(192.168.100.29 192.168.100.34 192.168.100.39)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example of this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!

            
            all:
  hosts:
    node1:
      ansible_host: 192.168.100.29
      ip: 192.168.100.29
      access_ip: 192.168.100.29
    node2:
      ansible_host: 192.168.100.34
      ip: 192.168.100.34
      access_ip: 192.168.100.34
    node3:
      ansible_host: 192.168.100.39
      ip: 192.168.100.39
      access_ip: 192.168.100.39
      
  children:
    kube_control_plane:
      hosts:
        node1:
    kube_node:
      hosts:
        node2:
        node3:
    etcd:
      hosts:
        node1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment procedure:

Deployment Node console

Copy
Copied!

            
            $ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for K8s cluster deployment to complete, please make sure no errors are encountered in the playbook log.

Below is an example of a successful result:

Deployment Node console

Copy
Copied!

            
            ...
PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=4    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=501  changed=111  unreachable=0    failed=0    skipped=1131 rescued=0    ignored=2   
node2                      : ok=360  changed=40   unreachable=0    failed=0    skipped=661  rescued=0    ignored=1   
node3                      : ok=360  changed=40   unreachable=0    failed=0    skipped=660  rescued=0    ignored=1     
 
 
Sunday 9 May 2021  19:39:17 +0000 (0:00:00.064)       0:06:54.711 ******** 
=============================================================================== 
kubernetes/control-plane : kubeadm | Initialize first master ----------------------------------------------------------------------------------------- 28.13s
kubernetes/control-plane : Master | wait for kube-scheduler ------------------------------------------------------------------------------------------ 12.78s
download : download_container | Download image if required ------------------------------------------------------------------------------------------- 10.56s
container-engine/containerd : ensure containerd packages are installed -------------------------------------------------------------------------------- 9.48s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.36s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.08s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.05s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 8.91s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 8.47s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------- 8.30s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 7.49s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 7.39s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------------------- 7.07s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 5.99s
container-engine/containerd : ensure containerd repository is enabled --------------------------------------------------------------------------------- 5.59s
container-engine/crictl : download_file | Download item ----------------------------------------------------------------------------------------------- 5.45s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 5.34s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates -------------------------------------------------------------------------------- 5.00s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 4.95s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 4.50s

K8s Cluster Customization and Verification

Now that the K8S cluster is deployed, connection to the K8s cluster can be done from any K8S Master Node with the root user account or from another server with installed KUBECTL command and configured KUBECONFIG=<path-to-config-file> to customize deployment.

In our guide we continue the deployment from K8s Master Node with the root user account:

Label the Worker Nodes:

Master Node console

Copy
Copied!

            
            $ kubectl label nodes node2 node-role.kubernetes.io/worker=
$ kubectl label nodes node3 node-role.kubernetes.io/worker=

Important

K8s Worker Node labeling is required for a proper installation of the NVIDIA Network Operator.

Below is an output example of the K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Master Node console

Copy
Copied!

            
            ## Get cluster node status   kubectl get node -o wide   NAME    STATUS   ROLES node1   Ready node2   Ready    worker node3   Ready    worker   ## Get system pods status   kubectl -n kube-system get pods -o wide   NAME calico-kube-controllers calico-node-4f748 calico-node-jhbjh calico-node-m78p6 coredns-8474476ff8-dczww coredns-8474476ff8-ksvkd dns-autoscaler-5ffdc7f89d-h6nc8 kube-apiserver-node1 kube-controller-manager-node1 kube-proxy-2bq45 kube-proxy-4c8p7 kube-proxy-j226w kube-scheduler-node1 nginx-proxy-node2 nginx-proxy-node3 nodelocaldns-9rffq nodelocaldns-fdnr7 nodelocaldns-qhpxk

AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION         CONTAINER-RUNTIME control-plane,master    9d   v1.22.8   192.168.100.29   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-generic      containerd://1.5.8 9d   v1.22.8   192.168.100.34   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-lowlatency   containerd://1.5.8 9d   v1.22.8   192.168.100.39   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-lowlatency   containerd://1.5.8 READY   STATUS    RESTARTS       AGE   IP               NODE    NOMINATED NODE   READINESS GATES -5788f6558-bm5h9   1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   192.168.100.34   node2   <none>           <none> 1/1     Running   0               9d   192.168.100.39   node3   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   10.233.90.23     node1   <none>           <none> 1/1     Running   0               9d   10.233.96.234    node2   <none>           <none> 1/1     Running   0               9d   10.233.90.20     node1   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   192.168.100.34   node2   <none>           <none> 1/1     Running   0               9d   192.168.100.39   node3   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none> 1/1     Running   0               9d   192.168.100.34   node2   <none>           <none> 1/1     Running   0               9d   192.168.100.39   node3   <none>           <none> 1/1     Running   0               9d   192.168.100.39   node3   <none>           <none> 1/1     Running   0               9d   192.168.100.34   node2   <none>           <none> 1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>

NVIDIA GPU Operator Installation for K8s cluster

The preferred method to deploy the GPU Operator using helm from the K8s Master node. To install helm, simply use the following command:

Copy
Copied!

            
            $ snap install helm --classic

Add the NVIDIA GPU Operator Helm repository.

Copy
Copied!

            
            $ helm repo add nvidia https://nvidia.github.io/gpu-operator 
$ helm repo update

Deploy NVIDIA GPU Operator.

GPU Operator should be deployed with enabled GPUDirect kernel module - driver.rdma.enabled=true.

Copy
Copied!

            
            $ helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.rdma.enabled=true --set driver.rdma.useHostMofed=true 
 
$ helm ls -n gpu-operator
NAME                   	NAMESPACE   	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
gpu-operator-1652190420	gpu-operator	1       	2022-05-10 13:47:01.106147933 +0000 UTC	deployed	gpu-operator-v1.10.0	v1.10.0  NAME

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running and the validation is complete:

Copy
Copied!

            
            $ kubectl get pod -n gpu-operator -o wide
 
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE    NOMINATED NODE   READINESS GATES
gpu-feature-discovery-bcc22                                       1/1     Running     1 (3d8h ago)    5d18h   10.233.96.3     node2   <none>           <none>
gpu-feature-discovery-vl68h                                       1/1     Running     0               5d18h   10.233.92.58    node3   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-master-5b5fx8zlx   1/1     Running     1 (4m5s ago)    5d18h   10.233.90.17    node1   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-czsb4       1/1     Running     0               4s      10.233.92.75    node3   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-fnlj6       1/1     Running     0               4s      10.233.96.253   node2   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-r44r8       1/1     Running     1 (4m5s ago)    5d18h   10.233.90.22    node1   <none>           <none>
gpu-operator-6497cbf9cd-vcsrg                                     1/1     Running     1 (4m6s ago)    5d18h   10.233.90.19    node1   <none>           <none>
nvidia-container-toolkit-daemonset-4h9dr                          1/1     Running     0               5d18h   10.233.96.246   node2   <none>           <none>
nvidia-container-toolkit-daemonset-rv7sn                          1/1     Running     1 (5d18h ago)   5d18h   10.233.92.50    node3   <none>           <none>
nvidia-cuda-validator-kr6q9                                       0/1     Completed   0               5d18h   10.233.92.61    node3   <none>           <none>
nvidia-cuda-validator-zb4p8                                       0/1     Completed   0               5d18h   10.233.96.4     node2   <none>           <none>
nvidia-dcgm-exporter-5hdzh                                        1/1     Running     0               5d18h   10.233.96.198   node2   <none>           <none>
nvidia-dcgm-exporter-lnqzb                                        1/1     Running     0               5d18h   10.233.92.57    node3   <none>           <none>
nvidia-device-plugin-daemonset-dxgnz                              1/1     Running     0               5d18h   10.233.92.62    node3   <none>           <none>
nvidia-device-plugin-daemonset-w692b                              1/1     Running     0               5d18h   10.233.96.9     node2   <none>           <none>
nvidia-device-plugin-validator-pqns8                              0/1     Completed   0               5d18h   10.233.92.64    node3   <none>           <none>
nvidia-device-plugin-validator-sgtmt                              0/1     Completed   0               5d18h   10.233.96.10    node2   <none>           <none>
nvidia-driver-daemonset-l9x4n                                     2/2     Running     1 (2d19h ago)   5d18h   10.233.92.30    node3   <none>           <none>
nvidia-driver-daemonset-tf2tl                                     2/2     Running     5 (2d21h ago)   5d18h   10.233.96.244   node2   <none>           <none>
nvidia-operator-validator-p6794                                   1/1     Running     0               5d18h   10.233.96.6     node2   <none>           <none>
nvidia-operator-validator-xjrg9                                   1/1     Running     0               5d18h   10.233.92.54    node3   <none>           <none>

NVIDIA Network Operator Installation

The NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. The Helm is required for the Network Operator deployment.

Add the NVIDIA Network Operator Helm repository:

Copy
Copied!

            
            ## Add REPO  
helm repo add mellanox https://mellanox.github.io/network-operator \
  && helm repo update

Create the values.yaml file to customize the Network Operator deployment (e xample):

Copy
Copied!

            
            nfd:
  enabled: true
 
sriovNetworkOperator:
  enabled: true
 
ofedDriver:
  deploy: false
nvPeerDriver:
  deploy: false
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false
 
deployCR: true
secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
  multus:
    deploy: true
  ipamPlugin:
    deploy: true

Deploy the operator:

Copy
Copied!

            
            helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name
 
helm ls -n network-operator 
NAME                       	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                 	APP VERSION
network-operator-1648457278	network-operator	1       	2022-03-28 08:47:59.548667592 +0000 UTC	deployed	network-operator-1.1.0	v1.1.0

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running:

Copy
Copied!

            
            ## PODs status in namespace - network-operator
 
kubectl -n network-operator get pods -o wide
NAME                                                              READY   STATUS    RESTARTS        AGE   IP               NODE    NOMINATED NODE   READINESS GATES
network-operator-1648457278-5885dbfff5-wjgsc                      1/1     Running   0                5m   10.233.90.15     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-master-zbcx8   1/1     Running   0                5m   10.233.90.16     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-kk4qs   1/1     Running   0                5m   10.233.90.18     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-n44b6   1/1     Running   0                5m   10.233.92.221    node3   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-xhzfw   1/1     Running   0                5m   10.233.96.233    node2   <none>           <none>
network-operator-1648457278-sriov-network-operator-5cd4bdb6mm9f   1/1     Running   0                5m   10.233.90.21     node1   <none>           <none>
sriov-device-plugin-cxnrl                                         1/1     Running   0                5m   192.168.100.34   node2   <none>           <none>
sriov-device-plugin-djlmn                                         1/1     Running   0                5m   192.168.100.39   node3   <none>           <none>
sriov-network-config-daemon-rgfvk                                 3/3     Running   0                5m   192.168.100.39   node3   <none>           <none>
sriov-network-config-daemon-zzchs                                 3/3     Running   0                5m   192.168.100.34   node2   <none>           <none>
 
## PODs status in namespace - nvidia-network-operator-resources 
 
kubectl -n nvidia-network-operator-resources get pods -o wide
NAME                   READY   STATUS    RESTARTS       AGE   IP               NODE    NOMINATED NODE   READINESS GATES
cni-plugins-ds-snf6x   1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>
cni-plugins-ds-zjb27   1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
kube-multus-ds-mz7nd   1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>
kube-multus-ds-xjxgd   1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
whereabouts-jgt24      1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
whereabouts-sphx4      1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>

High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SR-IOV-enabled devices in your node.
In this deployment, the network interface has been chosen with the following name: enp57s0f0 .

To review the interface status please use the following command:

NICs status

Copy
Copied!

            
            ## NIC status 
kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml  

...
status:
  interfaces:
    deviceID: 101d
    driver: mlx5_core
    eSwitchMode: legacy
    linkSpeed: 100000 Mb/s
    linkType: ETH
    mac: 0c:42:a1:2b:73:fa
    mtu: 9000
    name: enp57s0f0
    numVfs: 8
    pciAddress: "0000:39:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 101d
    driver: mlx5_core
...

Create SriovNetworkNodePolicy CR for chosen network interface - policy.yaml file, by specifying the chosen interface in the 'nicSelector'.

According to application design VF0 allotted into a separate pool from the rest of VFn:

policy.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics-sw1
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/custom-rdma.capable: "true"
  resourceName: timepool
  priority: 99
  mtu: 9000
  numVfs: 8
  nicSelector:
    pfNames: [ "enp57s0f0#0-0" ]
  deviceType: netdevice
  isRdma: true

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics-sw2
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/custom-rdma.capable: "true"
  resourceName: rdmapool
  priority: 99
  mtu: 9000
  numVfs: 8
  nicSelector:
    pfNames: [ "enp57s0f0#1-7" ]
  deviceType: netdevice
  isRdma: true

Deploy policy.yaml:

Copy
Copied!

            
            kubectl apply -f policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw1 created
sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw2 created

Important

This step takes a while. This depends on the amount of K8s Worker Nodes to apply the configuration, and the number of VFs for each selected network interface.

Create an SriovNetwork CR for chosen network interface - network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy.

In this example below created:

timenet - K8s network name for PTP time sync
rdmanet - K8s network name with dynamic IPAM
rdma-static - K8s network name with static IPAM

network.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: timenet
  namespace: network-operator
spec:
  ipam: |
    {
         "datastore": "kubernetes",
         "kubernetes": {"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},
         "log_file": "/tmp/whereabouts.log",
         "log_level": "debug",
         "type": "whereabouts",
         "range": "172.20.0.0/24",
         "exclude": [ "172.20.0.1/32" ]
    }
  networkNamespace: default
  resourceName: timepool
  trust: "on"
  vlan: 0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdmanet
  namespace: network-operator
spec:
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.102.0/24",
      "exclude": [ "192.168.102.254/32", "192.168.102.253/32" ]
    }
  networkNamespace: default
  resourceName: rdmapool
  vlan: 2

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdmanet-static
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "static"
    }
  networkNamespace: default
  resourceName: rdmapool
  vlan: 2

Deploy network.yaml:

Copy
Copied!

            
            kubectl apply -f network.yaml
sriovnetwork.sriovnetwork.openshift.io/timenet created
sriovnetwork.sriovnetwork.openshift.io/rdmanet created
sriovnetwork.sriovnetwork.openshift.io/rdmanet-static created

Manage HugePages

Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes will automatically discover and report all HugePages resources as schedulable resources. For get additional information K8s HugePages management, please refer here.

In order to allocate, HugePages needs to modify GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grub. This setting, below, allocates 2MB * 8192 pages = 16GB HugePages on boot time:

/etc/default/grub

Copy
Copied!

            
            ...
 
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=2M hugepagesz=2M hugepages=8192"
 
...

Run update-grub to apply the config to grub and reboot server:

Worker Node console

Copy
Copied!

            
            # update-grub
# reboot

After the server comes back, check hugepages allocation from master node by command:

Master Node console

Copy
Copied!

            
            # kubectl describe nodes node2
...
Capacity:
  cpu:                  48
  ephemeral-storage:    459923528Ki
  hugepages-1Gi:        0
  hugepages-2Mi:        16Gi
  memory:               264050900Ki
  nvidia.com/gpu:       2
  nvidia.com/rdmapool:  7
  nvidia.com/timepool:  1
  pods:                 110
Allocatable:
  cpu:                  46
  ephemeral-storage:    423865522704
  hugepages-1Gi:        0
  hugepages-2Mi:        16Gi
  memory:               246909140Ki
  nvidia.com/gpu:       2
  nvidia.com/rdmapool:  7
  nvidia.com/timepool:  1
  pods:                 110
...

Enable CPU and Topology Management

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

Require as much CPU time as possible
Are sensitive to processor cache misses
Are low-latency network applications
Coordinate with other processes and benefit from sharing a single processor cache

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested. In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

Important

To use Topology Manager, CPU Manager with static policy must be used.

For additional information, please refer to Control Topology Management Policies on a node and Control Topology Management Policies on a node.

In order to enable CPU Manager and Topology Manager, please add following lines to kubelet configuration file /etc/kubernetes/kubelet-config.yaml:

/etc/kubernetes/kubelet-config.yaml

Copy
Copied!

            
            ...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
topologyManagerPolicy: single-numa-node
featureGates:
  CPUManager: true
  TopologyManager: true

Due to changes in cpuManagerPolicy, remove /var/lib/kubelet/cpu_manager_state and restart kubelet service on each affected K8s worker node.

Worker Node console

Copy
Copied!

            
            # rm -f /var/lib/kubelet/cpu_manager_state
# service kubelet restart

Application

Below provides K8s specific components and K8s YAML configuration files to deploy Rivermax applications in K8s cluster.

Note

For proper application execution Rivermax license is required. To obtain a license please look at Rivermax License Generation Guidelines.

Note

To download Rivermax apps container images from container repository and application pipeline, you need to register and log in to the Rivermax portal by clicking on "Get Started".

Rivermax license

Upload Rivermax license as configmap value in K8s cluster.

Copy
Copied!

            
            kubectl create configmap rivermax-config --from-file=rivermax.lic=./rivermax.lic

Media Node application

This pod definition contains implementation of the AMWA Networked Media Open Specifications (NMOS) with the NMOS Rivermax Node implementation. For more information about AMWA, NMOS and the Networked Media Incubator, please refer to http://amwa.tv/. For more information about Rivermax SDK please refer to https://developer.nvidia.com/networking/Rivermax.
Below provided YAML configuration file for Media Node deployment. Please fill your container file name and your registry secret.

Copy
Copied!

            
            apiVersion: v1
kind: ConfigMap
metadata:
  name: river-config
data:
  container-config: |-
    #media_node JSON file to run
    config_json=/var/home/config.json
    #Output registry stdout/stderr output to a log inside container
    log_output=FALSE
    #Update/insert label parameter with container hostname on entrypoint script run
    update_label=TRUE
    #Allow these network interfaces in /etc/avahi/avahi-daemon.conf
    allow_interfaces=net1

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "mnc"
  labels:
    apps: rivermax
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rivermax
  template:
    metadata:
      labels:
        app: rivermax
      annotations:
        k8s.v1.cni.cncf.io/networks: rdmanet
    spec:
      containers:
      - command:
        image: < media node container image >
        name: "medianode"
        env:
          - name: DISPLAY
            value: "192.168.102.253:0.0"
        resources:
          requests:
            nvidia.com/rdmapool: 1
            hugepages-2Mi: 4Gi
            memory: 8Gi
            cpu: 4
          limits:
            nvidia.com/rdmapool: 1
            hugepages-2Mi: 4Gi
            memory: 8Gi
            cpu: 4
        securityContext:
          capabilities:
            add: [ "IPC_LOCK", "SYS_RESOURCE", "NET_RAW","NET_ADMIN" ]
        volumeMounts:
        - name: config
          mountPath: /var/home/ext/
        - name: licconfig
          mountPath: /opt/mellanox/rivermax/
        - mountPath: /hugepages
          name: hugepage
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: config
        configMap:
          name: river-config
      - name: licconfig
        configMap:
          name: rivermax-config
      - name: hugepage
        emptyDir:
          medium: HugePages
      - name: dshm
        emptyDir: {
          medium: 'Memory',
          sizeLimit: '4Gi'
          }
      imagePullSecrets:
      - name: < Container registry secret >

NMOS controller

AMWA NMOS controller is a device that can interact with NMOS APIs, which are a family of open specifications for networked media for professional applications. NMOS controller can discover, register, connect and manage media devices on an IP infrastructure using common methods and protocols. NMOS controller can also handle event and tally, audio channel mapping, authorization and other functions that are part of the NMOS roadmap. For more information, please look at README.md.

Copy
Copied!

            
            apiVersion: v1
kind: Pod
metadata:
  name: nmos-cpp
  labels:
    app.kubernetes.io/name: nmos
  annotations:
    k8s.v1.cni.cncf.io/networks: |
          [
            { "name": "rdmanet-static", 
              "ips": [ "192.168.102.254/24" ] 
            }
          ]
spec:
  containers:
  - name: nmos-pod
    image: docker.io/rhastie/nmos-cpp:latest
    env:
    - name: RUN_NODE
      value: "true"
    resources:
      requests:
        cpu: 2
        memory: 1Gi
        nvidia.com/rdmapool: 1
      limits:
        cpu: 2
        memory: 1Gi
        nvidia.com/rdmapool: 1
    ports:
      - containerPort: 8010
        name: port-8010
      - containerPort: 8011
        name: port-8011
      - containerPort: 11000
        name: port-11000
      - containerPort: 11001
        name: port-11001
      - containerPort: 1883
        name: port-1883 
      - containerPort: 5353
        name: port-5353
        protocol: UDP

DeepStream Media Gateway

One of the applications of DeepStream SDK is to encode RAW data to SRT stream. This application can capture video frames from a camera or a file, encode them using H.264 or H.265 codec, and send them over a network using SRT protocol. SRT stands for Secure Reliable Transport, which is a low-latency and secure streaming technology. This application can be useful for scenarios such as remote surveillance, live broadcasting, or video conferencing.
Below provided YAML configuration file for Media Gateway deployment. Please fill your container file name and your registry secret.

Copy
Copied!

            
            apiVersion: v1
kind: Pod
metadata:
  name: ds-rmax
  labels:
    name: dsrmax-app
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmanet
spec:
  containers:
  - name: dsrmax
    image: < DeepStream media gateway container image >
    command:
      - sh
      - -c
      - sleep inf
    env:
      - name: DISPLAY
        value: "192.168.102.253:0.0"
    ports:
      - containerPort: 7001
        name: udp-port        
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE", "NET_RAW","NET_ADMIN"]        
    resources:
      requests:
        nvidia.com/rdmapool: 1
        nvidia.com/gpu: 1
        hugepages-2Mi: 2Gi
        memory: 8Gi
        cpu: 8
      limits:
        nvidia.com/rdmapool: 1
        nvidia.com/gpu: 1
        hugepages-2Mi: 2Gi
        memory: 8Gi
        cpu: 8
    volumeMounts:
    - name: config
      mountPath: /var/home/ext/
    - name: licconfig
      mountPath: /opt/mellanox/rivermax/
    - mountPath: /hugepages
      name: hugepage
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: config
    configMap:
      name: river-config
  - name: licconfig
    configMap:
      name: rivermax-config
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: dshm
    emptyDir: {
      medium: 'Memory',
      sizeLimit: '4Gi'
      }
  imagePullSecrets:
  - name: < Container registry secret >
---
apiVersion: v1
kind: Service
metadata:
  name: rmax-service
spec:
  type: NodePort
  selector:
    name: dsrmax-app
  ports:
      # By default and for convenience, the `targetPort` is set to the same value as the `port` field.
    - port: 7001
      name: udp-port
      protocol: UDP
      targetPort: 7001

VNC container with GUI

This pop definition allows you to access a web VNC interface with Ubuntu LXDE/LXQT desktop environment inside a Kubernetes cluster. It uses a interface of the K8s secondary network to manage applications via GUI on your cluster nodes.
Below provided YAML configuration file for VNC deployment. Please fill your container file name.

Note

Example of this application can be found at - GitHub - theasp/docker-novnc: noVNC Display Container for Docker, but you can create your own container image.

Copy
Copied!

            
            apiVersion: v1
kind: Pod
metadata:
  name: ub-vnc  
  labels:
    name: ubuntu-vnc
  annotations:
    k8s.v1.cni.cncf.io/networks: |
       [
         { "name": "rdmanet-static",
           "ips": [ "192.168.102.253/24" ]
         }
       ]
spec:
  volumes:                          
    - name: dshm
      emptyDir:
        medium: Memory
  containers:
    - image: < NOVNC container image >
      name: vnc-container
      resources:
        limits:
          cpu: 4 
          memory: 8Gi
          nvidia.com/rdmapool: 1
      env:
        - name: DISPLAY_WIDTH
          value: "1920"
        - name: DISPLAY_HEIGHT
          value: "1080"
        - name: RUN_XTERM
          value: "yes"
        - name: RUN_FLUXBOX
          value: "yes"
      ports:
        - containerPort: 8080
          name: http-port
      volumeMounts:                 
        - mountPath: /dev/shm
          name: dshm          

---
apiVersion: v1
kind: Service
metadata:
  name: vnc-service
spec:
  type: NodePort
  selector:
    name: ubuntu-vnc
  ports:
    - port: 8080
      name: http-port
      targetPort: 8080

Authors

	Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

	Gareth Sylvester-Bradley Gareth Sylvester-Bradley is a Principal Engineer at NVIDIA, and currently serving as the chair of the Networked Media Open Specifications (NMOS) Architecture Review group in the Advanced Media Workflow Association (AMWA). He is focused on building software toolkits and agile, collaborative industry specifications to deliver open, software-defined, hardware-accelerated media workflows for broadcast, live production, medical imaging, industrial video, etc.

Gareth Sylvester-Bradley

Gareth Sylvester-Bradley is a Principal Engineer at NVIDIA, and currently serving as the chair of the Networked Media Open Specifications (NMOS) Architecture Review group in the Advanced Media Workflow Association (AMWA). He is focused on building software toolkits and agile, collaborative industry specifications to deliver open, software-defined, hardware-accelerated media workflows for broadcast, live production, medical imaging, industrial video, etc.

Spectrum Ethernet Switches Media & Entertainment Ubuntu GPU NVIDIA Rivermax GPUDirect RDMA Kubernetes Bare Metal Ethernet GPU Operator Network Operator ConnectX SR-IOV

Last updated on Sep 12, 2023.

On This Page

Switch console

Server console

Server console

Server console

Deployment Node console

Deployment Node console

Deployment Node console

Deployment Node console

Deployment Node console

inventory/mycluster/hosts.yaml

Deployment Node console

Deployment Node console

Master Node console

Master Node console

NICs status

policy.yaml

network.yaml

/etc/default/grub

Worker Node console

Master Node console

/etc/kubernetes/kubelet-config.yaml

Worker Node console