image image image image image image



On This Page

Created on June 15, 2022.

Scope

The following Reference Deployment Guide (RDG) shows deployment of Rivermax and DeepStream streaming apps over accelerated Kubernetes cluster. 

Abbreviations and Acronyms

TermDefinitionTermDefinition

CDN

Content Delivery Network

LLDP

Link Layer Discovery Protocol

CNI

Container Network Interface

NFD

Node Feature Discovery

CR

Custom Resources

NCCL

NVIDIA Collective Communication Library

CRD

Custom Resources Definition

OCI

Open Container Initiative

CRI

Container Runtime Interface

PF

Physical Function

DHCP

Dynamic Host Configuration Protocol

QSG

Quick Start Guide

DNS

Domain Name System

RDG

Reference Deployment Guide

DP

Device Plugin

RDMA

Remote Direct Memory Access

DSDeep Stream

RoCE

RDMA over Converged Ethernet

IPAM

IP Address Management

SR-IOV

Single Root Input Output Virtualization

K8s

Kubernetes

VF

Virtual Function

Introduction

This guide supplies a complete solution cycle of K8s cluster deployment including technology overview, design, component selection, deployment steps and apps workload examples. The solution will be delivered on top of standard servers. The NVIDIA end-to-end Ethernet infrastructure is used to oversee the workload.
In this guide, we use the NVIDIA GPU Operator and the NVIDIA Network Operator, who manage deploying and configuring GPU and Network components in the K8s cluster. These components allow you to accelerate workload using CUDA, RDMA and GPUDirect technologies. 

This guide shows the design of a K8s cluster with two K8s worker nodes and provides detailed instructions for deploying a K8s cluster.
A Greenfield deployment is assumed for this guide.

The information presented is written for experienced Media and Entertainment Broadcast System Admins, System Engineers and Solution Architects who need to deploy the Rivermax streaming apps for their customers.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA DGX A100 
    NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables 
    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches
    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects. 
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® LinuxSONiC and NVIDIA Onyx®.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray 
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
    • A highly available cluster
    • Composable attributes
    • Support for most popular Linux distributions

  • NVIDIA GPU Operator
    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

  • NVIDIA Network Operator
    An analog to the NVIDIA GPU Operator, the 
    NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

  • NVIDIA CUDA 
    CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

  • NVIDIA Rivermax SDK
    NVIDIA Rivermax offers a unique IP-based solution for any media and data streaming use case. Rivermax together with NVIDIA GPU accelerated computing technologies unlocks innovation for a wide range of applications in Media and Entertainment (M&E), Broadcast, Healthcare, Smart Cities and more. Rivermax leverages NVIDIA ConnectX and BlueField DPU hardware streaming acceleration technology that enables direct data transfers to and from the GPU, delivering best-in-class throughput and latency with minimal CPU utilization for streaming workloads.

  • NVIDIA DeepStream SDK
    NVIDIA DeepStream allows the rapid development and deployment of Vision AI applications and services. DeepStream provides multi-platform, scalable, TLS-encrypted security that can be deployed on-premises, on the edge, and in the cloud. It delivers a complete streaming analytics toolkit for AI-based multi-sensor processing, video, audio and image understanding. Principally DeepStream is for vision AI developers, software partners, startups and OEMs building IVA apps and services. 

  • Networked Media Open Specifications (NMOS)
    NMOS specifications are a family of open, free-of-charge specifications that enable interoperability between media devices on an IP infrastructure. The core specifications, IS-04 Registration and Discovery and IS-05 Device Connection Management, provide uniform mechanisms to enable media devices and services to advertise their capabilities onto the network, and control systems to configure the video, audio and data streams between the devices' senders and receivers. NMOS is extensible and, for example, includes specifications for audio channel mapping, for exchange of event and tally information, and for securing the APIs, leveraging IT best practices. There are open-source NMOS implementations available, and NVIDIA provides a free NMOS Node library in the DeepStream SDK.



Logical Design

The logical design includes the following parts:

  • Deployment node running Kubespray that deploys Kubernetes cluster

  • K8s Master node running all Kubernetes management components

  • K8s Worker nodes with NVIDIA GPUs and NVIDIA ConnectsX-6Dx adapter

  • High-speed Ethernet fabric (Secondary K8s network) 

  • Deployment and K8s Management networks

Application Logical Design

In our guide we deployed the following applications:

  1. Rivermax Media node
  2. NMOS registry controller
  3. DeepStream gateway
  4. Time synchronization service
  5. VNC apps for internal GUI access

Software Stack Components

Bill of Materials

The following hardware setup is utilized in this guide to build K8s cluster with two K8s Worker nodes.

You can use any suitable hardware according to the network topology and software stack.


Deployment and Configuration

Network / Fabric

This RDG describes K8s cluster deployment with multiple K8s Worker Nodes.

The high-performance network is a secondary network for Kubernetes cluster and requires the L2 network topology.

The Deployment/Management network topology and DNS/DHCP network services are part of the IT infrastructure. The components installation procedure and configuration are not covered in this guide.

Network IP Configuration

Below are the server names with their relevant network configurations.


Server/Switch Type


Server/Switch Name

IP and NICs

High-Speed Network

Management Network

Deployment node

depserver

N/A                                    

eth0: DHCP
192.168.100.202

K8s Master node

node1

N/A

eth0: DHCP
192.168.100.29

K8s Worker Node1

node2

enp57s0f0: no IP set

eth0: DHCP
192.168.100.34

K8s Worker Node2

node3

enp57s0f0: no IP set

eth0: DHCP
192.168.100.39

High-speed switch

switch


mgmt0: DHCP
192.168.100.49

enpXXs0f0 high-speed network interfaces do not require additional configuration.

Wiring

On each K8s Worker Node only the first port of NVIDIA Network Adapter is wired to an NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

The below figure illustrates the required wiring for building a K8s cluster.

Fabric configuration

Switch configuration is provided below: 

Switch console
##
## Running database "initial"
## Generated at 2022/05/10 15:49:25 +0200
## Hostname: switch
## Product release: 3.9.3202
##

##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable

##
## Interface Ethernet configuration
##
   interface ethernet 1/1-1/32 speed 100GxAuto force
   interface ethernet 1/1-1/32 switchport mode hybrid
   
##
## VLAN configuration
##
   vlan 2
   vlan 1001
   vlan 2 name "RiverData"
   vlan 1001 name "PTP"
   interface ethernet 1/1-1/32 switchport hybrid allowed-vlan all
   interface ethernet 1/5 switchport access vlan 1001
   interface ethernet 1/7 switchport access vlan 1001
   interface ethernet 1/5 switchport hybrid allowed-vlan add 2
   interface ethernet 1/7 switchport hybrid allowed-vlan add 2

   
##
## STP configuration
##
no spanning-tree
   
##
## L3 configuration
##
   interface vlan 1001
   interface vlan 1001 ip address 172.20.0.1/24 primary
   
##
## IGMP Snooping configuration
##
   ip igmp snooping unregistered multicast forward-to-mrouter-ports
   ip igmp snooping
   vlan 1001 ip igmp snooping
   vlan 1001 ip igmp snooping querier 
   interface ethernet 1/5 ip igmp snooping fast-leave
   interface ethernet 1/7 ip igmp snooping fast-leave


   
##
## Local user account configuration
##
   username admin password 7 $6$mSW1WwYI$M5xfvsphrTRht6J2ByfF.J475tq8YuGKR6K1FwSgvkdb1QQFZbx/PtqK.GVJEBoMcmXsnB57QycP7jSp.Hy/Q.
   username monitor password 7 $6$V/Og9kzY$qc.oU2Ma9MPJClZlbvymOrb1wtE0N5yfQYPamhzRYeN2npVY/lOE5iisHUpxNqm3Ku8lIWDTPiO/bklyCMi2o.
   
##
## AAA remote server configuration
##
# ldap bind-password ********
   ldap vrf default enable
   radius-server vrf default enable
# radius-server key ********
   tacacs-server vrf default enable
# tacacs-server key ********
   
##
## Password restriction configuration
##
no password hardening enable
   
##
## SNMP configuration
##
   snmp-server vrf default enable
   
##
## Network management configuration
##
# web proxy auth basic password ********
   clock timezone Asia Middle_East Jerusalem
   ntp vrf default disable
   terminal sysrq enable
   web vrf default enable
   
##
## PTP protocol
##
   protocol ptp
   ptp priority1 1
   ptp vrf default enable
   interface ethernet 1/5 ptp enable
   interface ethernet 1/7 ptp enable
   interface vlan 1001 ptp enable
   
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID ca9888a2ed650c5c4bd372c055bdc6b4da65eb1e
# (public-cert config omitted since private-key config is hidden)

##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Host

General Configuration

General Prerequisites:

  • Hardware
    Ensure that all the K8s worker nodes have the exact hardware specification (see BoM for details).

  • Host BIOS
    Verify that SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.

  • Host OS
    The Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.

  • Experience with Kubernetes
    Familiarization with the Kubernetes Cluster architecture is essential. 

Make sure that the BIOS settings on the K8s Worker Nodes are tuned for maximum performance.

All K8s Worker Nodes must have the exact PCIe placement for the NIC and should expose the same interface name.

Host OS Prerequisites 

Ensure that a non-root depuser account is created during the deployment of the Ubuntu Server 20.04 operating system.

Update the Ubuntu software packages by running the following commands:

Server console
$ sudo apt-get update
$ sudo apt-get install linux-image-lowlatency -y
$ sudo apt-get upgrade -y
$ sudo reboot 

Add to non-root depuser account sudo privileges without password.

In this solution, the following line was added to the EOF /etc/sudoers:

Server console
$ sudo vim /etc/sudoers
  
#includedir /etc/sudoers.d
  
#K8s cluster deployment user with sudo privileges without password
depuser ALL=(ALL) NOPASSWD:ALL

OFED Installation and Configuration

OFED installation is required only on the K8s Worker Nodes. To download the latest OFED version please visit Linux Drivers (nvidia.com).
Download and installation procedures are provided below. All steps are required with root privileges.
After OFED installation, please reboot your node.

Server console
wget https://content.mellanox.com/ofed/MLNX_OFED-5.5-1.0.3.2/MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64.iso
mount -o loop ./MLNX_OFED_LINUX-5.5-1.0.3.2-ubuntu20.04-x86_64.iso /mnt/
/mnt/mlnxofedinstall --vma --without-fw-update
reboot


K8s Cluster Deployment

The Kubernetes cluster in this solution is installed using Kubespray with a non-root depuser account from the deployment node.

SSH Private Key and SSH Passwordless Login

Log in to the Deployment Node as a deployment user (in our case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Deployment Node console
$ ssh-keygen 

Generating public/private rsa key pair.
Enter file in which to save the key (/home/depuser/.ssh/id_rsa): 
Created directory '/home/depuser/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/depuser/.ssh/id_rsa
Your public key has been saved in /home/depuser/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver
The key's randomart image is:
+---[RSA 3072]----+
|                *|
|               .*|
|      . o . .  o=|
|       o + . o +E|
|        S o  .**O|
|         . .o=OX=|
|           . o%*.|
|             O.o.|
|           .*.ooo|
+----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Deployment Node console
$ ssh-copy-id depuser@192.168.100.29

/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub"
The authenticity of host '192.168.100.29 (192.168.100.29)' can't be established.
ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
depuser@192.168.100.29's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'depuser@192.168.100.29'"
and check to make sure that only the key(s) you wanted were added.

Verify that you have passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node console
$ ssh depuser@192.168.100.29

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the Deployment Node, please run the following commands:

Deployment Node console
$ cd ~
$ sudo apt -y install python3-pip jq
$ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.18.1.tar.gz
$ tar -zxf v2.18.1.tar.gz
$ cd kubespray-2.18.1
$ sudo pip3 install -r requirements.txt

The default folder for subsequent commands is ~/kubespray-2.18.1.
To download the latest Kubespray version please visit Releases · kubernetes-sigs/kubespray · GitHub.

Deployment Customization

Create a new cluster configuration and host configuration file.
Replace the IP addresses below with your nodes' IP addresses:

Deployment Node console
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(192.168.100.29 192.168.100.34 192.168.100.39)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example of this deployment:

inventory/mycluster/hosts.yaml
all:
  hosts:
    node1:
      ansible_host: 192.168.100.29
      ip: 192.168.100.29
      access_ip: 192.168.100.29
    node2:
      ansible_host: 192.168.100.34
      ip: 192.168.100.34
      access_ip: 192.168.100.34
    node3:
      ansible_host: 192.168.100.39
      ip: 192.168.100.39
      access_ip: 192.168.100.39
      
  children:
    kube_control_plane:
      hosts:
        node1:
    kube_node:
      hosts:
        node2:
        node3:
    etcd:
      hosts:
        node1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment procedure:

Deployment Node console
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for K8s cluster deployment to complete, please make sure no errors are encountered in the playbook log.

Below is an example of a successful result:

Deployment Node console
...
PLAY RECAP ***************************************************************************************************************************************************
localhost                  : ok=4    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=501  changed=111  unreachable=0    failed=0    skipped=1131 rescued=0    ignored=2   
node2                      : ok=360  changed=40   unreachable=0    failed=0    skipped=661  rescued=0    ignored=1   
node3                      : ok=360  changed=40   unreachable=0    failed=0    skipped=660  rescued=0    ignored=1     


Sunday 9 May 2021  19:39:17 +0000 (0:00:00.064)       0:06:54.711 ******** 
=============================================================================== 
kubernetes/control-plane : kubeadm | Initialize first master ----------------------------------------------------------------------------------------- 28.13s
kubernetes/control-plane : Master | wait for kube-scheduler ------------------------------------------------------------------------------------------ 12.78s
download : download_container | Download image if required ------------------------------------------------------------------------------------------- 10.56s
container-engine/containerd : ensure containerd packages are installed -------------------------------------------------------------------------------- 9.48s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.36s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.08s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 9.05s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 8.91s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 8.47s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------------- 8.30s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 7.49s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 7.39s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------------------- 7.07s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 5.99s
container-engine/containerd : ensure containerd repository is enabled --------------------------------------------------------------------------------- 5.59s
container-engine/crictl : download_file | Download item ----------------------------------------------------------------------------------------------- 5.45s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 5.34s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates -------------------------------------------------------------------------------- 5.00s
download : download_container | Download image if required -------------------------------------------------------------------------------------------- 4.95s
download : download_file | Download item -------------------------------------------------------------------------------------------------------------- 4.50s

K8s Cluster Customization and Verification

Now that the K8S cluster is deployed, connection to the K8s cluster can be done from any K8S Master Node with the root user account or from another server with installed KUBECTL command and configured KUBECONFIG=<path-to-config-file> to customize deployment.

In our guide we continue the deployment from K8s Master Node with the root user account:

Label the Worker Nodes:

Master Node console
$ kubectl label nodes node2 node-role.kubernetes.io/worker=
$ kubectl label nodes node3 node-role.kubernetes.io/worker=

K8s Worker Node labeling is required for a proper installation of the NVIDIA Network Operator.

Below is an output example of the K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Master Node console
## Get cluster node status
 
kubectl get node -o wide

NAME    STATUS   ROLES                  AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION         CONTAINER-RUNTIME
node1   Ready    control-plane,master    9d   v1.22.8   192.168.100.29   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-generic      containerd://1.5.8
node2   Ready    worker                  9d   v1.22.8   192.168.100.34   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-lowlatency   containerd://1.5.8
node3   Ready    worker                  9d   v1.22.8   192.168.100.39   <none>        Ubuntu 20.04.4 LTS   5.4.0-109-lowlatency   containerd://1.5.8

## Get system pods status
 
kubectl -n kube-system get pods -o wide

NAME                                      READY   STATUS    RESTARTS       AGE   IP               NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-5788f6558-bm5h9   1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
calico-node-4f748                         1/1     Running   0               9d   192.168.100.34   node2   <none>           <none>
calico-node-jhbjh                         1/1     Running   0               9d   192.168.100.39   node3   <none>           <none>
calico-node-m78p6                         1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
coredns-8474476ff8-dczww                  1/1     Running   0               9d   10.233.90.23     node1   <none>           <none>
coredns-8474476ff8-ksvkd                  1/1     Running   0               9d   10.233.96.234    node2   <none>           <none>
dns-autoscaler-5ffdc7f89d-h6nc8           1/1     Running   0               9d   10.233.90.20     node1   <none>           <none>
kube-apiserver-node1                      1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
kube-controller-manager-node1             1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
kube-proxy-2bq45                          1/1     Running   0               9d   192.168.100.34   node2   <none>           <none>
kube-proxy-4c8p7                          1/1     Running   0               9d   192.168.100.39   node3   <none>           <none>
kube-proxy-j226w                          1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
kube-scheduler-node1                      1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>
nginx-proxy-node2                         1/1     Running   0               9d   192.168.100.34   node2   <none>           <none>
nginx-proxy-node3                         1/1     Running   0               9d   192.168.100.39   node3   <none>           <none>
nodelocaldns-9rffq                        1/1     Running   0               9d   192.168.100.39   node3   <none>           <none>
nodelocaldns-fdnr7                        1/1     Running   0               9d   192.168.100.34   node2   <none>           <none>
nodelocaldns-qhpxk                        1/1     Running   0               9d   192.168.100.29   node1   <none>           <none>

NVIDIA GPU Operator Installation for K8s cluster 

The preferred method to deploy the GPU Operator using helm from the K8s Master node. To install helm, simply use the following command:

$ snap install helm --classic


Add the NVIDIA GPU Operator Helm repository. 

$ helm repo add nvidia https://nvidia.github.io/gpu-operator 
$ helm repo update


Deploy NVIDIA GPU Operator.
GPU Operator should be deployed with enabled GPUDirect kernel module - driver.rdma.enabled=true.

$ helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.rdma.enabled=true --set driver.rdma.useHostMofed=true 

$ helm ls -n gpu-operator
NAME                   	NAMESPACE   	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
gpu-operator-1652190420	gpu-operator	1       	2022-05-10 13:47:01.106147933 +0000 UTC	deployed	gpu-operator-v1.10.0	v1.10.0  NAME                   	


Once the Helm chart is installed, check the status of the pods to ensure all the containers are running and the validation is complete:

$ kubectl get pod -n gpu-operator -o wide

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE    NOMINATED NODE   READINESS GATES
gpu-feature-discovery-bcc22                                       1/1     Running     1 (3d8h ago)    5d18h   10.233.96.3     node2   <none>           <none>
gpu-feature-discovery-vl68h                                       1/1     Running     0               5d18h   10.233.92.58    node3   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-master-5b5fx8zlx   1/1     Running     1 (4m5s ago)    5d18h   10.233.90.17    node1   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-czsb4       1/1     Running     0               4s      10.233.92.75    node3   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-fnlj6       1/1     Running     0               4s      10.233.96.253   node2   <none>           <none>
gpu-operator-1652190420-node-feature-discovery-worker-r44r8       1/1     Running     1 (4m5s ago)    5d18h   10.233.90.22    node1   <none>           <none>
gpu-operator-6497cbf9cd-vcsrg                                     1/1     Running     1 (4m6s ago)    5d18h   10.233.90.19    node1   <none>           <none>
nvidia-container-toolkit-daemonset-4h9dr                          1/1     Running     0               5d18h   10.233.96.246   node2   <none>           <none>
nvidia-container-toolkit-daemonset-rv7sn                          1/1     Running     1 (5d18h ago)   5d18h   10.233.92.50    node3   <none>           <none>
nvidia-cuda-validator-kr6q9                                       0/1     Completed   0               5d18h   10.233.92.61    node3   <none>           <none>
nvidia-cuda-validator-zb4p8                                       0/1     Completed   0               5d18h   10.233.96.4     node2   <none>           <none>
nvidia-dcgm-exporter-5hdzh                                        1/1     Running     0               5d18h   10.233.96.198   node2   <none>           <none>
nvidia-dcgm-exporter-lnqzb                                        1/1     Running     0               5d18h   10.233.92.57    node3   <none>           <none>
nvidia-device-plugin-daemonset-dxgnz                              1/1     Running     0               5d18h   10.233.92.62    node3   <none>           <none>
nvidia-device-plugin-daemonset-w692b                              1/1     Running     0               5d18h   10.233.96.9     node2   <none>           <none>
nvidia-device-plugin-validator-pqns8                              0/1     Completed   0               5d18h   10.233.92.64    node3   <none>           <none>
nvidia-device-plugin-validator-sgtmt                              0/1     Completed   0               5d18h   10.233.96.10    node2   <none>           <none>
nvidia-driver-daemonset-l9x4n                                     2/2     Running     1 (2d19h ago)   5d18h   10.233.92.30    node3   <none>           <none>
nvidia-driver-daemonset-tf2tl                                     2/2     Running     5 (2d21h ago)   5d18h   10.233.96.244   node2   <none>           <none>
nvidia-operator-validator-p6794                                   1/1     Running     0               5d18h   10.233.96.6     node2   <none>           <none>
nvidia-operator-validator-xjrg9                                   1/1     Running     0               5d18h   10.233.92.54    node3   <none>           <none>


NVIDIA Network Operator Installation  

The NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. The Helm is required for the Network Operator deployment.

Add the NVIDIA Network Operator Helm repository:

## Add REPO  
helm repo add mellanox https://mellanox.github.io/network-operator \
  && helm repo update            

Create the values.yaml file to customize the Network Operator deployment (example):

nfd:
  enabled: true
 
sriovNetworkOperator:
  enabled: true

ofedDriver:
  deploy: false
nvPeerDriver:
  deploy: false
rdmaSharedDevicePlugin:
  deploy: false
sriovDevicePlugin:
  deploy: false
 
deployCR: true
secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
  multus:
    deploy: true
  ipamPlugin:
    deploy: true           

Deploy the operator:

helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name
 
helm ls -n network-operator 
NAME                       	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                 	APP VERSION
network-operator-1648457278	network-operator	1       	2022-03-28 08:47:59.548667592 +0000 UTC	deployed	network-operator-1.1.0	v1.1.0   

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running:

## PODs status in namespace - network-operator

kubectl -n network-operator get pods -o wide
NAME                                                              READY   STATUS    RESTARTS        AGE   IP               NODE    NOMINATED NODE   READINESS GATES
network-operator-1648457278-5885dbfff5-wjgsc                      1/1     Running   0                5m   10.233.90.15     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-master-zbcx8   1/1     Running   0                5m   10.233.90.16     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-kk4qs   1/1     Running   0                5m   10.233.90.18     node1   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-n44b6   1/1     Running   0                5m   10.233.92.221    node3   <none>           <none>
network-operator-1648457278-node-feature-discovery-worker-xhzfw   1/1     Running   0                5m   10.233.96.233    node2   <none>           <none>
network-operator-1648457278-sriov-network-operator-5cd4bdb6mm9f   1/1     Running   0                5m   10.233.90.21     node1   <none>           <none>
sriov-device-plugin-cxnrl                                         1/1     Running   0                5m   192.168.100.34   node2   <none>           <none>
sriov-device-plugin-djlmn                                         1/1     Running   0                5m   192.168.100.39   node3   <none>           <none>
sriov-network-config-daemon-rgfvk                                 3/3     Running   0                5m   192.168.100.39   node3   <none>           <none>
sriov-network-config-daemon-zzchs                                 3/3     Running   0                5m   192.168.100.34   node2   <none>           <none>

## PODs status in namespace - nvidia-network-operator-resources 

kubectl -n nvidia-network-operator-resources get pods -o wide
NAME                   READY   STATUS    RESTARTS       AGE   IP               NODE    NOMINATED NODE   READINESS GATES
cni-plugins-ds-snf6x   1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>
cni-plugins-ds-zjb27   1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
kube-multus-ds-mz7nd   1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>
kube-multus-ds-xjxgd   1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
whereabouts-jgt24      1/1     Running   0               5m   192.168.100.34   node2   <none>           <none>
whereabouts-sphx4      1/1     Running   0               5m   192.168.100.39   node3   <none>           <none>
     

High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SR-IOV-enabled devices in your node.
In this deployment, the network interface has been chosen with the following name: enp57s0f0

To review the interface status please use the following command:

NICs status
## NIC status 
kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml  

...
status:
  interfaces:
    deviceID: 101d
    driver: mlx5_core
    eSwitchMode: legacy
    linkSpeed: 100000 Mb/s
    linkType: ETH
    mac: 0c:42:a1:2b:73:fa
    mtu: 9000
    name: enp57s0f0
    numVfs: 8
    pciAddress: "0000:39:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 101d
    driver: mlx5_core
...  

Create SriovNetworkNodePolicy CR for chosen network interface - policy.yaml file, by specifying the chosen interface in the 'nicSelector'.

According to application design VF0 allotted into a separate pool from the rest of VFn:

policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics-sw1
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/custom-rdma.capable: "true"
  resourceName: timepool
  priority: 99
  mtu: 9000
  numVfs: 8
  nicSelector:
    pfNames: [ "enp57s0f0#0-0" ]
  deviceType: netdevice
  isRdma: true

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics-sw2
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/custom-rdma.capable: "true"
  resourceName: rdmapool
  priority: 99
  mtu: 9000
  numVfs: 8
  nicSelector:
    pfNames: [ "enp57s0f0#1-7" ]
  deviceType: netdevice
  isRdma: true

Deploy policy.yaml:

kubectl apply -f policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw1 created
sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw2 created
This step takes a while. This depends on the amount of K8s Worker Nodes to apply the configuration, and the number of VFs for each selected network interface.

Create an SriovNetwork CR for chosen network interface - network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy.
In this example below created:

  • timenet - K8s network name for PTP time sync
  • rdmanet - K8s network name with dynamic IPAM
  • rdma-static - K8s network name with static IPAM
network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: timenet
  namespace: network-operator
spec:
  ipam: |
    {
         "datastore": "kubernetes",
         "kubernetes": {"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},
         "log_file": "/tmp/whereabouts.log",
         "log_level": "debug",
         "type": "whereabouts",
         "range": "172.20.0.0/24",
         "exclude": [ "172.20.0.1/32" ]
    }
  networkNamespace: default
  resourceName: timepool
  trust: "on"
  vlan: 0

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdmanet
  namespace: network-operator
spec:
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.102.0/24",
      "exclude": [ "192.168.102.254/32", "192.168.102.253/32" ]
    }
  networkNamespace: default
  resourceName: rdmapool
  vlan: 2

---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rdmanet-static
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "static"
    }
  networkNamespace: default
  resourceName: rdmapool
  vlan: 2

Deploy network.yaml:

kubectl apply -f network.yaml
sriovnetwork.sriovnetwork.openshift.io/timenet created
sriovnetwork.sriovnetwork.openshift.io/rdmanet created
sriovnetwork.sriovnetwork.openshift.io/rdmanet-static created

Manage HugePages

Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes will automatically discover and report all HugePages resources as schedulable resources. For get additional information K8s HugePages management, please refer here.

In order to allocate, HugePages needs to modify GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grubThis setting, below, allocates 2MB * 8192 pages = 16GB HugePages on boot time:

/etc/default/grub
...

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=2M hugepagesz=2M hugepages=8192"

...

Run update-grub to apply the config to grub and reboot server:

Worker Node console
# update-grub
# reboot

After the server comes back, check hugepages allocation from master node by command:

Master Node console
# kubectl describe nodes node2
...
Capacity:
  cpu:                  48
  ephemeral-storage:    459923528Ki
  hugepages-1Gi:        0
  hugepages-2Mi:        16Gi
  memory:               264050900Ki
  nvidia.com/gpu:       2
  nvidia.com/rdmapool:  7
  nvidia.com/timepool:  1
  pods:                 110
Allocatable:
  cpu:                  46
  ephemeral-storage:    423865522704
  hugepages-1Gi:        0
  hugepages-2Mi:        16Gi
  memory:               246909140Ki
  nvidia.com/gpu:       2
  nvidia.com/rdmapool:  7
  nvidia.com/timepool:  1
  pods:                 110
...

Enable CPU and Topology Management

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible
  • Are sensitive to processor cache misses
  • Are low-latency network applications
  • Coordinate with other processes and benefit from sharing a single processor cache

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested. In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

To use Topology Manager, CPU Manager with static policy must be used.

For additional information, please refer to Control Topology Management Policies on a node and Control Topology Management Policies on a node.

In order to enable CPU Manager and Topology Manager, please add following lines to kubelet configuration file /etc/kubernetes/kubelet-config.yaml

/etc/kubernetes/kubelet-config.yaml
...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
topologyManagerPolicy: single-numa-node
featureGates:
  CPUManager: true
  TopologyManager: true

Due to changes in cpuManagerPolicy, remove /var/lib/kubelet/cpu_manager_state and restart kubelet service on each affected K8s worker node.

Worker Node console
# rm -f /var/lib/kubelet/cpu_manager_state
# service kubelet restart

Application

Below provides K8s specific components and K8s YAML configuration files to deploy Rivermax applications in K8s cluster.

For proper application execution Rivermax license is required. To obtain a license please look at Rivermax License Generation Guidelines.
To download Rivermax apps container images from container repository and application pipeline, you need to register and log in to the Rivermax portal by clicking on "Get Started".

Rivermax license

Upload Rivermax license as configmap value in K8s cluster.

kubectl create configmap rivermax-config --from-file=rivermax.lic=./rivermax.lic

Media Node application

This pod definition contains implementation of the AMWA Networked Media Open Specifications (NMOS) with the NMOS Rivermax Node implementation. For more information about AMWA, NMOS and the Networked Media Incubator, please refer to http://amwa.tv/. For more information about Rivermax SDK please refer to https://developer.nvidia.com/networking/Rivermax.
Below provided YAML configuration file for Media Node deployment. Please fill your container file name and your registry secret.

apiVersion: v1
kind: ConfigMap
metadata:
  name: river-config
data:
  container-config: |-
    #media_node JSON file to run
    config_json=/var/home/config.json
    #Output registry stdout/stderr output to a log inside container
    log_output=FALSE
    #Update/insert label parameter with container hostname on entrypoint script run
    update_label=TRUE
    #Allow these network interfaces in /etc/avahi/avahi-daemon.conf
    allow_interfaces=net1

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: "mnc"
  labels:
    apps: rivermax
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rivermax
  template:
    metadata:
      labels:
        app: rivermax
      annotations:
        k8s.v1.cni.cncf.io/networks: rdmanet
    spec:
      containers:
      - command:
        image: < media node container image >
        name: "medianode"
        env:
          - name: DISPLAY
            value: "192.168.102.253:0.0"
        resources:
          requests:
            nvidia.com/rdmapool: 1
            hugepages-2Mi: 4Gi
            memory: 8Gi
            cpu: 4
          limits:
            nvidia.com/rdmapool: 1
            hugepages-2Mi: 4Gi
            memory: 8Gi
            cpu: 4
        securityContext:
          capabilities:
            add: [ "IPC_LOCK", "SYS_RESOURCE", "NET_RAW","NET_ADMIN" ]
        volumeMounts:
        - name: config
          mountPath: /var/home/ext/
        - name: licconfig
          mountPath: /opt/mellanox/rivermax/
        - mountPath: /hugepages
          name: hugepage
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: config
        configMap:
          name: river-config
      - name: licconfig
        configMap:
          name: rivermax-config
      - name: hugepage
        emptyDir:
          medium: HugePages
      - name: dshm
        emptyDir: {
          medium: 'Memory',
          sizeLimit: '4Gi'
          }
      imagePullSecrets:
      - name: < Container registry secret >   

NMOS controller

AMWA NMOS controller is a device that can interact with NMOS APIs, which are a family of open specifications for networked media for professional applications. NMOS controller can discover, register, connect and manage media devices on an IP infrastructure using common methods and protocols. NMOS controller can also handle event and tally, audio channel mapping, authorization and other functions that are part of the NMOS roadmap. For more information, please look at README.md.

apiVersion: v1
kind: Pod
metadata:
  name: nmos-cpp
  labels:
    app.kubernetes.io/name: nmos
  annotations:
    k8s.v1.cni.cncf.io/networks: |
          [
            { "name": "rdmanet-static", 
              "ips": [ "192.168.102.254/24" ] 
            }
          ]
spec:
  containers:
  - name: nmos-pod
    image: docker.io/rhastie/nmos-cpp:latest
    env:
    - name: RUN_NODE
      value: "true"
    resources:
      requests:
        cpu: 2
        memory: 1Gi
        nvidia.com/rdmapool: 1
      limits:
        cpu: 2
        memory: 1Gi
        nvidia.com/rdmapool: 1
    ports:
      - containerPort: 8010
        name: port-8010
      - containerPort: 8011
        name: port-8011
      - containerPort: 11000
        name: port-11000
      - containerPort: 11001
        name: port-11001
      - containerPort: 1883
        name: port-1883 
      - containerPort: 5353
        name: port-5353
        protocol: UDP

DeepStream Media Gateway

One of the applications of DeepStream SDK is to encode RAW data to SRT stream. This application can capture video frames from a camera or a file, encode them using H.264 or H.265 codec, and send them over a network using SRT protocol. SRT stands for Secure Reliable Transport, which is a low-latency and secure streaming technology. This application can be useful for scenarios such as remote surveillance, live broadcasting, or video conferencing.
Below provided YAML configuration file for Media Gateway deployment. Please fill your container file name and your registry secret.

apiVersion: v1
kind: Pod
metadata:
  name: ds-rmax
  labels:
    name: dsrmax-app
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmanet
spec:
  containers:
  - name: dsrmax
    image: < DeepStream media gateway container image >
    command:
      - sh
      - -c
      - sleep inf
    env:
      - name: DISPLAY
        value: "192.168.102.253:0.0"
    ports:
      - containerPort: 7001
        name: udp-port        
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE", "NET_RAW","NET_ADMIN"]        
    resources:
      requests:
        nvidia.com/rdmapool: 1
        nvidia.com/gpu: 1
        hugepages-2Mi: 2Gi
        memory: 8Gi
        cpu: 8
      limits:
        nvidia.com/rdmapool: 1
        nvidia.com/gpu: 1
        hugepages-2Mi: 2Gi
        memory: 8Gi
        cpu: 8
    volumeMounts:
    - name: config
      mountPath: /var/home/ext/
    - name: licconfig
      mountPath: /opt/mellanox/rivermax/
    - mountPath: /hugepages
      name: hugepage
    - mountPath: /dev/shm
      name: dshm
  volumes:
  - name: config
    configMap:
      name: river-config
  - name: licconfig
    configMap:
      name: rivermax-config
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: dshm
    emptyDir: {
      medium: 'Memory',
      sizeLimit: '4Gi'
      }
  imagePullSecrets:
  - name: < Container registry secret >
---
apiVersion: v1
kind: Service
metadata:
  name: rmax-service
spec:
  type: NodePort
  selector:
    name: dsrmax-app
  ports:
      # By default and for convenience, the `targetPort` is set to the same value as the `port` field.
    - port: 7001
      name: udp-port
      protocol: UDP
      targetPort: 7001
    

VNC container with GUI

This pop definition allows you to access a web VNC interface with Ubuntu LXDE/LXQT desktop environment inside a Kubernetes cluster. It uses a interface of the K8s secondary network to manage applications via GUI on your cluster nodes.
Below provided YAML configuration file for VNC deployment. Please fill your container file name. 

Example of this application can be found at - GitHub - theasp/docker-novnc: noVNC Display Container for Docker, but you can create your own container image.

apiVersion: v1
kind: Pod
metadata:
  name: ub-vnc  
  labels:
    name: ubuntu-vnc
  annotations:
    k8s.v1.cni.cncf.io/networks: |
       [
         { "name": "rdmanet-static",
           "ips": [ "192.168.102.253/24" ]
         }
       ]
spec:
  volumes:                          
    - name: dshm
      emptyDir:
        medium: Memory
  containers:
    - image: < NOVNC container image >
      name: vnc-container
      resources:
        limits:
          cpu: 4 
          memory: 8Gi
          nvidia.com/rdmapool: 1
      env:
        - name: DISPLAY_WIDTH
          value: "1920"
        - name: DISPLAY_HEIGHT
          value: "1080"
        - name: RUN_XTERM
          value: "yes"
        - name: RUN_FLUXBOX
          value: "yes"
      ports:
        - containerPort: 8080
          name: http-port
      volumeMounts:                 
        - mountPath: /dev/shm
          name: dshm          

---
apiVersion: v1
kind: Service
metadata:
  name: vnc-service
spec:
  type: NodePort
  selector:
    name: ubuntu-vnc
  ports:
    - port: 8080
      name: http-port
      targetPort: 8080




Authors

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.



Gareth Sylvester-Bradley

Gareth Sylvester-Bradley is a Principal Engineer at NVIDIA, and currently serving as the chair of the Networked Media Open Specifications (NMOS) Architecture Review group in the Advanced Media Workflow Association (AMWA). He is focused on building software toolkits and agile, collaborative industry specifications to deliver open, software-defined, hardware-accelerated media workflows for broadcast, live production, medical imaging, industrial video, etc.































Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2023 NVIDIA Corporation & affiliates. All Rights Reserved.