Deployment Steps

Deploying GDS with NVIDIA AI Enterprise with Kubernetes involves three high level steps:

  • Installing Kubernetes

  • Installing the NVIDIA Network Operator

  • Installing the NVIDIA GPU Operator with GDS Support

This section provides further details on each step.

Install Kubernetes

To install Upstream Kubernetes on your system, use the following instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#ubuntu-lts. For the container engine, use containerd.

Install Network Operator

In order for GDS to function on local NVMe’s or NFSoRDMA shares, Network Operator must be installed. The operator patches the nvme and rpcrdma modules with additional symbols which facilitate the GDS operations. Installation can be done by following the Installing NVIDIA Network Operator Instructions. The guide also provides additional parameters for the helm based installation which can be found here.

Note

It can take up to 10 minutes for the Network Operator deployment to finish installing the MOFED driver.

Install GPU Operator with GDS support

With a fresh system setup that has a clean install of Ubuntu, one must completely uninstall the NVIDIA GPU driver and NVIDIA container runtime from the system, and instead install GPU Operator which houses the NVIDIA GPU driver in a Kubernetes pod. The reason for doing this is because classically, when the driver is installed on the bare metal system, you need to schedule your hardware resources for each container/pod. For example, if you have multiple workloads running in multiple containers, but only have 1 system with 1 GPU, then you cannot schedule that single GPU as a resource to every running container. The GPU Operator houses a discovery DaemonSet which discovers GPU’s on the host and other connected Kubernetes nodes, and labels the GPU’s as a resource on whichever node it was found on. Then, each node which has labeled GPU’s has the NVIDIA GPU driver DaemonSet installed. The driver is thus containerized on each node and also exposes the GPU driver to every other running container on that node. There is also a device plugin DaemonSet which is installed that allows for the labeled GPU’s on nodes to be scheduled in other application containers.

GPU Operator can be installed on an NVAIE 3.0 compatible system by following the instructions. The install requires an NLS token which can be obtained from the NVIDIA Licensing Portal.

In order to install with GDS support, the Helm install command should be run as such:

Copy
Copied!
            

helm install --wait gpu-operator nvaie/gpu-operator-3-0 -n gpu-operator \ --set driver.repository=nvcr.io/nvidia \ --set driver.image=driver \ --set driver.version=525.60.13 \ --set driver.licensingConfig.config.name="" \ --set gds.enabled=true \ --set driver.rdma.enabled=true

You can verify that the GPU operator is installed and running by issuing the following command:

Copy
Copied!
            

kubectl get pods -n gpu-operator

You should see similar output:

Copy
Copied!
            

NAME READY STATUS RESTARTS AGE gpu-feature-discovery-jgcph 1/1 Running 0 7h33m gpu-operator-1663780918-node-feature-discovery-master-6c97sl7gv 1/1 Running 0 15d gpu-operator-1663780918-node-feature-discovery-worker-6xkjg 1/1 Running 0 15d gpu-operator-6764d9bc9b-8sm2c 1/1 Running 0 15d nvidia-container-toolkit-daemonset-w4bvg 1/1 Running 0 7h33m nvidia-cuda-validator-692qk 0/1 Completed 0 7h19m nvidia-dcgm-exporter-mxmr6 1/1 Running 0 7h33m nvidia-device-plugin-daemonset-2r76s 1/1 Running 0 7h33m nvidia-device-plugin-validator-hjld4 0/1 Completed 0 7h18m nvidia-driver-daemonset-8kmwc 3/3 Running 0 7h21m nvidia-operator-validator-7hcbv 1/1 Running 0 7h21m

You can verify that GDS is ready on the system with the following command:

Copy
Copied!
            

lsmod | grep nvidia

You should see nvidia-fs in the output:

Copy
Copied!
            

nvidia_fs 249856 0 nvidia_peermem 16384 0 nvidia_modeset 1171456 0 nvidia_uvm 1191936 0 nvidia 55463936 132 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 348160 10 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 491520 7 drm_kms_helper,drm_vram_helper,ast,nvidia,ttm

Previous Prerequisites
Next Creating an Application Container
© Copyright 2024, NVIDIA. Last updated on Apr 2, 2024.