Deployment Steps#

Deploying GDS with NVIDIA AI Enterprise with Kubernetes involves three high level steps:

Installing Kubernetes

Installing the NVIDIA Network Operator

Installing the NVIDIA GPU Operator with GDS Support

This section provides further details on each step.

Creating GDS Enabled Kubernetes#

Install Kubernetes#

To install Upstream Kubernetes on your system, use the following instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#ubuntu-lts. For the container engine, use containerd.

Install Network Operator#

In order for GDS to function on local NVMe’s or NFSoRDMA shares, Network Operator must be installed. The operator patches the nvme and rpcrdma modules with additional symbols which facilitate the GDS operations. Installation can be done by following the Installing NVIDIA Network Operator Instructions. The guide also provides additional parameters for the helm based installation which can be found here.

Note

It can take up to 10 minutes for the Network Operator deployment to finish installing the MOFED driver.

Install GPU Operator with GDS support#

With a fresh system setup that has a clean install of Ubuntu, one must completely uninstall the NVIDIA GPU driver and NVIDIA container runtime from the system, and instead install GPU Operator which houses the NVIDIA GPU driver in a Kubernetes pod. The reason for doing this is because classically, when the driver is installed on the bare metal system, you need to schedule your hardware resources for each container/pod. For example, if you have multiple workloads running in multiple containers, but only have 1 system with 1 GPU, then you cannot schedule that single GPU as a resource to every running container. The GPU Operator houses a discovery DaemonSet which discovers GPU’s on the host and other connected Kubernetes nodes, and labels the GPU’s as a resource on whichever node it was found on. Then, each node which has labeled GPU’s has the NVIDIA GPU driver DaemonSet installed. The driver is thus containerized on each node and also exposes the GPU driver to every other running container on that node. There is also a device plugin DaemonSet which is installed that allows for the labeled GPU’s on nodes to be scheduled in other application containers.

GPU Operator can be installed on an NVAIE 3.0 compatible system by following the instructions. The install requires an NLS token which can be obtained from the NVIDIA Licensing Portal.

In order to install with GDS support, the Helm install command should be run as such:

helm install --wait gpu-operator nvaie/gpu-operator-3-0 -n gpu-operator \
    --set driver.repository=nvcr.io/nvidia \
    --set driver.image=driver \
    --set driver.version=525.60.13 \
    --set driver.licensingConfig.config.name="" \
    --set gds.enabled=true \
    --set driver.rdma.enabled=true

You can verify that the GPU operator is installed and running by issuing the following command:

kubectl get pods -n gpu-operator

You should see similar output:

NAME                                                             READY STATUS    RESTARTS  AGE
gpu-feature-discovery-jgcph                                      1/1   Running   0         7h33m
gpu-operator-1663780918-node-feature-discovery-master-6c97sl7gv  1/1   Running   0         15d
gpu-operator-1663780918-node-feature-discovery-worker-6xkjg      1/1   Running   0         15d
gpu-operator-6764d9bc9b-8sm2c                                    1/1   Running   0         15d
nvidia-container-toolkit-daemonset-w4bvg                         1/1   Running   0         7h33m
nvidia-cuda-validator-692qk                                      0/1   Completed 0         7h19m
nvidia-dcgm-exporter-mxmr6                                       1/1   Running   0         7h33m
nvidia-device-plugin-daemonset-2r76s                             1/1   Running   0         7h33m
nvidia-device-plugin-validator-hjld4                             0/1   Completed 0         7h18m
nvidia-driver-daemonset-8kmwc                                    3/3   Running   0         7h21m
nvidia-operator-validator-7hcbv                                  1/1   Running   0         7h21m

You can verify that GDS is ready on the system with the following command:

lsmod | grep nvidia

You should see nvidia-fs in the output:

nvidia_fs 249856 0
nvidia_peermem 16384 0
nvidia_modeset 1171456 0
nvidia_uvm 1191936 0
nvidia 55463936 132 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_core 348160 10 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm 491520 7 drm_kms_helper,drm_vram_helper,ast,nvidia,ttm