Deployment Steps#

Deploying GDS with NVIDIA AI Enterprise with Kubernetes involves three high level steps:

  • Installing Kubernetes

  • Installing the NVIDIA Network Operator

  • Installing the NVIDIA GPU Operator with GDS Support

This section provides further details on each step.

Creating GDS Enabled Kubernetes#

Install Kubernetes#

To install Upstream Kubernetes on your system, use the following instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#ubuntu-lts. For the container engine, use containerd.

Install Network Operator#

In order for GDS to function on local NVMe’s or NFSoRDMA shares, Network Operator must be installed. The operator patches the nvme and rpcrdma modules with additional symbols which facilitate the GDS operations. Installation can be done by following the Installing NVIDIA Network Operator Instructions. The guide also provides additional parameters for the helm based installation which can be found here.

Note

It can take up to 10 minutes for the Network Operator deployment to finish installing the MOFED driver.

Install GPU Operator with GDS support#

With a fresh system setup that has a clean install of Ubuntu, one must completely uninstall the NVIDIA GPU driver and NVIDIA container runtime from the system, and instead install GPU Operator which houses the NVIDIA GPU driver in a Kubernetes pod. The reason for doing this is because classically, when the driver is installed on the bare metal system, you need to schedule your hardware resources for each container/pod. For example, if you have multiple workloads running in multiple containers, but only have 1 system with 1 GPU, then you cannot schedule that single GPU as a resource to every running container. The GPU Operator houses a discovery DaemonSet which discovers GPU’s on the host and other connected Kubernetes nodes, and labels the GPU’s as a resource on whichever node it was found on. Then, each node which has labeled GPU’s has the NVIDIA GPU driver DaemonSet installed. The driver is thus containerized on each node and also exposes the GPU driver to every other running container on that node. There is also a device plugin DaemonSet which is installed that allows for the labeled GPU’s on nodes to be scheduled in other application containers.

GPU Operator can be installed on an NVAIE 3.0 compatible system by following the instructions. The install requires an NLS token which can be obtained from the NVIDIA Licensing Portal.

In order to install with GDS support, the Helm install command should be run as such:

1helm install --wait gpu-operator nvaie/gpu-operator-3-0 -n gpu-operator \
2    --set driver.repository=nvcr.io/nvidia \
3    --set driver.image=driver \
4    --set driver.version=525.60.13 \
5    --set driver.licensingConfig.config.name="" \
6    --set gds.enabled=true \
7    --set driver.rdma.enabled=true

You can verify that the GPU operator is installed and running by issuing the following command:

kubectl get pods -n gpu-operator

You should see similar output:

 1NAME                                                             READY STATUS    RESTARTS  AGE
 2gpu-feature-discovery-jgcph                                      1/1   Running   0         7h33m
 3gpu-operator-1663780918-node-feature-discovery-master-6c97sl7gv  1/1   Running   0         15d
 4gpu-operator-1663780918-node-feature-discovery-worker-6xkjg      1/1   Running   0         15d
 5gpu-operator-6764d9bc9b-8sm2c                                    1/1   Running   0         15d
 6nvidia-container-toolkit-daemonset-w4bvg                         1/1   Running   0         7h33m
 7nvidia-cuda-validator-692qk                                      0/1   Completed 0         7h19m
 8nvidia-dcgm-exporter-mxmr6                                       1/1   Running   0         7h33m
 9nvidia-device-plugin-daemonset-2r76s                             1/1   Running   0         7h33m
10nvidia-device-plugin-validator-hjld4                             0/1   Completed 0         7h18m
11nvidia-driver-daemonset-8kmwc                                    3/3   Running   0         7h21m
12nvidia-operator-validator-7hcbv                                  1/1   Running   0         7h21m

You can verify that GDS is ready on the system with the following command:

lsmod | grep nvidia

You should see nvidia-fs in the output:

1nvidia_fs 249856 0
2nvidia_peermem 16384 0
3nvidia_modeset 1171456 0
4nvidia_uvm 1191936 0
5nvidia 55463936 132 nvidia_uvm,nvidia_peermem,nvidia_modeset
6ib_core 348160 10 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
7drm 491520 7 drm_kms_helper,drm_vram_helper,ast,nvidia,ttm