Deploying GDS with NVIDIA AI Enterprise with Kubernetes involves three high level steps:
Installing the NVIDIA Network Operator
Installing the NVIDIA GPU Operator with GDS Support
This section provides further details on each step.
To install Upstream Kubernetes on your system, use the following instructions: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#ubuntu-lts. For the container engine, use containerd.
Install Network Operator
In order for GDS to function on local NVMe’s or NFSoRDMA shares, Network Operator must be installed. The operator patches the nvme and rpcrdma modules with additional symbols which facilitate the GDS operations. Installation can be done by following the Installing NVIDIA Network Operator Instructions. The guide also provides additional parameters for the helm based installation which can be found here.
It can take up to 10 minutes for the Network Operator deployment to finish installing the MOFED driver.
Install GPU Operator with GDS support
With a fresh system setup that has a clean install of Ubuntu, one must completely uninstall the NVIDIA GPU driver and NVIDIA container runtime from the system, and instead install GPU Operator which houses the NVIDIA GPU driver in a Kubernetes pod. The reason for doing this is because classically, when the driver is installed on the bare metal system, you need to schedule your hardware resources for each container/pod. For example, if you have multiple workloads running in multiple containers, but only have 1 system with 1 GPU, then you cannot schedule that single GPU as a resource to every running container. The GPU Operator houses a discovery DaemonSet which discovers GPU’s on the host and other connected Kubernetes nodes, and labels the GPU’s as a resource on whichever node it was found on. Then, each node which has labeled GPU’s has the NVIDIA GPU driver DaemonSet installed. The driver is thus containerized on each node and also exposes the GPU driver to every other running container on that node. There is also a device plugin DaemonSet which is installed that allows for the labeled GPU’s on nodes to be scheduled in other application containers.
GPU Operator can be installed on an NVAIE 3.0 compatible system by following the instructions. The install requires an NLS token which can be obtained from the NVIDIA Licensing Portal <https://ui.licensing.nvidia.com/login>.
In order to install with GDS support, the Helm install command should be run as such:
helm install --wait gpu-operator nvaie/gpu-operator-3-0 -n gpu-operator \ --set driver.repository=nvcr.io/nvidia \ --set driver.image=driver \ --set driver.version=525.60.13 \ --set driver.licensingConfig.config.name="" \ --set gds.enabled=true \ --set driver.rdma.enabled=true
You can verify that the GPU operator is installed and running by issuing the following command:
$ kubectl get pods -n gpu-operator
You should see similar output:
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-jgcph 1/1 Running 0 7h33m gpu-operator-1663780918-node-feature-discovery-master-6c97sl7gv 1/1 Running 0 15d gpu-operator-1663780918-node-feature-discovery-worker-6xkjg 1/1 Running 0 15d gpu-operator-6764d9bc9b-8sm2c 1/1 Running 0 15d nvidia-container-toolkit-daemonset-w4bvg 1/1 Running 0 7h33m nvidia-cuda-validator-692qk 0/1 Completed 0 7h19m nvidia-dcgm-exporter-mxmr6 1/1 Running 0 7h33m nvidia-device-plugin-daemonset-2r76s 1/1 Running 0 7h33m nvidia-device-plugin-validator-hjld4 0/1 Completed 0 7h18m nvidia-driver-daemonset-8kmwc 3/3 Running 0 7h21m nvidia-operator-validator-7hcbv 1/1 Running 0 7h21m
You can verify that GDS is ready on the system with the following command:
$ lsmod | grep nvidia
You should see
nvidia-fs in the output:
nvidia_fs 249856 0 nvidia_peermem 16384 0 nvidia_modeset 1171456 0 nvidia_uvm 1191936 0 nvidia 55463936 132 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 348160 10 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 491520 7 drm_kms_helper,drm_vram_helper,ast,nvidia,ttm