The nvidia-driver-daemonset houses an nvidia-fs-sidecar container, which loads the NVIDIA GPU driver and GDS kernel modules into the host, which can then be accessed by privileged containers running in other pods. In order to launch your own application container and utilize GDS, the user space libraries must be installed in your application container. This can be done easily by using a CUDA container image as the base image in the application container Dockerfile:
FROM nvcr.io/nvidia/cuda:11.7.1-devel-ubuntu20.04
RUN apt-get update && apt-get install -y libcufile-dev
If the full CUDA base container image is not desired, an Ubuntu base image can be used and the following commands can be added to the Dockerfile to install the CUDA toolkit and libcufile user space libraries without installing CUDA in its entirety:
FROM ubuntu:20.04
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb && \
dpkg -i cuda-keyring_1.0-1_all.deb && \
apt-get update && \
apt-get -y install cuda-toolkit-<major>-<minor> libcufile-dev
Where major
and minor
correspond to CUDA version. Eg. CUDA 11.8 would have major = 11 and minor = 8
After the application container is built, a YAML file must be created to describe the specification of a pod which will be launched in Kubernetes. If your application container image is hosted locally with Docker, then you must run the following commands to create a private Docker registry and push the image to it so Kubernetes can access it:
$ sudo docker run -d -p 5000:5000 --restart=always --name registry registry:2
$ sudo docker tag <your image name>:<your image version> localhost:5000/<your image name>:<your image version>
$ sudo docker push localhost:5000/<your image name>:<your image version>
The following YAML example gives the minimum requirements of what must be in the YAML (aside from name, image location, and command to run):
apiVersion: v1
kind: Pod
metadata:
name: gds-application
spec:
hostNetwork: true
hostIPC: true
containers:
- name: gds-application
image: <your application image>
imagePullPolicy: Always
command: [ "/bin/bash", "-c", "--" ]
args: [ "whiletrue;dosleep30;done;" ]
securityContext:
privileged: true
volumeMounts:
- name: udev
mountPath: /run/udev
volumeMounts:
- name: kernel-config
mountPath: /sys/kernel/config
volumeMounts:
- name: dev
mountPath: /run/dev
volumeMounts:
- name: sys
mountPath: /sys
volumeMounts:
- name: results
mountPath: /results
volumeMounts:
- name: lib
mountPath: /lib/modules
volumes:
- name: udev
hostPath:
path: /run/udev
- name: kernel-config
hostPath:
path: /sys/kernel/config
- name: dev
hostPath:
path: /run/dev
- name: sys
hostPath:
path: /sys
- name: results
hostPath:
path: /results
- name: lib
hostPath:
path: /lib/modules