Validate the GPU/RDMA access within the container#
Validate GPU access from container#
Create a file named ‘gpu-test.yaml’ with the following information
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-pod
image: nvcr.io/nvidia/cuda:12.6.3-runtime-ubuntu22.04
imagePullPolicy: IfNotPresent
command: ["/bin/sh"]
args: ["-c", "nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 8
With the Job file created we can now get ready to execute the job by loading the Kubernetes Module using the following command.
module load kubernetes
Now that the environment module is loaded we’re finally ready to run the example job:
kubectl apply -f gpu-test.yaml
Next we’ll need to monitor the progress of the job using the following command:
k8suser@bcm10-headnode1:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nvidia-smi-test 1/1 Running 0 8m20s
Once the job has finished, verify the results using kubectl logs:
The log file should list all the 8 GPUs in the node where the job was executed.
k8suser@bcm10-headnode1:~$ kubectl logs nvidia-smi-test
Wed Feb 12 23:49:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1B:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 |
| N/A 28C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:52:00.0 Off | 0 |
| N/A 31C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:61:00.0 Off | 0 |
| N/A 29C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9D:00.0 Off | 0 |
| N/A 28C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 |
| N/A 26C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:D1:00.0 Off | 0 |
| N/A 29C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DF:00.0 Off | 0 |
| N/A 29C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
Reference: Nvidia System Management Interface
Validate the RDMA Network from the container#
Create a test file named ‘ib-network-validation.yaml’ with the following information:
apiVersion: v1
kind: Pod
metadata:
name: network-validation-pod
annotations:
k8s.v1.cni.cncf.io/networks: ibp192s0,ibp206s0,ibp154s0,ibp220s0,ibp24s0,ibp64s0,ibp79s0,ibp94s0
spec:
containers:
- name: network-validation-pod
image: docker.io/deepops/nccl-tests:latest
imagePullPolicy: IfNotPresent
command:
- sh
- -c
- sleep inf
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
requests:
nvidia.com/resibp192s0: "1"
nvidia.com/resibp206s0: "1"
nvidia.com/resibp154s0: "1"
nvidia.com/resibp220s0: "1"
nvidia.com/resibp24s0: "1"
nvidia.com/resibp64s0: "1"
nvidia.com/resibp79s0: "1"
nvidia.com/resibp94s0: "1"
limits:
nvidia.com/resibp192s0: "1"
nvidia.com/resibp206s0: "1"
nvidia.com/resibp154s0: "1"
nvidia.com/resibp220s0: "1"
nvidia.com/resibp24s0: "1"
nvidia.com/resibp64s0: "1"
nvidia.com/resibp79s0: "1"
nvidia.com/resibp94s0: "1"
Apply the ‘ib-network-validation.yaml’ file:
kubectl apply –f ib-network-validation.yaml
Confirm the pod is in running state
k8suser@bcm10-headnode1:~$ k get pods
NAME READY STATUS RESTARTS AGE
network-validation-pod 1/1 Running 0 5s
Verify the InfiniBand/RDMA interfaces are available in the container
root@bcm10-headnode1:~# kubectl exec -it network-validation-pod --
/usr/sbin/ibdev2netdev
mlx5_14 port 1 ==> net5 (Up)
mlx5_25 port 1 ==> net6 (Up)
mlx5_33 port 1 ==> net7 (Up)
mlx5_43 port 1 ==> net8 (Up)
mlx5_49 port 1 ==> net3 (Up)
mlx5_58 port 1 ==> net1 (Up)
mlx5_61 port 1 ==> net2 (Up)
mlx5_70 port 1 ==> net4 (Up)
Delete the test pod
root@bcm10-headnode1:~# kubectl delete pod network-validation-pod
pod "network-validation-pod" deleted