Validate the GPU/RDMA access within the container#

Validate GPU access from container#

Create a file named ‘gpu-test.yaml’ with the following information

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-pod
      image: nvcr.io/nvidia/cuda:12.6.3-runtime-ubuntu22.04
      imagePullPolicy: IfNotPresent
      command: ["/bin/sh"]
      args: ["-c", "nvidia-smi"]
      resources:
            limits:
            nvidia.com/gpu: 8

With the Job file created we can now get ready to execute the job by loading the Kubernetes Module using the following command.

module load kubernetes

Now that the environment module is loaded we’re finally ready to run the example job:

kubectl apply -f gpu-test.yaml

Next we’ll need to monitor the progress of the job using the following command:

k8suser@bcm10-headnode1:~$ kubectl get pods
NAME          READY STATUS  RESTARTS AGE
nvidia-smi-test 1/1 Running 0        8m20s

Once the job has finished, verify the results using kubectl logs:

The log file should list all the 8 GPUs in the node where the job was executed.

k8suser@bcm10-headnode1:~$ kubectl logs nvidia-smi-test
Wed Feb 12 23:49:34 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
| N/A   27C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   28C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:52:00.0 Off |                    0 |
| N/A   31C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9D:00.0 Off |                    0 |
| N/A   28C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   26C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:D1:00.0 Off |                    0 |
| N/A   29C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DF:00.0 Off |                    0 |
| N/A   29C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Reference: Nvidia System Management Interface

Validate the RDMA Network from the container#

Create a test file named ‘ib-network-validation.yaml’ with the following information:

apiVersion: v1
kind: Pod
metadata:
  name: network-validation-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: ibp192s0,ibp206s0,ibp154s0,ibp220s0,ibp24s0,ibp64s0,ibp79s0,ibp94s0
spec:
  containers:
    - name: network-validation-pod
      image: docker.io/deepops/nccl-tests:latest
      imagePullPolicy: IfNotPresent
      command:
        - sh
        - -c
        - sleep inf
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]
      resources:
        requests:
            nvidia.com/resibp192s0: "1"
            nvidia.com/resibp206s0: "1"
            nvidia.com/resibp154s0: "1"
            nvidia.com/resibp220s0: "1"
            nvidia.com/resibp24s0: "1"
            nvidia.com/resibp64s0: "1"
            nvidia.com/resibp79s0: "1"
            nvidia.com/resibp94s0: "1"
        limits:
          nvidia.com/resibp192s0: "1"
          nvidia.com/resibp206s0: "1"
          nvidia.com/resibp154s0: "1"
          nvidia.com/resibp220s0: "1"
          nvidia.com/resibp24s0: "1"
          nvidia.com/resibp64s0: "1"
          nvidia.com/resibp79s0: "1"
          nvidia.com/resibp94s0: "1"

Apply the ‘ib-network-validation.yaml’ file:

kubectl apply –f ib-network-validation.yaml

Confirm the pod is in running state

k8suser@bcm10-headnode1:~$ k get pods
NAME READY STATUS RESTARTS AGE
network-validation-pod 1/1 Running 0 5s

Verify the InfiniBand/RDMA interfaces are available in the container

root@bcm10-headnode1:~# kubectl exec -it network-validation-pod --
/usr/sbin/ibdev2netdev
mlx5_14 port 1 ==> net5 (Up)
mlx5_25 port 1 ==> net6 (Up)
mlx5_33 port 1 ==> net7 (Up)
mlx5_43 port 1 ==> net8 (Up)
mlx5_49 port 1 ==> net3 (Up)
mlx5_58 port 1 ==> net1 (Up)
mlx5_61 port 1 ==> net2 (Up)
mlx5_70 port 1 ==> net4 (Up)

Delete the test pod

root@bcm10-headnode1:~# kubectl delete pod network-validation-pod

pod "network-validation-pod" deleted