This is a one node instance k8s cluster with an A100 GPU which was split into three multi-instance GPUs (MIG).
Use the following command to see the current GPU information. The GPU should be an NVIDIA A100 80GB that has been turned into 3 Multi-Instance GPU (MIG) instances.
kubectl run nvidia-smi --rm -t -i --restart=Never --image=nvidia/cuda:11.4.0-base nvidia-smi
Output should look similar to the one below.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | On |
| N/A 42C P0 81W / 300W | N/A | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 4 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 5 0 2 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You will see a running spark client pod called sparkrunner-0, bash into the pod.
Get the IP of the client pod that will be used later.
kubectl describe pod sparkrunner-0 | grep IP
Output should look similar to the one below.
cni.projectcalico.org/podIP: 192.168.34.30/32
cni.projectcalico.org/podIPs: 192.168.34.30/32
IP: 192.168.34.30
IPs:
IP: 192.168.34.30