General Cluster information

This is a one node instance k8s cluster with an A100 GPU which was split into three multi-instance GPUs (MIG).

Use the following command to see the current GPU information. The GPU should be an NVIDIA A100 80GB that has been turned into 3 Multi-Instance GPU (MIG) instances.

Copy
Copied!
            

kubectl run nvidia-smi --rm -t -i --restart=Never --image=nvidia/cuda:11.4.0-base nvidia-smi

Output should look similar to the one below.

Copy
Copied!
            

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | On | | N/A 42C P0 81W / 300W | N/A | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 3 0 0 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 4 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 5 0 2 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

You will see a running spark client pod called sparkrunner-0, bash into the pod.

Get the IP of the client pod that will be used later.

Copy
Copied!
            

kubectl describe pod sparkrunner-0 | grep IP

Output should look similar to the one below.

Copy
Copied!
            

cni.projectcalico.org/podIP: 192.168.34.30/32 cni.projectcalico.org/podIPs: 192.168.34.30/32 IP: 192.168.34.30 IPs: IP: 192.168.34.30

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.