General Cluster information

This is a one node instance k8s cluster with an A100 or H100 GPU which was split into three multi-instance GPUs (MIG).

Use the following command to see the current GPU information. The GPU should be an NVIDIA A100 or H100 80GB that has been turned into 3 Multi-Instance GPU (MIG) instances.

Copy
Copied!
            

kubectl run nvidia-smi --rm -t -i --restart=Never --image=nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi

Output should look similar to the one below.

Copy
Copied!
            

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | On | | N/A 42C P0 81W / 300W | N/A | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 3 0 0 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 4 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 5 0 2 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 32767MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

The benchmarks use interactive Jupyter notebook applications that will run on the LaunchPad cluster deployed in Kubernetes. You can access the cluster by clicking the System Console link in the left menu. Once you are connected you can connect to the sparkrunner pod according to each benchmark guide and run the benchmarks. We also deploy some services such as spark history server/spark/jupyter which you can access by clicking the Desktop tab for monitoring the application status, view eventlogs easily. You can also ssh to the cluster in the desktop in the terminal.

© Copyright 2022-2023, NVIDIA. Last updated on Jun 23, 2023.