Setup the Prerequisites#
Supported Platforms#
The following Nvidia GPUs are supported using the default helm chart:
8 x H100 (80 GB)
8 x A100 (80 GB)
8 x L40S (48 GB)
Prerequisites#
256+ GB system memory
Ubuntu 22.04
NVIDIA driver 535.161.08 (Recommended minimum version)
CUDA 12.2+ (CUDA driver installed with NVIDIA driver)
Kubernetes v1.31.2
NVIDIA GPU Operator v23.9
Helm v3.x
NGC API Key
Note
GPU operator related errors with microk8s observed with certain newer NVIDIA driver versions.
Install the NVIDIA Driver#
Download and install the NVIDIA driver 535.161.08 from NVIDIA Unix drivers page at:
https://www.nvidia.com/Download/driverResults.aspx/222416/en-us/
Run the following commands:
chmod 755 NVIDIA-Linux-x86_64-535.161.08.run sudo ./NVIDIA-Linux-x86_64-535.161.08.run --no-cc-version-check
Installing a Kubernetes Cluster#
Use the following commands to install a microk8s cluster on Ubuntu 22.04 on a single 8xH100 node:
# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A
Note
To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, please check: https://microk8s.io/docs/getting-started.
Make sure sudo microk8s kubectl get pod -A
shows all nodes in Running or Completed Status.
This may take some time.
For a two node setup, run the following command on the control plane node:
sudo microk8s add-node
Run the following commands on the second worker node. Use the join string from the above command when joining the cluster.
# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
sudo microk8s join <JOIN_STRING> # This may take a few seconds to complete.
Obtain NGC API Key#
To obtain the required NGC API Key, follow the steps in:
Next, deploy the VSS Blueprint.