Setup the Prerequisites#

Prerequisites#

Refer to Supported Platforms
256+ GB system memory
500 GB of storage
Ubuntu 22.04
NVIDIA driver 535.183.06 (Recommended minimum version). NVIDIA driver 570.86.15 (for H200). NVIDIA driver 570.133.20 (for B200)
CUDA 12.2+ (CUDA driver installed with NVIDIA driver)
Kubernetes v1.31.2
NVIDIA GPU Operator v23.9 (Recommended minimum version)
Helm v3.x
NGC API Key

Install the NVIDIA Driver#

Download and install the NVIDIA driver 535.183.06 from NVIDIA Unix drivers page at:

https://www.nvidia.com/en-us/drivers/details/228697/

Run the following commands:

chmod 755 NVIDIA-Linux-x86_64-535.183.06.run
sudo ./NVIDIA-Linux-x86_64-535.183.06.run --no-cc-version-check

Note

If you are using a H200 GPU, install the NVIDIA driver 570.86.15 from https://www.nvidia.com/en-us/drivers/details/239776/.

For B200 GPU, install the NVIDIA driver 570.133.20 from https://www.nvidia.com/en-us/drivers/details/242548/.

Install the NVIDIA Fabric Manager#

Note

There are two ways to handle NVIDIA Fabric manager:

GPU operator + driver container: will start fabric manager process automatically in the driver container.

To use this method and avoid manual fabric manager installation, please follow documentation here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#microk8s

GPU operator without driver container: User needs to manually install NVIDIA fabric manager.

This is the method VSS documents and recommends in this section.

Requirements to know when manually installing nvidia fabric manager

Must match exact driver version (e.g., 535.216.03 driver requires 535.216.03 Fabric Manager)

Some systems require the NVIDIA Fabric Manager to be installed.

For more info, please check: When do I need NVIDIA Fabric Manager in the deployment node(s)?.

For VSS deployment on clean machines, NVIDIA Fabric Manager where required, can be installed using the following steps:

Download the NVIDIA Fabric Manager debian for the required driver version:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-535_535.183.06-1_amd64.deb

# OR

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-570_570.86.15-1_amd64.deb

# OR

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-570_570.133.20-1_amd64.deb

Install the debian file using:

sudo apt-get install ./nvidia-fabricmanager-535_535.183.06-1_amd64.deb

# OR

sudo apt-get install ./nvidia-fabricmanager-570_570.86.15-1_amd64.deb

# OR

sudo apt-get install ./nvidia-fabricmanager-570_570.133.20-1_amd64.deb

Start the NVIDIA Fabric Manager service:

sudo systemctl start nvidia-fabricmanager

Verify the NVIDIA Fabric Manager service is running:

sudo systemctl status nvidia-fabricmanager.service

To enable the NVIDIA Fabric Manager service to start automatically at boot, run the following command:
```
sudo systemctl enable nvidia-fabricmanager.service
```

Obtain NGC API Key#

Login to https://ngc.nvidia.com, go to NGC API Keys page and generate a “Legacy API Key”.

Note

Please make sure to generate a “Legacy API Key” at the bottom of the API Keys page.

More information can be found in the NGC User Guide.

Once the above prerequisites have been met, either:

Complete the Helm prerequisites below or
Continue to deploy the VSS Blueprint using Docker Compose.

Helm Prerequisites#

Installing a Kubernetes Cluster#

Use the following commands to install a microk8s cluster on a Ubuntu 22.04 single node:

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A

Note

If you see errors from cuda-validator pod, try forcing NVIDIA GPU operator to use the system driver:

sudo microk8s enable nvidia force-system-driver

Note

To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, please check: https://microk8s.io/docs/getting-started.

Make sure sudo microk8s kubectl get pod -A shows all pods in Running or Completed Status. This may take some time.

Multi Node Setup (Optional)#

Multi-node deployments can be used in case where more resources (e.g. GPUs) are required than available on a single node. For more information refer to Multi-Node Deployment.

For a multi node setup, run the following command on the control plane node:

sudo microk8s add-node

Run the following commands on the other worker nodes. Use the join string from the above command when joining the cluster.

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
sudo microk8s join <JOIN_STRING>  # This may take a few seconds to complete.

Next, deploy the VSS Blueprint using Helm.