Prerequisites#

Prerequisites Overview#

Refer to Supported Platforms
256 GB or more of system memory
500 GB of storage
Ubuntu 22.04
NVIDIA driver 580.65.06 (Recommended minimum version)
CUDA 13.0+ (CUDA driver installed with NVIDIA driver)
Kubernetes v1.31.2
NVIDIA GPU Operator v23.9 (Recommended minimum version)
Helm v3.x
NGC API Key

Install the NVIDIA Driver#

Download and install the NVIDIA driver for your GPU type:

wget https://us.download.nvidia.com/tesla/580.65.06/NVIDIA-Linux-x86_64-580.65.06.run
chmod 755 NVIDIA-Linux-x86_64-580.65.06.run
sudo ./NVIDIA-Linux-x86_64-580.65.06.run --no-cc-version-check

Install the NVIDIA Fabric Manager#

Some systems require the NVIDIA Fabric Manager to be installed. For more info, check: When do I need NVIDIA Fabric Manager in the deployment nodes?.

Note

There are two ways to handle NVIDIA Fabric manager:

GPU operator + driver container: will start fabric manager process automatically in the driver container.

To use this method and avoid manual fabric manager installation, follow: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#microk8s
GPU operator without driver container: You need to manually install NVIDIA fabric manager.

This is the method VSS documents and recommends in this section.

Requirements when manually installing NVIDIA fabric manager:
- Must match exact driver version (for example, 580.65.06 driver requires 580.65.06 Fabric Manager)

For VSS deployment on clean machines, NVIDIA Fabric Manager where required, can be installed using the following steps:

Download the NVIDIA Fabric Manager debian for the required driver version:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_580.65.06-1_amd64.deb

Install the debian file using:

sudo apt-get install ./nvidia-fabricmanager_580.65.06-1_amd64.deb

Start the NVIDIA Fabric Manager service:

sudo systemctl start nvidia-fabricmanager

Verify the NVIDIA Fabric Manager service is running:

sudo systemctl status nvidia-fabricmanager.service

To enable the NVIDIA Fabric Manager service to start automatically at boot, run the following command:
```
sudo systemctl enable nvidia-fabricmanager.service
```

Helm Prerequisites#

Installing a Kubernetes Cluster#

Use the following commands to install a microk8s cluster on a Ubuntu 22.04 single node:

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A

Note

If you observe errors from cuda-validator pod, try forcing NVIDIA GPU operator to use the system driver:
sudo microk8s enable nvidia force-system-driver
To join the group for admin access, avoid using sudo, and other information about microk8s setup and usage, check: https://microk8s.io/docs/getting-started.

Ensure sudo microk8s kubectl get pod -A shows all pods in Running or Completed Status. This can take some time.

Multi Node Setup (Optional)#

Multi-node deployments can be used in cases where more resources (for example, GPUs) are required than available on a single node. For more information refer to Multi-Node Deployment.

For a multi-node setup:

Run the following command on the control plane node:
```
sudo microk8s add-node
```

Run the following commands on the other worker nodes. Use the join string from the above command when joining the cluster.

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
sudo microk8s join <JOIN_STRING>  # This can take a few seconds to complete.

Deploy the VSS Blueprint using Helm.