Prerequisites#
Prerequisites Overview#
Refer to Supported Platforms
256 GB or more of system memory
500 GB of storage
Ubuntu 22.04
NVIDIA driver 580.65.06 (Recommended minimum version)
CUDA 13.0+ (CUDA driver installed with NVIDIA driver)
Kubernetes v1.31.2
NVIDIA GPU Operator v23.9 (Recommended minimum version)
Helm v3.x
NGC API Key
Install the NVIDIA Driver#
Download and install the NVIDIA driver for your GPU type:
wget https://us.download.nvidia.com/tesla/580.65.06/NVIDIA-Linux-x86_64-580.65.06.run chmod 755 NVIDIA-Linux-x86_64-580.65.06.run sudo ./NVIDIA-Linux-x86_64-580.65.06.run --no-cc-version-check
Install the NVIDIA Fabric Manager#
Some systems require the NVIDIA Fabric Manager to be installed. For more info, check: When do I need NVIDIA Fabric Manager in the deployment nodes?.
Note
There are two ways to handle NVIDIA Fabric manager:
GPU operator + driver container: will start fabric manager process automatically in the driver container.
To use this method and avoid manual fabric manager installation, follow: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#microk8s
GPU operator without driver container: You need to manually install NVIDIA fabric manager.
This is the method VSS documents and recommends in this section.
Requirements when manually installing NVIDIA fabric manager:
Must match exact driver version (for example,
580.65.06
driver requires580.65.06
Fabric Manager)
For VSS deployment on clean machines, NVIDIA Fabric Manager where required, can be installed using the following steps:
Download the NVIDIA Fabric Manager debian for the required driver version:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_580.65.06-1_amd64.deb
Install the debian file using:
sudo apt-get install ./nvidia-fabricmanager_580.65.06-1_amd64.deb
Start the NVIDIA Fabric Manager service:
sudo systemctl start nvidia-fabricmanager
Verify the NVIDIA Fabric Manager service is running:
sudo systemctl status nvidia-fabricmanager.service
To enable the NVIDIA Fabric Manager service to start automatically at boot, run the following command:
sudo systemctl enable nvidia-fabricmanager.service
Helm Prerequisites#
Installing a Kubernetes Cluster#
Use the following commands to install a microk8s
cluster on a Ubuntu 22.04 single node:
# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A
Note
If you observe errors from cuda-validator pod, try forcing NVIDIA GPU operator to use the system driver:
sudo microk8s enable nvidia force-system-driver
To join the group for admin access, avoid using
sudo
, and other information aboutmicrok8s
setup and usage, check: https://microk8s.io/docs/getting-started.Ensure
sudo microk8s kubectl get pod -A
shows all pods in Running or Completed Status. This can take some time.
Multi Node Setup (Optional)#
Multi-node deployments can be used in cases where more resources (for example, GPUs) are required than available on a single node. For more information refer to Multi-Node Deployment.
For a multi-node setup:
Run the following command on the control plane node:
sudo microk8s add-node
Run the following commands on the other worker nodes. Use the join string from the above command when joining the cluster.
# Install microk8s sudo snap install microk8s --classic # Enable nvidia and hostpath-storage add-ons sudo microk8s enable nvidia sudo microk8s enable hostpath-storage sudo microk8s join <JOIN_STRING> # This can take a few seconds to complete.