Prerequisites#

Prerequisites Overview#

  • Refer to Supported Platforms

  • 256 GB or more of system memory

  • 500 GB of storage

  • Ubuntu 22.04

  • NVIDIA driver 580.65.06 (Recommended minimum version)

  • CUDA 13.0+ (CUDA driver installed with NVIDIA driver)

  • Kubernetes v1.31.2

  • NVIDIA GPU Operator v23.9 (Recommended minimum version)

  • Helm v3.x

  • NGC API Key

Install the NVIDIA Driver#

  1. Download and install the NVIDIA driver for your GPU type:

    wget https://us.download.nvidia.com/tesla/580.65.06/NVIDIA-Linux-x86_64-580.65.06.run
    chmod 755 NVIDIA-Linux-x86_64-580.65.06.run
    sudo ./NVIDIA-Linux-x86_64-580.65.06.run --no-cc-version-check
    

Install the NVIDIA Fabric Manager#

Some systems require the NVIDIA Fabric Manager to be installed. For more info, check: When do I need NVIDIA Fabric Manager in the deployment nodes?.

Note

There are two ways to handle NVIDIA Fabric manager:

  1. GPU operator + driver container: will start fabric manager process automatically in the driver container.

    To use this method and avoid manual fabric manager installation, follow: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#microk8s

  2. GPU operator without driver container: You need to manually install NVIDIA fabric manager.

    This is the method VSS documents and recommends in this section.

    Requirements when manually installing NVIDIA fabric manager:

    • Must match exact driver version (for example, 580.65.06 driver requires 580.65.06 Fabric Manager)

For VSS deployment on clean machines, NVIDIA Fabric Manager where required, can be installed using the following steps:

  1. Download the NVIDIA Fabric Manager debian for the required driver version:

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_580.65.06-1_amd64.deb
    
  2. Install the debian file using:

    sudo apt-get install ./nvidia-fabricmanager_580.65.06-1_amd64.deb
    
  3. Start the NVIDIA Fabric Manager service:

    sudo systemctl start nvidia-fabricmanager
    
  4. Verify the NVIDIA Fabric Manager service is running:

    sudo systemctl status nvidia-fabricmanager.service
    
  5. To enable the NVIDIA Fabric Manager service to start automatically at boot, run the following command:

    sudo systemctl enable nvidia-fabricmanager.service
    

Helm Prerequisites#

Installing a Kubernetes Cluster#

Use the following commands to install a microk8s cluster on a Ubuntu 22.04 single node:

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A

Note

  • If you observe errors from cuda-validator pod, try forcing NVIDIA GPU operator to use the system driver:

    sudo microk8s enable nvidia force-system-driver
    
  • To join the group for admin access, avoid using sudo, and other information about microk8s setup and usage, check: https://microk8s.io/docs/getting-started.

    Ensure sudo microk8s kubectl get pod -A shows all pods in Running or Completed Status. This can take some time.

Multi Node Setup (Optional)#

Multi-node deployments can be used in cases where more resources (for example, GPUs) are required than available on a single node. For more information refer to Multi-Node Deployment.

For a multi-node setup:

  1. Run the following command on the control plane node:

    sudo microk8s add-node
    
  2. Run the following commands on the other worker nodes. Use the join string from the above command when joining the cluster.

    # Install microk8s
    sudo snap install microk8s --classic
    # Enable nvidia and hostpath-storage add-ons
    sudo microk8s enable nvidia
    sudo microk8s enable hostpath-storage
    sudo microk8s join <JOIN_STRING>  # This can take a few seconds to complete.
    
  3. Deploy the VSS Blueprint using Helm.