Setup the Prerequisites#

Prerequisites#

  • Refer to Supported Platforms

  • 256+ GB system memory

  • 500 GB of storage

  • Ubuntu 22.04

  • NVIDIA driver 535.183.06 (Recommended minimum version). NVIDIA driver 570.86.15 (for H200)

  • CUDA 12.2+ (CUDA driver installed with NVIDIA driver)

  • Kubernetes v1.31.2

  • NVIDIA GPU Operator v23.9 (Recommended minimum version)

  • Helm v3.x

  • NGC API Key

Install the NVIDIA Driver#

  1. Download and install the NVIDIA driver 535.183.06 from NVIDIA Unix drivers page at:

    https://www.nvidia.com/en-us/drivers/details/228697/

  2. Run the following commands:

    chmod 755 NVIDIA-Linux-x86_64-535.183.06.run
    sudo ./NVIDIA-Linux-x86_64-535.183.06.run --no-cc-version-check
    

Note

If you are using a H200 GPU, install the NVIDIA driver 570.86.15 from https://www.nvidia.com/en-us/drivers/details/239776/.

Install the NVIDIA Fabric Manager#

Note

There are two ways to handle NVIDIA Fabric manager:

  1. GPU operator + driver container: will start fabric manager process automatically in the driver container.

To use this method and avoid manual fabric manager installation, please follow documentation here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#microk8s

  1. GPU operator without driver container: User needs to manually install NVIDIA fabric manager.

This is the method VSS documents and recommends in this section.

Requirements to know when manually installing nvidia fabric manager

  • Must match exact driver version (e.g., 535.216.03 driver requires 535.216.03 Fabric Manager)

Some systems require the NVIDIA Fabric Manager to be installed.

For more info, please check: When do I need NVIDIA Fabric Manager in the deployment node(s)?.

For VSS deployment on clean machines, NVIDIA Fabric Manager where required, can be installed using the following steps:

  1. Download the NVIDIA Fabric Manager debian for the required driver version:

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-535_535.183.06-1_amd64.deb
    
    # OR
    
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-570_570.86.15-1_amd64.deb
    
  2. Install the debian file using:

    sudo apt-get install ./nvidia-fabricmanager-535_535.183.06-1_amd64.deb
    
    # OR
    
    sudo apt-get install ./nvidia-fabricmanager-570_570.86.15-1_amd64.deb
    
  3. Start the NVIDIA Fabric Manager service:

    sudo systemctl start nvidia-fabricmanager
    
  4. Verify the NVIDIA Fabric Manager service is running:

    sudo systemctl status nvidia-fabricmanager.service
    
  5. To enable the NVIDIA Fabric Manager service to start automatically at boot, run the following command:

    sudo systemctl enable nvidia-fabricmanager.service
    

Obtain NGC API Key#

Login to https://ngc.nvidia.com, go to NGC API Keys page and generate a “Legacy API Key”.

NGC API Key

Note

Please make sure to generate a “Legacy API Key” at the bottom of the API Keys page.

More information can be found in the NGC User Guide.

Once the above prerequisites have been met, either:

  1. Complete the Helm prerequisites below or

  2. Continue to deploy the VSS Blueprint using Docker Compose.

Helm Prerequisites#

Installing a Kubernetes Cluster#

Use the following commands to install a microk8s cluster on a Ubuntu 22.04 single node:

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
# Install kubectl
sudo snap install kubectl --classic
# Verify microk8s is installed correctly
sudo microk8s kubectl get pod -A

Note

If you see errors from cuda-validator pod, try forcing NVIDIA GPU operator to use the system driver:

sudo microk8s enable nvidia force-system-driver

Note

To join the group for admin access, avoid using sudo, and other information about microk8s setup/usage, please check: https://microk8s.io/docs/getting-started.

Make sure sudo microk8s kubectl get pod -A shows all pods in Running or Completed Status. This may take some time.

Multi Node Setup (Optional)#

Multi-node deployments can be used in case where more resources (e.g. GPUs) are required than available on a single node. For more information refer to Multi-Node Deployment.

For a multi node setup, run the following command on the control plane node:

sudo microk8s add-node

Run the following commands on the other worker nodes. Use the join string from the above command when joining the cluster.

# Install microk8s
sudo snap install microk8s --classic
# Enable nvidia and hostpath-storage add-ons
sudo microk8s enable nvidia
sudo microk8s enable hostpath-storage
sudo microk8s join <JOIN_STRING>  # This may take a few seconds to complete.

Next, deploy the VSS Blueprint using Helm.