Prerequisites#

Before you begin using the OpenFold3 NIM, ensure the following requirements described on this page are met.

The installation and setup workflows work with the following system architectures:

  • Ubuntu 22.04 / 24.04 and amd64 (x86_64)

  • Ubuntu 24.04 with arm64 (aarch64)

  • Without NVSwitch. For systems with NVSwitch, you may need fabricmanager. To get fabricmanager, refer to Installing the GPU Driver.

Known issues#

NGC (NVIDIA GPU Cloud) Account#

  1. Create an account on NGC

  2. Generate an API Key

  3. Log in to the NVIDIA Container Registry, using your NGC API key as passord

  • NVIDIA docker images will be used to verify the NVIDIA Driver, CUDA, Docker, and NVIDIA Container Toolkit stack

docker login nvcr.io --username='$oauthtoken'

NGC CLI Tool#

  1. Download the NGC CLI tool for your OS.

    Important: Use NGC CLI version 3.41.1 or newer. Here is the command to install this on AMD64 Linux in your home directory:

    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.3/files/ngccli_linux.zip -O ~/ngccli_linux.zip && \
    unzip ~/ngccli_linux.zip -d ~/ngc && \
    chmod u+x ~/ngc/ngc-cli/ngc && \
    echo "export PATH=\"\$PATH:~/ngc/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
    
  2. Set up your NGC CLI Tool locally (You’ll need your API key for this!):

    ngc config set
    

    Note: After you enter your API key, you may see multiple options for the org and team. Select as desired or hit enter to accept the default.

Set up your NIM cache#

The NIM needs a directory on your system called the NIM cache, where it can

  • Download the model artifact (checkpoints and TRT engines)

  • Read the model artifact if it has been previously downloaded

The NIM cache directory must:

  • Reside on a disk with at least 15GB of storage

  • Have a permission state that allows the NIM to read, write, an execute

The NIM cache directory can be set up as follows, if your home directory ‘~’ is on a disk with enough storage.

## Create the NIM cache directory in a location with sufficent storage
mkdir -p ~/.cache/nim

## Set the NIM cache directory permissions to allow all (a) users to read, write, and execute (rwx)
sudo chmod -R a+rwx ~/.cache/nim

Now, you should be able to pull the NIM container, refer to the Getting Started. You won’t be able to run the NIM until completing the installation of the NVIDIA Driver, CUDA, Docker, and the NVIDIA Container Toolkit.

Installing the NVIDIA Driver, CUDA, Docker, and NVIDIA Container Toolkit Stack#

Collect System Information#

Before installation, collect your system information to determine the appropriate installation path.

  1. Determine the OS version:

# Check OS version
cat /etc/os-release
# Example output for Ubuntu:
# NAME="Ubuntu"
# VERSION="24.04.3 LTS (Noble Numbat)"
# ID=ubuntu
# VERSION_ID="24.04"

# Set OS version as environment variable for use in subsequent commands
export OS_VERSION=$( . /etc/os-release && echo "$VERSION_ID" | tr -d '.' )
echo "OS Version: $OS_VERSION"
# Example output for Ubuntu 24.04:
# OS Version: 2404
  1. Determine the GPU model:

# Check GPU model
nvidia-smi | grep -i "NVIDIA" | awk '{print $3, $4}'
# Example output:
# 590.44.01 Driver
# H100 PCIe

If you see a message like Command 'nvidia-smi' not found, then attempt to determine GPU model with the command below:

# Check GPU model
lspci | grep -i "3D controller"
# Example output:
# 01:00.0 3D controller: NVIDIA Corporation GH100 [H100 SXM5 80GB] (rev a1)
  1. Determine the CPU architecture:

# Set CPU arch as environment variable, on Ubuntu/Debian system
export CPU_ARCH=$(dpkg --print-architecture)
echo "CPU_ARCH: ${CPU_ARCH}"
# Example output:
# amd64

# Set CPU arch as environment variable, on a non-Ubuntu/Debian system
export CPU_ARCH=$(uname -m)
echo "CPU_ARCH: ${CPU_ARCH}"
# Example output:
# x86_64

Installation Instructions by Architecture#

Select the appropriate section based on your CPU architecture identified in the previous step:

Installation for amd64 / x86_64 Systems#

For systems with amd64 or x86_64 CPU architecture (H100, H200, A100, L40S, B200)

1. Find and Download Driver Package#

a. On your local machine (with browser), visit the NVIDIA Drivers download page, and observe the fields in the ‘Manual Driver Search’ dialogue box.

b. Enter your system information:

For H100, H200, A100, L40S:

Field

Value

Product Category

Data Center / Tesla

Product Series

H-Series, A-Series, or L-Series

Product

H100, H200, A100

OS

Linux 64-bit Ubuntu 24.04

CUDA Toolkit Version

13.1

Language

English (US)

For B200:

Field

Value

Product Category

Data Center / Tesla

Product Series

HGX-Series

Product

HGX B200

OS

Linux 64-bit Ubuntu 24.04

CUDA Toolkit Version

13.1

Language

English (US)

c. Click Find to find driver version 590.44.01 or higher

d. On the results page, click View

e. On the next page, right-click the Download button and select Copy Link Address

Note: Some distributions like Ubuntu, Debian, or RHEL have distribution-specific packages (.deb, .rpm). For other distributions, use the .run installer.

2. Direct Driver URLs#

For Ubuntu 24.04 (Noble):

# Driver 590.44.01 for H100/H200/B200/A100/L40S on x86_64 system
https://us.download.nvidia.com/tesla/590.44.01/nvidia-driver-local-repo-ubuntu2404-590.44.01_1.0-1_amd64.deb

For Ubuntu 22.04 (Jammy):

# Driver 590.44.01 for H100/H200/B200/A100/L40S on x86_64 system
https://us.download.nvidia.com/tesla/590.44.01/nvidia-driver-local-repo-ubuntu2204-590.44.01_1.0-1_amd64.deb

For RHEL 8/Rocky Linux 8:

# Driver 590.44.01 for H100/H200/B200/A100/L40S
https://us.download.nvidia.com/tesla/590.44.01/nvidia-driver-local-repo-rhel8-590.44.01-1.0-1.x86_64.rpm

Important: Always check the NVIDIA Driver Downloads page for the latest driver version compatible with your GPU and OS.

4. Download the Driver#

# Download driver using OS_VERSION environment variable
# For Ubuntu (automatically uses correct version: 2204, 2404, etc.)
wget https://us.download.nvidia.com/tesla/590.44.01/nvidia-driver-local-repo-ubuntu${OS_VERSION}-590.44.01_1.0-1_${CPU_ARCH}.deb

5. Install the Local Repository#

For Ubuntu/Debian:

sudo dpkg -i nvidia-driver-local-repo-ubuntu${OS_VERSION}-590.44.01_1.0-1_${CPU_ARCH}.deb

For RHEL/CentOS/Rocky Linux:

sudo rpm -i nvidia-driver-local-repo-rhel8-590.44.01-1.0-1.${CPU_ARCH}.rpm

6. Update Package Lists and Install Driver#

For Ubuntu/Debian:

# Copy the GPG key
sudo cp /var/nvidia-driver-local-repo-ubuntu${OS_VERSION}-590.44.01/nvidia-driver-local-*-keyring.gpg /usr/share/keyrings/

# Update package cache
sudo apt-get update

# Install the driver
sudo apt-get install -y cuda-drivers

For RHEL/CentOS/Rocky Linux:

# Update package cache
sudo dnf clean all
sudo dnf makecache

# Install the driver
sudo dnf install -y cuda-drivers

7. Reboot System#

sudo reboot

8. Verify Driver Installation#

After reboot, verify the driver:

nvidia-smi

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01    Driver Version: 590.44.01    CUDA Version: 13.1   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 PCIe    Off  | 00001E:00:00.0  Off |                    0  |
| N/A   30C    P0    68W / 350W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

9. Install Docker#

Verify Docker is installed with version >=23.0.1:

docker --version
# Example output:
# Docker version 29.1.3, build f52814d

If Docker is not installed or does not meet requirements:

10. Install NVIDIA Container Toolkit#

Verify the NVIDIA Container Toolkit:

nvidia-container-cli --version

If not installed:

  1. Follow Installing the NVIDIA Container Toolkit

  2. Configure Docker: Configuring Docker

11. Verify the Complete Stack#

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Example output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01    Driver Version: 590.44.01    CUDA Version: 13.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 ...     Off  | 00000000:01:00.0 Off |                  N/A |
| 41%   30C    P8     1W / 260W |   2244MiB / 81559MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Note: For more information on enumerating multi-GPU systems, refer to the NVIDIA Container Toolkit’s GPU Enumeration Docs

Installation for arm64 / aarch64 DGX Systems#

For arm64 / aarch64 DGX Systems (e.g., DGX GB200 Compute Tray)

Note: These steps follow the NVIDIA DGX OS 7 User Guide: Installing the GPU Driver, customized for DGX GB200 Compute Tray with:

  • 2x Grace CPUs (arm64 / aarch64)

  • 4x Blackwell GPUs

  • Ubuntu 24.04

  • Linux kernel version 6.8.0-1044-nvidia-64k

1. Check NVIDIA Driver State#

Check the running driver version:

nvidia-smi

Example successful output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01              Driver Version: 590.44.01     CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB200                   On  |   00000008:01:00.0 Off |                    0 |
| N/A   29C    P0            130W / 1200W |       0MiB / 189471MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
  • If the running driver version is 590+, skip to Step 9

  • If nvidia-smi fails, proceed to Step 2

2. Confirm OS Sees NVIDIA GPUs#

sudo lshw -class display -json | jq '.[] | select(.description=="3D controller")'

Product-specific information:

sudo lshw -class system -json | jq '.[0]'

3. Verify System Requirements#

Check your Linux distribution, kernel version, and gcc version:

. /etc/os-release && echo "$PRETTY_NAME"   # Linux distribution
uname -r  # Kernel version
gcc --version  # GCC version

Example output:

Ubuntu 24.04.2 LTS
6.8.0-1044-nvidia-64k
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

Verify against Table 3: Supported Linux Distributions.

4. Update Linux Kernel Version (If Needed)#

For GB200 systems, use kernel version 6.8.0-1044-nvidia-64k or 6.8.0-1043-nvidia-64k.

If you have a different kernel version, configure grub:

# Update grub default menu entry
sudo sed --in-place=.bak \
  '/^[[:space:]]*GRUB_DEFAULT=/c\GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.8.0-1044-nvidia-64k"' \
  /etc/default/grub

# Verify update
cat /etc/default/grub

# Update grub and reboot
sudo update-grub
sudo reboot

5. Remove NVIDIA Libraries to Avoid Conflicts#

Check for existing NVIDIA libraries:

ls /usr/lib/aarch64-linux-gnu/ | grep -i nvidia

If not empty, remove them:

sudo apt remove --autoremove --purge -Vy \
  cuda-compat\* \
  cuda-drivers\*  \
  libnvidia-cfg1\* \
  libnvidia-compute\* \
  libnvidia-decode\* \
  libnvidia-encode\* \
  libnvidia-extra\* \
  libnvidia-fbc1\* \
  libnvidia-gl\* \
  libnvidia-gpucomp\* \
  libnvidia-nscq\* \
  libnvsdm\* \
  libxnvctrl\* \
  nvidia-dkms\* \
  nvidia-driver\* \
  nvidia-fabricmanager\* \
  nvidia-firmware\* \
  nvidia-headless\* \
  nvidia-imex\* \
  nvidia-kernel\* \
  nvidia-modprobe\* \
  nvidia-open\* \
  nvidia-persistenced\* \
  nvidia-settings\* \
  nvidia-xconfig\* \
  xserver-xorg-video-nvidia\*

6. Download Package Repositories and Install DGX Tools#

Follow Installing DGX System Configurations and Tools:

a. Download and unpack ARM64-specific packages:

curl https://repo.download.nvidia.com/baseos/ubuntu/noble/arm64/dgx-repo-files.tgz | sudo tar xzf - -C /

b. Update local APT database:

sudo apt update

c. Install DGX system tools:

sudo apt install -y nvidia-system-core
sudo apt install -y nvidia-system-utils
sudo apt install -y nvidia-system-extra

d. Install linux-tools for your kernel:

sudo apt install -y linux-tools-nvidia-64k

e. Install NVIDIA peermem loader:

sudo apt install -y nvidia-peermem-loader

7. Install GPU Driver#

Follow Installing the GPU Driver:

a. Pin the driver version:

sudo apt install nvidia-driver-pinning-590

b. Install the open GPU kernel module:

sudo apt install --allow-downgrades \
  nvidia-driver-590-open \
  libnvidia-nscq \
  nvidia-modprobe \
  nvidia-imex \
  datacenter-gpu-manager-4-cuda13 \
  nv-persistence-mode

c. Enable the persistence daemon:

sudo systemctl enable nvidia-persistenced nvidia-dcgm nvidia-imex

d. Reboot:

sudo reboot

8. Verify Driver Installation#

After reboot, repeat Step 1 to check NVIDIA Driver.

9. Install Docker and NVIDIA Container Toolkit#

Follow Installing Docker and the NVIDIA Container Toolkit.

Verify the stack:

sudo docker run --rm --gpus=all nvcr.io/nvidia/cuda:12.6.2-base-ubuntu24.04 nvidia-smi

10. Enable Docker for Non-Root User (Optional)#

See Manage Docker as non-root user.

11. Verify Complete Stack#

a. Log into NGC:

docker login nvcr.io --username '$oauthtoken'

b. Run verification:

sudo docker run --rm --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  nvcr.io/nvidia/pytorch:25.12-py3 \
  python -c \
"import torch, pynvml;
pynvml.nvmlInit();
print('Driver:', pynvml.nvmlSystemGetDriverVersion());
print('CUDA:', torch.version.cuda);
print('GPU count:', torch.cuda.device_count())"

Expected output:

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.1 driver version 590.44.01 with kernel driver version 590.44.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

Driver: 590.44.01
CUDA: 13.1
GPU count: 4

Running on Slurm With Enroot#

This section describes how to run the OpenFold3 NIM on a Slurm-based HPC cluster using Enroot to run the NGC container.

Environment#

  • Enroot: Version 3.4.1 or newer is supported.

1. Check Enroot is Available#

On the login node, verify Enroot is installed:

enroot version
which enroot

If Enroot is not installed, contact your cluster administrator. Typical path: /usr/bin/enroot.

Create the Enroot config directory if it does not exist:

mkdir -p ~/.config/enroot

2. Create and Add NGC Credentials#

Create the credentials file so Enroot can pull images from NGC. Replace YOUR_NGC_API_KEY with your NGC API key:

echo 'machine nvcr.io login $oauthtoken password YOUR_NGC_API_KEY' > ~/.config/enroot/.credentials

Restrict permissions and confirm the file content:

chmod 600 ~/.config/enroot/.credentials
cat ~/.config/enroot/.credentials

Note: Use vi ~/.config/enroot/.credentials or another editor if you prefer not to echo the key on the command line.

3. Configure and Run the Slurm Job Script#

Copy the script below to a file, such as run_openfold3_slurm.sh. Set the variables at the top for your cluster and paths:

  • ACCOUNT: Slurm account name

  • PARTITION: Slurm partition (for example, interactive or gpu)

  • ROOT_DIR: Your working directory on the cluster (for example, Lustre or NFS path)

  • NGC_IMAGE: NGC image with tag; use # between registry and image path for Enroot (refer to the example script below)

  • NGC_API_KEY: Export this in your environment before running, or ensure it is available inside the job for enroot import

Use a shared filesystem path for ROOT_DIR and CACHE_PATH so that the SQSH image and cache are visible from the node where the job runs. Use a node-local or large-quota path for TMP_PATH if /tmp has strict quotas (e.g. Lustre scratch).

Save the following as run_openfold3_slurm.sh and run it with bash run_openfold3_slurm.sh to avoid copy-paste or quoting issues:

#!/bin/bash

# --- Slurm job configuration (customize for your cluster) ---
ACCOUNT="your_slurm_account"
PARTITION="interactive"
GPUS_PER_NODE=1
MEMORY="32G"
TIME="04:00:00"

# --- Paths (customize for your cluster) ---
# Root directory on shared storage (e.g. Lustre/NFS)
ROOT_DIR="/path/to/your/workspace"

# NIM cache on shared storage
CACHE_PATH="${ROOT_DIR}/.nim_cache"
WORKING_CACHE_DIR="/opt/nim/.cache"

# Project/data mount inside container
WORKING_PATH="${ROOT_DIR}"
MOUNT_PATH="/workspace/data"

# Temp directory (use shared storage if node /tmp has quota limits)
TMP_PATH="${ROOT_DIR}/tmp"
MOUNT_TMP_PATH="${MOUNT_PATH}/tmp"

# --- NGC image (use # between registry and image path for Enroot) ---
NGC_IMAGE="docker://nvcr.io#nim/openfold/openfold3:1.4.0"

CONTAINER_NAME="openfold3"
ACTIVE_SQSH_PATH="${ROOT_DIR}/openfold3.sqsh"

# --- Helpers ---
handle_error() {
    echo "Error: $1"
    exit 1
}

mkdir -p "$(dirname "${ACTIVE_SQSH_PATH}")" || handle_error "Failed to create directory for SQSH file"
mkdir -p "${CACHE_PATH}" || handle_error "Failed to create cache directory"
mkdir -p "${TMP_PATH}" || handle_error "Failed to create tmp directory"

echo "Step 1: Requesting interactive resources..."
srun --account="${ACCOUNT}" \
     --partition="${PARTITION}" \
     --gpus-per-node="${GPUS_PER_NODE}" \
     --mem="${MEMORY}" \
     --time="${TIME}" \
     --export=ALL \
     -o /dev/tty -e /dev/tty \
     bash -c "
    echo \"Loading environment...\"

    # Step 2: Import Docker image (if not present)
    echo \"Step 2: Importing Docker image\"
    if [ ! -f \"${ACTIVE_SQSH_PATH}\" ]; then
        echo \"Importing from NGC...\"
        mkdir -p \"\$(dirname \"${ACTIVE_SQSH_PATH}\")\" || { echo \"Failed to create directory\"; exit 1; }
        enroot import -o \"${ACTIVE_SQSH_PATH}\" ${NGC_IMAGE} || { echo \"Failed to import Docker image\"; exit 1; }
    else
        echo \"Docker image already exists, skipping import.\"
    fi

    # Step 3: Create Enroot container
    echo \"Step 3: Creating Enroot container...\"
    if enroot list 2>/dev/null | grep -q \"${CONTAINER_NAME}\"; then
        echo \"Removing existing container...\"
        enroot remove -f ${CONTAINER_NAME} || true
    fi
    enroot create --name ${CONTAINER_NAME} \"${ACTIVE_SQSH_PATH}\" || { echo \"Failed to create container\"; exit 1; }

    # Step 4: Start container and NIM server
    NODE=\$(hostname)
    echo \"\$NODE\" > \"${WORKING_PATH}/.openfold3_node\"
    echo \"Step 4: OpenFold3 container ready.\"
    echo \"=====================================\"
    echo \"Working directory in container: ${MOUNT_PATH}\"
    echo \"\"
    echo \"From another terminal (login node or your machine), run:\"
    echo \"  ssh -L 8000:localhost:8000 \$NODE\"
    echo \"Then keep that SSH session open and call the API (see Step 4 in docs).\"
    echo \"Node name saved to: ${WORKING_PATH}/.openfold3_node\"
    echo \"Type exit to leave the container.\"
    echo \"=====================================\"

    cat > \"\${TMPDIR:-/tmp}/rc.local\" << RCEOF
#!/bin/sh
export TMPDIR=${MOUNT_TMP_PATH}
export TEMP=${MOUNT_TMP_PATH}
export TMP=${MOUNT_TMP_PATH}
export HOME=${MOUNT_PATH}
export XDG_CACHE_HOME=${WORKING_CACHE_DIR}
export XDG_DATA_HOME=${MOUNT_PATH}/.local/share
mkdir -p ${MOUNT_TMP_PATH} ${MOUNT_PATH}/.local/share
/opt/nim/start_server.sh &
exec /bin/bash
RCEOF
    chmod +x \"\${TMPDIR:-/tmp}/rc.local\"
    enroot start \\
      --mount ${CACHE_PATH}:${WORKING_CACHE_DIR} \\
      --mount ${WORKING_PATH}:${MOUNT_PATH} \\
      --mount \"\${TMPDIR:-/tmp}/rc.local:/etc/rc.local\" \\
      -e NGC_API_KEY \\
      -e TMPDIR=${MOUNT_TMP_PATH} \\
      -e TEMP=${MOUNT_TMP_PATH} \\
      -e TMP=${MOUNT_TMP_PATH} \\
      -e HOME=${MOUNT_PATH} \\
      -e XDG_CACHE_HOME=${WORKING_CACHE_DIR} \\
      -e XDG_DATA_HOME=${MOUNT_PATH}/.local/share \\
      ${CONTAINER_NAME} || { echo \"Failed to start container\"; exit 1; }

    echo \"=====================================\"
    echo \"Workflow completed.\"
"

Submit the job (interactive):

bash run_openfold3_slurm.sh

Or, for a batch job, wrap the same script body in a sbatch script and set ACCOUNT, PARTITION, and other Slurm directives as needed.

4. Call the API From Your Machine#

After the container is running on a compute node, the script prints the node name, such as batch-block1-2106, and saves it to ROOT_DIR/.openfold3_node.

  1. From your laptop or the login node, create an SSH tunnel to that node:

    ssh -L 8000:localhost:8000 $(cat /path/to/your/workspace/.openfold3_node)
    

    Or use the node name directly if you know it:

    ssh -L 8000:localhost:8000 <compute-node-name>
    
  2. In another terminal (with the tunnel still open), run a test request:

    curl -s -X POST "http://localhost:8000/biology/openfold/openfold3/predict" \
      -H "Content-Type: application/json" \
      -d '{"inputs":[{"input_id":"my_first_prediction","molecules":[{"type":"protein","sequence":"MKTVRQERLKSIVR","msa":{"main":{"a3m":{"alignment":">query\nMKTVRQERLKSIVR","format":"a3m"}}}}],"output_format":"pdb"}]}' \
      --max-time 300
    

If the request succeeds, you will get a JSON response containing the prediction result.

Troubleshooting#

Common Issues#

Driver version mismatch: If nvidia-smi shows an older driver version, ensure you’ve rebooted after installation.

CUDA version mismatch: The driver must support CUDA 13.1 or higher. Check the CUDA version in the nvidia-smi output. If your system shows CUDA 12.x or CUDA 13.0, you need to install driver 590.44.01 or higher.

To verify CUDA compatibility:

  1. Check current driver: nvidia-smi

  2. Verify CUDA version shows 13.1 or higher

  3. If not, refer to NVIDIA CUDA Compatibility

Secure Boot (amd64 systems): If you have Secure Boot enabled, you may need to sign the NVIDIA kernel modules or disable Secure Boot in your BIOS.

Library version conflicts: If you encounter library version conflicts, ensure all old NVIDIA packages are removed before installing the new driver.

Architecture-Specific Troubleshooting#

For amd64 / x86_64 Systems:

Previous driver versions:

# Remove old drivers
sudo apt-get remove --purge nvidia-*
sudo apt-get autoremove

# Verify removal
ls /usr/lib/x86_64-linux-gnu/ | grep -i nvidia

Package conflicts:

# Clean package cache
sudo apt-get clean
sudo apt-get update

# Try installation again
sudo apt-get install -y cuda-drivers

For arm64 / aarch64 DGX Systems:

Kernel version issues:

# Check current kernel
uname -r

# List available kernels
dpkg --list | grep linux-image

# Configure grub to use correct kernel (see Step 4 in installation)

DGX-specific issues:

# Check DGX system status
sudo nvidia-bug-report.sh

# Verify fabricmanager (if using NVSwitch)
systemctl status nvidia-fabricmanager

# Check NVIDIA services
systemctl status nvidia-persistenced
systemctl status nvidia-dcgm

Build errors for older kernel:

  • Ignore build errors for modules built for 6.14.0-1015-nvidia-64k

  • These errors are expected and do not affect functionality

Getting Additional Help#

If you continue to experience issues:

  1. Check NVIDIA driver logs: dmesg | grep -i nvidia

  2. Review Docker logs: sudo journalctl -u docker.service

  3. Consult NVIDIA Driver Installation Guide

  4. For DGX systems: NVIDIA DGX OS 7 User Guide