Troubleshooting CUDA Driver Initialization Failures#

At startup, NIM validates that the CUDA driver can initialize via torch.cuda.init(). If this check fails, startup will fail with a RuntimeError:

RuntimeError: CUDA driver failed to initialize: <error message>.
This may indicate a driver mismatch — ensure the NVIDIA Container Toolkit
is bind-mounting the host driver into the container.
Common causes: compat shims loaded instead of host driver,
or missing nvidia-container-toolkit configuration.

Common error messages include:

  • “Error 803: system has unsupported display driver / cuda driver combination” — The container is loading a bundled CUDA compat driver that does not match the host’s kernel driver. This is the most common cause.

  • “Error 802: system not yet initialized” — The NVIDIA kernel driver is not loaded or fabric manager is not running (required on NVSwitch systems like DGX A100/H100).

Causes#

The NIM container image ships CUDA compat libraries at /usr/local/cuda-*/compat/. At runtime, the NVIDIA Container Toolkit should bind-mount the host’s userspace CUDA driver into the container, overriding the bundled compat libraries. If this mechanism fails, the container loads the bundled compat driver, which may not match the host’s kernel driver.

Common reasons the host driver is not mounted:

  • NVIDIA Container Toolkit is not installed or not configured

  • The container runtime is not set to nvidia (--runtime=nvidia or equivalent)

  • CDI mode is enabled but not properly configured

  • The LD_LIBRARY_PATH inside the container prioritizes compat libraries over the host driver

Resolution#

  1. Verify the NVIDIA Container Toolkit is installed and configured:

    nvidia-ctk --version
    
  2. Ensure the container uses the NVIDIA runtime:

    docker run --runtime=nvidia --gpus all ...
    
  3. Check which libcuda.so is loaded inside the container:

    docker exec ${CONTAINER} python3 -c \
      "import ctypes; ctypes.CDLL('libcuda.so.1'); \
       [print(l.split(None,5)[-1].strip().removesuffix(' (deleted)')) for l in open('/proc/self/maps') if 'libcuda.so' in l][:1]"
    

    The path should point to the host driver (e.g., /usr/lib/x86_64-linux-gnu/libcuda.so.1), not to /usr/local/cuda-*/compat/.

  4. For Error 802 on NVSwitch systems, ensure nvidia-fabricmanager is running and its version matches the driver:

    nvidia-smi  # check driver version
    nv-fabricmanager --version  # must match
    sudo systemctl start nvidia-fabricmanager