Troubleshooting CUDA Driver Initialization Failures#
At startup, NIM validates that the CUDA driver can initialize via torch.cuda.init(). If this check fails, startup will fail with a RuntimeError:
RuntimeError: CUDA driver failed to initialize: <error message>.
This may indicate a driver mismatch — ensure the NVIDIA Container Toolkit
is bind-mounting the host driver into the container.
Common causes: compat shims loaded instead of host driver,
or missing nvidia-container-toolkit configuration.
Common error messages include:
“Error 803: system has unsupported display driver / cuda driver combination” — The container is loading a bundled CUDA compat driver that does not match the host’s kernel driver. This is the most common cause.
“Error 802: system not yet initialized” — The NVIDIA kernel driver is not loaded or fabric manager is not running (required on NVSwitch systems like DGX A100/H100).
Causes#
The NIM container image ships CUDA compat libraries at /usr/local/cuda-*/compat/. At runtime, the NVIDIA Container Toolkit should bind-mount the host’s userspace CUDA driver into the container, overriding the bundled compat libraries. If this mechanism fails, the container loads the bundled compat driver, which may not match the host’s kernel driver.
Common reasons the host driver is not mounted:
NVIDIA Container Toolkit is not installed or not configured
The container runtime is not set to
nvidia(--runtime=nvidiaor equivalent)CDI mode is enabled but not properly configured
The
LD_LIBRARY_PATHinside the container prioritizes compat libraries over the host driver
Resolution#
Verify the NVIDIA Container Toolkit is installed and configured:
nvidia-ctk --versionEnsure the container uses the NVIDIA runtime:
docker run --runtime=nvidia --gpus all ...
Check which
libcuda.sois loaded inside the container:docker exec ${CONTAINER} python3 -c \ "import ctypes; ctypes.CDLL('libcuda.so.1'); \ [print(l.split(None,5)[-1].strip().removesuffix(' (deleted)')) for l in open('/proc/self/maps') if 'libcuda.so' in l][:1]"
The path should point to the host driver (e.g.,
/usr/lib/x86_64-linux-gnu/libcuda.so.1), not to/usr/local/cuda-*/compat/.For Error 802 on NVSwitch systems, ensure
nvidia-fabricmanageris running and its version matches the driver:nvidia-smi # check driver version nv-fabricmanager --version # must match sudo systemctl start nvidia-fabricmanager