Troubleshoot AI Workbench#

Use the information in this documentation to troubleshoot issues that arise when you work with NVIDIA AI Workbench. Typically, you find issues in the log and runtime files.

Get help with the following issues:

Clone project fails because of bad project spec file #

If you attempt to clone a project and it fails, you might see an error message such as Problem validating new project. To fix this problem, verify that the spec file in the project that you are cloning has a valid spec file.

Clone project fails repeatedly #

If you attempt to clone a project, it might fail for reasons such as a network failure. After the clone fails the first time, the project is in an intermediate state. First clean up the project by doing the following.

Find the file inventory.json in the ~/.nvwb/ folder. If you are on Windows, look in Linux\NVIDIA-Workbench in file explorer.
Inspect the file to see if there is an entry for the new project.
1. Delete the new project’s entry from the file. Be careful not to change the json structure of the file.
2. Go to the ~/.nvwb/project-runtime-info folder and completely remove the new project’s folder. For more information, see AI Workbench Project Runtime Files.
Clone the project again.

Container won’t build after adding and removing an invalid package #

After you add packages to your AI Workbench project, sometimes your container won’t build. If you add an invalid package, and then remove it, you are stuck in a quick build loop. To resolve this issue, do the following:

Open your project in the AI Workbench Desktop Application.
Open the Environment page.
Use the Packages list to remove the package (or confirm that it is removed).
Rebuild your project.
- To rebuild your project from the desktop application, click Build in the status bar, and then click Clear Cache and Build.
- To rebuild your project from the CLI, use the --full-build flag to build your project.

CUDA and other errors on Windows #

On Windows, if your NVIDIA GPU driver is version 555 or later, and your Docker Desktop version is earlier than 4.33.0, when you run a notebook or do another action that uses the GPU, you might see a CUDA (or other) error message. To resolve this issue, update to Docker Desktop version 4.33.0 or later.

Docker container build fails on macOS #

On macOS your Docker containers run in a VM that Docker manages. If not enough resources are allocated to the VM, your container build fails, and you see a message that the disk is full. To resolve this issue, in the Docker desktop app, in the settings, increase the system resources (CPU, memory, disk) that are allocated to the Docker VM.

Docker installed incorrectly or missing buildx #

If a project doesn’t build, or behaves in an unexpected way, Docker might not be installed correctly on your computer. You might see error messages that mention buildx or failed to read dockerfile in the output window or log files. These errors indicate that Docker is not installed correctly on your computer. To resolve this issue, uninstall Docker from your computer and let AI Workbench install it for you, or follow Docker’s instructions.

Podman container build fails on macOS #

On macOS your Podman containers run in a VM that Podman manages. If you are using Podman on a macOS, your container might not build or start. To resolved this issue, try stopping and restarting AI Workbench which stops and restarts the podman VM. You can manipulate the podman VM by using the podman machine commands. For Podman on macOS, during installation AI Workbench creates a machine called nvidia-workbench.

Podman container slow on first start #

Rootless Podman containers that use the --userns=keep-id flag with the native overlay driver have a known issue of being extremely slow during container start. In some cases, it can take several minutes for a container to start after the first time it is built.

When using a new user namespace with different id mappings, to ensure that the container image is presented with the right ownership inside the new user namespace, podman creates a copy of the image and chowns every file to the expected user, which takes several minutes for a moderately large image (>10GB). Besides the time taken to start the container, there is a locking bug in podman during the image copy process, which results in all podman commands hanging and freezing, until the copy process completes, and the container starts.

A workaround for this issue is to use fuse-overlayfs instead of native overlayfs for rootless podman containers that use the --userns=keep-id flag, until idmapped mounts are supported from user namespace for rootless containers. However, since fuse-overlayfs is a FUSE file system, it is inherently slower than the native overlayfs. Also, all podman volumes and images would need to be deleted through a podman system reset command, before switching from native overlayfs to fuse overlayfs, and vice versa. Enable fuse-overlayfs by enabling it as a mount_program in the storage.conf file for podman.

However, due to the considerable increase to build time introduced by fuse-overlayfs (in some cases 2-5x slower build time), podman containers continue to use the native overlayfs with AI Workbench. Instead as a temporary fix, podman containers are briefly started and stopped after the containers are built, so that the podman process of creating an ID-mapped copy of each layer, during which time the podman commands may hang, runs as part of the container build flow instead of the container start flow. This mitigates the slow podman container start or container start failure. An appropriate message describing the issue appears to the user through the build output.

Remote location fails to connect when needing reboot #

If your remote location has GPUs, sometimes a driver update or other system update requires you to restart the computer before AI Workbench can connect to it. To verify this issue, SSH into the remote computer and run nvidia-smi. If you see an error, reboot the remote computer to resolve the issue.

Remote location inaccessible after you update AI Workbench locally #

The version of AI Workbench installed on your local computer must match the version of AI Workbench installed on your remote locations. If you update AI Workbench on your local computer, but not a remote location, and then try to connect to the remote location, an error occurs.

You might see the following error in your AI Workbench Desktop Application.

Error connecting to <remote location name>

You might see the following error in your log files.

{
 "level":"error",
 "error":"service version (0.34.0) does not match expected version (0.34.1)",
 ...
 "message":"AI Workbench Server Incompatible"
 }

To resolve this issue, see Update AI Workbench on a Remote System.

Ubuntu 24.04 install or run fails #

If you are installing and running AI Workbench on Ubuntu 24.04, you might see errors such as The SUID sandbox helper binary was found, but is not configured correctly. You can try fixing this problem by running the following code.

1sudo sysctl -w kernel.apparmor_restrict_unprivileged_unconfined=0

VS Code says it failed to start, but the window opens without error #

AI Workbench has built-in support for VS Code. Sometimes it can take a long time for VS Code to install and initialize inside the project container. Try opening the VS Code app again. To prevent AI Workbench from shutting down the container when VS Code fails to start, start another app like JupyterLab first.

VS Code fails to find or connect to the container #

If VS Code fails to find or connect to the container, verify that you have correctly configured VS Code for AI Workbench. Both Podman and Windows require configuration. For instructions, see Visual Studio Code Integration.

Windows install fails to configure WSL #

AI Workbench uses Windows Subsystem for Linux 2 (WSL). The installer configures WSL and installs the AI Workbench WSL distribution. If problems occur during installation, try the following steps:

If any Windows updates are pending, reboot your computer.
Run wsl --update manually in Windows PowerShell. If your internet connection is unstable, you might need to run this command multiple times.
Your corporate VPN may prevent you from downloading the the NVIDIA-Workbench WSL distribution from the Windows store.
If you have an old version of WSL, manually install WSL 2.

Windows install fails to import WSL distro #

The AI Workbench installer might fail to import the WSL distribution. In this case, you might see the following error message.

[error]  (configure-distro)         importDistro Command failed: wsl --import NVIDIA-Workbench "C:\ProgramData\NVIDIA Corporation\workbench" "C:\Users\<user name>\AppData\Local\Temp\ubuntu-jammy-wsl-amd64-wsl.rootfs.tar.gz" --version 2
[info]   (configure-distro)         importDistroResponse { success: false, error: 'import-error' }
[info]   (configure-distro)         installDistroResponse(): error installDistroResponse.error import-error
[error]  (configure-distro-channel) Unknown Error

This can happen if there is an issue with the virtualization on your computer. To resolve this issue, do the following:

Open a Windows terminal and run systeminfo.
Find the item Hyper-V Requirements.
1. If it says A hypervisor has been detected…, stop and contact AI Workbench support.
2. Otherwise, ensure that virtualization is correctly installed on your computer. Follow the instructions from Microsoft, for example, Enable virtualization on Windows.

Windows install fails to install Docker Desktop #

If you want to use Docker as your container runtime, you need Docker Desktop. When you install AI Workbench on Windows, in most cases the installer can install Docker Desktop for you. In cases where you install AI Workbench as a user that doesn’t have administrator privileges, the AI Workbench installer fails to install Docker Desktop. Use the instructions in the Docker documentation to add the user to the docker-users group, and then restart the AI Workbench installer.