Troubleshoot AI Workbench
Use the information in this documentation to troubleshoot issues that arise when you work with NVIDIA AI Workbench. Typically, you find issues in the log and runtime files.
Get help with the following issues:
After you add packages to your AI Workbench project, sometimes your container won’t build. If you add an invalid package, and then remove it, you are stuck in a quick build loop. To resolve this issue, do the following:
Open your project in the AI Workbench desktop application.
Open the Environment page.
Use the Packages list to remove the package (or confirm that it is removed).
Rebuild your project.
To rebuild your project from the desktop application, click Build in the status bar, and then click Clear Cache and Build.
To rebuild your project from the CLI, use the
--full-build
flag to build your project.
On Windows, if your NVIDIA GPU driver is version 555 or later, and your Docker Desktop version is earlier than 4.33.0, when you run a notebook or do another action that uses the GPU, you might see a CUDA (or other) error message. To resolve this issue, update to Docker Desktop version 4.33.0 or later.
On macOS your Docker containers run in a VM that Docker manages. If not enough resources are allocated to the VM, your container build fails, and you see a message that the disk is full. To resolve this issue, in the Docker desktop app, in the settings, increase the system resources (CPU, memory, disk) that are allocated to the Docker VM.
If a project doesn’t build, or behaves in an unexpected way, Docker might not be installed correctly on your computer. You might see error messages that mention buildx or failed to read dockerfile in the output window or log files. These errors indicate that Docker is not installed correctly on your computer. To resolve this issue, uninstall Docker from your computer and let AI Workbench install it for you, or follow Docker’s instructions.
AI Workbench uses Windows Subsystem for Linux 2 (WSL). The installer configures WSL and installs the AI Workbench WSL distribution. If problems occur during installation, try the following steps:
If any Windows updates are pending, reboot your computer.
Run
wsl --update
manually in Windows PowerShell. If your internet connection is unstable, you might need to run this command multiple times.Your corporate VPN may prevent you from downloading the the
NVIDIA-Workbench
WSL distribution from the Windows store.If you have an old version of WSL, manually install WSL 2.
The AI Workbench installer might fail to import the WSL distribution. In this case, you might see the following error message.
[error] (configure-distro) importDistro Command failed: wsl --import NVIDIA-Workbench "C:\ProgramData\NVIDIA Corporation\workbench" "C:\Users\<user name>\AppData\Local\Temp\ubuntu-jammy-wsl-amd64-wsl.rootfs.tar.gz" --version 2
[info] (configure-distro) importDistroResponse { success: false, error: 'import-error' }
[info] (configure-distro) installDistroResponse(): error installDistroResponse.error import-error
[error] (configure-distro-channel) Unknown Error
This can happen if there is an issue with the virtualization on your computer. To resolve this issue, do the following:
Open a Windows terminal and run
systeminfo
.Find the item Hyper-V Requirements.
If it says A hypervisor has been detected…, stop and contact AI Workbench support.
Otherwise, ensure that virtualization is correctly installed on your computer. Follow the instructions from Microsoft, for example, Enable virtualization on Windows.
If you want to use Docker as your container runtime, you need Docker Desktop. When you install AI Workbench on Windows, in most cases the installer can install Docker Desktop for you. In cases where you install AI Workbench as a user that doesn’t have administrator privileges, the AI Workbench installer fails to install Docker Desktop. Use the instructions in the Docker documentation to add the user to the docker-users group, and then restart the AI Workbench installer.
On macOS your Podman containers run in a VM that Podman manages.
If you are using Podman on a macOS, your container might not build or start.
To resolved this issue, try stopping and restarting AI Workbench which stops and restarts the podman VM.
You can manipulate the podman VM by using the podman machine
commands.
For Podman on macOS, during installation AI Workbench creates a machine called nvidia-workbench
.
Rootless Podman containers that use the --userns=keep-id
flag with the native overlay driver
have a known issue of being extremely slow during container start.
In some cases, it can take several minutes for a container to start after the first time it is built.
When using a new user namespace with different id mappings, to ensure that the container image is presented with the right ownership inside the new user namespace, podman creates a copy of the image and chowns every file to the expected user, which takes several minutes for a moderately large image (>10GB). Besides the time taken to start the container, there is a locking bug in podman during the image copy process, which results in all podman commands hanging and freezing, until the copy process completes, and the container starts.
A workaround for this issue is to use fuse-overlayfs instead of native overlayfs
for rootless podman containers that use the --userns=keep-id
flag,
until idmapped mounts are supported from user namespace for rootless containers.
However, since fuse-overlayfs is a FUSE file system, it is inherently slower than the native overlayfs.
Also, all podman volumes and images would need to be deleted through a podman system reset command,
before switching from native overlayfs to fuse overlayfs, and vice versa.
Enable fuse-overlayfs by enabling it as a mount_program in the storage.conf file for podman.
However, due to the considerable increase to build time introduced by fuse-overlayfs (in some cases 2-5x slower build time), podman containers continue to use the native overlayfs with AI Workbench. Instead as a temporary fix, podman containers are briefly started and stopped after the containers are built, so that the podman process of creating an ID-mapped copy of each layer, during which time the podman commands may hang, runs as part of the container build flow instead of the container start flow. This mitigates the slow podman container start or container start failure. An appropriate message describing the issue appears to the user through the build output.
If your remote location has GPUs, sometimes a driver update or other system update
requires you to restart the computer before AI Workbench can connect to it.
To verify this issue, SSH into the remote computer and run nvidia-smi
.
If you see an error, reboot the remote computer to resolve the issue.
The version of AI Workbench installed on your local computer must match the version of AI Workbench installed on your remote locations. If you update AI Workbench on your local computer, but not a remote location, and then try to connect to the remote location, an error occurs.
You might see the following error in your AI Workbench desktop application.
Error connecting to <remote location name>
You might see the following error in your log files.
{
"level":"error",
"error":"service version (0.34.0) does not match expected version (0.34.1)",
...
"message":"AI Workbench Server Incompatible"
}
To resolve this issue, see Update AI Workbench on a Remote Computer.
AI Workbench has built-in support for VS Code. Sometimes it can take a long time for VS Code to install and initialize inside the project container. Try opening the VS Code app again. To prevent AI Workbench from shutting down the container when VS Code fails to start, start another app like JupyterLab first.
If VS Code fails to find or connect to the container, verify that you have correctly configured VS Code for AI Workbench. Both Podman and Windows require configuration. For instructions, see VS Code in AI Workbench.