Troubleshoot AI Workbench

User Guide (Latest)

AI Workbench uses Windows Subsystem for Linux 2 (WSL2). The installer configures WSL and installs the AI Workbench WSL distribution. If problems occur during installation, try the following steps:

  • If any Windows updates are pending, reboot your computer.

  • Run wsl --update manually in Windows PowerShell. If your internet connection is unstable, you might need to run this command multiple times.

  • Your corporate VPN may prevent you from downloading the the NVIDIA-Workbench WSL distribution from the Windows store.

  • If you have an old version of WSL, manually install WSL2.

After you add packages to your AI Workbench project, sometimes your container won’t build. If you add an invalid package, and then remove it, you are stuck in a quick build loop. To resolve this issue from the AI Workbench desktop, click Build in the status bar, and then click Clear Cache and Build. To resolve this issue from the CLI, use the --full-build flag to build your project.

On macOS your Docker containers run in a VM that Docker manages. If not enough resources are allocated to the VM, your container build fails, and you see a message that the disk is full. To resolve this issue, in the Docker desktop app, in the settings, increase the system resources (CPU, memory, disk) that are allocated to the Docker VM.

On macOS your Podman containers run in a VM that Podman manages. If you are using Podman on a macOS, your container might not build or start. To resolved this issue, try stopping and restarting AI Workbench which stops and restarts the podman VM. You can manipulate the podman VM by using the podman machine commands. For Podman on macOS, during installation AI Workbench creates a machine called nvidia-workbench.

Rootless Podman containers that use the --userns=keep-id flag with the native overlay driver have a known issue of being extremely slow during container start. In some cases, it can take several minutes for a container to start after the first time it is built.

When using a new user namespace with different id mappings, to ensure that the container image is presented with the right ownership inside the new user namespace, podman creates a copy of the image and chowns every file to the expected user, which takes several minutes for a moderately large image (>10GB). Besides the time taken to start the container, there is a locking bug in podman during the image copy process, which results in all podman commands hanging and freezing, until the copy process completes, and the container starts.

A workaround for this issue is to use fuse-overlayfs instead of native overlayfs for rootless podman containers that use the --userns=keep-id flag, until idmapped mounts are supported from user namespace for rootless containers. However, since fuse-overlayfs is a FUSE file system, it is inherently slower than the native overlayfs. Also, all podman volumes and images would need to be deleted through a podman system reset command, before switching from native overlayfs to fuse overlayfs, and vice versa. Enable fuse-overlayfs by enabling it as a mount_program in the storage.conf file for podman.

However, due to the considerable increase to build time introduced by fuse-overlayfs (in some cases 2-5x slower build time), podman containers will continue to use the native overlayfs with AI Workbench. Instead as a temporary fix, podman containers are briefly started and stopped after the containers are built, so that the podman process of creating an ID-mapped copy of each layer, during which time the podman commands may hang, runs as part of the container build flow instead of the container start flow. This mitigates the slow podman container start or container start failure. An appropriate message describing the issue appears to the user through the build output.

If your remote location has GPUs, sometimes a driver update or other system update requires you to restart the computer before AI Workbench can connect to it. To verify this issue, SSH into the remote computer and run nvidia-smi. If you see an error, reboot the remote computer to resolve the issue.

The version of AI Workbench installed on your local computer must match the version of AI Workbench installed on your remote locations. If you update AI Workbench on your local computer, but not a remote location, and then try to connect to the remote location, an error occurs.

You might see the following error in your AI Workbench desktop application.

  • Error connecting to <remote location name>

You might see the following error in your log files.

Copy
Copied!
            

{ "level":"error", "error":"service version (0.34.0) does not match expected version (0.34.1)", ... "message":"AI Workbench Server Incompatible" }

To resolve this issue, see Update AI Workbench on a remote computer.

AI Workbench has built-in support for VS Code. Sometimes it can take a long time for VS Code to install and initialize inside the project container. Try opening the VS Code app again. To prevent AI Workbench from shutting down the container when VS Code fails to start, start another app like JupyterLab first.

If VS Code fails to find or connect to the container, verify that you have correctly configured VS Code for AI Workbench. Both Podman and Windows require configuration. For instructions, see VS Code.

Previous Reverse Proxy and Networking
Next Log Files
© Copyright © 2024, NVIDIA Corporation. Last updated on Jun 10, 2024.