Common Issues and Debugging Tips#

For support with your cluster, reach out to your NVIDIA TAM. Alternatively you can file a ticket directly at the NVIDIA Enterprise Support Portal , or for non-urgent inquiries, use the shared communications channel with your TAM, which was created during your onboarding process.

Note

NVIDIA does not have access to your namespaces. If you need NVIDIA to debug issues in your namespace or node, NVIDIA will request your approval (via TAM) and an approval process will follow that will give NVIDIA admin permissions to access your namespace or node.

Common Issues and Resolutions#

This section will attempt to cover likely issues you may encounter while working on the cluster, and the resolutions for those issues.

Issues

Resolutions

You are unable to carry out an action on the cluster, such as an action, department, or project being greyed out or missing.

This is likely an issue due to lack of scope access or limited permissions, reach out to your cluster admin to adjust your user roles and scopes. Refer to the Overview section for more information about user roles and scopes.

Alternatively, you may be trying to access a feature that is not enabled or supported in DGX Cloud Create. Refer to the Limitations section for more information.

You receive a There are issues with your connection to the cluster error message.

Verify that you are on a network that has been allowlisted for access to the cluster. Reach out to your cluster admin for instructions on verifying this. If you are a cluster admin, and need an additional network added to the allowlist, reach out to your NVIDIA TAM.

If you continue to receive error messages like this after verifying the network, such as when trying to run a Workload, submit a support ticket or work with your TAM for additional investigation.

When running a job, you get a library not found error message.

Verify that the LD_LIBRARY_PATH environment variable is set correctly. By default, DGX Cloud Create adds the following paths to this variable: /usr/lib/x86_64-linux-gnu and /usr/local/nvidia/lib64. Depending on the container or workload, the correct path for this variable may be different.

If needed, override the environment variable in the pod’s manifest file or using NVIDIA Run:ai’s +ENVIRONMENT VARIABLE option under Runtime settings when configuring your Environment for your Workload.

PVCs created using the K8s API or kubectl are not visible or mountable in NVIDIA Run:ai.

This is by design, you will need to create a data source in NVIDIA Run:ai, then select an existing PVC, choosing the PVC you created manually. Then, you’ll be able to select and mount this PVC in your NVIDIA Run:ai-submitted workloads.

Workloads submitted and/or deleted in the CLI are not visible in the UI, and/or vice versa.

Alternatively, workloads are stopped in the UI but still running in the CLI, or vice versa.

This can happen when the NVIDIA Run:ai version of the cluster and the NVIDIA Run:ai version of the control plane are out of sync. Contact your NVIDIA TAM for assistance in updating the cluster to the latest version.

Workloads are stuck in Pending or Failed states.

This can happen for a variety of reasons, but most commonly, the workload does not meet the requirements for the security policies set on the DGX Cloud Create cluster, or is trying to use a feature that is not currently supported on DGX Cloud Create.

For more information about these policies, see the Security Restrictions for Kubernetes section, and/or the Security Restrictions and Cluster Limitations section. For troubleshooting these stuck jobs, we recommend using kubectl commands for more detailed information from the cluster.

Unable to mount additional shared memory

Workloads submitted with NVIDIA Run:ai include 16 GB of shared memory (SHM) in all pods mounted at /dev/shm inside containers by default. Attempting to mount additional memory by manually specifying SHM sizes for /dev/shm in manifest files will result in an error as the directory is already pre-allocated by NVIDIA Run:ai. If additional SHM is needed, contact your NVIDIA TAM.

Workloads are stuck in Pending states after attaching a new data source.

Check the Workload’s Event History under the Show Details menu for a FailedVolumeBinding event. If you discover that event, attempt to launch the workload without the data source attached.

If that new job launches, ensure you have followed the guidance from Recommended Storage Classes. If you have, it is possible that the cluster’s allocated storage capacity has been exhausted. Contact your TAM for further guidance.

Jupyter Workspace workloads may throw a 403 error

When running a Jupyter workload of type Workspace, you may encounter a 403 error when attempting to connect to Jupyter in your browser. The NVIDIA Run:ai logs may display the following errors:

Blocking request with non-local 'Host'. If the server should be accessible at that name, set ServerApp.allow_remote_access to disable the check.

or No web browser found: Error('could not locate runnable browser')

or Jupyter Server 2.14.2 is running at: [I 2024-11-18 22:35:25.737 ServerApp] http://localhost:0 [I 2024-11-18 22:35:25.737 ServerApp] http://127.0.0.1:0

To resolve this issue, ensure you have jupyter-lab as the runtime command and the follow runtime arguments --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser in the Runtime command and arguments section in the Environment that you created.

A pod with a new PVC is taking a long time to be created

If this is the first time a PVC has been provisioned for a particular storage class, there could be a lengthy delay for the initial creation of the volume. You may see errors from the pod describe such as:

error: binding volumes: timed out waiting for the condition

or rpc error: code = DeadlineExceeded desc = context deadline exceeded

This is a transient error and the PVC should eventually get provisoned and the pod should become ready and in the Running state eventually. This is a one-time penalty and future launches should happen more rapidly.

The namespace for a project remains even after the project has been deleted

This is a known behavior in NVIDIA Run:ai. If the namespace is no longer required, a customer admin can delete it using kubectl delete ns <the-orphan>, substituting the actual namespace.