Common Issues and Debugging Tips#
For support with your cluster, reach out to your NVIDIA TAM. Alternatively you can file a ticket directly at the NVIDIA Enterprise Support Portal , or for non-urgent inquiries, use the shared communications channel with your TAM, which was created during your onboarding process.
Note
NVIDIA does not have access to your namespaces. If you need NVIDIA to debug issues in your namespace or node, NVIDIA will request your approval (via TAM) and an approval process will follow that will give NVIDIA admin permissions to access your namespace or node.
Common Issues and Resolutions#
This section will attempt to cover likely issues you may encounter while working on the cluster, and the resolutions for those issues.
Issues |
Resolutions |
---|---|
You are unable to carry out an action on the cluster, such as an action, department, or project being greyed out or missing. |
This is likely an issue due to lack of scope access or limited permissions, reach out to your cluster admin to adjust your user roles and scopes. Refer to the Overview section for more information about user roles and scopes. Alternatively, you may be trying to access a feature that is not enabled or supported in DGX Cloud Create. Refer to the Limitations section for more information. |
You receive a There are issues with your connection to the cluster error message. |
Verify that you are on a network that has been allowlisted for access to the cluster. Reach out to your cluster admin for instructions on verifying this. If you are a cluster admin, and need an additional network added to the allowlist, reach out to your NVIDIA TAM. If you continue to receive error messages like this after verifying the network, such as when trying to run a Workload, submit a support ticket or work with your TAM for additional investigation. |
When running a job, you get a |
Verify that the If needed, override the environment variable in the pod’s manifest file or using NVIDIA Run:ai’s +ENVIRONMENT VARIABLE option under Runtime settings when configuring your Environment for your Workload. |
PVCs created using the K8s API or |
This is by design, you will need to create a data source in NVIDIA Run:ai, then select an existing PVC, choosing the PVC you created manually. Then, you’ll be able to select and mount this PVC in your NVIDIA Run:ai-submitted workloads. |
Workloads submitted and/or deleted in the CLI are not visible in the UI, and/or vice versa. Alternatively, workloads are stopped in the UI but still running in the CLI, or vice versa. |
This can happen when the NVIDIA Run:ai version of the cluster and the NVIDIA Run:ai version of the control plane are out of sync. Contact your NVIDIA TAM for assistance in updating the cluster to the latest version. |
Workloads are stuck in Pending or Failed states. |
This can happen for a variety of reasons, but most commonly, the workload does not meet the requirements for the security policies set on the DGX Cloud Create cluster, or is trying to use a feature that is not currently supported on DGX Cloud Create. For more information about these policies, see the Security Restrictions for Kubernetes section, and/or the Security Restrictions and Cluster Limitations section. For troubleshooting these stuck jobs, we recommend using |
Unable to mount additional shared memory |
Workloads submitted with NVIDIA Run:ai include 16 GB of shared memory (SHM) in all pods mounted at |
Workloads are stuck in Pending states after attaching a new data source. |
Check the Workload’s Event History under the Show Details menu for a If that new job launches, ensure you have followed the guidance from Recommended Storage Classes. If you have, it is possible that the cluster’s allocated storage capacity has been exhausted. Contact your TAM for further guidance. |
Jupyter Workspace workloads may throw a 403 error |
When running a Jupyter workload of type Workspace, you may encounter a 403 error when attempting to connect to Jupyter in your browser. The NVIDIA Run:ai logs may display the following errors:
or or To resolve this issue, ensure you have |
A pod with a new PVC is taking a long time to be created |
If this is the first time a PVC has been provisioned for a particular storage class, there could be a lengthy delay for the initial creation of the volume. You may see errors from the pod describe such as:
or This is a transient error and the PVC should eventually get provisoned and the pod should become ready and in the Running state eventually. This is a one-time penalty and future launches should happen more rapidly. |
The namespace for a project remains even after the project has been deleted |
This is a known behavior in NVIDIA Run:ai. If the namespace is no longer required, a customer admin can delete it using |