Job Failure Diagnosis

Learn how to diagnose job failures in DGX Cloud Lepton.

DGX Cloud Lepton provides a built-in feature to automatically diagnose job failures. This feature is designed to help users quickly identify and resolve issues in batch jobs, ensuring optimal performance and reliability.

Let's walk through an example of how to use this feature to diagnose a job failure.

Create the Job

Navigate to the Batch Jobs page to create the job with the following configuration:

Resource

As we just want to test the job failure diagnosis feature, we can create a job with any desired GPU card and corresponding node group.

Container

In the container section, use the default image and paste the following command to run the job:

The above script will create a memory leak on a GPU by continuously allocating tensors using Pytorch and preventing them from being garbage collected, until it reaches a target memory usage (defaulting to 10GB but configurable via command line).

Create

Click Create to create the job. Once the job is created, you can view the job in the Lepton Dashboard.

Diagnose Job Failure

The job will fail after reaching the target memory usage. You will then see a failure message on the job details page. For this example, you will see an error tag showing the job failed due to ERR_GPU_OUT_OF_MEMORY. Hover over the error tag to see the detailed error message.

Job Failure

Create the Job

Diagnose Job Failure

Job Failure Diagnosis

Create the Job

Resource

Container

Create

Diagnose Job Failure

Corporate Info

NVIDIA Developer

Resources