Job Failure Diagnosis
Learn how to diagnose job failures in DGX Cloud Lepton.
DGX Cloud Lepton provides a built-in feature to automatically diagnose job failures. This feature is designed to help users quickly identify and resolve issues in batch jobs, ensuring optimal performance and reliability.
Let's walk through an example of how to use this feature to diagnose a job failure.
Create the Job
Navigate to the Batch Jobs page to create the job with the following configuration:
Resource
As we just want to test the job failure diagnosis feature, we can create a job with any desired GPU card and corresponding node group.
Container
In the container section, use the default image and paste the following command to run the job:
The above script will create a memory leak on a GPU by continuously allocating tensors using Pytorch and preventing them from being garbage collected, until it reaches a target memory usage (defaulting to 10GB but configurable via command line).
Create
Click Create to create the job. Once the job is created, you can view the job in the Lepton Dashboard.
Diagnose Job Failure
The job will fail after reaching the target memory usage. You will then see a failure message on the job details page.
For this example, you will see an error tag showing the job failed due to ERR_GPU_OUT_OF_MEMORY.
Hover over the error tag to see the detailed error message.
