Migrate Slurm Scripts to DGX Cloud Lepton Jobs

Slurm to Lepton Terminology

The following table outlines the key terminology mapping between Slurm and Lepton for job migration.

Slurm	Lepton
`--job-name`	Job Name
`--partition`, `--cluster`, `--qos`	Node Group and Queue Priority
`--nodes (-N)`, `--ntasks (-n)`	Workers
`--gres=gpu:X`, `--cpus-per-task`, `--mem`	Resource Shape
`--time`	Job Timeout
`--constraint`, `--nodelist`	Node Group / Node ID
`srun`, `mpirun` command	Run Command (DGX Cloud Lepton will auto wrap with `/bin/bash -c`)
`--export`, environment	Environment Variables and Secrets
Singularity / Docker Image	Container Image
`squeue`, `scancel`	Jobs List and Stop Job
Job array, dependencies	Not Supported Yet

Example

This example demonstrates how to migrate a Slurm script to a DGX Cloud Lepton job.

The following Slurm script launches an interactive job:

Requests 1 node with 8 GPUs
Uses a specific partition
Runs a container from a Singularity image file
Exports all current environment variables to the container
Opens an interactive bash shell inside the container

srun -N 1 -A sw_aidot \
    --job-name test-job \
    --gpus-per-node 8 \
    --partition=YOUR_PARTITION \
    --container-image YOUR_CONTAINER_IMAGE_PATH \
    --container-name YOUR_CONTAINER_NAME \
    --export=ALL \
    --pty bash -i

Now, convert this to a batch job on DGX Cloud Lepton.

Create the Job on DGX Cloud Lepton

Navigate to the Create a Job page, and configure the following fields:

Name: test-job as the job name configured in the Slurm script.
Resource:
- Select a node group with available GPUs. If you don't have one, refer to this guide to request a node group. For this example, we request 1 node with 8 GPUs, so we can select a GPU type like H100-80GB-HBM3 and specify x8 for the GPU count.
- To specify the priority, click on the dropdown in the Resource section and select one of the priorities. Refer to this guide for more details.
- Set workers to 1. You can also specify a higher number of workers to accelerate the training process.
Container Image: Specify the private container image in the Container section. For this example, use the default image.
Advanced Settings: Configure advanced settings in the Advanced Settings section.
- Add environment variables and secrets under Environment Variables. For example, you can add HUGGING_FACE_HUB_TOKEN as a securely stored secret on DGX Cloud Lepton, which you can then reference as an environment variable within the container.

Create the job using the DGX Cloud Lepton CLI

Jobs can also be submitted using the Lepton CLI. Install the latest version of the CLI with the following command:

pip install --upgrade leptonai

Once installed, the CLI can be invoked with lep. To see the list of available commands, run:

lep -h

Taking the example srun command above, this can be invoked with the CLI as follows:

lep job create \
    --name test-job \
    --resource-shape gpu.8xh100-80gb \
    --node-group YOUR_NODE_GROUP \
    --container-image YOUR_CONTAINER_IMAGE \
    --env ENV1=ENV1VALUE \
    --env ENV2=ENV2VALUE \
    --secret NAME=SECRET_NAME \
    --command "sleep infinity"

Breaking down each of the flags is as follows:

--name: This is the name of the job and is analagous to the --job-name command.
--resource-shape: This describes what resources should be allocated in the container. For multi-GPU workloads, this comes in the format gpu.Nx<gpu-type> where N is the number of GPUs to allocate in the job and <gpu-type> is the name of the GPU to run on, such as h100-80gb. For single-GPU workloads, this would be gpu.h100-80gb. The list of available resource shapes can be viewed with lep job create -h and viewing the --resource-shape option. This is analagous to the --gpus-per-node flag for Slurm.
--node-group: This specifies which node group to run the job in. The list of available node groups can be found in the Nodes section of the DGX Cloud Lepton UI. This most closely resembles partitions in Slurm, but it goes further as different node groups could span different NVIDIA Cloud Providers (NCPs).
--container-image: This is the container image to use for the job from a container registry, such as nvcr.io/nvidia/pytorch:YY.MM-py3. This is analagous to the --container-image command in Slurm.
--env: This is a key:value pair to add as an environment variable. Multiple environment variables can be set with multiple --env flags.
--secret: You can specify a secret that is added to DGX Cloud Lepton to be usable in the job. Set the NAME to the name that the secret will be set to inside the container, and SECRET_NAME to the name of the secret in DGX Cloud Lepton.
--command: This is the actual command to run inside the job. This can either be something like sleep infinity to let the job run indefinitely in the background and allow users to connect remotely to the container, or one or multiple commands strung together to be run on container start.

List Jobs

To list all jobs, run:

lep job list

To list all jobs in some specific states, run:

lep job list --state running --state failed

lep job list -s r -s f

Stop a Job

To stop a job, run:

lep job stop -i

Create 10 jobs

for i in `seq 10`; do
  lep job create -n "test-job-$i" \
    --resource-shape "cpu.small" \
    --node-group  \
    --command "echo $i"
done

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

8. Workspace

1. Dev Pod

2. Batch Job

1. API Reference

2. CLI Reference

3. Limits

Migrate Slurm Scripts to DGX Cloud Lepton Jobs

Slurm to Lepton Terminology

Example

Create the Job on DGX Cloud Lepton

Create the job using the DGX Cloud Lepton CLI

List Jobs

Stop a Job

Create 10 jobs

References

Corporate Info

NVIDIA Developer

Resources