Templates

DGX Cloud Lepton provides two templates to help you get started quickly:

  • MPI - For distributed computing with MPI
  • Torchrun - For PyTorch distributed training

To use these templates, select the Template option at the top of the create job page.

templates 0.7x

MPI Template

To use the MPI template, follow the steps below:

  1. Go to the Batch Jobs tab and click on Create Job.

  2. Select MPI on the top of the create job page.

  3. Fill in the MPI Command, job name, and other configurations.

    1. MPI Command: The template provides a default command. You can customize it using the predefined environment variables provided by MPI template.
    Note

    In this documentation, we will use the term 'primary' instead of 'master' to align with modern terminology. Please note that UI, commands, and environment variables may still reference 'master'.

    The MPI template generates a bash script that sets up the MPI runtime environment. The following environment variables are automatically available in your job:

    • MASTER_ADDR: Hostname of the primary node (first node)
    • MASTER_IP: IP address of the primary node
    • THIS_ADDR: Hostname of the current node
    • HOST_ADDRS: Comma-separated list of all node hostnames
    • HOST_IPS: Comma-separated list of all node IP addresses
    • NNODES: Total workers for the current job
    • NODE_RANK: Rank of the current node among all workers
    • NGPUS: Number of GPUs on the current node
    • HOSTFILE: Path to the file containing all node IPs (default: "/tmp/hostfile.txt")
    1. Job Name: Enter a descriptive name like mpi-job or your preferred identifier.
    2. Other Configurations: Use the default settings or customize as needed. See batch job configurations for detailed options.
  4. Click Create to launch your job.

Torchrun Template

The Torchrun template streamlines PyTorch distributed training. Follow these steps:

  1. Go to the Batch Jobs tab and click Create Job.

  2. Select Torchrun on the top of the create job page.

  3. Fill in the Torchrun Command, job name, and other configurations.

    1. Torchrun Command: The template provides a default command. You can customize it using the predefined environment variables.
    Note

    In this documentation, we will use the term 'primary' instead of 'master' to align with modern terminology. Please note that UI, commands, and environment variables may still reference 'master'.

    The Torchrun template generates a bash script that predefines environment variables for your job:

    • PET_MASTER_ADDR: Hostname of the primary node (first node)
    • MASTER_IP: IP address of the primary node
    • THIS_ADDR: Hostname of the current node
    • PET_NNODES: Total number of workers for the current job
    • PET_NODE_RANK: Rank of the current node among all workers
    • PET_NPROC_PER_NODE: Number of GPUs on the current node

    Environment variables prefixed with PET_ are automatically parsed as corresponding torchrun parameters. For example, PET_MASTER_ADDR becomes --master_addr. You can override these by explicitly setting parameters in your torchrun command.

    1. Job Name: Enter a descriptive name like torchrun-job or your preferred identifier.
    2. Other Configurations: Use the default settings or customize as needed. Refer to batch job configurations for detailed options.
  4. Click Create to launch your job.

Copyright @ 2025, NVIDIA Corporation.