Environment Variables Reference

Basic Environment Variables

When a batch job is launched on DGX Cloud Lepton, the following environment variable names are automatically set, with the values corresponding to the job configurations.

Environment Variable NameDescriptionSample Value
LEPTON_RESOURCE_ACCELERATOR_NUMNumber of hardware accelerators allocated1
LEPTON_JOB_WORKER_HOSTNAME_PREFIXPrefix used for naming worker hostnamesworker
LEPTON_WORKSPACE_IDIdentifier for the current workspaceprod01awsuswest
LEPTON_RESOURCE_ACCELERATOR_TYPEType of hardware accelerator usedNVIDIA-A100-80GB
LEPTON_WORKER_IDUnique identifier for the current workerenv-job-98bw-0-2nm7s
LEPTON_JOB_FAILURE_COUNTNumber of failed job attempts0
LEPTON_JOB_TOTAL_WORKERSTotal number of workers assigned to the job1
LEPTON_JOB_WORKER_INDEXIndex of the current worker within the job0
LEPTON_SUBDOMAINSubdomain name assigned to the job serviceenv-job-98bw-job-svc
LEPTON_JOB_SERVICE_PREFIXPrefix used for naming services related to the jobenv-job-98bw
LEPTON_JOB_NAMEName assigned to the jobenv-job-98bw
LEPTON_VIRTUAL_ENVPath to the Python virtual environment/opt/lepton/venv

Reference for torch.distributed.launch

Using the DGX Cloud Lepton environmental variables, you can normally construct the env variables for various AI training framework abstractions.

For example, if you use torch.distributed.launch, the required env variables can be set up as:

Environment Variable NameMeaningConstruction Method
SERVICE_PREFIXPrefix for the job service, defaulting to LEPTON_JOB_SERVICE_PREFIX or LEPTON_JOB_NAME${LEPTON_JOB_SERVICE_PREFIX:-${LEPTON_JOB_NAME}}
SUBDOMAINSubdomain for the job service, defaulting to LEPTON_SUBDOMAIN or LEPTON_JOB_NAME-job-svc${LEPTON_SUBDOMAIN:-${LEPTON_JOB_NAME}-job-svc}
MASTER_ADDRAddress of the master node for distributed training${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local
MASTER_PORTPort for the master node communication29400
WORLD_SIZETotal number of workers assigned to the job${LEPTON_JOB_TOTAL_WORKERS}
WORKER_ADDRSAddress for worker nodes(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -)
NODE_RANKRank of the current worker node in the distributed setup${LEPTON_JOB_WORKER_INDEX}

Here is an example for a short script that sets up the environment variables for a job:

#! /usr/bin/env bash

# Set the environment variables
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-${LEPTON_JOB_NAME}}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-${LEPTON_JOB_NAME}-job-svc}"
export MASTER_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}

# Run the distributed training script.
python -m torch.distributed.run \
    --nnodes=$WORLD_SIZE \
    --nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    /path/to/train.py
Copyright @ 2025, NVIDIA Corporation.