Environment Variables Reference
Basic Environment Variables
When a batch job is launched on DGX Cloud Lepton, the following environment variable names are automatically set, with the values corresponding to the job configurations.
| Environment Variable Name | Description | Sample Value |
|---|---|---|
LEPTON_RESOURCE_ACCELERATOR_NUM | Number of hardware accelerators allocated | 1 |
LEPTON_JOB_WORKER_HOSTNAME_PREFIX | Prefix used for naming worker hostnames | worker |
LEPTON_WORKSPACE_ID | Identifier for the current workspace | prod01awsuswest |
LEPTON_RESOURCE_ACCELERATOR_TYPE | Type of hardware accelerator used | NVIDIA-A100-80GB |
LEPTON_WORKER_ID | Unique identifier for the current worker | env-job-98bw-0-2nm7s |
LEPTON_JOB_FAILURE_COUNT | Number of failed job attempts | 0 |
LEPTON_JOB_TOTAL_WORKERS | Total number of workers assigned to the job | 1 |
LEPTON_JOB_WORKER_INDEX | Index of the current worker within the job | 0 |
LEPTON_SUBDOMAIN | Subdomain name assigned to the job service | env-job-98bw-job-svc |
LEPTON_JOB_SERVICE_PREFIX | Prefix used for naming services related to the job | env-job-98bw |
LEPTON_JOB_WORKER_PREFIX | Prefix used for naming services | worker |
LEPTON_JOB_NAME | Name assigned to the job | env-job-98bw |
LEPTON_VIRTUAL_ENV | Path to the Python virtual environment | /opt/lepton/venv |
Reference for torch.distributed.launch
Using the DGX Cloud Lepton environmental variables, you can normally construct the env variables for various AI training framework abstractions.
For example, if you use torch.distributed.launch, the required env variables can be set up as:
| Environment Variable Name | Meaning | Construction Method |
|---|---|---|
MASTER_ADDR | Address of the master node for distributed training | ${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN} |
MASTER_PORT | Port for the master node communication | 29400 |
WORLD_SIZE | Total number of workers assigned to the job | ${LEPTON_JOB_TOTAL_WORKERS} |
WORKER_ADDRS | Address for worker nodes | (seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${LEPTON_JOB_WORKER_PREFIX}-{}.${LEPTON_SUBDOMAIN} | paste -sd ',' -) |
NODE_RANK | Rank of the current worker node in the distributed setup | ${LEPTON_JOB_WORKER_INDEX} |
Here is an example for a short script that sets up the environment variables for a job:
#! /usr/bin/env bash
# Set the environment variables
export MASTER_ADDR=${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${LEPTON_JOB_WORKER_PREFIX}-{}.${LEPTON_SUBDOMAIN} | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}
# Run the distributed training script.
python -m torch.distributed.run \
--nnodes=$WORLD_SIZE \
--nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
/path/to/train.py