Environment Variables Reference
Basic Environment Variables
When a batch job is launched on DGX Cloud Lepton, the following environment variable names are automatically set, with the values corresponding to the job configurations.
Environment Variable Name | Description | Sample Value |
---|---|---|
LEPTON_RESOURCE_ACCELERATOR_NUM | Number of hardware accelerators allocated | 1 |
LEPTON_JOB_WORKER_HOSTNAME_PREFIX | Prefix used for naming worker hostnames | worker |
LEPTON_WORKSPACE_ID | Identifier for the current workspace | prod01awsuswest |
LEPTON_RESOURCE_ACCELERATOR_TYPE | Type of hardware accelerator used | NVIDIA-A100-80GB |
LEPTON_WORKER_ID | Unique identifier for the current worker | env-job-98bw-0-2nm7s |
LEPTON_JOB_FAILURE_COUNT | Number of failed job attempts | 0 |
LEPTON_JOB_TOTAL_WORKERS | Total number of workers assigned to the job | 1 |
LEPTON_JOB_WORKER_INDEX | Index of the current worker within the job | 0 |
LEPTON_SUBDOMAIN | Subdomain name assigned to the job service | env-job-98bw-job-svc |
LEPTON_JOB_SERVICE_PREFIX | Prefix used for naming services related to the job | env-job-98bw |
LEPTON_JOB_NAME | Name assigned to the job | env-job-98bw |
LEPTON_VIRTUAL_ENV | Path to the Python virtual environment | /opt/lepton/venv |
Reference for torch.distributed.launch
Using the DGX Cloud Lepton environmental variables, you can normally construct the env variables for various AI training framework abstractions.
For example, if you use torch.distributed.launch
, the required env variables can be set up as:
Environment Variable Name | Meaning | Construction Method |
---|---|---|
SERVICE_PREFIX | Prefix for the job service, defaulting to LEPTON_JOB_SERVICE_PREFIX or LEPTON_JOB_NAME | ${LEPTON_JOB_SERVICE_PREFIX:-${LEPTON_JOB_NAME}} |
SUBDOMAIN | Subdomain for the job service, defaulting to LEPTON_SUBDOMAIN or LEPTON_JOB_NAME-job-svc | ${LEPTON_SUBDOMAIN:-${LEPTON_JOB_NAME}-job-svc} |
MASTER_ADDR | Address of the master node for distributed training | ${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local |
MASTER_PORT | Port for the master node communication | 29400 |
WORLD_SIZE | Total number of workers assigned to the job | ${LEPTON_JOB_TOTAL_WORKERS} |
WORKER_ADDRS | Address for worker nodes | (seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -) |
NODE_RANK | Rank of the current worker node in the distributed setup | ${LEPTON_JOB_WORKER_INDEX} |
Here is an example for a short script that sets up the environment variables for a job:
#! /usr/bin/env bash
# Set the environment variables
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-${LEPTON_JOB_NAME}}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-${LEPTON_JOB_NAME}-job-svc}"
export MASTER_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}
# Run the distributed training script.
python -m torch.distributed.run \
--nnodes=$WORLD_SIZE \
--nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
/path/to/train.py