Environment Variables Reference

Basic Environment Variables

When a batch job is launched on DGX Cloud Lepton, the following environment variable names are automatically set, with the values corresponding to the job configurations.

Environment Variable Name	Description	Sample Value
`LEPTON_RESOURCE_ACCELERATOR_NUM`	Number of hardware accelerators allocated	1
`LEPTON_JOB_WORKER_HOSTNAME_PREFIX`	Prefix used for naming worker hostnames	worker
`LEPTON_WORKSPACE_ID`	Identifier for the current workspace	prod01awsuswest
`LEPTON_RESOURCE_ACCELERATOR_TYPE`	Type of hardware accelerator used	NVIDIA-A100-80GB
`LEPTON_WORKER_ID`	Unique identifier for the current worker	env-job-98bw-0-2nm7s
`LEPTON_JOB_FAILURE_COUNT`	Number of failed job attempts	0
`LEPTON_JOB_TOTAL_WORKERS`	Total number of workers assigned to the job	1
`LEPTON_JOB_WORKER_INDEX`	Index of the current worker within the job	0
`LEPTON_SUBDOMAIN`	Subdomain name assigned to the job service	env-job-98bw-job-svc
`LEPTON_JOB_SERVICE_PREFIX`	Prefix used for naming services related to the job	env-job-98bw
`LEPTON_JOB_WORKER_PREFIX`	Prefix used for naming services	worker
`LEPTON_JOB_NAME`	Name assigned to the job	env-job-98bw
`LEPTON_VIRTUAL_ENV`	Path to the Python virtual environment	/opt/lepton/venv

Reference for `torch.distributed.launch`

Using the DGX Cloud Lepton environmental variables, you can normally construct the env variables for various AI training framework abstractions.

For example, if you use torch.distributed.launch, the required env variables can be set up as:

Environment Variable Name	Meaning	Construction Method
`MASTER_ADDR`	Address of the master node for distributed training	`${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}`
`MASTER_PORT`	Port for the master node communication	`29400`
`WORLD_SIZE`	Total number of workers assigned to the job	`${LEPTON_JOB_TOTAL_WORKERS}`
`WORKER_ADDRS`	Address for worker nodes	`(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) \| xargs -I {} echo ${LEPTON_JOB_WORKER_PREFIX}-{}.${LEPTON_SUBDOMAIN} \| paste -sd ',' -)`
`NODE_RANK`	Rank of the current worker node in the distributed setup	`${LEPTON_JOB_WORKER_INDEX}`

Here is an example for a short script that sets up the environment variables for a job:

#! /usr/bin/env bash

# Set the environment variables
export MASTER_ADDR=${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN}
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${LEPTON_JOB_WORKER_PREFIX}-{}.${LEPTON_SUBDOMAIN} | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}

# Run the distributed training script.
python -m torch.distributed.run \
    --nnodes=$WORLD_SIZE \
    --nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    /path/to/train.py

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

7. Observability

9. Workspace

1. Dev Pod

2. Batch Job

3. Endpoint

4. RayCluster

5. Connections

1. API Reference

2. CLI Reference

3. Limits

Environment Variables Reference

Basic Environment Variables

Reference for `torch.distributed.launch`

Corporate Info

NVIDIA Developer

Resources

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

7. Observability

9. Workspace

1. Dev Pod

2. Batch Job

3. Endpoint

4. RayCluster

5. Connections

1. API Reference

2. CLI Reference

3. Limits

Environment Variables Reference

Basic Environment Variables

Reference for torch.distributed.launch

Corporate Info

NVIDIA Developer

Resources

Reference for `torch.distributed.launch`