Environment Variables#

To successfully run and optimize jobs (via sbatch or srun, for example) on your DGX Cloud cluster, specific environment variables should be set within the job. This is particularly important and required for multi-node jobs and large-scale usage of the cluster.

Within the job examples in this user guide, such as found in the Running Example Jobs section, a reference is made to source env-vars.sh in each script example.

The contents of this script and instructions for creating this file are covered in the sections below.

The environment variables required and the values specified will differ depending on where your DGX Cloud cluster is running. If you do not know which cloud service provider your DGX Cloud cluster is running on, reach out to your cluster admin or NVIDIA Technical Account Manager for more details.

For DGX Cloud clusters running in Azure using A100 GPUs, an example environment variable script is provided below. You can create this script in a shared location such as /lustrefs/fs0/scratch, such that it is accessible for all users of the cluster.

To create the script, first ensure that the directory you are intending to save this script exists already, then use the text editor of your choice to create a file called env-vars.sh within that directory.

Paste in the following content into the file, then save and exit the editor.

 1#!/bin/bash
 2export OMPI_MCA_coll_hcoll_enable=0
 3export UCX_TLS=rc
 4export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
 5export CUDA_DEVICE_ORDER=PCI_BUS_ID
 6export NCCL_SOCKET_IFNAME=eth0
 7export NCCL_IB_PCI_RELAXED_ORDERING=1
 8export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
 9export NCCL_DEBUG=INFO
10export NCCL_PROTO=LL,LL128,Simple
11export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
12export MELLANOX_VISIBLE_DEVICES=all
13export PMIX_MCA_gds=hash
14export PMIX_MCA_psec=native

As part of these environment variables a topology file is specified via NCCL_TOPO_FILE. When present for a given deployment, the topology file will be available in the /cm/shared/etc/ path. The container-mounts argument with the /cm/shared value makes this path visible in the resulting job.

For DGX Cloud clusters running in OCI using A100 GPUs, an example environment variable script is provided below. You can create this script in a shared location such as /lustrefs/fs0/scratch, such that it is accessible for all users of the cluster.

To create the script, first ensure that the directory you are intending to save this script exists already, then use the text editor of your choice to create a file called env-vars.sh within that directory.

Paste in the following content into the file, then save and exit the editor.

 1#!/bin/bash
 2export NVIDIA_DRIVER_CAPABILITIES=all
 3export MELLANOX_VISIBLE_DEVICES=all
 4export MEM_AFFINITY="0:0:0:0:1:1:1:1"
 5export GPU_AFFINITY="0:0:0:0:1:1:1:1"
 6export CPU_AFFINITY="0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111"
 7export OMPI_MCA_pml="ucx"
 8export OMPI_MCA_coll="^hcoll"
 9export OMPI_MCA_coll_hcoll_enable=0
10export HCOLL_ENABLE_MCAST_ALL=0
11export NCCL_IB_TIMEOUT=18
12export NCCL_IB_SL=0
13export NCCL_IB_TC=41
14export NCCL_IGNORE_CPU_AFFINITY=0
15export NCCL_IB_GID_INDEX=3
16export NCCL_IB_QPS_PER_CONNECTION=4
17export NCCL_CROSS_NIC=0
18export NCCL_IB_HCA="^=mlx5_0,mlx5_13"
19export RX_QUEUE_LEN=8192
20export IB_RX_QUEUE_LEN=8192

For DGX Cloud clusters running in OCI using H100 GPUs, an example environment variable script is provided below. You can create this script in a shared location such as /lustrefs/fs0/scratch, such that it is accessible for all users of the cluster.

To create the script, first ensure that the directory you are intending to save this script exists already, then use the text editor of your choice to create a file called env-vars.sh within that directory.

Paste in the following content into the file, then save and exit the editor.

 1#!/bin/bash
 2export NVIDIA_DRIVER_CAPABILITIES=all
 3export MELLANOX_VISIBLE_DEVICES=all
 4export MEM_AFFINITY="0:0:0:0:1:1:1:1"
 5export GPU_AFFINITY="0:0:0:0:1:1:1:1"
 6export CPU_AFFINITY="0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111"
 7export OMPI_MCA_pml="ucx"
 8export OMPI_MCA_coll="^hcoll"
 9export OMPI_MCA_coll_hcoll_enable=0
10export HCOLL_ENABLE_MCAST_ALL=0
11export NCCL_IB_TIMEOUT=18
12export NCCL_IB_SL=0
13export NCCL_IB_TC=41
14export NCCL_IGNORE_CPU_AFFINITY=0
15export NCCL_IB_GID_INDEX=3
16export NCCL_IB_QPS_PER_CONNECTION=4
17export NCCL_CROSS_NIC=0
18export NCCL_IB_HCA="^mlx5_2,mlx5_11"
19export RX_QUEUE_LEN=8192
20export IB_RX_QUEUE_LEN=8192