1. Introduction

This guide is intended to provide cluster users with information they need to run jobs and develop workflows on NVIDIA DGX™ Cloud.

Information for cluster owners and admin can be found in the NVIDIA DGX Cloud Cluster Admin Guide

The NVIDIA DGX Cloud Onboarding Quick Start Guide gives cluster owners, cluster admins and cluster users information they need to get started with the DGX Cloud Cluster.

1.1. NVIDIA DGX Cloud Overview

NVIDIA DGX Cloud is an AI platform for enterprise developers, optimized for the demands of generative AI.

DGX Cloud Stack Diagram

DGX Cloud delivers an integrated, full-stack solution co-engineered with leading cloud partners, integrating their best of breed architectures with the newest NVIDIA AI technologies across accelerated computing, networking fabric, and software, and providing direct access to NVIDIA AI experts.

DGX Cloud includes NVIDIA AI Enterprise, offering accelerated data science libraries, optimized frameworks and pre-trained models enabling developers to quickly and easily deliver production-ready AI.

1.2. Overview of Your Cluster

The DGX Cloud cluster includes the following components provisioned and configured for use:

  • Head (management) node(s)

  • Login nodes

  • CPU worker nodes

  • GPU worker nodes

  • Shared NFS storage

  • Shared high-performance Lustre storage

The exact configuration and capacity of each of the components available is customized for specific requirements as specified during the onboarding process.

The Slurm workload management tool is provided as the primary interface of the cluster for users familiar with HPC schedulers who want direct access to underlying systems.

The Slurm implementation is deployed and configured by NVIDIA Base Command Manager, and leverages Pyxis and Enroot as the container plugin and runtime.

The cluster configuration via Base Command Manager (BCM) and the Slurm configuration can be further modified and customized by the customer for specific use cases.

For more information on BCM, refer to the BCM documentation.

For more information on Slurm, refer to the Slurm Quick Start Guide.

2. Accessing Your DGX Cloud Cluster

Note

The following sections assume that your cluster admin has worked with you to create a cluster user. If you do not have an admin user and SSH key pair for logging in to the cluster yet, please contact your cluster admin to get onboarded.

2.1. DGX Cloud Access Methods

As a user, a few connection options are available for accessing a DGX Cloud environment:

  • Login Nodes: Two login nodes are accessible via public internet from network addresses defined by your cluster administrator, using a public SSH key that has been registered with the user account provisioned for you in the DGX Cloud deployment.

  • User Portal: A web interface called User Portal is accessible via a SSH tunnel to port 8081 to either login node. Instructions to access the User Portal can be found in the User Portal section of this guide.

Note

The addresses to access the login nodes should be provided to you by your cluster admin.

2.2. Accessing Login Nodes

Note

Non-root users are not permitted to SSH to the head node - they will only be able to SSH to the cluster’s Slurm login nodes.

To access a Slurm login node, you need to have a user account created along with a matching public SSH key pair on your system tied to that user. Your cluster admin or owner will need to have created this user account for you and attached your SSH key pair.

Instructions for creating an SSH key pair can be found in the Creating an SSH Key Pair to Send to Your Admin For User Creation section below. For more information, see the NVIDIA DGX Cloud Cluster Admin Guide.

2.2.1. Logging In with an SSH Key

Cluster users will have SSH access to the login nodes. Cluster users can also access the User Portal from the login nodes.

To access a login node, follow these steps:

  1. Obtain the login node addresses from your cluster admin.

  2. Login with the user account(s) created by the cluster admin:

ssh -i /path/to/ssh_cert <cluster-user>@ip-addr-of-login-node

2.2.2. Updating Passwords

You will access the login nodes using your SSH key. However, to access the web interfaces for the User Portal you will use a password.

You should have been given a default account password by your cluster admin. To change your password after you have logged in with an SSH key, use the passwd command:

passwd

Make note of your new password. If you do not remember your password, please consult with the cluster admin to reset it.

2.2.3. Creating an SSH Key Pair to Send to Your Admin For User Creation

For a cluster admin or owner to create your user account, you will need to create an SSH key pair and send the public key to the admin. Follow the steps below to do so.

  1. Open a Linux terminal, then run the following command, making sure to change the <cluster-user> and your_email@example.com to your specific information:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-user>-rsa-dgxc -C "your_email@example.com"
    
  2. Run the following command to output the content of the public key. Copy the output from running this command, and send it to your cluster admin or owner:

    cat ~/.ssh/<cluster-user>-rsa-dgxc.pub
    

3. Overview of Working in Your DGX Cloud Cluster

3.1. Home Directories

Your home directory is your default file path upon logging into a DGX Cloud Slurm cluster. This is of the form /home/demo-user.

In a DGX Cloud Slurm cluster, all user home directories reside on a network filesystem (NFS). This makes it possible for all nodes in the cluster to access the same data concurrently when you run a job.

The network filesystem means you can:

  • Write a script once and run it on multiple nodes simultaneously.

  • Clone from a repository once and make the repository available on the entire cluster

  • Use a shared directory to log a job’s output from multiple producers

  • Run your code on whatever systems are available - data availability is not dependent on any one system.

All User data is isolated via filesystem permissions, which means that user-a can only read and write data to /home/user-a, and not to /home/user-b. However, users can impact each other if the home filesystem is overtaxed.

3.1.1. Best Practices with Home Directories

When working with Home Directories, there are some best practices to keep in mind:

  • Home directory storage is a great fit for small amounts of data, particularly code, scripts, and configuration files.

  • Home directory storage can also be an acceptable fit for log files, but alternative storage should be considered for log files that will grow to be large and/or are frequently written to.

  • Home directory storage has a quota configured for your home directory - the default per-user quota is 10 GB if your administrator has not specified a different quota in collaboration with your NVIDIA TAM.

For datasets and more intensive storage IO use cases, a parallel filesystem is provided. Key information is discussed in the following section.

3.1.2. Home Directory Quota Management

You may exhaust your quota - when that happens, a message like the following will appear in a log file or your terminal:

disk quota exceeded

To write any additional data to your home directory, you must remove data or move data to Lustre (which is discussed in further detail in the following section).

To investigate your home directory utilization, the du command combined with the sort command can be used to identify large files in your home directory by sorting from smallest to largest.

1$ du -sh /home/demo-user/* | sort -h
24.0K      /home/demo-user/a-script.sh
34.0K      /home/demo-user/an-sbatch-file
48.0K      /home/demo-user/log.out
5226M      /home/demo-user/an-executable
67G        /home/demo-user/a-big-file

In the case shown above, a-big-file could cause quota problems with the default quota of 10 GB.

To remove the file if it is no longer needed, use the rm command.

rm a-big-file

To move it instead, use the mv command. The second argument is the target path, in this case, the demo-user Lustre scratch directory.

mv a-big-file /lustre/fs0/scratch/demo-user/

3.2. Lustre Shared Filesystem Directories

The parallel filesystem made available as part of DGX Cloud clusters is Lustre. In DGX Cloud clusters, the Lustre filesystem will be mounted at /lustre/fs0. This can be seen by using the df command and looking for the /lustre/fs0 path in the Mounted on column:

 1df -h
 2
 3Filesystem                Size  Used Avail Use% Mounted on
 4tmpfs                      63G  1.3M   63G   1% /run
 5/dev/sda1                 199G   14G  186G   7% /
 6tmpfs                      63G     0   63G   0% /dev/shm
 7tmpfs                     5.0M     0  5.0M   0% /run/lock
 8tmpfs                     4.0M     0  4.0M   0% /sys/fs/cgroup
 9/dev/sdb3                 4.4G  4.0G  404M  91% /cm/node-installer-ebs
1010.14.1.4:/bcm-cm-shared  1.0T  2.5G 1022G   1% /cm/shared
1110.14.1.4:/bcm-home       1.0T  308M  1.0T   1% /home
1210.14.2.9@tcp:/lustrefs    79T   87G   75T   1% /lustre/fs0
13tmpfs                      13G  4.0K   13G   1% /run/user/0

Users cannot create files or directories inside of the Lustre filesystem by default - administrators must create a dedicated directory that a user has access to, which is commonly referred to as a scratch directory. DGX Cloud documentation suggests the /lustre/fs0/scratch directory for these user-specific directories.

You can check to see if the cluster admin has created this directory for your user with the ls command, targeting the lustre/fs0/scratch directory.

1ls /lustre/fs0/scratch/
2
3charlie  demo-user  alice  bob

If your username is not present in this directory or if an alternative path has not been communicated to you, reach out to your cluster admin.

3.2.1. Accessing Shared Directories

This Lustre filesystem path, similar to the NFS home directory, will be available at the same location (/lustre/fs0) across all login, CPU, and GPU nodes that are part of the DGX Cloud Slurm cluster.

3.2.2. Shared Directory Permissions

It is important to understand that data written to a shared scratch directory may be publicly viewable by default. You can confirm this with the ls -l command.

1ls -l /lustre/fs0/scratch/
2total 40
3...
4drwxr-xr-x 9 demo-user      demo-user      4096 Apr 22 10:25 demo-user
5...

This output indicates the following information about the /lustre/fs0/scratch/demo-user directory:

  • The user who owns the directory (in this case, demo-user) has read, write, and execute permissions for it.

  • The group that owns the directory (the demo-user group) has read and execute permissions for it.

  • Any user with access to the Lustre filesystem has read and execute permissions for it.

3.2.3. Restricting File and Directory Permissions

Sensitive information should have more restricted permissions. While applications can manage this, you can also adjust permissions using the chmod tool.

To revoke read and write access for other users for a given file in the /lustre/fs0/scratch/demo-user directory, the demo-user can execute the following:

chmod o-rw /lustre/fs0/scratch/demo-user/example-file

Where o refers to other users and the - indicates that those permissions should be removed. This will result in the final set of permissions being completely empty, indicating a lack of read, write, and execute permissions for arbitrary users.

1ls -l /lustre/fs0/scratch/demo-user/example-file
2-rw-rw---- 1 demo-user demo-user 0 Apr 29 14:44 /lustre/fs0/scratch/demo-user/example-file

To restrict access to an entire directory, use the same strategy:

chmod o-rw /lustre/fs0/scratch/demo-user/private

This action results in a similar output for the directory.

1ls -l /lustre/fs0/scratch/demo-user/
2total 1024
3...
4drwxrwx---  2 demo-user demo-user   4096 Apr 29 14:47 private
5...

3.2.4. Viewing Quota

Administrators are able to set user quotas in specific directories (such as a user’s dedicated scratch directory). A user can check their own quota with the lfs quota command:

 1lfs quota -u demo-user -v /lustre/fs0/scratch/demo-user
 2
 3Disk quotas for usr demo-user (uid 1004):
 4    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
 5  /lustre/fs0/scratch/demo-user/
 6                      4  104857600 104857600       -       1   10000   11000       -
 7  lustrefs-MDT0000_UUID
 8                      4*      -       4       -       1*      -       1       -
 9  lustrefs-OST0000_UUID
10                      0       -       0       -       -       -       -       -
11  lustrefs-OST0001_UUID
12                      0       -       0       -       -       -       -       -
13  lustrefs-OST0002_UUID
14                      0       -       0       -       -       -       -       -
15  lustrefs-OST0003_UUID
16                      0       -       0       -       -       -       -       -
17  lustrefs-OST0004_UUID
18                      0       -       0       -       -       -       -       -

3.2.5. Best Practices

When using Lustre, you should take note of the following best practices:

  • Avoid accessing filesystem metadata when possible. Commands like ls -l should target specific files instead of entire directories to minimize load on the Lustre metadata targets.

  • Avoid having a large number of files (roughly 1000 or more) in a single directory. Split them into multiple subdirectories to minimize directory contention.

  • Avoid accessing small files (1 MB or less) on Lustre - either keep small configuration files on your /home directory when possible, aggregate many small files into a larger file, or cache small files locally on the GPU systems during compute jobs (noting that data cached to those systems will not persist after the job completes).

  • Avoid accessing executables on Lustre when possible - in the event that the Lustre filesystem experiences a high load, the executable may become temporarily unavailable, resulting in corruption.

3.3. User Portal

The DGX Cloud User Portal is a GUI which shows you a dashboard of your workloads in the cluster. In the portal you can view infromation about job queues, job history and cluster node status.

3.3.1. Logging In to the User Portal

You can log into the User Portal using these steps.

  1. Use this command to create an SSH tunnel back to your local desktop host:

    ssh <cluster-user>@ip-addr-of-login-node -L 8081:master:8081
    
  2. Navigate to https://localhost:8081/userportal/ from your web browser.

  3. In the browser, add your cluster username and password to sign into the User Portal.

    Note

    This is the same user that you used to connect via SSH from the Accessing Login Nodes section.

    Portal Login Page

Note

The first time a browser is used to log in to the cluster portal, a warning about the site certificate being untrusted appears in a default cluster configuration. This can safely be accepted.

3.3.3. Monitoring

At the top right of the User Portal users can access a Monitoring page that initially displays two empty plot panels within the dashboard section of the page. The panels can have measurables drag-and-dropped into them from the measurables navigation tree on the left hand side of the page. The tree can be partly or fully expanded.

A filter can be used to select from visible measurables. For example, after expanding the tree, it is possible to find a measurable for the cluster occupation rate by using the key word “occupation” in the filter. The measurable can then be dragged from the options that remain visible.

Extra plot panel widgets can be added to the page by clicking on the “Add new widget” option (the + button) in the panels section of the dashboard page. A new dashboard tab can be added to the page by clicking on the “Add new dashboard” option (the + button) in the dashboard tabs section of the page.

Monitoring Page

3.3.4. Accounting and Reporting

At the top right of the User Portal users can access an Accounting and Reporting page that can be used to show resource use by using PromQL queries. A wizard allows resource reports to be created per user and per account which can then be grouped and organized into individual tabs. Reports can be saved and exported as a download in CSV or Microsoft Excel format.

Account-Report

4. Setting Up to Run Jobs

4.1. Loading Slurm Modules

Your cluster admin should have configured your user account to automatically load the Slurm module upon login. However, if you need to load the Slurm module manually, run the following command to do so, before proceeding and running any Slurm commands.

module load slurm

To configure your user account to load the Slurm module automatically, run the following command, then log out, then log back into the login node.

module initadd slurm

4.2. Basic Slurm Commands

This section covers some common commands that can be used to interact with Slurm on your DGX Cloud cluster.

4.2.1. sinfo

To see all nodes in the cluster and their current state, SSH to the Slurm login node for your cluster and run the sinfo command:

1sinfo
2
3PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
4cpuq       up     infinite    2     idle   cpu[001-002]
5defq*      up     infinite   64     idle   gpu[001-064]

In this example there are 64 nodes available, all in an idle state.

When a node is in use its state will change from idle to alloc.

1sinfo
2
3PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
4cpuq       up     infinite    2     idle   cpu[001-002]
5defq*      up     infinite   1      alloc  gpu001
6defq*      up     infinite   63     idle   gpu[002-064]

Other possible node states include drain, indicating a node is currently down, mix indicating a node is partially in-use, and maint when a node is down for maintenance.

Note

Depending on the details of the configuration of your cluster, your specific partitions may vary. CPU nodes may be in the same defq partition as the GPU nodes, for example, or additional partitions may exist and/or have been created.

4.2.2. srun

The simplest way of running a job on Slurm is to use srun and specify the number of nodes and parameters to run the job on. The following command runs a job on two nodes and prints out the hostname of each node:

1srun --nnodes=2 --nproc-per-node=1 hostname
2
3gpu001
4gpu002

srun is a blocking command which prevents users from running other commands in the current terminal session until the job completes or is canceled.

4.2.3. sbatch

To run non-blocking jobs that are queued and run in the background, sbatch scripts are used. An sbatch script is a bash script which allows additional functions and code to be wrapped around the job submission. All sbatch scripts must have at least one srun command in the file that launches the job. Here is an example of an sbatch script:

1#!/bin/bash
2#SBATCH --exclusive
3#SBATCH --gpus-per-node=8
4#SBATCH --nodes=2
5#SBATCH --ntasks-per-node=4
6
7srun python3 trainer.py

To run the scripy use sbatch <filename>, where <filename> is the name of the bash file above. This job will run python3 trainer.py on each node with 2 nodes allocated and 8 GPUs per node running the job.

4.2.4. squeue

To view all jobs currently running and in the queue in a cluster, run squeue. The following example shows a job with id 1234 running on 64 nodes for 30 minutes on nodes gpu001-gpu064. The other two jobs, 1235 and 1236 are currently in the queue waiting for resources to be available. Once job 1234 completes and frees up resources, both of the pending jobs should get allocated:

1squeue
2
3JOBID  PARTITION      NAME       USER  ST  TIME   NODES  NODELIST(REASON)
41235   defq       llm-train   demo-use PD   0:00   2     (Dependency)
51236   defq       llm-ft      demo-use PD   0:00   2     (Dependency)
61234   defq       trainer     demo-use  R  30:00   64    gpu[001-064]

4.2.5. scancel

To cancel a running job, run scancel followed by the job ID. This will send a signal to the job to terminate and it will have a period of time to gracefully exit.

To cancel a job with ID 1234, you would run:

scancel 1234

To verify that it was canceled, check the queue. The state of CG indicates a job is completing:

1squeue
2
3JOBID  PARTITION      NAME       USER  ST  TIME   NODES  NODELIST(REASON)
41235   defq       llm-train   demo-use PD   0:00   2     (Dependency)
51236   defq       llm-ft      demo-use PD   0:00   2     (Dependency)
61234   defq       trainer     demo-use CG  30:00   64    gpu[001-064]

4.2.6. scontrol show

The scontrol show command allows you to view greater detail on different parts of the Slurm cluster. Some of the more common areas to show more information on are specific nodes or job details.

To view information on a specific node such as the current state, available resources, uptime, and more, run scontrol show node <node name>:

 1scontrol show node gpu001
 2
 3NodeName=gpu001 Arch=x86_64 CoresPerSocket=62
 4CPUAlloc=0 CPUEfctv=248 CPUTot=248 CPULoad=2.55
 5AvailableFeatures=location=us-chicago-1
 6ActiveFeatures=location=us-chicago-1
 7Gres=gpu:8(S:0-1)
 8NodeAddr=gpu001 NodeHostName=gpu001 Version=23.02.5
 9OS=Linux 5.15.0-1032-oracle #38~20.04.1-Ubuntu SMP Thu Mar 23 20:47:49 UTC 2023
10RealMemory=1878738 AllocMem=0 FreeMem=1945682 Sockets=2 Boards=1
11State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
12Partitions=batch
13BootTime=2024-02-21T12:31:03 SlurmdStartTime=2024-03-08T09:28:09
14LastBusyTime=2024-03-08T13:37:46 ResumeAfterTime=None
15CfgTRES=cpu=248,mem=1878738M,billing=60,gres/gpu=8
16AllocTRES=
17CapWatts=n/a
18CurrentWatts=0 AveWatts=0
19ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Additionally, view details of a specific job including current state and reason for that state, submission time, log files, and requests with scontrol show job <job ID>:

 1scontrol show job 1234
 2
 3JobId=1234 JobName=trainer
 4UserId=jdoe(1004) GroupId=jdoe(30) MCS_label=N/A
 5Priority=50 Nice=0 Account=general QOS=normal
 6JobState=RUNNING Reason=None Dependency=(null)
 7Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 8RunTime=00:24:15 TimeLimit=04:00:00 TimeMin=N/A
 9SubmitTime=2024-03-08T13:17:23 EligibleTime=2024-03-08T13:17:23
10AccrueTime=2024-03-08T13:17:23
11StartTime=2024-03-08T13:20:39 EndTime=2024-03-08T17:20:39 Deadline=N/A
12SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-03-08T13:20:39 Scheduler=Main
13Partition=defq AllocNode:Sid=login-01
14ReqNodeList=(null) ExcNodeList=(null)
15NodeList=gpu[001-064]
16BatchHost=gpu[001-064]
17NumNodes=64 NumCPUs=64 NumTasks=4 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
18ReqTRES=cpu=64,mem=1000G,node=1,billing=25,gres/gpu=8
19AllocTRES=cpu=64,mem=1000G,node=1,billing=25,gres/gpu=8
20Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
21MinCPUsNode=64 MinMemoryNode=1000G MinTmpDiskNode=0
22Features=(null) DelayBoot=00:00:00
23OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
24Command=trainer.sh
25WorkDir=/lustre/fs0/scratch/demo-user
26StdErr=/lustre/fs0/scratch/demo-user/slurm-%N-%J.out
27StdIn=/dev/null
28StdOut=/lustre/fs0/scratch/demo-user/slurm-%N-%J.out
29Power=
30TresPerNode=gres:gpu:4
31MailUser=

4.3. Setting Up NGC Integration

4.3.1. Accessing Your NGC Org

As a part of the DGX Cloud subscription, your organization has received access to NVIDIA NGC, with Private Registry and NVIDIA AI Enterprise subscriptions enabled.

Your cluster owner will be able to invite you to NGC. Once you have received the invitation, follow the instructions in the email to set up your account.

4.3.2. Setting Up Your NGC API Key

To generate your NGC API key, follow these steps:

  1. Go to ngc.nvidia.com and login.

  2. Click on your user profile in the top right of the screen and click the “Setup” button.

  3. Click “Generate Personal Key” and generate the key in the new form that opens. Save the displayed key in a safe place as this will only be shown once and is needed for future steps.

4.3.3. Setting Up Your Enroot Config File

Once you have generated your NGC API key, you must put the key in a config file. This will be used to authenticate with Enroot when you run a job.

Save your NGC API key to a config file with the following steps:

  1. Run the following commands on the login node:

    1mkdir -p ~/.config/enroot
    2touch ~/.config/enroot/.credentials
    
  2. Open ~/.config/enroot/.credentials and enter the following in the file, replacing <API KEY> with your NGC key.

    machine nvcr.io login $oauthtoken password <API KEY>
    
  3. Save the file.

4.3.4. Installing NGC CLI

NGC CLI simplifies interacting with your DGX Cloud Private Registry by providing commands to upload and download various artifacts (such as trained models).

To access the latest version of NGC CLI, navigate to the CLI Install page. The following documentation is based on version 3.41.4.

  1. From a login node, download the NGC CLI tool via wget.

    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.4/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
    
  2. Check the binary’s md5 hash to ensure it has not been corrupted.

    find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
    
  3. You can also check the binary’s SHA256 hash as well.

    sha256sum ngccli_linux.zip
    

    The following value should be returned for version 3.41.4.

    2c86681048ab8e2980bdd6aa6c17f086eff988276ad90a28f1307b69fdb50252
    
  4. When the tool is confirmed to be ready for use, make NGC CLI executable and add it to your user’s PATH.

    chmod u+x ngc-cli/ngc; echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
    
  5. Configure NGC CLI.

    ngc config set
    

    Enter the requested information at each step. The most important field is the requested API key, which is the same as the value used for Enroot configuration.

    Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: <API KEY>
    

Once you have completed these steps, you can use the NGC CLI from the login nodes or in Slurm jobs to upload and download artifacts from your private registry.

4.3.5. Running a Hello World Job

Once your NGC key has been added to the Enroot config file, a simple job can be run with an NGC container. The following command will launch a single-node job against a PyTorch container and print the PyTorch version inside the container:

srun --container-image nvcr.io/nvidia/pytorch:24.02-py3 -N1 bash -c "python3 -c 'import torch; print(torch.__version__)'"

4.4. Downloading Containers Onto the Cluster

Within your DGX Cloud cluster you can save container images onto the Lustre shared storage directory, to avoid repeated downloads of the same image. This can be particularly useful when working with large multi-node jobs or repetitive job runs.

The DGX Cloud cluster uses Enroot as the container runtime, which can save container images as squash files.

4.4.1. Creating a Squash File From an NGC Container

A squash file is the compressed representation of an NGC container that Enroot uses.

The following steps create a squash file from an NGC Container:

  1. In your Lustre directory, create a directory for squash files if one does not exist already, and enter that directory.

    1mkdir /lustre/fs0/scratch/demo-user/sqsh-files
    2cd /lustre/fs0/scratch/demo-user/sqsh-files
    
  2. At the terminal, run the following command:

    srun -N 1 --pty --partition cpuq --exclusive --job-name "enroot-import:interactive" bash
    

    You should now be running on a CPU node. This will be indicated by the shell prompt:

    demo-user@cpu001:~$
    
  3. Use enroot to import the target container - in this case, we are importing a NeMo Framework container from NGC.

    enroot import 'docker://$oauthtoken@nvcr.io#nvidia/nemo:24.01.framework'
    

    The output from a CPU node with 32 cores follows - the output may differ slightly from deployment to deployment.

     1[INFO] Querying registry for permission grant
     2[INFO] Authenticating with user: $oauthtoken
     3[INFO] Using credentials from file: /home/demo-user/.config/enroot/.credentials
     4[INFO] Authentication succeeded
     5[INFO] Fetching image manifest list
     6[INFO] Fetching image manifest
     7[INFO] Downloading 74 missing layers... 100% 74:0=0s 9ee9f649846387c87376a6ed73d7938fb7ebc0de62ae5e6b5285643682dc12c2
     8[INFO] Extracting image layers... 100% 78:0=0s a486411936734b0d1d201c8a0ed8e9d449a64d5033fdc33411ec95bc26460efb
     9[INFO] Converting whiteouts... 100% 78:0=0s a486411936734b0d1d201c8a0ed8e9d449a64d5033fdc33411ec95bc26460efb
    10[INFO] Creating squashfs filesystem... Parallel mksquashfs: Using 32 processors Creating 4.0 filesystem on /lustre/fs0/scratch/demo-user/sqsh-files-cpu/nvidia+nemo+24.01.framework.sqsh, block size 131072.
    11[=============================================================================================================================================================================================\] 594940/594940 100% Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
    12uncompressed data, uncompressed metadata, uncompressed fragments,       uncompressed xattrs, uncompressed ids   duplicates are not removed Filesystem size 34918700.97 Kbytes (34100.29 Mbytes)
    1399.98% of uncompressed filesystem size (34924117.67 Kbytes) Inode table size 13604323 bytes (13285.47 Kbytes)   100.00% of uncompressed inode table size (13604323 bytes) Directory table size 12122240 bytes (11838.12 Kbytes)
    14100.00% of uncompressed directory table size (12122240 bytes) No duplicate files removed Number of inodes 387307 Number of files 345979 Number of fragments 21833 Number of symbolic links 1737 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 39591
    15Number of ids (unique uids + gids) 1 Number of uids 1   root (0) Number of gids 1       root (0)
    
  4. End the interactive job.

    exit
    

Subsequent srun commands can now use this squash file instead of pulling a container from NGC. This example modifys an interactive job to use the squash file that was just downloaded:

srun -N 1 --pty --gpus 8 --container-image /lustre/fs0/scratch/demo-user/sqsh-files/nvidia+nemo+24.01.framework.sqsh --exclusive --job-name "use-squash-file:interactive" bash

5. Running Example Jobs

5.1. Example Single-Node Batch Job

This section gives more detail on running a single node batch job using Slurm on DGX Cloud. The example job runs on multiple GPUs wtihin a single node.

5.1.1. Basic srun Command

A Slurm srun command that uses Pyxis and Enroot to leverage NGC containers has the following structure:

srun --nodes 1 --job-name example-srun-job --exclusive --gpus-per-node 8 --mpi=pmix --container-mounts /lustre/fs0 --container-mounts /cm/shared --container-image="nvcr.io#nvidia/pytorch:24.02-py3" --partition defq <command>

Where the parameters are:

  • nodes - The number of nodes to launch to support this job.

  • job-name - An arbitrary string defining the job’s name.

  • exclusive - Requests exclusive access to the node that Slurm provides for this job. No other jobs will run on the provided node for the duration of this job.

  • gpus-per-node - Requests a quantity of GPUs from each provided node.

  • mpi - Specifies the MPI version to be used for this job. The pmix value is commonly the correct value - to learn more about the options available, visit the SchedMD documentation.

  • container-mounts - Specifies a path to map from the host into the job on each node that is part of that job. Multiple instances of this flag may be specified - one per desired path. In the example above, the path will be mirrored into the job as it exists on the host - to specify a different mapping, the argument would change to the format /host-path:/new-job-path. In the example above, a Lustre filesystem path and a shared NFS path are specified - to learn more about the purpose of the Lustre path, please see the DGX Cloud Lustre filesystem overview. The shared NFS path will be explained later in this section.

  • container-image - Specifies the NGC container that will be used through Enroot as the job’s host environment once the job begins. The argument will be of the format nvcr.io#nvidia/container:version for public NGC containers and nvcr.io#org/container:version or nvcr.io#org/team/container:version if using a Private Registry container.

  • partition - Specifies the queue to select a node from. By default, the defq queue will be used.

5.1.2. Environment Variables

In addition to the arguments specified as part of the srun command, various environment variables may be necessary. For the example above, the following environment variables are needed (particularly for multi-node usage models, which we will detail in a later section):

 1export OMPI_MCA_coll_hcoll_enable=0
 2export UCX_TLS=rc
 3export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
 4export CUDA_DEVICE_ORDER=PCI_BUS_ID
 5export NCCL_SOCKET_IFNAME=eth0
 6export NCCL_IB_PCI_RELAXED_ORDERING=1
 7export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
 8export NCCL_DEBUG=INFO
 9export NCCL_PROTO=LL,LL128,Simple
10export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
11export MELLANOX_VISIBLE_DEVICES=all
12export PMIX_MCA_gds=hash
13export PMIX_MCA_psec=native

As part of these environment variables a topology file is specified via NCCL_TOPO_FILE. When present for a given deployment, the topology file will be available in the /cm/shared/etc/ path. The container-mounts argument with the /cm/shared value makes this path visible in the resulting job.

5.1.3. Running an srun Command

The example below runs a GPU-based job.

  1. Create a script at $HOME/hostname.sh using the text editor of your choice, with the following content:

     1#!/bin/bash
     2export OMPI_MCA_coll_hcoll_enable=0
     3export UCX_TLS=rc
     4export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
     5export CUDA_DEVICE_ORDER=PCI_BUS_ID
     6export NCCL_SOCKET_IFNAME=eth0
     7export NCCL_IB_PCI_RELAXED_ORDERING=1
     8export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
     9export NCCL_DEBUG=INFO
    10export NCCL_PROTO=LL,LL128,Simple
    11export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
    12export MELLANOX_VISIBLE_DEVICES=all
    13export PMIX_MCA_gds=hash
    14export PMIX_MCA_psec=native
    15
    16srun --nodes 1 --job-name example-srun-job --exclusive --gpus-per-node 8 --mpi=pmix --container-mounts /lustre/fs0 --container-mounts /cm/shared --container-image="nvcr.io#nvidia/pytorch:24.02-py3" --partition defq hostname
    
  2. Make the script executable by running the following command:

    chmod +x hostname.sh
    
  3. You can now run the script:

    ./hostname.sh
    

The command output (sdout and sderr) will output to the terminal from which the above script was submitted. Use of srun is blocking unless run in the background with &.

An expected output example will show the hostname of the system that the job ran on (which Slurm takes care of determining for you), such as the following:

1pyxis: importing docker image: nvcr.io#nvidia/pytorch:24.02-py3
2pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
3gpu001

5.1.4. Running an sbatch Command

Alternatively, sbatch can be used to submit jobs:

 1#!/bin/bash
 2#SBATCH --partition defq --nodes 1
 3#SBATCH --exclusive
 4#SBATCH --job-name=example-sbatch-job
 5#SBATCH --gpus-per-node=8
 6
 7export OMPI_MCA_coll_hcoll_enable=0
 8export UCX_TLS=rc
 9export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
10export CUDA_DEVICE_ORDER=PCI_BUS_ID
11export NCCL_SOCKET_IFNAME=eth0
12export NCCL_IB_PCI_RELAXED_ORDERING=1
13export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
14export NCCL_DEBUG=INFO
15export NCCL_PROTO=LL,LL128,Simple
16export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
17export MELLANOX_VISIBLE_DEVICES=all
18export PMIX_MCA_gds=hash
19export PMIX_MCA_psec=native
20
21srun --mpi=pmix --container-mounts /lustre/fs0 --container-mounts /cm/shared --container-image="nvcr.io#nvidia/pytorch:24.02-py3" hostname

Arguments can be specified at the top of an sbatch script, or within individual commands as desired. srun commands can be specified within sbatch scripts, as shown.

To submit the script above, save the text to a file named hostname.sbatch and run sbatch hostname.sbatch. This will be non-blocking, meaning that additional jobs can be queued immediately if desired.

5.1.5. Viewing sbatch Log Files

A log file corresponding to the job number will automatically be created in the directory from which sbatch was run of the format slurm-x.out. This file name and path are configurable - consult the SchedMD documentation for these options and more.

To check the job’s status while in the queue or while running, the squeue command can be used as shown in the Cluster Overview documentation:

1squeue
2
3JOBID  PARTITION  NAME       USER  ST  TIME   NODES  NODELIST(REASON)
442     batch      example-s  jdoe  R   0:01   1     gpu001

5.1.6. SSH Access to Compute Nodes

While executing, the user who owns the active job on a node can SSH to that node if necessary or desired.

ssh gpu001

5.1.7. Access the Container Running on a Node

There are cases where it is advantageous to create an interactive terminal with access to a job that has been launched via sbatch, or a secondary terminal as part of an interactive job. To accomplish this, you will need to access the container running as part of the job through Enroot.

  1. To access the container running as part of a job on a node, first determine which node is running the job.

    1$ squeue
    2JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    3  517      defq example- demo-use  R       2:44      1 gpu002
    
  2. SSH to the listed node.

    ssh gpu002
    
  3. Use the enroot list --fancy command to generate a list of the containers that are running on the node.

    1$ enroot list --fancy
    2NAME         PID      COMM  STATE  STARTED  TIME        MNTNS       USERNS      COMMAND
    3pyxis_517.0  2351732  bash  S      May02    3-15:37:17  4026540834  4026540785  /usr/bin/bash -c <command>
    
  4. Copy the PID value for the target container and use as an argument to the enroot exec command. Run the bash command inside the container.

    enroot exec 2351732 bash
    

The end result will be an interactive session running inside the target container.

1$ enroot exec 2351732 bash
2bash: module: command not found
3demo-user@gpu002:/workspace$

5.2. Example Single-Node Interactive Bash Job

An interactive job can be launched with the following srun command. The key argument to remember is the --pty flag, which allows the job to run interactively in a terminal.

1srun -N 1 --pty \
2--container-image="nvcr.io#nvidia/pytorch:24.02-py3" \
3--container-mounts "/lustre/fs0" \
4--gpus 8 \
5--exclusive \
6--job-name "my-job:interactive" \
7bash

The shell should now indicate connection to a node with 8 GPUs, which can be confirmed by running nvidia-smi.

 1demo-user@gpu003:/workspace$ nvidia-smi
 2
 3+-----------------------------------------------------------------------------------------+
 4| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
 5|-----------------------------------------+------------------------+----------------------+
 6| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 7| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
 8|                                         |                        |               MIG M. |
 9|=========================================+========================+======================|
10|   0  NVIDIA A100-SXM4-80GB          On  |   00000001:00:00.0 Off |                    0 |
11| N/A   35C    P0             63W /  400W |       0MiB /  81920MiB |      0%      Default |
12|                                         |                        |             Disabled |
13+-----------------------------------------+------------------------+----------------------+
14|   1  NVIDIA A100-SXM4-80GB          On  |   00000002:00:00.0 Off |                    0 |
15| N/A   35C    P0             65W /  400W |       0MiB /  81920MiB |      0%      Default |
16|                                         |                        |             Disabled |
17+-----------------------------------------+------------------------+----------------------+
18|   2  NVIDIA A100-SXM4-80GB          On  |   00000003:00:00.0 Off |                    0 |
19| N/A   35C    P0             64W /  400W |       0MiB /  81920MiB |      0%      Default |
20|                                         |                        |             Disabled |
21+-----------------------------------------+------------------------+----------------------+
22|   3  NVIDIA A100-SXM4-80GB          On  |   00000004:00:00.0 Off |                    0 |
23| N/A   35C    P0             65W /  400W |       0MiB /  81920MiB |      0%      Default |
24|                                         |                        |             Disabled |
25+-----------------------------------------+------------------------+----------------------+
26|   4  NVIDIA A100-SXM4-80GB          On  |   0000000B:00:00.0 Off |                    0 |
27| N/A   35C    P0             62W /  400W |       0MiB /  81920MiB |      0%      Default |
28|                                         |                        |             Disabled |
29+-----------------------------------------+------------------------+----------------------+
30|   5  NVIDIA A100-SXM4-80GB          On  |   0000000C:00:00.0 Off |                    0 |
31| N/A   34C    P0             64W /  400W |       0MiB /  81920MiB |      0%      Default |
32|                                         |                        |             Disabled |
33+-----------------------------------------+------------------------+----------------------+
34|   6  NVIDIA A100-SXM4-80GB          On  |   0000000D:00:00.0 Off |                    0 |
35| N/A   34C    P0             63W /  400W |       0MiB /  81920MiB |      0%      Default |
36|                                         |                        |             Disabled |
37+-----------------------------------------+------------------------+----------------------+
38|   7  NVIDIA A100-SXM4-80GB          On  |   0000000E:00:00.0 Off |                    0 |
39| N/A   34C    P0             62W /  400W |       0MiB /  81920MiB |      0%      Default |
40|                                         |                        |             Disabled |
41+-----------------------------------------+------------------------+----------------------+
42
43+-----------------------------------------------------------------------------------------+
44| Processes:                                                                              |
45|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
46|        ID   ID                                                               Usage      |
47|=========================================================================================|
48|  No running processes found                                                             |
49+-----------------------------------------------------------------------------------------+

5.3. Example Single-Node JupyterLab Job

NGC containers support launching the Jupyter service for the interactive use of Jupyter notebooks. This section provides an example sbatch script that installs the RAPIDS Jupyterlab Dashboard and launches Jupyter in a Slurm job. We will then walk through how to connect to the resulting Jupyter service via your local web browser.

5.3.1. Example Jupyter sbatch Script

1#!/bin/bash
2#SBATCH --partition defq --nodes 1
3#SBATCH --exclusive
4#SBATCH --job-name=example-jupyter-sbatch-job
5#SBATCH --gpus-per-node=8
6
7srun --mpi=pmix --container-mounts /lustre/fs0/scratch/demo-user --container-image="nvcr.io#nvidia/pytorch:24.02-py3" \
8  bash -c "set -x; pip install --extra-index-url https://pypi.anaconda.org/rapidsai-wheels-nightly/simple --pre jupyterlab_nvdashboard;
9  jupyter lab --NotebookApp.token='' --notebook-dir=/ --no-browser --ip=0.0.0.0 --NotebookApp.allow_origin='*'; sleep 1d"

Save the text above in a file named jupyterlab.sbatch, and run the following command.

sbatch jupyterlab.sbatch

5.3.2. Connecting to the Jupyter Service

  1. Verify that the job has started and is running on a node via squeue. Take note of the node it is running on.

    1$ squeue
    2JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    3  517      defq example- demo-use  R       2:44      1 gpu002
    
  2. Watch the job’s output logs for output similar to the following, indicating that the server is accessible.

    1[I 2024-05-02 15:37:10.540 ServerApp] Jupyter Server 2.14.0 is running at:
    2[I 2024-05-02 15:37:10.540 ServerApp] http://gpu002:8888/lab
    3[I 2024-05-02 15:37:10.540 ServerApp]     http://127.0.0.1:8888/lab
    
  3. From a separate SSH session, log into a login node and pass port 8888 from the node running this job (in this example, gpu002) to your local workstation.

    ssh -i /path/to/ssh_cert <cluster-user>@ip-addr-of-login-node -L 8888:gpu002:8888
    
  4. Navigate to http://localhost:8888 from your web browser. You should see a Jupyter interface with the RAPIDS Jupyterlab Dashboard integrated.

Jupyterlab with RAPIDS Jupyterlab Dashboard

5.4. Example Single-Node VS Code Job

For developer workflows, having the ability to run Microsoft Visual Studio Code on a node inside the DGX Cloud cluster can eliminate code synchronization while iteratively experimenting with software changes. As an example, we will walk through the use of an interactive job to install the vscode CLI and leverage its Remote Tunnel capability to connect to your local workstation.

  1. Run the following job to get started. Note that a CPU node could have been selected instead of a GPU node, if it is preferred to make code changes on a node that is uninvolved in running an application test.

    1srun -N 1 --pty \
    2--container-image="nvcr.io#nvidia/pytorch:24.02-py3" \
    3--container-mounts "/lustre/fs0" \
    4--gpus 8 \
    5--exclusive \
    6--job-name "vscode-job:interactive" \
    7bash
    
  2. Download and extract the code tool.

    curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz && tar -xf vscode_cli.tar.gz
    
  3. Create a tunnel with the code tunnel command.

    ./code tunnel --accept-server-license-terms --name dgxc-remote-tunnel
    
  4. Select your preferred login method and follow the authentication prompts.

    *
    * Visual Studio Code Server
    *
    * By using the software, you agree to
    * the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
    * the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
    *
    ? How would you like to log in to Visual Studio Code? ›
    ❯ Microsoft Account
      GitHub Account
    
  5. Once authentication is successful, you will see a URL that allows you to open a web browser and use the remote tunnel.

    Open this link in your browser https://vscode.dev/tunnel/dgxc-remote-tunnel/workspace
    

5.5. Example Multi-Node Batch Jobs

Multi-node jobs bear many similarities to single-node jobs - to modify a Slurm command to run a multi-node job, simply request multiple nodes. An example sbatch script that requests multiple nodes follows:

 1#!/bin/bash
 2#SBATCH --partition defq --nodes 2
 3#SBATCH --exclusive
 4#SBATCH --job-name=example-mn-sbatch-job
 5#SBATCH --gpus-per-node=8
 6
 7export OMPI_MCA_coll_hcoll_enable=0
 8export UCX_TLS=rc
 9export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
10export CUDA_DEVICE_ORDER=PCI_BUS_ID
11export NCCL_SOCKET_IFNAME=eth0
12export NCCL_IB_PCI_RELAXED_ORDERING=1
13export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
14export NCCL_PROTO=LL,LL128,Simple
15export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
16export MELLANOX_VISIBLE_DEVICES=all
17export PMIX_MCA_gds=hash
18export PMIX_MCA_psec=native
19
20srun --mpi=pmix --container-mounts /lustre/fs0 --container-mounts /cm/shared --container-image="nvcr.io#nvidia/pytorch:24.02-py3" hostname

The only change made between this sbatch script and the previously discussed single node version is the `-nodes argument input incrementing from 1 to 2.

Save the text above in a file named multinode-hostname.sbatch, and run the following command.

sbatch multinode-hostname.sbatch

Expected output will now change as follows, and can be found in the resulting .out file (in this case, shown using the cat command).

1cat slurm-225.out
2
3pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
4pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
5gpu001
6gpu002

5.5.1. Key Environment Variables

There are a handful of environment variables that previous examples have included, but are essential for multi-node functionality. They fall into several categories. Those are: ensuring devices are presented to software correctly (CUDA_DEVICE_ORDER, MELLANOX_VISIBLE_DEVICES), ensuring MPI and PMIX are configured correctly (environment variables that begin with, OMPI_MCA or PMIX), ensuring NCCL is configured correctly (environment variables that begin with NCCL), and ensuring UCX is configured correctly (environment variables that begin with UCX).

In addition to the environment variables noted, there many environment variables that Slurm automatically creates and sets depending on the arguments provided via srun or sbatch. A full list of the options available via srun and associated environment variables can be found in the official documentation.

A table is provided here for the relevant SLURM_GPU environment variables and what they mean, along mechanisms to ensure the environment variable will be visible during jobs.

Environment Variable

Explanation

Associated Flag

Associated Sbatch Environment Variable

SLURM_GPUS

The total number of GPUs required for the job.

--gpus

SBATCH_GPUS

SLURM_GPUS_ON_NODE

The number of GPUs available to the step on this node.

--exclusive

SBATCH_EXCLUSIVE

SLURM_GPUS_PER_NODE

The number of GPUs required for the job on each node included in the job’s resource allocation.

--gpus-per-node

SBATCH_GPUS_PER_NODE

SLURM_GPUS_PER_TASK

The number of GPUs required for the job on each task to be spawned in the job’s resource allocation.

--gpus-per-task

SBATCH_GPUS_PER_TASK

SLURM_GPU_BIND

The way tasks are bound to specific GPUs in a job.

--gpu-bind

SBATCH_GPU_BIND

SLURM_GPU_FREQ

The requested frequency values for GPUs allocated to the job.

--gpu-freq

SBATCH_GPU_FREQ

5.5.2. Example PyTorch Job

To run a more complex example using the PyTorch NGC container, we will first clone the PyTorch examples repository to our home directory. If we were generating log files or downloading data inside this repository, it would make sense to clone this repository to our Lustre scratch directory instead.

git clone https://github.com/pytorch/examples

We will use PyTorch’s torchrun tool to launch the multinode.py example.

Create a file named torchrun.sh and add the following text to it. We have specified environment variables as arguments to torchrun, which Slurm creates for us based on how we launch our Slurm job.

1#!/bin/bash
2
3torchrun \
4--nnodes $SLURM_JOB_NUM_NODES \
5--nproc_per_node $SLURM_GPUS_PER_NODE \
6--master-addr $MASTER_ADDR \
7--master-port $MASTER_PORT \
8--node-rank $RANK \
9/home/demo-user/examples/distributed/ddp-tutorial-series/multinode.py 100 25

Make the torchrun.sh file executable.

chmod +x torchrun.sh

We will create an sbatch script to run torchrun.sh. Create a file named multinode-torchrun.sbatch and add the following text to it.

 1#!/bin/bash
 2
 3#SBATCH --job-name=multinode-example
 4#SBATCH --partition defq
 5#SBATCH --exclusive
 6#SBATCH --nodes=2
 7#SBATCH --ntasks-per-node=1
 8#SBATCH --gpus-per-node=1
 9#SBATCH --gpus-per-task=1
10
11export LOGLEVEL=INFO
12export OMPI_MCA_coll_hcoll_enable=0
13export UCX_TLS=rc
14export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
15export CUDA_DEVICE_ORDER=PCI_BUS_ID
16export NCCL_SOCKET_IFNAME=eth0
17export NCCL_IB_PCI_RELAXED_ORDERING=1
18export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
19export NCCL_DEBUG=INFO
20export NCCL_PROTO=LL,LL128,Simple
21export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
22export MELLANOX_VISIBLE_DEVICES=all
23export PMIX_MCA_gds=hash
24export PMIX_MCA_psec=native
25
26srun --container-image="nvcr.io#nvidia/pytorch:24.02-py3" --mpi=pmix --container-mounts /lustre/fs0 /home/demo-user/torchrun.sh

This sbatch script will use a single GPU on each of the two requested nodes. It is important to specify --ntasks-per-node=1, as torchrun will subsequently launch any additional necessary processes itself. Run multinode-torchrun.sbatch with the following command.

sbatch multinode-torchrun.sbatch

An output similar to the following can be found in the resulting .out file (in this case, shown using the cat command).

 1cat slurm-226.out
 2
 3pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
 4pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.02-py3
 5[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO] Starting elastic_operator with launch configs:
 6[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   entrypoint       : /home/demo-user/examples/distributed/ddp-tutorial-series/multinode.py
 7[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   min_nodes        : 2
 8[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   max_nodes        : 2
 9[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   nproc_per_node   : 1
10[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   run_id           : none
11[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   rdzv_backend     : static
12[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   rdzv_endpoint    : gpu001:49228
13[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   rdzv_configs     : {'rank': 1, 'timeout': 900}
14[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   max_restarts     : 0
15[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   monitor_interval : 5
16[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   log_dir          : None
17[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]   metrics_cfg      : {}
18[2024-06-13 14:24:39,312] torch.distributed.launcher.api: [INFO]
19...
20[GPU0] Epoch 99 | Batchsize: 32 | Steps: 32
21gpu002:862118:862201 [0] NCCL INFO [Service thread] Connection closed by localRank 0
22gpu001:875891:875974 [0] NCCL INFO [Service thread] Connection closed by localRank 0
23gpu001:875891:875891 [0] NCCL INFO comm 0x55555cee37a0 rank 0 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
24gpu002:862118:862118 [0] NCCL INFO comm 0x55555cee3e20 rank 1 nranks 2 cudaDev 0 busId 100000 - Abort COMPLETE
25[2024-06-13 14:24:51,350] torch.distributed.elastic.agent.server.api: [INFO] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
26[2024-06-13 14:24:51,350] torch.distributed.elastic.agent.server.api: [INFO] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
27[2024-06-13 14:24:51,351] torch.distributed.elastic.agent.server.api: [INFO] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
28[2024-06-13 14:24:51,352] torch.distributed.elastic.agent.server.api: [INFO] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
29[2024-06-13 14:24:51,353] torch.distributed.elastic.agent.server.api: [INFO] Done waiting for other agents. Elapsed: 0.0012083053588867188 seconds
30[2024-06-13 14:24:51,352] torch.distributed.elastic.agent.server.api: [INFO] Done waiting for other agents. Elapsed: 0.002357959747314453 seconds

6. Moving Data Into Your DGX Cloud Cluster

There are several factors to consider with regard to data movement in DGX Cloud:

  • Is the data available on your local workstation that can access the DGX Cloud cluster?

  • Is the data available on a shared filesystem or repository within your corporate network?

  • Is the data available on the public internet outside of an object store?

  • Is the data available on the public internet in a CSP object store?

  • What protocols, ports, or client software are necessary to transfer the data?

  • How much data will be transferred?

  • Is the person transferring the data the same person who will be using the data?

6.1. Moving Data from Your Local Workstation to DGX Cloud

To move data from your local workstation to DGX Cloud, consider the scale of your data.

If you are working with a set of data that is publicly available code (Github, Gitlab), it is preferred to clone that code using git from either a DGX Cloud login node or a job.

git clone https://git*b.com/my-repo

If the code is only available from your local workstation or a fileshare accessible from your workstation, a copy is warranted. You can write to either your home directory (of the form /home/<cluster user>) or a Lustre directory you have access to (an example is /lustre/fs0/scratch/<cluster user>).

Remember the guidance noted for home directories and Lustre - choose the appropriate location for your data transfer. A small amount of code and configuration files is a great fit for your home directory. However, larger sets of code that include data (or download it internally to the repository structure) will make the most sense on Lustre.

These strategies work well for local workstations and internal compute resources configured to access your DGX Cloud environment.

Note

Depending on the scale of local data you are working with, uploading that data to an object store may make sense instead of using tools like scp, sftp, or rsync.

6.1.1. scp

scp is a common default data transfer method for files, archives, and small directories. The following example will copy the local archive.tgz archive to the user’s home directory.

scp archive.tgz <cluster user>@ip-addr-of-login-node:/home/<cluster user>

The following command will copy archive.tgz to the user’s Lustre scratch directory instead.

scp archive.tgz <cluster user>@ip-addr-of-login-node:/lustre/fs0/scratch/<cluster user>

To copy a directory instead of a file or archive, use the -r flag.

scp -r data/ <cluster user>@ip-addr-of-login-node:/lustre/fs0/scratch/<cluster user>

Note that scp expects the remote directory to exist. If the remote directory does not exist, the following behavior will be observed.

1scp archive.tgz <cluster user>@ip-addr-of-login-node:/lustre/fs0/scratch/<cluster user>/fake-dir/
2scp: dest open "/lustre/fs0/scratch/<cluster user>/fake-dir/": Failure
3scp: failed to upload archive.tgz to /lustre/fs0/scratch/<cluster user>/fake-dir/

To resolve this problem, SSH to the login node and create the expected directory.

1ssh <cluster user>@ip-addr-of-login-node
2mkdir /lustre/fs0/scratch/<cluster user>/fake-dir/

6.1.2. sftp

sftp is another option to transfer files from your workstation to DGX Cloud. To get started, have sftp connect to a login node.

sftp <cluster user>@ip-addr-of-login-node

You will see the prompt change, indicating authentication succeeded and you are using sftp.

sftp>

sftp offers commands to move data from your local system to the remote system and from the remote system to your local system.

Some operations apply to both local and remote directories and files. For those operations, an l is pre-pended to the command version that operates on your local workstation.

For example, to list the local workstation directory’s contents, run lls. To list the directory contents in your DGX Cloud directory, run ls.

A full inventory of operations can be found with the ? command.

1sftp> ?
2Available commands:
3...

To change your remote directory, use the cd command. Then, check your remote directory with the pwd command and your local directory with lpwd. Files will be copied to or from the directory you change to in both local and remote contexts.

1sftp> cd /lustre/fs0/scratch/demo-user
2sftp> pwd
3Remote working directory: /lustre/fs0/scratch/demo-user
4sftp> lpwd
5Local working directory: /Users/demo-user

To upload a file to DGX Cloud, use the put command.

sftp> put archive.tgz

To upload a directory and its contents to DGX Cloud, add the -R flag.

sftp> put -R my-directory

To easily download data from DGX Cloud to your local workstation, use the get command (with the -R flag for directory download).

sftp> get -R my-log-file-directory

To exit your sftp session, use the bye command.

sftp> bye

6.1.3. rsync

rsync is also a commonly used tool for data transfer, primarily when multiple files and/or directories need to move and when data is being synchronized periodically between local and remote directories.

For example, use rsync to copy a local workstation file to a user’s DGX Cloud home directory.

rsync archive.tgz <cluster user>@ip-addr-of-login-node:/home/<cluster user>

rsync usage is similar to scp, but there are some important differences.

To copy a directory from your local workstation to DGX Cloud, ensure the local directory argument has a trailing / if you intend to maintain the local directory structure.

rsync local-directory/ <cluster user>@ip-addr-of-login-node:/home/<cluster user>

Without the trailing /, the contents of the local directory will be placed in the DGX Cloud target directory.

There are a large number of flags to consider when using rsync. Check the official documentation to explore all possible flags and optimize for your use case.

6.2. Moving Data from Your Cloud-based Storage to DGX Cloud

When data that you need to move into DGX Cloud is in a cloud-based object store, the general strategy below can be relied on for various tools.

  1. Write a sbatch script to download the appropriate tool, configure authentication, and place the data in your target directory (likely Lustre scratch).

  2. Run the sbatch script.

  3. Verify that the results in your target DGX Cloud directory are correct.

  4. Repeat for additional data.

An example process for how to do this on CPU nodes in the cluster can be found in the section below.

6.2.1. Loading Data Via Batch Jobs on CPU Nodes

This section shows an example of how to run batch job on a CPU nodes to download data from cloud object stores, then store the data on the Lustre filesystem.

6.2.1.1. Identifying Nodes

You can use the sinfo command to see which nodes are in the cluster:

1sinfo
2
3PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
4cpuq         up   infinite      2   idle cpu[001-002]
5defq*        up   infinite     64   idle gpu[001-064]

In this example, the CPU nodes are available in a discrete cpuq partition. We’ll continue to reference this partition in following sections. In the next step, we will specify the job samples to run on the CPU nodes.

Note

Depending on the details of the configuration of your cluster, your specific partitions may vary. CPU nodes may be in the same defq partition as the GPU nodes, for example, or additional partitions may exist and/or have been created.

6.2.1.2. S3

6.2.1.2.1. Loading Data From S3 via an AWS CLI sbatch Job

In this section, we write a batch job to load data from S3. The job carries out three main tasks:

  1. Downloads the AWS CLI tool.

  2. Unpacks the downloaded CLI tool into your home directory.

  3. Uses the tool to download a dataset to Lustre via AWS S3.

1#!/bin/bash
2#SBATCH --partition cpuq --nodes 1
3#SBATCH --exclusive
4#SBATCH --job-name=aws-s3-download
5srun bash -c 'curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"
6  -o "awscliv2.zip"; unzip awscliv2.zip; mkdir /lustre/fs0/scratch/demo-user/s3-dl;
7/home/demo-user/aws/dist/aws s3 sync --no-sign-request --region=us-east-1 s3://noaa-swpc-pds /lustre/fs0/scratch/demo-user/s3-dl/'

To run the script:

  1. Save the script above to your home directory as aws-s3-download.sbatch.

  2. Update <demo-user> to your username.

  3. Submit the job to the queue with this command:

    sbatch aws-s3-download.sbatch
    

An open dataset is used in this example, so no AWS credentials are required.

The output from sinfo will show activity in the cpuq partition on one of the CPU nodes.

1sinfo
2
3PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
4cpuq         up   infinite      1  alloc cpu001
5cpuq         up   infinite      1   idle cpu002
6defq*        up   infinite     64   idle gpu[001-064]
6.2.1.2.2. Accessing S3 Data with Authentication

You can use environment variables to access an S3 bucket that requires security authorization.

A modified sbatch script that uses AWS environment variables follows - it is rewritten to assume AWS CLI has been downloaded and unpacked.

 1#!/bin/bash
 2#SBATCH --partition cpuq --nodes 1
 3#SBATCH --exclusive
 4#SBATCH --job-name=aws-secure-s3-download
 5
 6export AWS_ACCESS_KEY_ID=your-access-key
 7export AWS_SECRET_ACCESS_KEY=your-secret-access-key
 8export AWS_DEFAULT_REGION=target-region
 9
10srun bash -c 'mkdir /lustre/fs0/scratch/demo-user/secure-s3-dl; /home/user/aws/dist/aws s3 sync s3://your-secure-bucket /lustre/fs0/scratch/demo-user/secure-s3-dl/'

Different authentication mechanisms may be required, depending on the location of data or company policies.

6.2.1.2.3. Accessing S3 Data with the s5cmd Tool

An alternative tool to consider for data downloads using S3 is the s5cmd tool. It provides significant performance gains relative to the standard aws tool, which can lead to significant time savings for large data movement operations on the order of tens of gigabytes or more.

The following sbatch script mirrors the previous examples, but downloads and uses the s5cmd tool instead.

1#!/bin/bash
2#SBATCH --partition cpuq --nodes 1
3#SBATCH --exclusive
4#SBATCH --job-name=s5cmd-s3-download
5srun bash -c 'wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz; tar -xf s5cmd_2.2.2_Linux-64bit.tar.gz; mkdir /lustre/fs0/scratch/demo-user/s5cmd-dl;
6/home/demo-user/s5cmd --no-sign-request cp s3://noaa-swpc-pds/* /lustre/fs0/scratch/demo-user/s5cmd-dl/'

6.2.1.3. Azure

6.2.1.3.1. Loading Data From Azure Storage via an AzCopy sbatch Job

In this section, we write a batch job to load data from Azure Storage using AzCopy. The job carries out three main tasks:

  1. Downloads the AzCopy tool.

  2. Unpacks the downloaded CLI tool into your home directory.

  3. Uses the tool to download a dataset to Lustre via Azure Storage.

1#!/bin/bash
2#SBATCH --partition cpuq --nodes 1
3#SBATCH --exclusive
4#SBATCH --job-name=azcopy-download
5
6srun bash -c 'wget -O azcopy_v10.tar.gz https://aka.ms/downloadazcopy-v10-linux && tar -xf azcopy_v10.tar.gz --strip-components=1; mkdir /lustre/fs0/scratch/demo-user/azcopy; ./azcopy copy "https://azureopendatastorage.blob.core.windows.net/mnist/train-images-idx3-ubyte.gz" "/lustre/fs0/scratch/demo-user/azcopy"'

To run the script:

  1. Save the script above to your home directory as azcopy-download.sbatch.

  2. Update <demo-user> to your username.

  3. Submit the job to the queue with this command:

    sbatch azcopy-download.sbatch
    

An open dataset is used in this example, so no Azure credentials are required.

The output from sinfo will show activity in the cpuq partition on one of the CPU nodes.

1sinfo
2
3PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
4cpuq         up   infinite      1  alloc cpu001
5cpuq         up   infinite      1   idle cpu002
6defq*        up   infinite     64   idle gpu[001-064]

6.2.1.4. CPU Job sbatch Logs

As with other examples in this section, use the tail -f command to follow the output of the resulting Slurm log file in your home directory.

 1tail -f /home/demo-user/slurm-22.out
 2
 3download: s3://noaa-swpc-pds/json/goes/primary/differential-protons-7-day.json to s3-dl/json/goes/primary/differential-protons-7-day.json
 4download: s3://noaa-swpc-pds/text/3-day-geomag-forecast.txt to s3-dl/text/3-day-geomag-forecast.txt
 5download: s3://noaa-swpc-pds/text/aurora-nowcast-hemi-power.txt to s3-dl/text/aurora-nowcast-hemi-power.txt
 6download: s3://noaa-swpc-pds/text/45-day-ap-forecast.txt to s3-dl/text/45-day-ap-forecast.txt
 7download: s3://noaa-swpc-pds/json/solar-cycle/sunspots.json to s3-dl/json/solar-cycle/sunspots.json
 8download: s3://noaa-swpc-pds/json/goes/primary/xrays-7-day.json to s3-dl/json/goes/primary/xrays-7-day.json
 9download: s3://noaa-swpc-pds/text/3-day-forecast.txt to s3-dl/text/3-day-forecast.txt
10download: s3://noaa-swpc-pds/text/27-day-outlook.txt to s3-dl/text/27-day-outlook.txt
11download: s3://noaa-swpc-pds/products/alerts.json to s3-dl/products/alerts.json
12download: s3://noaa-swpc-pds/text/relativistic-electron-fluence-tabular.txt to s3-dl/text/relativistic-electron-fluence-tabular.txt

6.2.1.5. Verifying Data

The downloaded data will be present in the Lustre filesystem, ready for local manipulation or ingest into a GPU-based job if already preprocessed.

1ls /lustre/fs0/scratch/demo-user/s3-dl/
2
3index.html  json  products  text

7. Managing Jobs

7.1. Job Status and Monitoring

To see the status of which jobs on the cluster, use the squeue command:

1squeue -a -l
2
32Tue Nov 17 19:08:18 2020
43JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
549      batch    bash user01 RUNNING 5:43 UNLIMITED 1 dgx1
6510     batch    Bash user02 RUNNING 6:33 UNLIMITED 2 dgx[2-3]

This shows the JOBID as well as the state of all jobs in the queue.

You can use the -u flag to view jobs for a particular username:

squeue -l -u USERNAME

To view more detailed information about a particular job, use the scontrol command:

scontrol show job

7.2. Adjusting Job Priority

You can change the job priority for a queued job using the following command:

scontrol update JOBID Priority=<Priority-Integer>

Where <Priority-Integer> is replaced by an integer between 1 and 2^32 - 1. The higher the integer, the higher the priority the job is given.

7.3. Pausing and Resuming a Running Job

It is possible to pause a running job using the scontrol suspend command:

scontrol suspend JOBID

The job will be stopped at its current step, so it can be resumed at a later time.

To resume a paused job, use the scontrol resume command:

scontrol resume JOBID

7.4. Holding and Releasing a Queued Job

It is possible to prevent a queued job from running using the scontrol hold command:

scontrol hold JOBID

The Job will remain in the queue in a held state, so will not run until it is released:

scontrol release JOBID

7.5. Resubmitting a Job

You can use the requeue flag with scontrol to add a batch job in the CONFIGURING, RUNNING, STOPPED or SUSPENDED state to the pending queue:

scontrol requeue JOBID

7.6. Cancelling a Job

To cancel a running job, first make sure you know the JOBID, which can be found using the squeue command [link to squeue sn]. You can then cancel with scancel:

scancel JOBID

Multiple jobs can be canceled in one command by separating job id’s by commas:

scancel JOBID1, JOBID2