1. Onboarding Quick Start Guide#

1.1. Introduction and Personas#

Congratulations on your new DGX Cloud cluster!

This guide is intended to provide the necessary information for the cluster owner, cluster admins, and cluster users to get started with their primary responsibilities on their DGX Cloud cluster.

The intended workflow of the guide starts with the cluster owner, who is the main contact for managing the DGX Cloud subscription and cluster.

More detailed information about user roles and functionalities can be found in the corresponding guides:

NVIDIA DGX Cloud Slurm Cluster Admin Guide (For cluster owners and admins)
NVIDIA DGX Cloud Slurm Cluster User Guide (For cluster users)

1.2. Cluster Owner#

The cluster owner is responsible for:

Onboarding cluster admins and users via cmsh
Enrolling and inviting admins and users to NGC
Activating their subscription and registering for NVIDIA Enterprise Support
Collaborating with NVIDIA and cluster admins to troubleshoot and report any issues related to their DGX Cloud cluster

1.3. Cluster Admin#

For a cluster admin, common tasks include:

Onboarding cluster users via Base View and creating/managing teams/accounts
Managing the cluster’s high-performance filesystem
Configuring Slurm queues and user quotas
Deeper inspection and manipulation of the Slurm job queue state
Debugging cluster behavior concerns

1.4. Cluster User#

For a cluster user, common tasks include:

Scheduling compute jobs in Slurm
Ad hoc Slurm job queue interaction
Downloading source code or datasets
Manipulating configuration files

1.5. Cluster Owner Steps#

1.5.1. Prerequisites#

As a cluster owner, ensure the following steps have been completed:

Your Technical Account Manager (TAM) should have already reached out to you as the organization admin. During this process:
- The TAM should have created a shared communication channel with you. Use this channel for any questions or issues during your experience on DGX Cloud.
- You should have created an SSH key pair, and sent the public key to your TAM for initial access to the cluster.
To access the cluster, the TAM will provide the following information:
- Head Node: <IP address of head node>
- Login Nodes: <IP addresses of login nodes>

You will use the head node to manage the BCM installation, cluster configuration, and admin/user onboarding. The tool you will use for cluster configuration is cmsh (cluster management shell).

Note

The cmsh tool is only available on the cluster head node as the cluster owner (root) or a sudo-enabled admin, and cannot be access from login nodes.

You will have SSH access to the head node via the root user and can SSH into the login nodes from the head node.

Cluster admins and users will primarily use the login nodes for their day-to-day work on the DGX Cloud cluster. They will have SSH and Base View access to the login nodes only.

As a cluster owner, you will be responsible for creating user accounts for cluster admins and users.

Important

As a security best practice, usage of the root account should be minimized. Instructions to create a non-root user for the cluster owner to access the head node with can be found in the Enable an Alternative Cluster Owner Account section of the NVIDIA DGX Cloud Cluster Administration Guide.

The root account used by the cluster owner on the head node should not be used to run jobs in the Slurm cluster. Only cluster admins and users should run jobs on the Slurm cluster, via the login nodes.

If needed, the cluster owner should create their own separate admin and/or user accounts to access the login nodes for work that does not require root access.

1.5.2. Storage Information#

There are three kinds of storage available across the cluster:

Locally attached storage dedicated to individual nodes
Shared NFS storage, available on all nodes in the cluster, including the head node
Shared high-performance Lustre storage, available on login and worker nodes in the Slurm cluster only, and not available on the head node

For more detail about the storage available as part of this product, see the Overview of Working in Your DGX Cloud Cluster section of the NVIDIA DGX Cloud User Guide.

1.5.3. Accessing the Head Node as Root (Cluster Owner Only)#

As the cluster owner, you can access the head node using the SSH key pair you provided to your Technical Account Manager (TAM). To do this, use the following command:

ssh -i /path/to/ssh_cert root@ip-addr-of-head-node

Note

The root account can only log directly into the head node using external IP addresses. To access login nodes as the root account, SSH to them after logging into the head node.

If you encounter any issues while trying SSH, refer to Troubleshooting for assistance.

1.5.4. Adding Cluster Admins Via cmsh#

As a cluster owner, you can add admins to help manage the Slurm cluster using the following steps.

Compile a list of cluster admins: Make a list of people who will require admin access to the cluster.

Create SSH key pairs: Ask each cluster admin to create an SSH key pair for themselves using the following command:

ssh-keygen -t ed25519 -b 4096 -f ~/.ssh/<cluster-admin>-ed25519-dgxc -C "cluster_admin_email@example.com"

Obtain public keys: Once each admin has an SSH key pair, have each send you the contents of their public key (<cluster-admin>-ed25519-dgxc.pub) file generated in their ~/.ssh/ directory. You will use this information in the following steps to create the cluster admin user.
Create a cluster admin user: From the head node as root, run the following commands to create a cluster admin.
1. Enter cmsh with this command:
```
cmsh
```
2. Run the following commands within cmsh:
```
 1 user
 2 add <cluster-admin>
 3 set password
 4 set profile tenantadmin
 5 commit
 6
 7 group
 8 use tenantadmin
 9 append members <cluster-admin>
10 commit
11 quit
```
3. Switch to the user’s account:
```
sudo su - <cluster-admin>
```
4. Add their SSH public key (obtained during Step 3 above) to the authorized_keys file using a text editor of your choice. For example,
```
nano $HOME/.ssh/authorized_keys
```
5. Configure their admin user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:
```
module initadd slurm
```
6. Exit the admin user’s account:
```
exit
```
7. Run the following commands to add the admin user as a Slurm admin:
```
1 module load slurm
2 sacctmgr add User User=<cluster-admin> Account=root AdminLevel=Administrator
```
  Commit the changes when prompted.

(Optional) Create a shared scratch space on LustreFS for the admin user: If the cluster admin will be running slurm jobs, you can configure their user to have a scratch space on the Lustre shared file system, or they can configure this themselves if needed from the login node (using sudo). Follow the steps in below to do so.

Note

These steps are written assuming the cluster owner (root) is executing these steps. It is possible to run these steps as a cluster admin, but commands such as mkdir, chown, and lfs will require the use of sudo.

SSH to a login node in order to access the Lustre filesystem. The following command assumes the root user is SSHing to the login node from the head node.
```
ssh slogin001
```

Run the following commands to create the admin user’s scratch space on the Lustre filesystem:

mkdir -p /lustre/fs0/scratch/<cluster-admin>

chown <cluster-admin>:<cluster-admin> /lustre/fs0/scratch/<cluster-admin>

(Optional) You can then assign a quota to the admin user if necessary, using the commands below. More details can be found in the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

   # see current quota
   lfs quota -u <cluster-admin> -v /lustre/fs0/scratch/<cluster-admin>

   #example output of lfs quota for a user named demo-user
   Disk quotas for usr demo-user (uid 1004):
       Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
   /lustre/fs0/scratch/demo-user/
                       4       0       0       -       1       0       0       -
   lustrefs-MDT0000_UUID
                       4       -       0       -       1       -       0       -
   lustrefs-OST0000_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0001_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0002_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0003_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0004_UUID
                       0       -       0       -       -       -       -       -
   Total allocated inode limit: 0, total allocated block limit: 0
   uid 1004 is using default block quota setting
   uid 1004 is using default file quota setting

   # set quota, e.g., 100G, and inodes
   lfs setquota -u <cluster-admin> -b 100G -B 100G -i 10000 -I11000 /lustre/fs0/scratch/<cluster-admin>

   #example output after running setquota for a user named demo-user
   Disk quotas for usr demo-user (uid 1004):
       Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
   /lustre/fs0/scratch/demo-user/
                       4  104857600 104857600       -       1   10000   11000       -
   lustrefs-MDT0000_UUID
                       4*      -       4       -       1*      -       1       -
   lustrefs-OST0000_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0001_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0002_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0003_UUID
                       0       -       0       -       -       -       -       -
   lustrefs-OST0004_UUID
                       0       -       0       -       -       -       -       -

Send the information to the admin: The admin is now set up to access the cluster and start working. Send the following information to the admin:
- Login node addresses
- Their username and password information
- Which SSH public key you configured for their user
- (Optional) Their LustreFS scratch directory information
Each cluster admin should now be able to log on to the login nodes from the Prerequisites section using the following command:
```
ssh -i /path/to/cluster_admin_ssh_cert <cluster-admin>@ip-addr-of-login-node
```
Note

New cluster admins will not be able to SSH to the head node or use cmsh by default - to enable that capability for a given cluster admin, follow the Enable an Alternative Cluster Owner Account section of the NVIDIA DGX Cloud Cluster Administration Guide.

Repeat Steps 1 through 6 for each cluster admin user you want to create.

1.5.5. Adding Cluster Users Via cmsh#

As a cluster owner, you’ll need to gather some information to onboard users to the cluster.

Compile a list of cluster users: Start by compiling a list of users who need access to the cluster.
Create SSH key pairs: Each user will need to create an SSH key pair for themselves with the following command:
```
ssh-keygen -t ed25519 -b 4096 -f ~/.ssh/<cluster-user>-ed25519-dgxc -C "your_email@example.com"
```
Obtain public keys: Once each user has created an SSH key pair, have them send you the contents of their public key (<cluster-user>-ed25519-dgxc.pub) file located in their ~/.ssh/ directory. You will use this in the following steps to create the cluster user.
Create a cluster user: From the head node as root, run the following commands to create a cluster user.
1. Enter cmsh with this command:
```
cmsh
```
2. Within cmsh, run the following commands to create a cluster user:
```
1 user
2 add <cluster-user>
3 set password
4 set profile portal
5 commit
6 quit
```
3. Switch to the user’s account using the following command:
```
sudo su - <cluster-user>
```
4. Add the user’s SSH public key (obtained earlier above) into the authorized_keys file in their ~/.ssh/ directory, using the text editor of your choice. For example,
```
nano $HOME/.ssh/authorized_keys
```
5. Configure their user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:
```
module initadd slurm
```
6. Exit the user’s account by running the following command:
```
exit
```

Create a shared scratch space on LustreFS for the user: Next, create a LustreFS directory for the user. Follow the steps below to create and configure shared storage for the user.

Note

These steps are written assuming the cluster owner (root) is executing these steps. It is possible to run these steps as a cluster admin, but commands such as mkdir, chown, and lfs will require the use of sudo.

SSH to a login node in order to access the Lustre filesystem. The following command assumes the root user is SSHing to the login node from the head node.
```
ssh slogin001
```

Run the following commands to create the user scratch space on the Lustre filesystem:

mkdir -p /lustre/fs0/scratch/<cluster-user>

chown <cluster-user>:<cluster-user> /lustre/fs0/scratch/<cluster-user>

(Optional) You can then assign a quota to the user if necessary, using the commands below. More details can be found in the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

 # see current quota
 lfs quota -u <cluster-user> -v /lustre/fs0/scratch/<cluster-user>

 #example output of lfs quota for a user named demo-user
 Disk quotas for usr demo-user (uid 1004):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
 /lustre/fs0/scratch/demo-user/
                     4       0       0       -       1       0       0       -
 lustrefs-MDT0000_UUID
                     4       -       0       -       1       -       0       -
 lustrefs-OST0000_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0001_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0002_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0003_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0004_UUID
                     0       -       0       -       -       -       -       -
 Total allocated inode limit: 0, total allocated block limit: 0
 uid 1004 is using default block quota setting
 uid 1004 is using default file quota setting

 # set quota, e.g., 100G, and inodes
 lfs setquota -u <cluster-user> -b 100G -B 100G -i 10000 -I11000 /lustre/fs0/scratch/<cluster-user>

 #example output after running setquota for a user named demo-user
 Disk quotas for usr demo-user (uid 1004):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
 /lustre/fs0/scratch/demo-user/
                     4  104857600 104857600       -       1   10000   11000       -
 lustrefs-MDT0000_UUID
                     4*      -       4       -       1*      -       1       -
 lustrefs-OST0000_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0001_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0002_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0003_UUID
                     0       -       0       -       -       -       -       -
 lustrefs-OST0004_UUID
                     0       -       0       -       -       -       -       -

Note

If no quota is set, the user has unlimited storage access to the whole Lustre filesystem, and if not careful, can consume the entire filesystem.

Send the information to the user: The user is now set up to access the cluster and start working. Send the following information to the user:
- Login node addresses
- Their username and password information
- Which SSH public key you configured for their user
- Their LustreFS scratch directory information
Each user should now be able to log on to the login nodes from the Prerequisites section using the following command:
```
ssh -i /path/to/cluster_user_ssh_cert <cluster-user>@ip-addr-of-login-node
```
Repeat Steps 1 through 6 for each cluster user you want to create.
(Optional) Create a list of cluster teams or Slurm accounts: Refer to Setting Up Fair Share Scheduling and Teams for more information.

1.5.6. Setting Up NGC#

As a part of the DGX Cloud subscription, your organization has received access to NVIDIA NGC, with Private Registry and NVIDIA AI Enterprise subscriptions enabled. As the cluster owner, you will be responsible for managing your NGC org, and inviting your admins and users to NGC.

For more information on setting up your NGC org, please see the NGC User Guide.

To invite users to the NGC org, follow the steps in the NGC User Guide here.

1.6. Cluster Admin Steps (Optional)#

Cluster admins can manage the configuration of the Slurm job scheduler, run jobs, and execute tasks that require sudo access from the login, CPU, and GPU nodes. Additionally, cluster admins can onboard other cluster admins and users via Base View.

Note

The following sections assume your cluster owner created a cluster admin user. If you haven’t been set up with an admin user and an SSH key pair for logging into the cluster, please contact your cluster owner to get onboarded.

1.6.1. Accessing the Login Node#

As a cluster admin, you have SSH access to the login nodes but not to the head node. Cluster admins can also access Base View via the login nodes.

To access the login node, follow these steps:

Obtain the login node IPs from the cluster owner.
Log in via SSH with the user account(s) created by the cluster owner:
```
ssh -i /path/to/ssh_cert <cluster-admin>@ip-addr-of-login-node
```
Note

If you encounter any errors while trying SSH, refer to the Troubleshooting section for help.

1.6.2. Accessing Base View#

Base View is a browser-based GUI that provides a dashboard view of the cluster.

Refer to the Accessing Base View section of the NVIDIA DGX Cloud Cluster Administration Guide for details.

Note

The root user cannot access Base View.

1.6.3. Adding Users Via Base View#

Note

The steps in this section do not need to be performed if the cluster owner has already completed the user creation via cmsh. If the cluster admin is creating users via Base View, proceed with the steps below.

For more information about creating and onboarding users in Base View, refer to the Adding Users Via Base View section of the NVIDIA DGX Cloud Cluster Administration Guide.

1.6.4. Creating and Configuring Lustre Shared Storage for Cluster Admins and Users#

Note

The steps in this section do not need to be performed if the cluster owner has already completed the user creation via cmsh. If the cluster admin is creating users via Base View, proceed with the steps below.

Cluster admins have the capability to create and manage directories in shared storage for users. For more information about this, refer to the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

1.7. Cluster User Steps#

Cluster users can perform the following actions on the login nodes:

Use Slurm commands such as sinfo and squeue to determine the state of the Slurm job queue
Interact with NFS and Lustre storage attached to the cluster
Target jobs between CPU and GPU nodes depending on the use case
Schedule blocking or interactive jobs on the Slurm job queue
Schedule batch jobs on the Slurm job queue

Note

The following sections assume that your cluster admin has worked with you to create a cluster user. If you do not have a user and SSH key pair for logging in to the cluster yet, please contact your cluster admin to get onboarded.

1.7.1. Accessing the Login Node#

Cluster users will have SSH access to the login nodes only. Cluster users can also access the User Portal through the login nodes.

To access the login node, follow these steps:

Obtain the login node IPs from your cluster admin.
Log in via SSH with the user account(s) created by the cluster admin:
```
ssh -i /path/to/ssh_cert <cluster-user>@ip-addr-of-login-node
```
Note

If you encounter any errors while trying SSH, refer to the Troubleshooting section for help.

1.7.2. Setting Up NGC Integration#

For more information on setting up your user account to be able to pull containers from NGC, refer to Setting Up NGC Integration in the DGX Cloud Cluster User Guide.

1.7.3. Running Jobs#

The following sections guide you on how to set up and run basic jobs from the login nodes.

1.7.3.1. Loading Slurm Modules#

To interact with software that has been installed in a DGX Cloud cluster, the appropriate modules must be loaded. Modules provide a quick method for loading and unloading specific sets of software and configuration data in a DGX Cloud environment. For more information about modules, see section 2.2 of the Base Command Manager administrator manual.

module load slurm

If you would like to configure your user account to load the Slurm module automatically, run the following command, log out, then log back into the login node.

module initadd slurm

1.7.3.2. Running a Single-Node Job#

The example below runs a common single node GPU-based job with the NCCL tests tool from an NGC container. Refer to Single-node Jobs in the DGX Cloud Cluster User Guide for more information.

Create a script at $HOME/run-sn.sh using the text editor of your choice, with the following content:

Azure A100

#!/bin/bash

srun -N1 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io#nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8

OCI A100

#!/bin/bash

srun -N1 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io#nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8

OCI H100

#!/bin/bash

srun -N1 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io#nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 16G -f 2 -g 8

Make the script executable by running the following command:
```
chmod +x $HOME/run-sn.sh
```

Now you can run the script:

cd $HOME
./run-sn.sh

You should see output similar to the following example:

  pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.09-py3
  # nThread 1 nGpus 8 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
  #
  # Using devices
  #  Rank  0 Group  0 Pid 848629 on     gpu008 device  0 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  1 Group  0 Pid 848629 on     gpu008 device  1 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  2 Group  0 Pid 848629 on     gpu008 device  2 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  3 Group  0 Pid 848629 on     gpu008 device  3 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  4 Group  0 Pid 848629 on     gpu008 device  4 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  5 Group  0 Pid 848629 on     gpu008 device  5 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  6 Group  0 Pid 848629 on     gpu008 device  6 [0x00] NVIDIA A100-SXM4-80GB
  #  Rank  7 Group  0 Pid 848629 on     gpu008 device  7 [0x00] NVIDIA A100-SXM4-80GB
  #
  #                                                              out-of-place                       in-place
  #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
  #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float     sum      -1   8215.7  130.69  228.71      0   8214.6  130.71  228.75      0
  2147483648     536870912     float     sum      -1    16274  131.95  230.92      0    16273  131.97  230.95      0
  4294967296    1073741824     float     sum      -1    32231  133.25  233.20      0    33012  130.10  227.68      0
  # Out of bounds values : 0 OK
  # Avg bus bandwidth    : 230.034
  #

1.7.3.3. Running a Multi-Node Job#

The example below runs a multi-node variant of the GPU-based job above with the NCCL tests tool from an NGC container. For more information on multi-node jobs, refer to Multi-node Jobs in the NVIDIA DGX Cloud Cluster User Guide.

Create a script at $HOME/run-mn.sh using the text editor of your choice, with the following content:

Azure A100

 #!/bin/bash

 srun -N2 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io/nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8

OCI A100

 #!/bin/bash

 srun -N2 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io/nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8

OCI H100

 #!/bin/bash

 srun -N2 --exclusive --gpus-per-node 8 --mpi=pmix --container-image nvcr.io/nvidia/pytorch:24.09-py3 -p defq all_reduce_perf_mpi -b 1G -e 16G -f 2 -g 8

Make the script executable by running the following command:
```
chmod +x $HOME/run-mn.sh
```

Now you can run the script:

./run-mn.sh

You should see output similar to the following example:

pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.09-py3
pyxis: imported docker image: nvcr.io#nvidia/pytorch:24.09-py3
# nThread 1 nGpus 8 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 824960 on     gpu005 device  0 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 824960 on     gpu005 device  1 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 824960 on     gpu005 device  2 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 824960 on     gpu005 device  3 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 824960 on     gpu005 device  4 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 824960 on     gpu005 device  5 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 824960 on     gpu005 device  6 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 824960 on     gpu005 device  7 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 822704 on     gpu006 device  0 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 822704 on     gpu006 device  1 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 822704 on     gpu006 device  2 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 822704 on     gpu006 device  3 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 822704 on     gpu006 device  4 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 822704 on     gpu006 device  5 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 822704 on     gpu006 device  6 [0x00] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 822704 on     gpu006 device  7 [0x00] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
1073741824     268435456     float     sum      -1    11214   95.75  179.53      0    11211   95.77  179.58      0
2147483648     536870912     float     sum      -1    21949   97.84  183.45      0    21629   99.29  186.17      0
4294967296    1073741824     float     sum      -1    44071   97.46  182.73      0    43494   98.75  185.15      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 182.768
#

1.7.4. Accessing the User Portal#

User Portal is a browser-based GUI designed specifically for cluster users so that they can have a dashboard of their own workloads in the cluster.

Refer to User Portal in the NVIDIA DGX Cloud Cluster User Guide for more information.

1.8. Troubleshooting#

1.8.1. SSH Key Permissions#

1.8.1.1. Unprotected Private Key File in WSL#

You may see this error when trying to ssh to the head node when using WSL on Windows.

      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      @         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
      @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
      Permissions 0777 for '<ssh-key-file>' are too open.
      It is required that your private key files are NOT accessible by others.
      This private key will be ignored.
      Load key "<ssh-key-file>": bad permissions
      <cluster-user>@slogin001: Permission denied (publickey,gssapi-with-mic).

To fix this, you need to update your WSL conf to allow you to own and change the file permissions for the SSH private key:

Create an /etc/wsl.conf file (as sudo) with the following contents:
```
1[automount]
2options="metadata"
```
Exit WSL
Terminate the instance via command prompt (wsl --terminate <distro-name>) or shut it down (wsl --shutdown)

Restart WSL

Then, from WSL, run the following command to change the permissions of the private key:

user@local:$HOME/.ssh$ chmod 600 <ssh-key-file>

Then, check the permissions:

user@local:$HOME/.ssh$ ls -l <ssh-key-file>

It should look like:

-rw------- 1 user local 2610 Apr  2 19:19 <ssh-key-file>

1.8.2. Base View Permissions#

In general, cluster users should use the User Portal instead of Base View for a UI to utilize the cluster. If a cluster user with insufficient permissions tries to log in to Base View they will see an error similar to the following.

_images/dgx-cloud-base-view-access-denied.png

Base View is primarily intended for cluster admins. Cluster users should access the User Portal.

1. Onboarding Quick Start Guide#

1.1. Introduction and Personas#

1.2. Cluster Owner#

1.3. Cluster Admin#

1.4. Cluster User#

1.5. Cluster Owner Steps#

1.5.1. Prerequisites#

1.5.2. Storage Information#

1.5.3. Accessing the Head Node as Root (Cluster Owner Only)#

1.5.4. Adding Cluster Admins Via cmsh#

1.5.5. Adding Cluster Users Via cmsh#

1.5.6. Setting Up NGC#

1.6. Cluster Admin Steps (Optional)#

1.6.1. Accessing the Login Node#

1.6.2. Accessing Base View#

1.6.3. Adding Users Via Base View#

1.6.4. Creating and Configuring Lustre Shared Storage for Cluster Admins and Users#

1.6.5. Setting Up Fair Share Scheduling and Teams#

1.7. Cluster User Steps#

1.7.1. Accessing the Login Node#

1.7.2. Setting Up NGC Integration#

1.7.3. Running Jobs#

1.7.3.1. Loading Slurm Modules#

1.7.3.2. Running a Single-Node Job#

1.7.3.3. Running a Multi-Node Job#

1.7.4. Accessing the User Portal#

1.8. Troubleshooting#

1.8.1. SSH Key Permissions#

1.8.1.1. Unprotected Private Key File in WSL#

1.8.2. Base View Permissions#