1. Introduction & Personas

Congratulations on your new DGX Cloud cluster!

This guide is intended to provide the necessary information for the cluster owner, cluster admins, and cluster users to get started with their primary responsibilities on their DGX Cloud cluster.

The intended workflow of the guide starts with the cluster owner, who is the main contact for managing the DGX Cloud subscription and cluster.

More detailed information about user roles and functionalities can be found in the corresponding guides:

1.1. Cluster Owner

The cluster owner is responsible for:

  • Onboarding cluster admins and users via cmsh

  • Enrolling and inviting admins and users to NGC

  • Activating their subscription and registering for NVIDIA Enterprise Support

  • Collaborating with NVIDIA and cluster admins to troubleshoot and report any issues related to their DGX Cloud cluster

1.2. Cluster Admin

For a cluster admin, common tasks include:

  • Onboarding cluster users via Base View and creating/managing teams/accounts

  • Managing the cluster’s high-performance filesystem

  • Configuring Slurm queues and user quotas

  • Deeper inspection and manipulation of the Slurm job queue state

  • Debugging cluster behavior concerns

1.3. Cluster User

For a cluster user, common tasks include:

  • Scheduling compute jobs in Slurm

  • Ad hoc Slurm job queue interaction

  • Downloading source code or datasets

  • Manipulating configuration files

2. Cluster Owner Steps

2.1. Prerequisites

As a cluster owner, ensure the following steps have been completed:

  • Your Technical Account Manager (TAM) should have already reached out to you as the organization admin. During this process:

    • The TAM should have created a shared communication channel with you. Use this channel for any questions or issues during your experience on DGX Cloud.

    • You should have created an SSH key pair, and sent the public key to your TAM for initial access to the cluster.

  • To access the cluster, the TAM will provide the following information:

    • Head Node: <IP address of head node>

    • Login Nodes: <IP addresses of login nodes>

You will use the head node to manage the BCM installation, cluster configuration, and admin/user onboarding. The tool you will use for cluster configuration is cmsh (cluster management shell).

You will have SSH access to the head node via the root user and can SSH into the login nodes from the head node.

Cluster admins and users will primarily use the login nodes for their day-to-day work on the DGX Cloud cluster. They will have SSH and Base View access to the login nodes only.

As a cluster owner, you will be responsible for creating user accounts for cluster admins and users.

Important

As a security best practice, usage of the root account should be minimized. Instructions to create a non-root user for the cluster owner to access the head node with can be found in the Enable an Alternative Cluster Owner Account section of the NVIDIA DGX Cloud Cluster Administration Guide.

The root account used by the cluster owner on the head node should not be used to run jobs in the Slurm cluster. Only cluster admins and users should run jobs on the Slurm cluster, via the login nodes.

If needed, the cluster owner should create their own separate admin and/or user accounts to access the login nodes for work that does not require root access.

2.2. Accessing the Head Node as Root (Cluster Owner Only)

As the cluster owner, you can access the head node using the SSH key pair you provided to your Technical Account Manager (TAM). To do this, use the following command:

ssh -i /path/to/ssh_cert root@ip-addr-of-head-node

Note

If you encounter any issues while trying SSH, refer to Troubleshooting for assistance.

2.3. Adding Cluster Admins Via cmsh

As a cluster owner, you can add admins to help manage the Slurm cluster using the following steps.

  1. Compile a list of cluster admins: Make a list of people who will require admin access to the cluster.

  2. Create SSH key pairs: Ask each cluster admin to create an SSH key pair for themselves using the following command:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-admin>-rsa-dgxc -C "cluster_admin_email@example.com"
    
  3. Obtain public keys: Once each admin has an SSH key pair, have each send you the contents of their public key (<cluster-admin>-rsa-dgxc.pub) file generated in their ~/.ssh/ directory. You will use this information in the following steps to create the cluster admin user.

  4. Create a cluster admin user: From the head node as root, run the following commands to create a cluster admin.

    1. Enter cmsh with this command:

      cmsh
      
    2. Run the following commands within cmsh:

       1user
       2add <cluster-admin>
       3set password
       4set profile tenantadmin
       5commit
       6
       7group
       8use tenantadmin
       9append members <cluster-admin>
      10commit
      11quit
      
    3. Switch to the user’s account:

      sudo su - <cluster-admin>
      
    4. Add their SSH public key (obtained during Step 3 above) to the authorized_keys file using a text editor of your choice. For example,

      nano $HOME/.ssh/authorized_keys
      
    5. Configure their admin user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:

      module initadd slurm
      
    6. Exit the admin user’s account:

      exit
      
    7. Run the following commands to add the admin user as a Slurm admin:

      1module load slurm
      2sacctmgr add User User=<cluster-admin> Account=root AdminLevel=Administrator
      

      Commit the changes when prompted.

  5. (Optional) Create a shared scratch space on LustreFS for the admin user: If the cluster admin will be running slurm jobs, you can configure their user to have a scratch space on the Lustre shared file system, or they can configure this themselves if needed (using sudo). Follow the steps in below to do so.

    1. Run the following commands to create the admin user’s scratch space on the Lustre filesystem:

      mkdir -p /lustre/fs0/scratch/<cluster-admin>
      
      chown <cluster-admin>:<cluster-admin> /lustre/fs0/scratch/<cluster-admin>
      
    2. (Optional) You can then assign a quota to the admin user if necessary, using the commands below. More details can be found in the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

       1# see current quota
       2lfs quota -u <cluster-admin> -v /lustre/fs0/scratch/<cluster-admin>
       3
       4#example output of lfs quota for a user named demo-user
       5Disk quotas for usr demo-user (uid 1004):
       6    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       7/lustre/fs0/scratch/demo-user/
       8                    4       0       0       -       1       0       0       -
       9lustrefs-MDT0000_UUID
      10                    4       -       0       -       1       -       0       -
      11lustrefs-OST0000_UUID
      12                    0       -       0       -       -       -       -       -
      13lustrefs-OST0001_UUID
      14                    0       -       0       -       -       -       -       -
      15lustrefs-OST0002_UUID
      16                    0       -       0       -       -       -       -       -
      17lustrefs-OST0003_UUID
      18                    0       -       0       -       -       -       -       -
      19lustrefs-OST0004_UUID
      20                    0       -       0       -       -       -       -       -
      21Total allocated inode limit: 0, total allocated block limit: 0
      22uid 1004 is using default block quota setting
      23uid 1004 is using default file quota setting
      24
      25# set quota, e.g., 100G, and inodes
      26lfs setquota -u <cluster-admin> -b 100G -B 100G -i 10000 -I11000 /lustre/fs0/scratch/<cluster-admin>
      27
      28#example output after running setquota for a user named demo-user
      29Disk quotas for usr demo-user (uid 1004):
      30    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
      31/lustre/fs0/scratch/demo-user/
      32                    4  104857600 104857600       -       1   10000   11000       -
      33lustrefs-MDT0000_UUID
      34                    4*      -       4       -       1*      -       1       -
      35lustrefs-OST0000_UUID
      36                    0       -       0       -       -       -       -       -
      37lustrefs-OST0001_UUID
      38                    0       -       0       -       -       -       -       -
      39lustrefs-OST0002_UUID
      40                    0       -       0       -       -       -       -       -
      41lustrefs-OST0003_UUID
      42                    0       -       0       -       -       -       -       -
      43lustrefs-OST0004_UUID
      44                    0       -       0       -       -       -       -       -
      
  6. Send the information to the admin: The admin is now set up to access the cluster and start working. Send the following information to the admin:

    • Login node addresses

    • Their username and password information

    • Which SSH public key you configured for their user

    • (Optional) Their LustreFS scratch directory information

    Each cluster admin should now be able to log on to the login nodes from the Prerequisites section using the following command:

    ssh -i /path/to/cluster_admin_ssh_cert <cluster-admin>@ip-addr-of-login-node
    

Repeat Steps 1 through 6 for each cluster admin user you want to create.

2.4. Adding Cluster Users Via cmsh

As a cluster owner, you’ll need to gather some information to onboard users to the cluster.

  1. Compile a list of cluster users: Start by compiling a list of users who need access to the cluster.

  2. Create SSH key pairs: Each user will need to create an SSH key pair for themselves with the following command:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-user>-rsa-dgxc -C "your_email@example.com"
    
  3. Obtain public keys: Once each user has created an SSH key pair, have them send you the contents of their public key (<cluster-user>-rsa-dgxc.pub) file located in their ~/.ssh/ directory. You will use this in the following steps to create the cluster user.

  4. Create a cluster user: From the head node as root, run the following commands to create a cluster user.

    1. Enter cmsh with this command:

      cmsh
      
    2. Within cmsh, run the following commands to create a cluster user:

      1user
      2add <cluster-user>
      3set password
      4set profile portal
      5commit
      6quit
      
    3. Switch to the user’s account using the following command:

      sudo su - <cluster-user>
      
    4. Add the user’s SSH public key (obtained earlier above) into the authorized_keys file in their ~/.ssh/ directory, using the text editor of your choice. For example,

      nano $HOME/.ssh/authorized_keys
      
    5. Configure their user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:

      module initadd slurm
      
    6. Exit the user’s account by running the following command:

      exit
      
  5. Create a shared scratch space on LustreFS for the user: Next, create a LustreFS directory for the user. Follow the steps below to create and configure shared storage for the user.

    1. Run the following commands to create the user scratch space on the Lustre filesystem:

      mkdir -p /lustre/fs0/scratch/<cluster-user>
      
      chown <cluster-user>:<cluster-user> /lustre/fs0/scratch/<cluster-user>
      
    2. (Optional) You can then assign a quota to the user if necessary, using the commands below. More details can be found in the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

       1# see current quota
       2lfs quota -u <cluster-user> -v /lustre/fs0/scratch/<cluster-user>
       3
       4#example output of lfs quota for a user named demo-user
       5Disk quotas for usr demo-user (uid 1004):
       6    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       7/lustre/fs0/scratch/demo-user/
       8                    4       0       0       -       1       0       0       -
       9lustrefs-MDT0000_UUID
      10                    4       -       0       -       1       -       0       -
      11lustrefs-OST0000_UUID
      12                    0       -       0       -       -       -       -       -
      13lustrefs-OST0001_UUID
      14                    0       -       0       -       -       -       -       -
      15lustrefs-OST0002_UUID
      16                    0       -       0       -       -       -       -       -
      17lustrefs-OST0003_UUID
      18                    0       -       0       -       -       -       -       -
      19lustrefs-OST0004_UUID
      20                    0       -       0       -       -       -       -       -
      21Total allocated inode limit: 0, total allocated block limit: 0
      22uid 1004 is using default block quota setting
      23uid 1004 is using default file quota setting
      24
      25# set quota, e.g., 100G, and inodes
      26lfs setquota -u <cluster-user> -b 100G -B 100G -i 10000 -I11000 /lustre/fs0/scratch/<cluster-user>
      27
      28#example output after running setquota for a user named demo-user
      29Disk quotas for usr demo-user (uid 1004):
      30    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
      31/lustre/fs0/scratch/demo-user/
      32                    4  104857600 104857600       -       1   10000   11000       -
      33lustrefs-MDT0000_UUID
      34                    4*      -       4       -       1*      -       1       -
      35lustrefs-OST0000_UUID
      36                    0       -       0       -       -       -       -       -
      37lustrefs-OST0001_UUID
      38                    0       -       0       -       -       -       -       -
      39lustrefs-OST0002_UUID
      40                    0       -       0       -       -       -       -       -
      41lustrefs-OST0003_UUID
      42                    0       -       0       -       -       -       -       -
      43lustrefs-OST0004_UUID
      44                    0       -       0       -       -       -       -       -
      

      Note

      If no quota is set, the user has unlimited storage access to the whole Lustre filesystem, and if not careful, can consume the entire filesystem.

  6. Send the information to the user: The user is now set up to access the cluster and start working. Send the following information to the user:

    • Login node addresses

    • Their username and password information

    • Which SSH public key you configured for their user

    • Their LustreFS scratch directory information

    Each user should now be able to log on to the login nodes from the Prerequisites section using the following command:

    ssh -i /path/to/cluster_user_ssh_cert <cluster-user>@ip-addr-of-login-node
    
  7. Repeat Steps 1 through 6 for each cluster user you want to create.

  8. (Optional) Create a list of cluster teams or Slurm accounts: Refer to Setting Up Fair Share Scheduling and Teams for more information.

2.5. Setting Up NGC

As a part of the DGX Cloud subscription, your organization has received access to NVIDIA NGC, with Private Registry and NVIDIA AI Enterprise subscriptions enabled. As the cluster owner, you will be responsible for managing your NGC org, and inviting your admins and users to NGC.

For more information on setting up your NGC org, please see the NGC User Guide.

To invite users to the NGC org, follow the steps in the NGC User Guide here.

3. Cluster Admin Steps (Optional)

Cluster admins can manage the configuration of the Slurm job scheduler, run jobs, and execute tasks that require sudo access from the login, CPU, and GPU nodes. Additionally, cluster admins can onboard other cluster admins and users via Base View.

Note

The following sections assume your cluster owner created a cluster admin user. If you haven’t been set up with an admin user and an SSH key pair for logging into the cluster, please contact your cluster owner to get onboarded.

3.1. Accessing the Login Node

As a cluster admin, you have SSH access to the login nodes but not to the head node. Cluster admins can also access Base View via the login nodes.

To access the login node, follow these steps:

  1. Obtain the login node IPs from the cluster owner.

  2. Log in via SSH with the user account(s) created by the cluster owner:

    ssh -i /path/to/ssh_cert <cluster-admin>@ip-addr-of-login-node
    

    Note

    If you encounter any errors while trying SSH, refer to the Troubleshooting section for help.

3.2. Accessing Base View

Base View is a browser-based GUI that provides a dashboard view of the cluster.

Refer to the Accessing Base View section of the NVIDIA DGX Cloud Cluster Administration Guide for details.

3.3. Adding Users Via Base View

Note

The steps in this section do not need to be performed if the cluster owner has already completed the user creation via cmsh. If the cluster admin is creating users via Base View, proceed with the steps below.

For more information about creating and onboarding users in Base View, refer to the Adding Users Via Base View section of the NVIDIA DGX Cloud Cluster Administration Guide.

3.4. Creating and Configuring Lustre Shared Storage for Cluster Admins and Users

Note

The steps in this section do not need to be performed if the cluster owner has already completed the user creation via cmsh. If the cluster admin is creating users via Base View, proceed with the steps below.

Cluster admins have the capability to create and manage directories in shared storage for users. For more information about this, refer to the Managing Lustre Storage section of the NVIDIA DGX Cloud Cluster Administration Guide.

3.5. Setting Up Fair Share Scheduling and Teams

For information about setting up accounts (teams) and fair share scheduling, refer to Managing Slurm Cluster in the NVIDIA DGX Cloud Cluster Administration Guide.

4. Cluster User Steps

Cluster users can perform the following actions on the login nodes:

  • Use Slurm commands such as sinfo and squeue to determine the state of the Slurm job queue

  • Interact with NFS and Lustre storage attached to the cluster

  • Target jobs between CPU and GPU nodes depending on the use case

  • Schedule blocking or interactive jobs on the Slurm job queue

  • Schedule batch jobs on the Slurm job queue

Note

The following sections assume that your cluster admin has worked with you to create a cluster user. If you do not have a user and SSH key pair for logging in to the cluster yet, please contact your cluster admin to get onboarded.

4.1. Accessing the Login Node

Cluster users will have SSH access to the login nodes only. Cluster users can also access the User Portal through the login nodes.

To access the login node, follow these steps:

  1. Obtain the login node IPs from your cluster admin.

  2. Log in via SSH with the user account(s) created by the cluster admin:

    ssh -i /path/to/ssh_cert <cluster-user>@ip-addr-of-login-node
    

    Note

    If you encounter any errors while trying SSH, refer to the Troubleshooting section for help.

4.2. Setting Up NGC Integration

For more information on setting up your user account to be able to pull containers from NGC, refer to Setting Up NGC Integration in the DGX Cloud Cluster User Guide.

4.3. Running Jobs

The following sections guide you on how to set up and run basic jobs from the login nodes.

4.3.1. Loading Slurm Modules

To interact with software that has been installed in a DGX Cloud cluster, the appropriate modules must be loaded. Modules provide a quick method for loading and unloading specific sets of software and configuration data in a DGX Cloud environment. For more information about modules, see section 2.2 of the Base Command Manager administrator manual.

module load slurm

If you would like to configure your user account to load the Slurm module automatically, run the following command, log out, then log back into the login node.

module initadd slurm

4.3.2. Running a Single-Node Job

The example below runs a common single node GPU-based job with the NCCL tests tool from an NGC container. Refer to Single-node Jobs in the DGX Cloud Cluster User Guide for more information.

  1. Create a script at $HOME/run-sn.sh using the text editor of your choice, with the following content:

     1#/bin/bash
     2
     3# should be encoded in enroot/environment.d/...
     4export PMIX_MCA_gds=hash
     5export PMIX_MCA_psec=native
     6export OMPI_MCA_coll_hcoll_enable=0
     7export CUDA_DEVICE_ORDER=PCI_BUS_ID
     8export NCCL_SOCKET_IFNAME=eth0
     9export NCCL_IB_PCI_RELAXED_ORDERING=1
    10export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
    11export MELLANOX_VISIBLE_DEVICES=all
    12export UCX_TLS=rc
    13export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
    14
    15export NCCL_PROTO=LL,LL128,Simple
    16export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
    17
    18srun -N1 --exclusive --gpus-per-node 8 --mpi=pmix --container-mounts=/cm/shared/etc/ndv4-topo.xml:/cm/shared/etc/ndv4-topo.xml --container-image="nvcr.io#nvidia/pytorch:23.12-py3" -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8
    
  2. Make the script executable by running the following command:

    chmod +x $HOME/run-sn.sh
    
  3. Now you can run the script:

    1cd $HOME
    2./run-sn.sh
    

    You should see output similar to the following example:

     1pyxis: importing docker image: nvcr.io#nvidia/pytorch:23.12-py3
     2pyxis: imported docker image: nvcr.io#nvidia/pytorch:23.12-py3
     3# nThread 1 nGpus 8 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
     4#
     5# Using devices
     6#  Rank  0 Group  0 Pid 848629 on     gpu008 device  0 [0x00] NVIDIA A100-SXM4-80GB
     7#  Rank  1 Group  0 Pid 848629 on     gpu008 device  1 [0x00] NVIDIA A100-SXM4-80GB
     8#  Rank  2 Group  0 Pid 848629 on     gpu008 device  2 [0x00] NVIDIA A100-SXM4-80GB
     9#  Rank  3 Group  0 Pid 848629 on     gpu008 device  3 [0x00] NVIDIA A100-SXM4-80GB
    10#  Rank  4 Group  0 Pid 848629 on     gpu008 device  4 [0x00] NVIDIA A100-SXM4-80GB
    11#  Rank  5 Group  0 Pid 848629 on     gpu008 device  5 [0x00] NVIDIA A100-SXM4-80GB
    12#  Rank  6 Group  0 Pid 848629 on     gpu008 device  6 [0x00] NVIDIA A100-SXM4-80GB
    13#  Rank  7 Group  0 Pid 848629 on     gpu008 device  7 [0x00] NVIDIA A100-SXM4-80GB
    14#
    15#                                                              out-of-place                       in-place
    16#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    17#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    181073741824     268435456     float     sum      -1   8215.7  130.69  228.71      0   8214.6  130.71  228.75      0
    192147483648     536870912     float     sum      -1    16274  131.95  230.92      0    16273  131.97  230.95      0
    204294967296    1073741824     float     sum      -1    32231  133.25  233.20      0    33012  130.10  227.68      0
    21# Out of bounds values : 0 OK
    22# Avg bus bandwidth    : 230.034
    23#
    

4.3.3. Running a Multi-Node Job

The example below runs a multi-node variant of the GPU-based job above with the NCCL tests tool from an NGC container. For more information on multi-node jobs, refer to Multi-node Jobs in the NVIDIA DGX Cloud Cluster User Guide.

  1. Create a script at $HOME/run-mn.sh using the text editor of your choice, with the following content:

     1#/bin/bash
     2
     3export OMPI_MCA_coll_hcoll_enable=0
     4export UCX_TLS=rc
     5export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
     6export CUDA_DEVICE_ORDER=PCI_BUS_ID
     7export NCCL_SOCKET_IFNAME=eth0
     8export NCCL_IB_PCI_RELAXED_ORDERING=1
     9export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
    10export NCCL_PROTO=LL,LL128,Simple
    11export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
    12export MELLANOX_VISIBLE_DEVICES=all
    13export PMIX_MCA_gds=hash
    14export PMIX_MCA_psec=native
    15
    16srun -N2 --exclusive --gpus-per-node 8 --mpi=pmix --container-mounts=/cm/shared/etc/ndv4-topo.xml:/cm/shared/etc/ndv4-topo.xml --container-image="nvcr.io#nvidia/pytorch:23.12-py3" -p defq all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 8
    
  2. Make the script executable by running the following command:

    chmod +x $HOME/run-mn.sh
    
  3. Now you can run the script:

    ./run-mn.sh
    

    You should see output similar to the following example:

     1pyxis: importing docker image: nvcr.io#nvidia/pytorch:23.12-py3
     2pyxis: imported docker image: nvcr.io#nvidia/pytorch:23.12-py3
     3# nThread 1 nGpus 8 minBytes 1073741824 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
     4#
     5# Using devices
     6#  Rank  0 Group  0 Pid 824960 on     gpu005 device  0 [0x00] NVIDIA A100-SXM4-80GB
     7#  Rank  1 Group  0 Pid 824960 on     gpu005 device  1 [0x00] NVIDIA A100-SXM4-80GB
     8#  Rank  2 Group  0 Pid 824960 on     gpu005 device  2 [0x00] NVIDIA A100-SXM4-80GB
     9#  Rank  3 Group  0 Pid 824960 on     gpu005 device  3 [0x00] NVIDIA A100-SXM4-80GB
    10#  Rank  4 Group  0 Pid 824960 on     gpu005 device  4 [0x00] NVIDIA A100-SXM4-80GB
    11#  Rank  5 Group  0 Pid 824960 on     gpu005 device  5 [0x00] NVIDIA A100-SXM4-80GB
    12#  Rank  6 Group  0 Pid 824960 on     gpu005 device  6 [0x00] NVIDIA A100-SXM4-80GB
    13#  Rank  7 Group  0 Pid 824960 on     gpu005 device  7 [0x00] NVIDIA A100-SXM4-80GB
    14#  Rank  8 Group  0 Pid 822704 on     gpu006 device  0 [0x00] NVIDIA A100-SXM4-80GB
    15#  Rank  9 Group  0 Pid 822704 on     gpu006 device  1 [0x00] NVIDIA A100-SXM4-80GB
    16#  Rank 10 Group  0 Pid 822704 on     gpu006 device  2 [0x00] NVIDIA A100-SXM4-80GB
    17#  Rank 11 Group  0 Pid 822704 on     gpu006 device  3 [0x00] NVIDIA A100-SXM4-80GB
    18#  Rank 12 Group  0 Pid 822704 on     gpu006 device  4 [0x00] NVIDIA A100-SXM4-80GB
    19#  Rank 13 Group  0 Pid 822704 on     gpu006 device  5 [0x00] NVIDIA A100-SXM4-80GB
    20#  Rank 14 Group  0 Pid 822704 on     gpu006 device  6 [0x00] NVIDIA A100-SXM4-80GB
    21#  Rank 15 Group  0 Pid 822704 on     gpu006 device  7 [0x00] NVIDIA A100-SXM4-80GB
    22#
    23#                                                              out-of-place                       in-place
    24#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    25#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    261073741824     268435456     float     sum      -1    12870   83.43  156.43      0    12611   85.14  159.64      0
    272147483648     536870912     float     sum      -1    23848   90.05  168.84      0    23997   89.49  167.80      0
    284294967296    1073741824     float     sum      -1    46598   92.17  172.82      0    48401   88.74  166.38      0
    29# Out of bounds values : 0 OK
    30# Avg bus bandwidth    : 165.318
    31#
    

4.4. Accessing the User Portal

User Portal is a browser-based GUI designed specifically for cluster users so that they can have a dashboard of their own workloads in the cluster.

Refer to User Portal in the NVIDIA DGX Cloud Cluster User Guide for more information.

5. Troubleshooting

5.1. SSH Key Permissions

5.1.1. Unprotected Private Key File in WSL

You may see this error when trying to ssh to the head node when using WSL on Windows.

1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
4Permissions 0777 for '<ssh-key-file>' are too open.
5It is required that your private key files are NOT accessible by others.
6This private key will be ignored.
7Load key "<ssh-key-file>": bad permissions
8<cluster-user>@slogin001: Permission denied (publickey,gssapi-with-mic).

To fix this, you need to update your WSL conf to allow you to own and change the file permissions for the SSH private key:

  1. Create an /etc/wsl.conf file (as sudo) with the following contents:

    1[automount]
    2options="metadata"
    
  2. Exit WSL

  3. Terminate the instance via command prompt (wsl --terminate <distro-name>) or shut it down (wsl --shutdown)

  4. Restart WSL

Then, from WSL, run the following command to change the permissions of the private key:

user@local:$HOME/.ssh$ chmod 600 <ssh-key-file>

Then, check the permissions:

user@local:$HOME/.ssh$ ls -l <ssh-key-file>

It should look like:

-rw------- 1 user local 2610 Apr  2 19:19 <ssh-key-file>

5.2. Base View Permissions

In general, cluster users should use the User Portal instead of Base View for a UI to utilize the cluster. If a cluster user with insufficient permissions tries to log in to Base View they will see an error similar to the following.

_images/dgx-cloud-base-view-access-denied.png

Base View is primarily intended for cluster admins. Cluster users should access the User Portal.