1. Introduction

1.1. NVIDIA DGX Cloud Overview

NVIDIA DGX™ Cloud is an AI platform for enterprise developers, optimized for the demands of generative AI.

DGX Cloud Stack Diagram

DGX Cloud delivers an integrated, full-stack solution co-engineered with leading cloud partners, integrating their best-of-breed architectures with the newest NVIDIA AI technologies across accelerated computing, networking fabric, and software. It also provides direct access to NVIDIA AI experts.

DGX Cloud includes NVIDIA AI Enterprise, offering accelerated data science libraries, optimized frameworks, and pre-trained models. This enables developers to quickly and easily deliver production-ready AI.

1.2. Overview of Your Cluster

The DGX Cloud cluster includes the following components provisioned and configured for use:

  • Head (management) node(s)

  • Login nodes

  • CPU worker nodes

  • GPU worker nodes

  • Shared NFS storage

  • Shared high-performance Lustre storage

The exact configuration and capacity of each of the components available is customized for specific requirements as specified during the onboarding process.

The Slurm workload management tool is provided as the primary interface of the cluster for users familiar with HPC schedulers who want direct access to underlying systems.

The Slurm implementation is deployed and configured by NVIDIA Base Command Manager, and leverages Pyxis and Enroot as the container plugin and runtime.

The cluster configuration via Base Command Manager (BCM) and the Slurm configuration can be further modified and customized by the customer for specific use cases.

For more information on BCM, refer to the BCM documentation.

For more information on Slurm, refer to the Slurm Quick Start Guide.

1.3. Overview of Cluster Administration

This section provides an overview of what you as the customer owners and admins have the ability to do on your DGX Cloud cluster. This information is provided for guidance and is a high-level overview.

For more information about the personas mentioned, see the DGX Cloud Onboarding Quick Start Guide.

For details on a shared responsibility model between NVIDIA and the customer, see the Shared Responsibility Model section below.

1.3.1. Cluster Owner Tasks

For a cluster owner, common tasks include:

  • Onboarding cluster admins and users to the cluster via cmsh.

  • Enrolling and inviting admins and users to your NGC Org.

  • Activating their subscription and registering for NVIDIA Enterprise Support.

  • Working with NVIDIA and your cluster admins to troubleshoot and file any issues associated with their DGX Cloud cluster.

1.3.2. Cluster Admin Tasks

For a cluster admin, common tasks include:

  • Onboarding cluster users via Base View.

  • Creating/managing teams/accounts.

  • Managing the cluster’s high-performance filesystem.

  • Configuring Slurm queues and user quotas.

  • Deeper inspection and manipulation of the Slurm job queue state.

  • Debugging cluster behavior concerns.

2. Security of DGX Cloud

2.1. Shared Responsibility Model

NVIDIA DGX Cloud provides a high-performance compute cluster provisioned and configured for the customer to use. Management and operation of the cluster is the customer’s responsibility (self-managed). Details of NVIDIA’s and customer’s responsibility are found in the sections below.

2.1.1. NVIDIA Responsibility

NVIDIA is responsible for the security of DGX Cloud. For customer-managed clusters, this includes:

  1. Managing the cloud service provider’s account, infrastructure resources, and security controls.

    • Separating resources (compute, network, and storage) between customers.

    • Implementing ingress restrictions via Classless Inter-Domain Routing (CIDR) ranges for the Base Command Manager (BCM) head node and login nodes based on customer guidance.

    • Configuring data encryption at rest at the infrastructure layer to protect customer data.

  2. Provisioning and pre-configuring a BCM Slurm cluster to simplify the customer’s onboarding process.

  3. Onboarding customer admins to NGC and the DGX Cloud cluster for initial access.

  4. Publish regular software updates and security patches for BCM software and notifying impacted customers.

  5. Deprovisioning clusters and associated resources at the end of use as part of the offboarding process.

2.1.2. Customer Responsibility

Customers are responsible for the security and compliance of their self-managed BCM cluster and its resources in DGX Cloud. This includes:

  1. Network Security

    • Providing NVIDIA with ingress rule guidance via CIDR ranges for restricting access at the CSP infrastructure layer.

    • Reaching out to NVIDIA support channels for assistance in updating CIDR restrictions as needed throughout the cluster’s lifetime.

    • Configuring software node-level firewall on the cluster nodes for added protection.

  2. User Access Management

    • Implementing access control methods to ensure only authorized users have access to the head node and login nodes through SSH.

    • Protecting credentials used for access management to the cluster.

    • Assigning appropriate roles to users on the DGX Cloud cluster.

    • Managing users on NGC.

  3. Data Protection

    • Securely transferring data between their managed storage and DGX Cloud storage.

    • Creating appropriate data backups and additional encryption as necessary to comply with their security compliance requirements.

  4. Software Security Updates

    • Ensuring the operating system is regularly updated with the latest security patches.

    • Performing updates to BCM software at a regular cadence.

  5. Security Logging, Threat Detection, and Incident Response

    • Collecting and forwarding application, system and node firewall logs to appropriate tools, such as Security information and event management (SIEM) software, as recommended by their IT and Security departments.

  6. Compliance

    • Abiding by the End User License Agreements or Terms of Use for the NVIDIA Software & Services used by you or your organization, including but not limited to the TOU for NVIDIA GPU Cloud and DGX Cloud.

    • Complying to your internal and other regulatory compliance requirements.

2.2. Additional Security Considerations

As the operator of a DGX Cloud cluster, you must consider your own security requirements once you take possession of the cluster. This cluster is yours to operate according to your organization’s requirements. However, during the provisioning and onboarding process for the cluster, you may want to take in to account a few different security considerations:

  • Before onboarding, your organization will be asked to specify Classless Inter-Domain Routing (CIDR) ranges to define the IP addresses that can access the DGX Cloud cluster. To minimize risk, these ranges should be as small as possible, and could even be restricted to a single IP address via a jumphost to maximize security.

  • For some organizations, the root account may not be allowed to be used for SSH access. In this scenario, consider following the Enable an Alternative Cluster Owner Account (Cluster Owner Only) documentation to implement a non-root privileged account on the head node. By default, only the root account can be used to access the head node, but enabling an alternative non-root privileged account and storing the root SSH certificate in a secure manner as a backup is a good practice to consider.

3. DGX Cloud Cluster Onboarding

A quick start guide providing step-by-step instructions for the cluster owner, cluster admins, and cluster users to get started on their DGX Cloud cluster is available.

For more information, refer to the DGX Cloud Onboarding Quick Start Guide.

4. Accessing Your DGX Cloud Cluster as Admins

As an admin, you have several connection options for accessing your DGX Cloud environment:

  • Head Node: The cluster owner can access the head node to manage Base Command Manager (BCM) as the root user. This is done via the public internet using Classless Inter-Domain Routing (CIDR) ranges and a public SSH key provided by the user to NVIDIA during the initial cluster creation.

  • Login Nodes: Two login nodes are accessible via public internet using CIDR ranges provided by the user to NVIDIA, using public SSH keys registered with a DGX Cloud deployment.

  • Base View: A web interface accessible via the public internet using CIDR ranges provided by the user to NVIDIA. Base View is accessed through the login nodes.

Note

NVIDIA provides the cluster owner with the IP addresses for the head nodes and login nodes. Please work with the cluster owner to retrieve the correct addresses to access the cluster using the steps below.

Important

The root account used by the cluster owner on the head node should not be used to run jobs in the Slurm cluster. Only cluster admins and users should run jobs on the Slurm cluster, via the login nodes.

If needed, the cluster owner should create their own admin and/or user accounts to access the login nodes for work that does not require root access.

4.1. Accessing the Head Node as Root (Cluster Owner Only)

As the cluster owner, you can access the head node using the SSH key pair you provided to your Technical Account Manager (TAM). To do this, use the following command:

ssh -i /path/to/ssh_cert root@ip-addr-of-head-node

Note

If you encounter any issues while trying SSH, refer to SSH Key Permissions for assistance.

4.2. Accessing the Head Node as Non-Root (Cluster Owner Only)

Optionally, a non-root privileged account can be created to access the head node. Follow Enable an Alternative Cluster Owner Account (Cluster Owner Only) to configure a non-root privileged account. To use that account instead (named dgxc-admin in this example), use the following command:

ssh -i /path/to/ssh_cert_for_dgxc_admin dgxc-admin@ip-addr-of-head-node

Once logged in to the head node, this user account can then use sudo to elevate to the root user by running the following command:

dgxc-admin@dgxc-bcm-prod-002-cluster1:~$ sudo su -

Node Status: running in active master mode

-------------------------------------------------------------------------------
Cluster Manager license expiration date: 03/Jul/2024, time remaining: 8w 2d
root@dgxc-bcm-prod-002-cluster1:~#

Now you can leverage cmsh to make configuration and user account changes as root.

root@dgxc-bcm-prod-002-cluster1:~# cmsh
[dgxc-bcm-prod-002-cluster1]% quit
root@dgxc-bcm-prod-002-cluster1:~#

4.3. Accessing Login Nodes as Root (Cluster Owner Only)

  1. To access login nodes as root, first log in to the head node as root, following the steps in the previous sections.

  2. Once you’ve logged in to the head node as root, you can then ssh to the login nodes from the head node using the following command (from inside of the head node session):

    ssh slogin001
    

    Note

    slogin002 is also available.

4.4. Accessing Login Nodes as Cluster Admins

Non-root users are not permitted to SSH to the head node - they will only be able to SSH to the cluster’s Slurm login nodes.

To SSH to a Slurm login node, ensure that you have a user account created along with a matching public SSH key pair on your system tied to that user.

4.4.1. Logging In with an SSH Key

Cluster admins will have SSH access to the login nodes. They can also access the Base View from the login nodes.

  1. Obtain the Login node addresses from your cluster owner.

  2. Login with the user account(s) created by the cluster owner:

ssh -i /path/to/ssh_cert <cluster-admin>@ip-addr-of-login-node

Note

If you encounter any issues while trying SSH, refer to SSH Key Permissions for assistance.

4.5. Accessing Base View

Base View is a browser-based GUI that provides a dashboard view of the cluster. This web application is hosted on the head node and provides a uniform view of the cluster’s status, health, and performance.

Base View is designed for use by cluster admins and is compatible with the latest versions of all major desktop web browsers. It is currently not supported on mobile devices.

4.5.1. Logging In

To log in to Base View, you will need a password, which your cluster owner should have provided when onboarding your user. If you do not remember your password, you must work with your cluster owner to reset it.

  1. Assuming that your user can already SSH to a login node, create an SSH tunnel back to your local desktop host at port 8081 using the following command:

    ssh -i /path/to/ssh_cert <cluster-admin>@ip-addr-of-login-node -L8081:master:8081
    

    The hostname or IP address of the head node should have been provided as part of the cluster prerequisite information. Once the correct URL has been entered in your browser, a login screen will appear.

  2. Navigate to https://localhost:8081/base-view/ from your web browser.

  3. In the browser, add your admin username and password to sign into Base View.

    Note

    The first time a browser is used to log in to the cluster portal, a warning about the site certificate being untrusted appears in a default cluster configuration. This can safely be accepted.

    Base View Login Page

Once logged in, you will see an overview of the cluster, including graphs for occupation rate, memory used, CPU cycles used, node statuses, GPU usage, and other cluster details.

Base View Overview Page

Base View provides additional information for cluster admins through various tabs, as shown in the following table.

Tab

Information

Settings

Controls various cluster settings such as cluster name and node name.

License Information

Shows various license information for the cluster.

System Information

Shows the main hardware specifications of the node (CPU, memory, BIOS), along with the OS version that it runs.

Version Information

Shows version information for important cluster software components, such as the CMDaemon database version, and the cluster manager version and builds.

Run Command

Allows a specified command to be run on a selected node of the cluster.

4.6. Accessing NGC

As a part of the DGX Cloud subscription, your organization has received access to NVIDIA NGC, with Private Registry and NVIDIA AI Enterprise subscriptions enabled.

For more information on setting up your NGC org, please see the NGC User Guide.

Note

The cluster owner is the org owner on NGC. The cluster owner can then also invite and set up admins to access and manage users within the NGC org.

5. Administering Your DGX Cloud Cluster

5.1. Managing Users and Admins

5.1.1. Adding Cluster Admins Via cmsh (Cluster Owner Only)

As a cluster owner, you can add admins to help manage the Slurm cluster using the following steps.

  1. Compile a list of cluster admins: Make a list of people who will require admin access to the cluster.

  2. Create SSH key pairs: Ask each cluster admin to create an SSH key pair for themselves using the following command:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-admin>-rsa-dgxc -C "cluster_admin_email@example.com"
    
  3. Obtain public keys: Once each admin has an SSH key pair, have each send you the contents of their public key (<cluster-admin>-rsa-dgxc.pub) file generated in their ~/.ssh/ directory. You will use this information in the following steps to create the cluster admin user.

  4. Create a cluster admin user: From the head node as root, run the following commands to create a cluster admin.

    1. Enter cmsh with this command:

      cmsh
      
    2. Run the following commands within cmsh:

       1user
       2add <cluster-admin>
       3set password
       4set profile tenantadmin
       5commit
       6
       7group
       8use tenantadmin
       9append members <cluster-admin>
      10commit
      11quit
      
    3. Switch to the user’s account:

      sudo su - <cluster-admin>
      
    4. Add their SSH public key (obtained during Step 3 above) to the authorized_keys file using a text editor of your choice. For example,

      nano $HOME/.ssh/authorized_keys
      
    5. (Optional) If the cluster admin will be running slurm jobs, you can configure their user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:

      module initadd slurm
      
    6. Exit the user’s account:

      exit
      
    7. Run the following commands to add the user as a Slurm admin:

      1module load slurm
      2sacctmgr add User User=<cluster-admin> Account=root AdminLevel=Administrator
      

      Commit the changes when prompted.

  5. (Optional) Create a shared scratch space on LustreFS for the admin user: If the cluster admin will be running slurm jobs, you can configure their user to have a scratch space on the Lustre shared file system, or they can configure this themselves if needed. Follow the steps in Creating and Configuring Lustre Shared Storage for Cluster Admins and Users to do so.

  6. Send the information to the admin: The admin is now set up to access the cluster and start working. Send the following information to the admin:

    • Login node addresses

    • Their username and password information

    • Which SSH public key you configured for their user

    • (Optional) Their LustreFS scratch directory information

    Each cluster admin should now be able to log on to the login nodes using the following command:

    ssh -i /path/to/cluster_admin_ssh_cert <cluster-admin>@ip-addr-of-login-node
    
  7. Add additional users: Repeat the steps in Adding Cluster Admins Via cmsh (Cluster Owner Only) and Creating and Configuring Lustre Shared Storage for Cluster Admins and Users for each cluster admin user you want to create. Then the cluster admins can follow the steps in the next section to create cluster users.

5.1.2. Adding Cluster Users Via cmsh (Cluster Owner Only)

As a cluster owner, you’ll need to gather some information to onboard users to the cluster.

  1. Compile a list of cluster users: Start by compiling a list of users who need access to the cluster.

  2. Create SSH key pairs: Each user will need to create an SSH key pair for themselves with the following command:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-user>-rsa-dgxc -C "your_email@example.com"
    
  3. Obtain public keys: Once each user has created an SSH key pair, have them send you the contents of their public key (<cluster-user>-rsa-dgxc.pub) file located in their ~/.ssh/ directory. You will use this in the following steps to create the cluster user.

  4. Create a cluster user: From the head node as root, run the following commands to create a cluster user.

    1. Enter cmsh with this command:

      cmsh
      
    2. Within cmsh, run the following commands to create a cluster user:

      1user
      2add <cluster-user>
      3set password
      4set profile portal
      5commit
      6quit
      
    3. Switch to the user’s account using the following command:

      sudo su - <cluster-user>
      
    4. Add the user’s SSH public key (obtained earlier above) into the authorized_keys file in their ~/.ssh/ directory, using the text editor of your choice. For example,

      nano $HOME/.ssh/authorized_keys
      
    5. Configure their user account to automatically load the slurm module upon login. This will prevent the user from running the module load slurm command every time on login. Run the below command to do so:

      module initadd slurm
      
    6. Exit the user’s account by running the following command:

      exit
      
  5. Create a shared scratch space on LustreFS for the user: Next, create a LustreFS directory for the user.

    Follow the steps in Creating and Configuring Lustre Shared Storage for Cluster Admins and Users to create and configure shared storage for the user.

  6. Send the information to the user: The user is now set up to access the cluster and start working. Send the following information to the user:

    • Login node addresses

    • Their username and password information

    • Which SSH public key you configured for their user

    • Their LustreFS scratch directory information

    Each user should now be able to log on to the login nodes using the following command:

    ssh -i /path/to/cluster_user_ssh_cert <cluster-user>@ip-addr-of-login-node
    
  7. Add additional users: Repeat the steps in Adding Cluster Users Via cmsh (Cluster Owner Only) and Creating and Configuring Lustre Shared Storage for Cluster Admins and Users for each cluster user you want to create.

  8. (Optional) Create a list of cluster teams or Slurm accounts: Refer to Managing the Slurm Cluster for more information.

5.1.3. (Optional) Elevating a User to a Cluster Admin (Cluster Owner Only)

Important

The following steps should only be performed if the user needs cluster admin capabilities.

  1. From the head node as root, Enter cmsh with this command:

    cmsh
    
  2. Run the following commands within cmsh:

     1user
     2use <target-cluster-admin>
     3set profile tenantadmin
     4commit
     5
     6group
     7use tenantadmin
     8append members <target-cluster-admin>
     9commit
    10quit
    
  3. Run the following commands to add the cluster admin as a Slurm admin:

    1module load slurm
    2sacctmgr add User User=<target-cluster-admin> Account=root AdminLevel=Administrator
    

    Commit the changes when prompted.

5.1.4. Enable an Alternative Cluster Owner Account (Cluster Owner Only)

To enable an alternative non-root cluster owner account with sudo privileges, follow the steps below.

This new user (which we will call dgxc-admin in this section) can then log in to the head node instead of the root user, and will have sudo permissions and the ability to elevate to root.

  1. First, follow the onboarding procedures defined in Adding Cluster Admins Via cmsh (Cluster Owner Only) to create the new user.

  2. Once the user has been created, in the head node’s /etc/ssh/ssh_config file, append the new user to the AllowUsers line, and write the file update. An example of what that line should look like is found below.

    AllowUsers root dgxc-admin
    
  3. Next, add the user to the sudo group on the head node.

    usermod -a -G sudo dgxc-admin
    
  4. Restart the ssh service.

    service ssh restart
    
  5. Verify the update was successful with a test ssh to the head node.

    ssh dgxc-admin@ip-addr-of-login-node
    

    Ensure that cmsh can be used with the account.

    1dgxc-admin@dgxc-bcm-prod-002-cluster1:~$ sudo su -
    2[sudo] password for dgxc-admin:
    3
    4Node Status: running in active master mode
    5
    6-------------------------------------------------------------------------------
    7Cluster Manager license expiration date: 03/Jul/2024, time remaining: 8w 2d
    8root@dgxc-bcm-prod-002-cluster1:~# cmsh
    9[dgxc-bcm-prod-002-cluster1]%
    

Further changes can be made to the /etc/ssh/ssh_config file to restrict SSH entirely for the root user. Align to your organization’s requirements for further changes.

5.1.5. Adding Users Via Base View

5.1.5.1. Prerequisites

Before creating users in Base View, you’ll need to gather some information to onboard users to the cluster.

  1. Compile a list of cluster users: Start by compiling a list of users who need access to the cluster.

  2. Create SSH key pairs: Each user will need to create an SSH key pair for themselves with the following command:

    ssh-keygen -t rsa -b 4096 -f ~/.ssh/<cluster-user>-rsa-dgxc -C "your_email@example.com"
    
  3. Obtain public keys: Once each user has created an SSH key pair, have them send you the contents of their public key (<cluster-user>-rsa-dgxc.pub) file located in their ~/.ssh/ directory. You will use this in the following steps to create the cluster user.

5.1.5.2. Creating Users in Base View

  1. To do so, first log in to Base View with your cluster admin user account, following the instructions in Accessing Base View.

  2. Once logged into Base View, click on the Identity Management section in the nav bar and choosing the Users option.

    _images/dgx-cloud-base-view-user-menu.png
  3. Click on the + ADD button at the bottom of the page, then Select the User option that appears.

    _images/dgx-cloud-base-view-add-button.png
  4. Fill the Name field along with both the Password and Confirm password field for the user you want to create - do not alter any other values.

    _images/dgx-cloud-base-view-fill-user.png
  5. Click the Save button at the bottom right of the form. A successful user creation will be indicated with a green Operation Successful banner at the top of the page.

    _images/dgx-cloud-base-view-operation-successful.png

5.1.5.3. Configuring SSH Access for Users After Account Creation with Base View

  1. Once the user account is created, you’ll need to add their SSH key for access. From a new terminal, SSH to a login node:

    ssh -i /path/to/ssh_cert <cluster-admin>@ip-addr-of-login-node
    
  2. Switch to the user’s account:

    sudo su - <cluster-admin-or-user>
    
  3. Add their SSH public key (obtained during the Prerequisites section above) to the authorized_keys file using a text editor of your choice. For example:

    nano $HOME/.ssh/authorized_keys
    
  4. Configure their user account to automatically load the slurm module upon login. This will avoid the user having to run the module load slurm command every time on login. Run the below command to do so:

    module initadd slurm
    
  5. Exit the user’s account:

    exit
    
  6. Proceed to the Creating and Configuring Lustre Shared Storage for Cluster Admins and Users section to create a scratch directory for the user if necessary.

  7. To complete cluster admin onboarding, have the cluster owner proceed to the (Optional) Elevating a User to a Cluster Admin (Cluster Owner Only) section.

  8. The user is now set up to access the cluster and start working. Send the following information to the user:

    • Login node addresses

    • Their username and password information

    • Which SSH public key you configured for their user

    • Their LustreFS scratch directory information

    Each user should now be able to log on to the login nodes section using the following command:

    ssh -i /path/to/cluster_user_ssh_cert <cluster-admin-or-user>@ip-addr-of-login-node
    

5.1.6. Creating and Configuring Lustre Shared Storage for Cluster Admins and Users

Note

If you are executing these steps as cluster admin, you may need to run these commands as sudo.

  1. From the login node, run the following commands to create the user scratch space on the Lustre filesystem:

    1mkdir -p /lustre/fs0/scratch/<cluster-user>
    2chown <cluster-user>:<cluster-user> /lustre/fs0/scratch/<cluster-user>
    
  2. (Optional) You can then assign a quota to the user if necessary. To view and set quota, see the Managing Lustre Storage section below.

    Note

    If no quota is set, the user has unlimited storage access to the whole Lustre filesystem and, if not careful, can consume the entire filesystem.

  3. The user is now set up with shared storage to use in the cluster.

5.1.7. Updating Password

Password-based logins are disabled via SSH, but a password is required to access Base View.

A default account password should have been created for each onboarded user or admin.

To change your password after initial login, use the passwd command.

passwd

If you do not remember your password, please work with the cluster owner to reset your user account’s password.

5.1.7.1. Resetting User Passwords as Root (Cluster Owner Only)

To reset another user’s password, log in to the head node as root. Once logged in, run the following command to reset the user password:

passwd <cluster-user>

Or to reset the password via cmsh, run the following commands as root:

cmsh
user
use <cluster-user>
set password
commit

5.2. Managing Home Directories

A user’s home directory is their default file path upon logging into a DGX Cloud Slurm cluster. This is of the form /home/demo-user.

In a DGX Cloud Slurm cluster, user home directories reside on a network filesystem (NFS). This allows for all nodes in the cluster to access the same data concurrently when a user is running a job.

This design choice makes the following behaviors possible:

  • Writing a script once and running it on multiple nodes simultaneously

  • Cloning code from a repository once and making it available on the entire cluster

  • Using a shared directory to log job output from multiple producers

Allowing users to run their code on whatever systems are available - data availability is not dependent on any one system.

User data is isolated via filesystem permissions (user-a can only read and write data to /home/user-a, not /home/user-b). However, users can impact each other if the home filesystem is overtaxed.

Some best practices and limitations to keep in mind:

  • Home directory storage is a great fit for small amounts of data, particularly code, scripts, and configuration files.

  • Home directory storage can also be an acceptable fit for log files, but alternative storage should be considered for log files that will grow to be large ot are frequently written to.

  • Home directory storage has a quota configured for each user’s home directory - the default per-user quota is 10 GB if you have not specified a different quota in collaboration with your NVIDIA TAM. Administrators can monitor home directory capacity if desired and occasionally encourage users to remove unneeded data or migrate that data elsewhere.

A parallel filesystem is provided for datasets and more intensive storage IO use cases. The following section discusses key information.

5.2.1. Checking Utilization

To check the utilization of all home directories relative to total available capacity, use the df -h command from either the head node or a login node. An example of usage and output follows.

 1df -h
 2
 3Filesystem                Size  Used Avail Use% Mounted on
 4tmpfs                      13G  1.6M   13G   1% /run
 5/dev/sdb3                 500G   56G  445G  12% /
 6tmpfs                      63G   28K   63G   1% /dev/shm
 7tmpfs                     5.0M     0  5.0M   0% /run/lock
 8/dev/sda1                 251G   32K  239G   1% /mnt/resource
 910.14.1.4:/bcm-home       1.0T  308M  1.0T   1% /home
1010.14.1.4:/bcm-cm-shared  1.0T  2.5G 1022G   1% /cm/shared

In the example above, the /home path (identified by the text in the Mounted on column) indicates 308 Megabytes of 1 Terabyte of available capacity is used.

To get a breakdown of capacity used by each user in their home directory, the du -sh command can be used. An example usage and output follows.

1du -sh /home/*
2
392K     /home/charlie
440K     /home/cmsupport
5196K    /home/demo-admin
6288M    /home/demo-user
752K     /home/alice
848K     /home/bob

In scenarios where the aggregate home directory utilization across all users approaches its maximum capacity, 80-90% capacity utilization is a common range to begin proactively removing or migrating data. Leveraging the du command is a quick way to identify heavy NFS storage users.

5.3. Managing Lustre Storage

The parallel filesystem made available as part of DGX Cloud clusters is Lustre. In DGX Cloud clusters, the Lustre filesystem will be mounted at /lustre/fs0. This can be seen by using the df command and looking for the the lustre/fs0 path in the Mounted on column.

 1df -h
 2
 3Filesystem                Size  Used Avail Use% Mounted on
 4tmpfs                      63G  1.3M   63G   1% /run
 5/dev/sda1                 199G   14G  186G   7% /
 6tmpfs                      63G     0   63G   0% /dev/shm
 7tmpfs                     5.0M     0  5.0M   0% /run/lock
 8tmpfs                     4.0M     0  4.0M   0% /sys/fs/cgroup
 9/dev/sdb3                 4.4G  4.0G  404M  91% /cm/node-installer-ebs
1010.14.1.4:/bcm-cm-shared  1.0T  2.5G 1022G   1% /cm/shared
1110.14.1.4:/bcm-home       1.0T  308M  1.0T   1% /home
1210.14.2.9@tcp:/lustrefs    79T   87G   75T   1% /lustre/fs0
13tmpfs                      13G  4.0K   13G   1% /run/user/0

5.3.1. Checking Utilization

To check the utilization of all Lustre relative to total available capacity, use the df -h command from the login node. An example usage and output follows.

 1df -h
 2
 3Filesystem                Size  Used Avail Use% Mounted on
 4tmpfs                      63G  1.3M   63G   1% /run
 5/dev/sda1                 199G   14G  186G   7% /
 6tmpfs                      63G     0   63G   0% /dev/shm
 7tmpfs                     5.0M     0  5.0M   0% /run/lock
 8tmpfs                     4.0M     0  4.0M   0% /sys/fs/cgroup
 9/dev/sdb3                 4.4G  4.0G  404M  91% /cm/node-installer-ebs
1010.14.1.4:/bcm-cm-shared  1.0T  2.5G 1022G   1% /cm/shared
1110.14.1.4:/bcm-home       1.0T  308M  1.0T   1% /home
1210.14.2.9@tcp:/lustrefs    79T   87G   75T   1% /lustre/fs0
13tmpfs                      13G  4.0K   13G   1% /run/user/0

In the example above, the /lustre/fs0 path (identified by the text in the Mounted on column, as noted above) indicates 87 Gigabytes of 79 Terabytes of available capacity is used.

To get a breakdown of capacity used by directory, the du -sh command can be used. Example usage and output follows.

1du -sh /lustre/fs0/*
24.0K    /lustre/fs0/admin-dir
387G     /lustre/fs0/scratch

The scratch directory can be further investigated by targeting that path, given that the admin-dir in the above case uses minimal capacity.

1du -sh /lustre/fs0/scratch/*
2
34.0K    /lustre/fs0/scratch/charlie
487G     /lustre/fs0/scratch/demo-user
54.0K    /lustre/fs0/scratch/alice
64.0K    /lustre/fs0/scratch/bob

In scenarios where Lustre capacity utilization approaches its maximum capacity, 80-90% capacity utilization is a common range to begin proactively removing or migrating data. Leveraging the du command is a quick way to identify heavy Lustre storage users.

5.3.2. Viewing Quota

To view the quota for a directory for a specific user (in this example, demo-user), the following lfs quota command can be used.

lfs quota -u demo-user -v /lustre/fs0/scratch/demo-user

5.3.3. Setting Quota

As an administrator, to set a user’s quota on their scratch directory, use the lctl setquota command. The following example is used for a user named demo-user on their dedicated scratch directory.

lfs setquota -u demo-user -b 100G -B 100G -i 10000 -I11000 /lustre/fs0/scratch/demo-user

The lfs quota command and resulting output when the quota for a directory is checked by either the admin or the user is as follows.

 1lfs quota -u demo-user -v /lustre/fs0/scratch/demo-user
 2
 3Disk quotas for usr demo-user (uid 1004):
 4    Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
 5/lustre/fs0/scratch/demo-user/
 6                    4  104857600 104857600       -       1   10000   11000       -
 7lustrefs-MDT0000_UUID
 8                    4*      -       4       -       1*      -       1       -
 9lustrefs-OST0000_UUID
10                    0       -       0       -       -       -       -       -
11lustrefs-OST0001_UUID
12                    0       -       0       -       -       -       -       -
13lustrefs-OST0002_UUID
14                    0       -       0       -       -       -       -       -
15lustrefs-OST0003_UUID
16                    0       -       0       -       -       -       -       -
17lustrefs-OST0004_UUID
18                    0       -       0       -       -       -       -       -

For more details, refer to Quota Administration.

5.4. Managing the Slurm Cluster

5.4.1. Configuring Fair Share Scheduling

If needed, the cluster admins should then set up cluster users as teams (called accounts in Slurm), with fair share scheduling across the various accounts.

For example:

 1sacctmgr create account name=science fairshare=50
 2sacctmgr create account name=chemistry parent=science fairshare=30
 3sacctmgr create account name=physics parent=science fairshare=20
 4sacctmgr create user name=User1 cluster=slurm \
 5account=physics fairshare=10
 6sacctmgr list associations cluster=slurm \
 7format=Account,Cluster,User,Fairshare
 8
 9# set max job
10sacctmgr modify user where name=User1 cluster=slurm \
11account=physics set maxjobs=2m

Refer to SLURM Fair Sharing for more information.

5.4.2. Viewing Node Information

To view all nodes in the cluster and their current state, ssh to the Slurm login node for your cluster and run the sinfo command:

1sinfo
2
3PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
4defq*      up     infinite   64     idle   gpu[001-064]

There are 64 nodes available in this example, all in an idle state. If a node is busy, its state will change from idle to alloc when the node is in use:

1sinfo
2
3PARTITION  AVAIL  TIMELIMIT  NODES  STATE  NODELIST
4defq*      up     infinite   1      alloc  gpu001
5defq*      up     infinite   63     idle   gpu[002-064]

Additional states include drain indicating a node is currently down, mix indicating a node is partially in use, and maint when a node is down for maintenance.

To get additional information about a specific node, use scontrol as follows.

 1scontrol show node gpu007
 2
 3NodeName=gpu007 Arch=x86_64 CoresPerSocket=48
 4CPUAlloc=0 CPUEfctv=96 CPUTot=96 CPULoad=0.24
 5AvailableFeatures=location=southcentralus
 6ActiveFeatures=location=southcentralus
 7Gres=gpu:a100:8(S:0-1)
 8NodeAddr=gpu007 NodeHostName=gpu007 Version=23.02.7
 9OS=Linux 5.15.0-1056-azure #64-Ubuntu SMP Tue Feb 6 19:23:34 UTC 2024
10RealMemory=1814258 AllocMem=0 FreeMem=1792041 Sockets=2 Boards=1
11State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
12Partitions=defq
13BootTime=2024-04-10T14:58:04 SlurmdStartTime=2024-04-15T14:37:44
14LastBusyTime=2024-04-15T14:37:45 ResumeAfterTime=None
15CfgTRES=cpu=96,mem=1814258M,billing=96,gres/gpu=8
16AllocTRES=
17CapWatts=n/a
18CurrentWatts=0 AveWatts=0
19ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

5.4.3. Draining Nodes

If a node needs to be taken down for maintenance or is otherwise unhealthy, it can be put it in the drain state. This lets Slurm know that the node should not be selected for jobs until it returns to the idle state. If a node is currently running jobs, it can be set to the mixed drain state, where the node will continue to run the job to completion. Once finished, the node will automatically set to drain, so another job can’t be run on it.

To drain a node, run the following command, replacing the node names with the nodes to drain and providing a reason for draining the node, such as maintenance:

scontrol update NodeName=<Enter name of node(s) to put in DRAIN state> State=Drain Reason="Specify reason for draining node here"

To confirm the node has been drained, run sinfo:

1sinfo
2
3PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
4default   up    4:00:00   1     drain node001

5.4.3.1. Restarting Failed Nodes

If a node is suspected to be unhealthy, it can be disabled for use by jobs using the previous steps.

If, in coordination with NVEX support, it is determined that the node is unhealthy and must be reprovisioned, the root administrator can initiate that process via cmsh as shown.

cmsh -c "device; terminate -n <name(s) of node(s) placed in DRAIN state>"

When the cmsh device submenu indicates the Status field of the node(s) as [DOWN], terminated, the following can be run to re-provision those nodes.

cmsh -c "device; power on -n <name(s) of node(s) placed in DRAIN state>"

When the cmsh device submenu indicates the Status field of the node(s) as [UP], an administrator can use scontrol again to remove the node(s) from the drain state.

scontrol update NodeName=<Enter name of node(s) to remove from DRAIN state> State=Idle

If there are continued issues with the node, reach out to NVEX support for next steps.

5.5. Managing Users on NGC

Note

The cluster owner is the org owner on NGC. The cluster owner can then also invite and set up admins to access and manage users within the NGC org.

To invite users to the NGC org, follow the steps in the NGC User Guide here.

6. Troubleshooting

6.1. SSH Key Permissions

6.1.1. Unprotected Private Key File

You may see this error when trying to ssh to the head node when using WSL on Windows.

1@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
3@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
4Permissions 0777 for '<path-to-ssh-key>' are too open.
5It is required that your private key files are NOT accessible by others.
6This private key will be ignored.
7Load key "<path-to-ssh-key>": bad permissions
8<cluster-user>@slogin001: Permission denied (publickey,gssapi-with-mic).

To fix this, you need to update your WSL conf to allow you to own and change the file permissions for the SSH private key.

  1. Create an /etc/wsl.conf file (as sudo) with the following contents:

    1[automount]
    2options="metadata"
    
  2. Exit WSL.

  3. Terminate the instance (wsl --terminate <distroname>) or shut it down (wsl --shutdown).

  4. Restart WSL.

Then, from WSL, run the following command to change the permissions of the private key:

user@local:$HOME/.ssh$ chmod 600 <ssh-key-file>

Then, check the permissions:

user@local:$HOME/.ssh$ ls -l <ssh-key-file>

It should look like:

-rw------- 1 user local 2610 Apr  2 19:19 <ssh-key-file>

6.2. Base View Permissions

In general, cluster users should use the User Portal instead of Base View for a UI to utilize the cluster. If a cluster user with insufficient permissions tries to log in to Base View, they will see an error similar to the following.

Base View Permissions Error

Base View is primarily intended for cluster admins.

6.3. Issues With Jobs

In a DGX Cloud Slurm cluster, job-related errors typically fall into two categories:

  • Issues that a user caused or are in the user’s control.

  • Issues stemming from factors beyond the user’s control, such as system health problems.

To distinguish between these requires familiarity with Slurm, GPU infrastructure, and the particular software in use.

6.3.1. Basic Questions

To determine if the environment is functioning as expected, a basic inspection of the job queue and hardware state is performed as a sanity check. Consider the following:

  • Are jobs completing across the cluster at all, or are issues limited to jobs interacting with specific nodes in the cluster?

  • Are problems isolated to specific applications or datasets?

  • Do basic tools like nvidia-smi and ibdev2netdev report the expected resources?

  • At the administrator level, do lower-level tools like dmesg indicate any errors related to the symptoms?

If any significant discrepancies from expected behavior are identified during this debugging process, there are several ways to proceed.

6.3.2. Job and Application Issues

If a job is causing problems, whether due to the user or the application itself, administrators can often resolve the issue by using scontrol to manage that job.

To use scontrol to view a specific job, first run squeue to identify the JOBID value:

1squeue -u demo-user
2
3            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4            235      defq bad-job  demo-use  R      49:40      4 gpu[002-005]

We can then take the JOBID value of 235 and use it as an argument to scontrol.

 1scontrol show job 235
 2
 3JobId=235 JobName=bad-job
 4UserId=demo-user(1004) GroupId=demo-user(1005) MCS_label=N/A
 5Priority=4294901758 Nice=0 Account=(null) QOS=normal
 6JobState=RUNNING Reason=None Dependency=(null)
 7Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
 8RunTime=00:52:44 TimeLimit=02:00:00 TimeMin=N/A
 9SubmitTime=2024-04-15T14:54:08 EligibleTime=2024-04-15T14:54:08
10AccrueTime=2024-04-15T14:54:08
11StartTime=2024-04-15T14:54:09 EndTime=2024-04-15T16:54:09 Deadline=N/A
12SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-15T14:54:09 Scheduler=Main
13Partition=defq AllocNode:Sid=slogin001:709779
14ReqNodeList=(null) ExcNodeList=(null)
15NodeList=gpu[002-005]
16BatchHost=gpu002
17NumNodes=4 NumCPUs=384 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
18ReqTRES=cpu=32,mem=515232M,node=4,billing=32,gres/gpu=32
19AllocTRES=cpu=384,node=4,billing=384,gres/gpu=32
20Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
21MinCPUsNode=8 MinMemoryNode=0 MinTmpDiskNode=0
22Features=(null) DelayBoot=00:00:00
23OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
24Command=/lustre/fs0/scratch/demo-user/nemo-megatron-launcher/launcher_scripts/results/gpt3_5b/nemo-megatron-gpt3_5b_submission.sh
25WorkDir=/lustre/fs0/scratch/demo-user/nemo-megatron-launcher/launcher_scripts
26StdErr=/lustre/fs0/scratch/demo-user/nemo-megatron-launcher/launcher_scripts/results/gpt3_5b/log-nemo-megatron-gpt3_5b_235.err
27StdIn=/dev/null
28StdOut=/lustre/fs0/scratch/demo-user/nemo-megatron-launcher/launcher_scripts/results/gpt3_5b/log-nemo-megatron-gpt3_5b_235.out
29Power=
30TresPerNode=gres:gpu:8

This output provides information on where to find the log files and how a job was initiated. If a job requires a time limit, use the following scontrol command to set it.

scontrol update job id=235 timelimit=01:00:00

Or if the job needs to be canceled, use the scancel command.

scancel job id=235

6.3.3. Stability or Health Issues

Repeated attempts to run a job that are met with errors can be frustrating - for example, NCCL could emit errors like the following:

gpu001:7127:17776 [2] socket.c:540 NCCL WARN Net : Connection closed by remote peer dgxc-bcm-prod-002-cluster1.cm.cluster<59366>

A recommended first step is to add a debug flag to the code. For NCCL-related problems, adding the following environment variable to a new run of the job experiencing the error will provide additional information. It’s advisable not to keep it enabled constantly, as it will generate extensive logging output.

export NCCL_DEBUG=INFO

An appropriate next step, assuming the information provided continues to indicate a problem with the node, is to completely remove the potentially unhealthy node (gpu001, in this case) from use while it is being triaged. As noted in previous sections, using scontrol to drain the node accomplishes this goal.

scontrol update NodeName=gpu001 State=Drain Reason="NCCL errors"

If similar or identical jobs now pass without issues, then there is likely a problem with the node that you have placed into a drain state. Contact your NVIDIA TAM to get customer support involved in triaging the system and providing a replacement.

6.3.4. Performance Issues

Diagnosing performance issues often involves combining the debugging methods discussed in the stability and application-specific sections above. In addition to these methods, verifying the software state of both the system and the application is essential.

For example, have there been any recent updates to the codebase or container? Has the operating system software on the cluster been updated?

The interaction between hardware, systems, and software can evolve, leading to unforeseen consequences. Experimenting with job configurations that have worked previously is a good first step in assessing performance issues.

If you are confident that there is a new problem that is not an application bug, please contact your NVIDIA TAM to open a support ticket.

7. Requesting Modifications To Your DGX Cloud Cluster

While most administrative tasks are available to you, some tasks require assistance from NVIDIA.

These include actions such as expanding and adding additional nodes to the cluster, increasing shared storage capacity on NFS or Lustre, and changing quota for user home directories.

For assistance with these steps, please work with your TAM to ensure necessary modifications are made to your quota.