2. Cluster Onboarding Guide#

2.1. Introduction#

Congratulations on your new DGX Cloud Create cluster!

The following sections guide you through the onboarding process and provide an overview of DGX Cloud Create’s features. For detailed information, refer to the DGX Cloud Create documentation.

DGX Cloud Create includes access to the following:

  • NVIDIA DGX Cloud Create Cluster

    • NVIDIA Run:ai installed on the cluster

  • NVIDIA AI Enterprise

  • NVIDIA GPU Cloud (NGC) Private Registry

  • NVIDIA Enterprise Support Portal

Your Technical Account Manager (TAM) will schedule an onboarding session to review these components, including how to access and use them. Use this guide as a reference during that session.

During your onboarding session, you will perform the following tasks:

  1. Access the DGX Cloud Create Cluster

  2. Set up the NVIDIA Run:ai and Kubernetes CLI

  3. Create a Department in NVIDIA Run:ai

  4. Create a Project in NVIDIA Run:ai

  5. Assign User Roles in NVIDIA Run:ai

  6. Create Environments

  7. Create Data Sources

  8. Run a Single Node NCCL Test

  9. Run a Multi Node NCCL Test

  10. (Optional) Run an Interactive Workload

  11. (Optional) Download Data into the Data Source

2.2. Admin Steps#

Before onboarding, a cluster admin will be designated to manage the resources provided in DGX Cloud Create. The cluster admin has the following roles:

  • Application Administrator for the cluster and NVIDIA Run:ai

  • Org Owner for NGC (used for Private Registry and AI Enterprise)

The cluster admin and TAM complete these steps before onboarding:

  1. Provide Classless Inter-Domain Routing (CIDR) ranges for company networks to NVIDIA to ensure cluster access

  2. Configure single sign-on (SSO) to the cluster and NVIDIA Run:ai

  3. (Optional) Configure private access connectivity to the cluster and NVIDIA Run:ai as described in (Optional) Private Access

  4. Set up their NGC Org for access to NGC Private Registry and AI Enterprise

  5. Register for the NVIDIA Enterprise Support Portal

2.2.1. Overview of Personas in DGX Cloud Create#

To interact with DGX Cloud Create, generally, there are two different roles involved:

  • Admin Persona: Manages users, resource access, and cluster utilization.

  • User Persona: Includes researchers, ML engineers, and others leveraging DGX Cloud Create components for their work.

For more information on user roles, refer to Cluster User Scopes and User Roles.

During onboarding, the first half of the session will focus on admin-level tasks to prepare the cluster and resources for use, followed by basic user examples to get started running jobs on the cluster.

2.2.2. Accessing the DGX Cloud Create Cluster#

During the cluster setup, NVIDIA works alongside your assigned cluster administrator to configure SSO using your OIDC provider. This setup will allow your organization to use standard credentials to sign in to the cluster.

During onboarding, your TAM will provide the following information to access the cluster:

  • A URL to access the NVIDIA Run:ai cloud control plane

  • A kubeconfig file containing the cluster DNS/URL

To log in, navigate to the NVIDIA Run:ai URL, enter your email and select Continue with SSO.

We will use the kubeconfig file in the Setting up the NVIDIA Run:ai and Kubernetes CLI section.

2.2.3. Setting up the NVIDIA Run:ai and Kubernetes CLI#

Your DGX Cloud Create cluster includes access to the NVIDIA Run:ai CLI and the Kubernetes cluster and CLI.

Your TAM will provide a kubeconfig file to set up the CLI.

Use this kubeconfig and follow the Accessing the NVIDIA Run:ai CLI instructions to set up your CLI.

2.2.4. Creating a Department in NVIDIA Run:ai#

Next, you’ll need to create a department to use in NVIDIA Run:ai. A default department may already be provided.

However, if additional departments are needed, or if the default department’s quota needs to be modified, follow the instructions in Departments to make the appropriate changes.

2.2.5. Creating a Project in NVIDIA Run:ai#

With your department set up, you can now create a project to run the user workloads.

This project is assigned a more granular quota within what’s available for the department you just created.

Both Application Administrators and Department Administrators can create projects in NVIDIA Run:ai.

For instructions on creating a project, refer to Creating a Project.

Once you’ve created a department and project, you can assign users to specific projects or departments as needed.

2.2.6. Assigning User Roles in NVIDIA Run:ai#

Note

This step is optional during onboarding. You can proceed to the next sections to run workloads within the department or project you just created.

After logging in to NVIDIA Run:ai and setting up departments and projects, you can grant access to additional users.

For more information, refer to Managing Users.

2.2.7. Creating an Environment#

Environments are a NVIDIA Run:ai construct for configuring a container image to be used. To run the example workloads in the following sections, create these environments using the nvcr.io/nvidia/nemo:25.02 base image.

  • A standard NeMo environment

  • A distributed NeMo environment

For detailed instructions, refer to Creating a New Environment.

2.2.8. Creating a Data Source#

A Data Source provides a persistent mount point for data to be downloaded and saved on the cluster. You will need to create a Data Source to run the example workloads in the next sections.

Follow the instructions here to create Data Sources.

2.3. User Steps#

2.3.1. Running Sample Workloads#

Note

The following instructions describe how to run single and multi-node NCCL tests on DGX Cloud Create that attempt to measure the overall bus bandwidth of direct communications across GPUs.

It is important to note that there are theoretical bandwidth speeds and then the effective rates as measured by the test. The effective rates will vary across the high-speed networking interconnect provided by each CSP.

Also, NCCL may apply the tree and ring algorithms in an all-reduce test differently based on the number of ranks (GPU processes) used in the test. In some cases, a 2-node H100 (16 total GPU) NCCL all-reduce test may show twice the performance in GB/s of a 4, 8, or higher node test because NCCL can solely apply the tree algorithm. Please refer to these notes for more information on how NCCL calculates bus bandwidth performance in its tests. If you have concerns about the observed performance metrics from your DGX Cloud Create cluster, please contact your TAM.

2.3.1.1. Running a Single Node NCCL Test#

The following are prerequisites and steps for running a single-node NCCL Test.

2.3.1.1.1. Prerequisites#
  • A Department with a quota for 8 GPUs

  • A Project in the above department with a quota for 8 GPUs

  • A user with a role capable of running workloads

  • An 8-GPU compute resource

  • A standard environment with image base nvcr.io/nvidia/nemo:25.02

2.3.1.1.2. Instructions#
  1. Navigate to the Workloads menu.

  2. Click +New Workload, and select Training.

  3. Select the correct Project.

  4. Select Standard Workload architecture.

  5. Name your training “nccl-test”.

  6. Select the environment that uses the image above.

  7. Under Runtime settings, click + Commands & Arguments.

  8. Set variables as follows:

    • Command: all_reduce_perf_mpi

    • Arguments: -b 1G -e 16G -f 2 -g 8

  9. Select the correct compute resources (likely h100-8g).

  10. Click Create Training.

  11. View the logs to see the output:

    # out-of-place in-place
    # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
    # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
    1073741824  268435456  float sum -1 XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    2147483648  536870912  float sum -1 XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    4294967296  1073741824 float sum -1 XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    8589934592  2147483648 float sum -1 XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    17179869184 4294967296 float sum -1 XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    # Out of bounds values : 0 OK
    # Avg bus bandwidth : XXX.XX
    

2.3.1.2. Running a Multi-Node NCCL Test#

2.3.1.2.1. Prerequisites#
  • A Department with a quota for 32 GPUs

  • A Project in the above department with a quota for 32 GPUs

  • A user with a role capable of running workloads

  • An 8-GPU compute resource

  • A distributed environment with image base nvcr.io/nvidia/nemo:25.02

2.3.1.2.2. Instructions#
  1. Navigate to the Workloads menu.

  2. Click +New Workload, and select Training.

  3. Select the correct Project.

  4. Select Distributed Workload architecture.

  5. Select the MPI framework for the workload.

  6. Name your training “nccl-test”.

  7. Click CONTINUE.

  8. Select the distributed environment you have configured.

  9. Ensure the runtime settings for the workers are:

    • Command: /usr/sbin/sshd

    • Arguments: -De

  10. Configure to use four workers.

  11. Select the correct environment for workers (8 GPU H100 config).

  12. Select CONTINUE.

  13. Ensure that Allow different setup for the master is toggled on.

  14. Select the same environment as the one used with the workers.

  15. Under Runtime settings, click + Commands & Arguments. Set the values in the tab corresponding to the CSP your cluster is on (that is, if running on AWS, select the AWS tab):

    • Command: /opt/amazon-efa-ofi/openmpi/bin/mpirun

    • Arguments:

      --allow-run-as-root -np 32 -x LD_LIBRARY_PATH -x OPAL_PREFIX -x FI_EFA_USE_DEVICE_RDMA -x NCCL_PROTO=simple all_reduce_perf_mpi -b 1g -e 16g -f2 -g1
      
    • Command: mpirun

    • Arguments:

      --allow-run-as-root -np 32 -x LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64" -x NCCL_ALGO="Ring,Tree" -x NCCL_BUFFSIZE="8388608" -x NCCL_CROSS_NIC="0" -x NCCL_DYNAMIC_CHUNK_SIZE="524288" -x NCCL_FASTRAK_CTRL_DEV="eth0" -x NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL="0" -x NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING="0" -x NCCL_FASTRAK_IFNAME="eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8" -x NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices" -x NCCL_FASTRAK_NUM_FLOWS="2" -x NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS="600000" -x NCCL_FASTRAK_USE_LLCM="1" -x NCCL_FASTRAK_USE_SNAP="1" -x NCCL_MIN_NCHANNELS="4" -x NCCL_NET_GDR_LEVEL="PIX" -x NCCL_NVLS_ENABLE="0" -x NCCL_P2P_NET_CHUNKSIZE="524288" -x NCCL_P2P_NVL_CHUNKSIZE="1048576" -x NCCL_P2P_PCI_CHUNKSIZE="524288" -x NCCL_PROTO="Simple" -x NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE="/usr/local/nvidia/lib64/a3plus_guest_config.textproto" -x NCCL_SOCKET_IFNAME="eth0" -x NCCL_TUNER_CONFIG_PATH="/usr/local/nvidia/lib64/a3plus_tuner_config.textproto" -x NCCL_TUNER_PLUGIN="libnccl-tuner.so" all_reduce_perf_mpi  -b 1G -e 16G -f 2 -g 1
      
  16. Select the correct Compute Resources (select cpu or alternative cpu only compute resource for launcher).

  17. Click Create Training.

  18. View the logs to see the output.

    # out-of-place in-place
    # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
    # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
    1073741824     268435456     float     sum      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    2147483648     536870912     float     sum      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    4294967296    1073741824     float     sum      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    8589934592    2147483648     float     sum      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    17179869184   4294967296     float     sum      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    # Out of bounds values : 0 OK
    # Avg bus bandwidth : XXX.XX
    

2.3.1.3. (Optional) Running a Single Node NeMo Interactive Workload#

If desired, you can now also follow the instructions to set up and run an interactive workload using NeMo.

2.3.1.4. (Optional) Downloading Data into the Data Source#

You can also use the same interactive Jupyter environment to install additional CLI tools in the container and download your data from an object store into the data source that was created and mounted.

2.4. Conclusion#

That’s it! You can now access to DGX Cloud Create and successfully run workloads on the cluster.

You can continue to customize and configure your cluster, download your data onto the cluster, and start running your workloads.

For any questions or assistance, reach out to your NVIDIA TAM.