4. Cluster User Guide#

Welcome to your Run:ai on DGX Cloud cluster! This section of the guide is targeted at practitioners who want to run workloads on the cluster. For complete Run:ai documentation, refer to the Run:ai Documentation Library.

4.1. Logging Into Your Cluster#

There are two ways to interact with a Run:ai on DGX Cloud cluster:

  • Using the Run:ai UI: Your cluster administrator should provide you with a URL to access the cluster. Use this URL to log in to the Run:ai UI.

  • Using the Run:ai CLI: This option is limited to certain user types. To use the Run:ai CLI, you must first log in using the UI. To learn how to set up the CLI, visit Accessing the Run:ai CLI in Advanced Usage.

4.1.1. Accessing the Run:ai UI#

To access the Run:ai console in a browser, follow these steps:

  1. Go to the cluster URL given to you by your cluster admin.

  2. In the login dialog, click Continue with SSO. Your SSO login dialog will open.

  3. Enter your details. Once logged in, you will be taken to the Run:ai Cluster Overview page.

4.2. Understanding Your Role and Scope#

When working in Run:ai, your user roles determine the actions you can perform within the cluster. This section explains how to identify your role and understand your permissions. You may be assigned multiple roles within a single cluster.

4.2.1. Learning Your User Role(s)#

Your cluster administrator should have informed you of your assigned role(s) when you first received the cluster URL. If you haven’t received this information, contact your cluster administrator.

To view the permissions and actions allowed by your user type, visit the Run:ai RBAC documentation.

4.3. Walking Through Your Run:ai Environment on DGX Cloud#

This section covers the key concepts of Run:ai on DGX Cloud, focusing on areas which are specific to the DGX Cloud-based cluster. For more information, refer to the full Run:ai Documentation Library.

4.3.1. Overview#

After logging into the Run:ai UI, you will be taken to the Run:ai Overview page.

Run:ai UI overview.

The Overview page displays key metrics about your cluster, including resource utilization and the number of allocated GPU devices. The name of your cluster is listed in the top navigation bar.

The left navigation menu provides quick access to different sections of the UI. Use it to navigate to the Projects, Workloads, Departments, and Nodes pages, among others, as discussed below.

Run:ai UI navigation.

For more information about analytics, quota management, and nodes, visit the Run AI documentation.

4.3.2. Departments and Projects#

Visit Run:ai Departments for a detailed introduction to departments and projects.

4.3.2.1. Departments#

Departments allow you to associate quotas with different teams and groups of users. Users are members of a department, and all projects must be created within a department.

Note

Only users with the role of Application Administrator and Editor can create departments.

4.3.2.1.1. Creating a Department#

For instructions on creating a department, refer to Creating a Department section of the Cluster Administrators guide.

4.3.2.2. Projects#

Projects are used to allocate resources, and define clear guardrails between different research initiatives. Groups of users (or, in some cases, individuals) are associated with a project and can run workloads within that project against a fixed project allocation.

All projects are associated with a department. It is important to take note of the department’s quota limits, as these will impact the allocation and quota of any project associated with that department.

Multiple users can be scoped into the same project. Project-level information, including credentials, is visible to all users in that project.

4.3.2.2.1. Creating Projects#

Note

Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can create and manage projects within their scope on the Run:ai cluster.

Follow Creating a Project in the Cluster Administrators guide to create a project.

4.3.2.2.2. Editing an Existing Project#

If your permissions allow, you can update an existing project by clicking on the checkbox to the left of the project name and then clicking Edit on the menu bar at the top of the projects overview page. This will take you back to the project creation page where you can update any entries as required.

You can update user access for the project by clicking on the checkbox to the left of the project name, then clicking Access Rules on the menu bar at the top of the projects overview page. This will bring up a pop-up where you can enter additional user email addresses to grant them access to the project.

4.3.3. Workloads#

A workload is a computational job or task that you submit to the cluster for execution. There are two types of workloads enabled on your Run:ai on DGX Cloud cluster:

  1. Workspaces - interactive environments best suited for development and exploration of data and code.

  2. Trainings - distributed training jobs which are not interactive.

Note

Inference is not enabled in the cluster, and the deprecated Jobs workload has not been validated.

Workloads run within a project scope and can consume a range of resources. Detailed information about these resources is available in the next section of this guide. Here is a brief overview of the resources:

  • Environments - A workload must have an environment. An environment acts as a container, holding libraries, requirements, and optionally code and data.

  • Compute Resources - A workload requires allocated compute resources, which determine the number of CPUs and GPUS it can utilize.

  • Volumes - Volumes can be associated with a job, giving access to your data. They are dedicated to specific workloads and cannot be transferred.

  • Data Sources - You can connect to a range of data sources including Git and Persistent Volume Claims (PVCs). Data sources are created separately from workloads, and can be used between different workloads, if desired.

  • Credentials - Credentials scoped at the project level will automatically be injected into any workload running in the project.

  • Templates - Templates allow you to pre-fill the workload creation form and save it for use by users in the project at a later date. This is useful for frequently used workloads.

In the next sections, we cover each of these workload components in more detail.

We also provide full examples of creating and launching a variety of workloads in the Interactive Workload Examples guide.

4.3.4. Environments#

Run:ai environments specify container URLs as well as image pull policies and entry commands for the containers. All workloads must specify an environment.

When creating an environment, you can specify the scope, indicating which departments and projects can view and use it.

4.3.4.1. Creating a New Environment#

Note

Only users with the role Application Administrator, Department Administrator, Editor, Environment Administrator, L1 Researcher and Research Manager can create environments.

To create a new environment using the Run:ai UI:

  1. In the left navigation menu, select Environments. You will be taken to the Environments overview page which shows information on existing environments in the cluster.

  2. Click + NEW ENVIRONMENT in the top left of the page. You will be taken to the New environment creation page.

  3. Select a Scope for the environment. This determines which clusters, departments, groups or projects can deploy the environment.

  4. Enter an Environment name and description.

  5. Insert the URL for a container image, and select the pull policy for the image.

  6. Select the architecture and type of the workload.

  7. (Optional) Select connections for your tools.

  8. (Optional) Under the Runtime settings drop-down menu, set any commands and arguments for running the container in the pod. Add any environment variables and a working directory for the container.

    Note

    When using a container image, it’s important to be aware of your container’s launch command or entry point.

  9. (Optional) Under the Security menu, set additional Linux capabilities for the container and indicate whether the UID, GID and groups should be taken from the image or from a custom location.

  10. Click CREATE ENVIRONMENT. You will be taken back to the Environments overview page where your environment is listed in the table.

4.3.4.2. Editing an Environment#

Note

Only users with the role Application Administrator, Department Administrator, Editor, Environment Administrator, L1 Researcher and Research Manager can edit environments.

You can edit an existing environment by creating an ‘edited copy.’ To edit an environment, from the Environment overview page in the UI:

  1. Click on the environment you wish to edit. The menu bar will appear at the top of the page.

  2. Click COPY & EDIT. You will be taken to the New environment overview page, which will be pre-filled with the selected environment details.

  3. Edit the environment as you wish, then click CREATE ENVIRONMENT.

4.3.5. Compute Resources#

In Run:ai, a compute resource refers to a resource request consisting of CPU devices and memory, and optionally GPUs and GPU memory. When a workspace is requested, with a specific compute resource, the scheduler searches for those resources. If available, the workspace will be launched with access them.

Note

On your Run:ai on DGX Cloud cluster, you cannot use partial GPUs. Therefore, avoid selecting or creating compute resources which use partial GPUs.

4.3.5.1. Creating a New Compute Resource#

Note

Only users with the role Application Administrator, Compute Resource Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create new compute resources.

To create a new compute resource:

  1. Navigate to the Compute Resources overview page using the left navigation menu.

  2. Click + NEW COMPUTE RESOURCE. You will be taken to the New compute resource creation page.

  3. Set a Scope for the resource. All users with access to create workloads within that scope will be able to use that compute resource.

  4. Give your compute resource a name and description.

  5. Under the Resources section of the page, select the number of CPUs and GPU devices per pod, as well as memory for both CPU and GPU devices.

    Note

    Partial GPUs is not supported on your Run:ai on DGX Cloud cluster.

  6. Once you have filled in the page, Click CREATE COMPUTE RESOURCE. You will be taken to the Compute resource overview page, where your new compute resource will appear in the resources list.

4.3.6. Storage#

Within Run:ai, there are two types of storage:

  • Data Sources - persistent storage which can be attached to workloads.

  • Volumes - non-persistent storage which is created at the launch of the workload and persists only while the workload is running.

In the next sections, we will cover data sources and volumes in more detail and give examples of using these storage types to access and read data.

DGX Cloud provides a GCP Filestore for persistent storage in the current release. The Filestore has several storage classes available to users.

4.3.6.1. Data Sources#

Data sources are persistent general-purpose storage or connections to external data, like a Git repository.

The data sources supported by DGX Cloud are:

  • Persistent Volume Claim (PVC) (refer to details below)

  • Git

  • ConfigMap

DGX Cloud does not support these data sources:

  • NFS

  • S3 Bucket

  • Host path

When creating any data source, it needs to be assigned a scope, defining which projects and departments can use or connect to the data source. The following section gives more details on the supported data sources.

4.3.6.1.1. PVC#

A Persistent Volume Claim (PVC) is a Kubernetes request for storage capacity initiated by a user. In the context of a Run:ai data source, the PVC is a pool of storage that can be provisioned against a set of available storage types in the DGX Cloud deployment, with varying capabilities and capacity.

PVC data sources should be used conceptually as a high performance shared directory for storing data that needs to be accessed in each project or namespace for your workloads.

Important

Creating a PVC counts against the storage quota available for your cluster. Care should be taken to ensure that each PVC is used efficiently, to not exhaust your total available storage quota too quickly.

Recommended Storage Classes

GCP-based clusters leverage shared storage provided by Google Filestore. The following PVC storage classes from Filestore are recommended for use in DGX Cloud. Their capabilities and performance characteristics are presented here.

Service Tier

GKE Storage Class

GKE PVC Size

Deployment

Access Mode

Zonal

zonal-rwx

10-100 TiB

Zonal

read/write/many

All other PVC storage classes (none, standard, standard-rwo, standard-rwx, premium-rwo, premium-rwx, Enterprise-rwx, Enterprise-multishare-rwx) are unsupported and their use will result in job failures.

Recommended Storage Classes

AWS-based clusters leverage shared storage provided by FSx Lustre. The following storage class is available for use:

Service Tier

Storage Class

PVC Size

Deployment

Access Mode

FSx Lustre (SSD)

lustre-sc

12-160 TiB

PERSISTENT-1

read/write/many

All other PVC storage classes (none, gp2, ebs) are unsupported and their use will result in job failures.

Note

For information on total storage capacity available for each storage class, or current storage utilization, refer to Managing Your Storage Utilization in Advanced Usage.

Creating a PVC

Note

Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a PVC.

To create a PVC from the Run:ai UI:

  1. From the Data Sources overview page, click + NEW DATA SOURCES. A drop-down menu will appear.

  2. From the drop-down menu, select PVC. You will be taken to the New data source creation page.

  3. Set a Scope for the PVC, and enter a name and description.

    Important

    PVC Data Sources created at the cluster or department level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  4. Fill out the Data mount section of the form:

    1. Select a Storage class. Be sure to review the DGX Cloud recommended storage classes.

    2. Select the access mode configuration for the PVC - either read/write by one node, read only by many nodes, or read/write by many nodes.

    3. Specify a claim size to ensure a minimum capacity for the PVC.

      Note

      The PVC may be larger than necessary, depending on the minimum partition size of the underlying storage. Refer to the table above when provisioning storage greater than 1 TiB to ensure compatibility with the target storage class.

    4. Choose the Filesystem option as the Volume mode.

      Note

      The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.

    5. Specify a Container path to define what path the PVC will be accessible from in a running job.

      Note

      If you do not specify a /scratch volume, one will be provisioned implicitly for GPU-based compute resources using ephemeral storage.

    6. (Optional) In the Restrictions pane, you can use the toggle switch to make the storage read-only if desired.

      Note

      This is an alternative to the PVC access mode configuration.

  5. Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where you can view your new PVC data source.

Note

When creating a new data source, you can also select an existing PVC. This PVC can be created using the K8s API or kubectl on the cluster directly. For more information on interacting with the K8s cluster, refer to Setting up Your Kubernetes Configuration File.

4.3.6.1.2. Git#

Git is a version control system (VCS) used to manage iterative code modification and collaboration within a software project. As a Run:ai data source, it takes a public or private git repository and makes it available inside a workload as a filesystem path.

Creating a Git Data Source

Note

Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a Git data source.

To create a Git data source within your Run:ai cluster using the UI:

  1. From the Data sources overview page, click + NEW DATA SOURCE. A drop-down menu will appear.

  2. From the drop-down menu, select Git. You will be taken to the New data source creation page.

  3. Set the scope for the Git data source.

  4. Enter a name and a description into the relevant fields.

  5. Fill in the URL for the Git repository, and optionally specify the branch.

  6. Select relevant credentials to the data source. These are required if the Git repository is private. (You can create the necessary credentials in the Run:ai Credentials web interface).

  7. Set the Container path. This defines the path from which the Git data source will be accessible in a running job.

    Note

    If you do not specify a /scratch volume, one will be provisioned implicitly for GPU-based compute resources using ephemeral storage.

  8. Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where your new data source will be shown in the table.

4.3.6.1.3. ConfigMap#

A ConfigMap is a Kubernetes object used to store non-confidential data as key-value pairs. As a Run:ai data source, it takes a ConfigMap that exists in the Run:ai Kubernetes cluster and makes it available inside a workload as a file.

Note

Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a Configmap.

To create a Configmap in your Run:ai cluster from the UI:

  1. From the Data sources overview page, click NEW DATA SOURCE. A drop-down menu will appear.

  2. From the drop-down menu, select ConfigMap. You will be taken to the New data source creation page.

  3. Select a Scope for the ConfigMap.

    Note

    ConfigMaps can only be scoped at the individual project level. You cannot select the whole cluster or a department as the scope for a ConfigMap.

  4. Enter a name and description.

  5. From the ConfigMap name drop-down menu, select a ConfigMap which exists on your Run:ai Kubernetes cluster.

  6. Set the Container path, which defines what path the ConfigMap will be accessible from in a running job.

    Note

    If you do not specify a /scratch volume, one will be provisioned implicitly for GPU-based compute resources using ephemeral storage.

  7. Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where the new ConfigMap will appear in the table.

4.3.6.2. Volumes#

Volumes are workload-specific storage in Run:ai. At workload creation time, a user can create a new volume. Most configuration choices for a volume mirror those documented in the PVC Data Source section. Additionally, a user can specify the volume’s contents as either persistent or ephemeral. A persistent volume retains all data written to it until the workload it is attached to is deleted. An ephemeral volume’s contents are removed every time a workload is stopped.

Workloads can use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods through the container’s writable layer, logs directory, and EmptyDir volumes. The data in an EmptyDir volume is preserved across container crashes.

When a user submits a job which requires a GPU, an EmptyDir on the local SSD of the node is attached to the pod as a scratch disk. The /scratch directory appears to have 6TiB, which is the total storage on the node; however, each Pod is limited to 200 GiB. If the Pod exceeds this 200 GiB limit, it will be evicted from the node (i.e., terminated). CPU-only workloads do not get a /scratch directory provisioned.

4.3.7. Credentials#

The Run:ai UI supports credentials for accessing containers and applications that are otherwise gated. The supported credential types on your cluster are:

  • Docker registries

  • Access keys

  • Usernames and passwords

Credentials of type Generic Secret are not enabled on the cluster.

When you add credentials to your cluster, you must select a scope. The credentials are then usable by any workload which is deployed at that scope level. The smallest scope which can be applied is the project-level scope. There is no user-specific scoping available for credentials.

As part of your subscription to Run:ai on DGX Cloud, you are granted access to the NVIDIA GPU Cloud (NGC) Catalog. The NGC Catalog contains GPU-accelerated AI models and SDKs that enable you to infuse AI into your applications at speed of light.

4.3.7.1. Accessing Your NGC Org#

Your cluster administrator can invite you to NVIDIA GPU Cloud (NGC). Once you have received the invitation, follow the instructions in the email to set up your account.

4.3.7.2. Setting Up Your NGC API Key#

To generate your NGC API key, follow these steps:

  1. Log into NGC.

  2. Click on your user account menu at the top right of the screen and select Setup.

  3. Click Generate Personal Key and generate the key in the new form that opens. Save the displayed key in a safe place as this will only be shown once and is required for future steps.

4.3.7.3. Adding NGC Credentials to the Run:ai Cluster#

Note

Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor and L1 Researcher can add credentials to the cluster.

To add NGC Credentials to the Run:ai Cluster:

  1. Access the Credentials page from the Run:ai left navigation menu.

  2. Click + NEW CREDENTIALS and select Docker registry from the drop down menu. You will be taken to the New credential creation page.

  3. Select the Scope for your NGC credential. The secret will be usable by any workload launched within the scope. For example, if your scope is set at the department level, all workloads launched in any project associated with that department can use the secret, regardless of which user created the credential, or launched the workload.

  4. Enter a name and description for the credential. This will be visible to any cluster user.

  5. Select New secret.

  6. For username, use $oauthtoken.

  7. For password, paste your token.

  8. Under Docker Registry URL, enter nvcr.io.

  9. Click CREATE CREDENTIALS. Your credentials will now be saved in the cluster.

4.3.7.4. Adding Git Credentials#

To add Git credentials to your Run:ai on DGX Cloud Cluster, you must first create a Personal Access Token from your Git server, and give access to the desired repositories as part of the creation process. You can create a token on GitHub by following these instructions, and similarly by following these instructions on GitLab.

Note

Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor and L1 Researcher can add credentials to the cluster.

Once you have generated a Personal Access Token, add your credentials to your Run:ai cluster by following these instructions:

  1. Access the Credentials page from the Run:ai left navigation menu.

  2. Click + NEW CREDENTIALS and select Username & password from the drop down menu. You will be taken to the New credential creation page.

  3. Select the Scope for your credential. The secret will be usable by any workload launched within the scope.

  4. Enter a name and description for the credential. This will be visible to any cluster user.

  5. Select New secret.

  6. For username, use your Git username.

  7. For password, paste your Personal Access Token.

  8. Click CREATE CREDENTIALS. Your credentials will now be saved in the cluster.

4.3.8. Templates#

Templates are used to pre-define commonly used workloads. They make it quick and easy to launch a workload based on pre-set instructions.

Note

Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor, L1 Researcher, Research Manager and Template Administrator can add credentials to the cluster.

To create a template, navigate to the Templates overview page from the left navigation menu then:

  1. Click + NEW TEMPLATE. The New Template Creation page will open.

  2. Set the Scope for the template. This determines which departments and projects will be able to use this template.

  3. Enter a template name and description.

  4. Select or create an Environment for your template to use.

  5. Choose a compute resource.

  6. (Optional) Select values for Volumes and Data Sources.

  7. Once you have selected all your desired settings for the template, click CREATE TEMPLATE. You will be taken back to the Templates overview page.

Once created, templates can be selected for use as part of the New Workload creation process.