3. Cluster Administrator Guide#

Congratulations on your new Run:ai on DGX Cloud cluster!

This section provides Cluster Administrators with essential information for setting up and managing Run:ai on DGX Cloud clusters while supporting users. It covers the setup process specific to the Run:ai on DGX Cloud offering. For comprehensive administrative details, refer to the Run:ai Documentation.

3.1. Cluster Handover#

In preparation for your onboarding, you should have engaged with your NVIDIA Technical Account Manager (TAM). Your TAM will provide the required documentation to accomplish the following prior to an onboarding call:

  • Designate an administrator for NGC and Run:ai

  • Federate your IdP with the NVIDIA identity federation

  • Specify an allowlist classless inter-domain routing (CIDR) range

During your onboarding call, the following will be provided:

  • A URL endpoint to access the Run:ai cloud control plane

  • A shared support channel (Slack or Teams) for yourselves and NVIDIA

  • A kubeconfig file, required to set up CLI access to the cluster

  • A URL for the cluster (within the kubeconfig file)

  • A URL for the OIDC Issuer for the cluster (within the kubeconfig file)

3.2. Accessing your Cluster#

There are two ways that administrators can interact with a Run:ai on DGX Cloud cluster:

  • Using the Run:ai UI: NVIDIA will provide you with the URL and an initial user login to the cluster. This user will be given the role of Application Administrator. That individual will then be able to create departments and projects, invite other users, and assign them access rules.

  • Using the Run:ai CLI: To use the Run:ai CLI, you must first log in using the UI. After logging into the UI, users will be able to set up the CLI. To learn how to set up the Run:ai CLI, see Accessing the Run:ai CLI.

3.2.1. Accessing the Run:ai UI#

To access the Run:ai console in a browser:

  1. Go to the cluster URL, as given to you by the TAM in onboarding instructions.

  2. In the login dialog, click CONTINUE WITH SSO. Your SSO login page will open as a pop-up.

  3. Enter your details. Once logged in, you will be taken to the Run:ai cluster overview page.

    Login Box in UI.

3.3. Managing Users#

Run:ai uses role-based access control (RBAC) to determine users’ access and ability to interact with cluster components. Each user can be assigned more than one role. See Cluster Users for more details on cluster user types.

3.3.1. Creating Users#

User management for Run:ai on DGX Cloud utilizes your existing identity provider (IdP) for authentication via NVIDIA’s identity federation service. As part of the onboarding process, your NVIDIA TAM should have worked with you to register your IdP with the service.

Run:ai access and role assignments are managed separately within the Run:ai platform.

  • Customer administrators cannot add local users or change the SSO configuration within Run:ai; they can only assign and remove roles.

  • After a role is assigned to a user within Run:ai, the user can log into Run:ai via SSO using the email address with their IdP account. The IdP email address must match the email associated with the assigned role in Run:ai.

  • Removing a user from your IdP does not automatically revoke their Run:ai access. To fully revoke a user’s access, their role must be removed within the Run:ai platform.

3.3.1.1. Assigning Roles in Run:ai#

Note

Only users with role Application Administrator, Editor, Department Administrator and Research Manager can assign roles within the cluster.

To assign a role:

  1. From the Run:ai overview page, select the Tools & Settings menu in the top right corner.

    Select the Tools & Settings drop-down menu.
  2. From the menu, select Access rules & Roles. You will be taken to the Access rules & Roles overview page.

  3. Select + NEW ACCESS RULE to assign a new role to a user. The New Access rule pop-up will appear.

    Select New Access Role.
  4. Under Subject, select User.

    Select User.
  5. Enter the user email address associated with their SSO user account.

  6. Select the role and the scope for the user to be invited, as described in the User Scopes section of the Overview Guide.

    Note

    You must notify the user their role and scope, as this determines what they can and cannot do within the cluster.

  7. Click SAVE RULE to save the rule.

Now the user can log in to the Run:ai environment using the SSO option on the login screen, at the URL provided by their customer administrator or their NVIDIA TAM.

3.3.1.2. Access Rule Creation in Run:ai#

Important

Any user with the ‘CREATE’ permission on Access rules in the Run:ai Cluster can create additional permissions (Access rules) in any SCOPE they are a part of, at a ‘level’ equal to or lesser than their role in that SCOPE.

The following roles have the ‘Create’ Access Rule permissions:

  • Application Administrator

  • Department Administrator

  • Editor

  • Research Manager

Because the ability to create access rules is not restricted to a specific scope, it is important to manage and monitor the cluster users.

3.3.1.2.1. Example: Assigning Access Roles#

In this example we show a consequence of all Create Access Rule permissions being at cluster level even if they are granted at a specific department level.

Alice has a role of Editor in Department 1 and a L1 researcher in Department 2. Because Alice has the ‘CREATE’ access rule permission from her role in Department 1, Alice can add additional users to Project 2-A, with role L1 researcher or lower (including L2 researcher and Viewer). These new users can run jobs and use resources in the cluster. They can also access any credentials scoped to Project 2-A.

However, the administration of Department 2 may not be aware that Alice has an Editor role in Department 1. Therefore, do not assume that Alice has the authority to add users to Project 2-A, especially if it contains proprietary information.

3.4. Administrator Cluster Setup#

To prepare your cluster for users to run workloads, you must set up the following:

  • Departments

  • Projects

  • Compute resources

Refer to Run:ai Departments for a detailed introduction to departments and projects.

3.4.1. Departments#

Departments allow you to associate quotas with different teams and groups of users. Users are members of a department, and all projects must be created within a department.

Note

Only users with the role of Application Administrator and Editor can create and manage departments.

3.4.1.1. Modifying the Default Department#

When your Run:ai on DGX Cloud cluster is provisioned, a “default” department will be created. However, this department won’t have any resource quota allocated to it, so no workloads can be run within the department.

Note

Only users with the role of Application Administrator and Editor can modify departments.

To add quota to the default department:

  1. In the Run:ai UI, navigate to the Departments overview page using the left navigation menu.

  2. Click the ‘default’ department entry. The blue ‘selected’ menu should appear at the top of the page.

  3. Click EDIT to access the Edit department page.

  4. (Optional) Rename the department.

  5. In the Quota management section, select how many GPU devices, CPU cores, and how much CPU memory the department should be allocated.

  6. Click SAVE to update the department. You will be taken back to the Departments overview page and the new quotas should be shown in the values in the table.

3.4.1.2. Creating a Department#

Having multiple departments in your cluster allows you to manage groups of projects and set quotas for separate groups.

Note

Only users with the role of Application Administrator and Editor can create departments.

You can create a department using the following steps:

  1. Navigate to the Departments overview page.

  2. Click + NEW DEPARTMENT. You will be taken to the New department creation page.

    Department Creation page.
  3. Enter a department name.

  4. Select a quota for the department. You should also select a quota for each of the three options: GPU devices, CPU cores and CPU Memory.

  5. Select whether you will Allow department to go over quota. If the department is allowed to go over quota, it will use spare GPUs available in the cluster, when available, beyond the quota listed.

  6. Click CREATE DEPARTMENT to save. You will be taken back to the Departments overview page.

3.4.2. Projects#

Projects are used to implement resource allocation, and define clear guardrails between different research initiatives. Groups of users (or in some cases, an individual) are associated with a project and can then run workloads within that project, against a fixed project allocation.

All projects are associated with a department. It is important to take note of the department’s quota limits, as these will impact the allocation and quota of any project associated with that department.

Multiple users can be scoped into the same project. Project-level information, including credentials, are visible to all users in that project.

3.4.2.1. Creating a Project#

Note

Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can create projects.

To create a project:

  1. Navigate to the Projects overview page.

  2. Click + NEW PROJECT. You will be taken to the New project creation page.

  3. Select a department under which the project will be created. The project will be viewable to all users within this department.

  4. In the Project name section, enter a name for the project.

  5. In the Quota management section, select the total number of resources that can be used by the project.

  6. (Optional) Use the scheduling rules drop-down menu to set up any rules for the cluster.

  7. Click CREATE PROJECT. You will be taken to the Projects overview page, where you can see the status of your newly created project.

3.4.2.2. Editing a Project#

Note

Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can edit projects.

You can update the project by clicking on the checkbox to the left of the project name, then clicking Edit on the menu bar at the top of the projects overview page. This will take you back to the project creation page, and you can update any entries as required.

3.4.2.3. Updating Access to Projects#

Note

Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can update access to Projects.

If a user has access to a department and a new project is created within that department, they will automatically be granted access to the project.

To update access to a project:

  1. From the Projects overview page, click on the checkbox to the left of the project name.

  2. Click ACCESS RULES on the menu bar at the top of the Projects overview page. This will bring up a pop-up where you can enter additional user email addresses to grant them access to the project.

3.4.3. Compute Resources#

In Run:ai, a compute resource is a resource request consisting of CPU devices, memory, and (optionally) GPUs and GPU Memory. When a workspace is requested, with a specific compute resource, the scheduler looks for those requested resources, and if they are available the workspace will be launched with access to those resources.

You can view the compute resources available within your cluster by navigating to Compute Resources from the left navigation menu in the Run:ai UI.

Note

Only users with the role of Application Administrator, Compute Resource Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create and manage compute resources.

As a cluster admin, you should ensure that when first given your cluster, the compute resources pre-loaded within your cluster are sensible for your users and their use cases. We recommend that you:

  1. Remove any cluster resources that are using partial GPUs. GPU splitting is not possible on your Run:ai on DGX Cloud cluster.

  2. Set up a CPU only compute resource.

  3. Set up 1 (one) and 8 GPU compute resources.

These suggestions are only guidelines and you will likely require alternative compute resources for your teams. For instructions on how to set up compute resources, see the Compute Resources section in the Getting Started for Cluster Users guide.

3.4.4. Cluster Updates and Upgrades#

As stated in the Shared Responsibility section of Product Overview, NVIDIA is responsible for all cluster updates and upgrades provided as part of the DGX Cloud managed service.

These updates and upgrades can be either disruptive or non-disruptive. Before an update or upgrade takes place, your NVIDIA TAM will reach out to notify you of the coming update or upgrade.

In the case of a non-disruptive update or upgrade, the TAM will communicate the scope and anticipated time window of the work.

In the case of a disruptive update or upgrade, the TAM will communicate the scope of the work and, if possible, enable you to select a time window from a range of options. It may not always be possible to provide a range of options for the maintenance time window depending on the scope of the work.

For disruptive updates or upgrades that incur downtime, NVIDIA has put the following rules in place:

  • For clusters with 20 GPU nodes or fewer, surge upgrades are performed with the maxUnavailable attribute set to 100%.

  • For clusters with greater than 20 GPU nodes, the entire GPU node pool is torn down and recreated using automation.