3. Cluster Administrator Guide
Congratulations on your new Run:ai on DGX Cloud cluster!
This section of the guide provides the Cluster Administrator with the essential information required to set up and manage your Run:ai on DGX Cloud cluster and support cluster users. The focus here is on setup process specifical to the Run:ai on DGX Cloud offering. For comprehensive administrative documentation, refer to the Run:ai Documentation..
3.1. Cluster Handover
In preparation for your onboarding, you should have engaged with your NVIDIA Technical Account Manager (TAM). Your TAM will provide the required documentation to accomplish the following prior to an onboarding call:
Designate an administrator for NGC and Run:ai
Federate your IdP with the NVIDIA identity federation
Specify an allowlist classless inter-domain routing (CIDR) range
During your onboarding call, the following will be provided:
A URL endpoint to access the Run:ai cloud control plane
A shared support channel (Slack or Teams) for yourselves and NVIDIA
A kubeconfig file, required to set up CLI access to the cluster
A URL for the cluster (within the kubeconfig file)
A URL for the OIDC Issuer for the cluster (within the kubeconfig file)
3.2. Accessing your Cluster
There are two ways that administrators can interact with a Run:ai on DGX Cloud cluster:
Using the Run:ai UI: NVIDIA will provide you with the URL and an initial user login to the cluster. This user will be given the role of Application Administrator. That individual will then be able to create departments and projects, invite other users, and assign them access rules.
Using the Run:ai CLI: To use the Run:ai CLI, you must first log in using the UI. After logging into the UI, users will be able to set up the CLI. To learn how to set up the Run:ai CLI, see Accessing the Run:ai CLI.
3.2.1. Accessing the Run:ai UI
To access the Run:ai console in a browser:
Go to the cluster URL, as given to you by the TAM in onboarding instructions.
In the login dialog, click CONTINUE WITH SSO. Your SSO login page will open as a pop-up.
Enter your details. Once logged in, you will be taken to the Run:ai cluster overview page.
3.3. Managing Users
Run:ai uses role-based access control (RBAC) to determine users’ access and ability to interact with cluster components. Each user can be assigned more than one role. See the Cluster Users for more details on cluster user types.
3.3.1. Creating Users
User management for Run:ai on DGX Cloud is integrated with NVIDIA GPU Cloud (NGC) via NVIDIA’s identity federation service. As part of the onboarding process, your NVIDIA TAM should have worked with you to register your identity provider (IdP) with the service.
Customer admininstrators cannot add local users or change the SSO configuration within Run:ai; they can only assign roles.
After assigning a role to a user, the customer user can log into Run:ai via SSO using the email address associated with the assigned roles.
3.3.1.1. Assigning Roles in Run:ai
Note
Only users with role Application Administrator, Editor, Department Administrator and Research Manager can assign roles within the cluster.
To assign a role:
From the Run:ai overview page, select the Tools & Settings menu in the top right corner.
From the menu, select Access rules & Roles. You will be taken to the Access rules & Roles overview page.
Select + NEW ACCESS RULE to assign a new role to a user. The New Access rule pop-up will appear.
Under Subject, select User.
Enter the user email address associated with their SSO user account.
Select the role and the scope for the user to be invited, keeping in mind the information from the previous section, User Scopes of the Overview Guide.
Note
You must notify the user their role and scope, as this determines what they can and cannot do within the cluster.
Click SAVE RULE to save the rule.
Now the user can log in to the Run:ai environment using the SSO option on the login screen, at the URL provided by their customer administrator or their NVIDIA TAM.
3.3.1.2. Access Rule Creation in Run:ai
Important
Any user with the ‘CREATE’ permission on Access rules in the Run:ai Cluster can create additional permissions (Access rules) in any SCOPE they are a part of, at a ‘level’ equal to or lesser than their role in that SCOPE.
The following roles have the ‘Create’ Access Rule permissions:
Application Administrator
Department Administrator
Editor
Research Manager
Because the ability to create access rules is not restricted to a specific scope, it is important to manage and monitor the cluster users.
3.3.1.2.1. Example: Assigning Access Roles
In this example we show a consequence of all Create Access Rule permissions being at cluster level even if they are granted at a specific department level.
Alice has a role of Editor in Department 1 and a L1 researcher in Department 2. Because Alice has the ‘CREATE’ access rule permission from her role in Department 1, Alice can add additional users to Project 2-A, with role L1 researcher or lower (including L2 researcher and Viewer). These new users can run jobs and use resources in the cluster. They can also access any credentials scoped to Project 2-A.
However, the administration of Department 2 may not be aware that Alice has an Editor role in Department 1. Therefore, do not assume that Alice has the authority to add users to Project 2-A, especially if it contains proprietary information.
3.4. Administrator Cluster Setup
To prepare your cluster for users to run workloads, you must set up the following:
Departments
Projects
Compute resources
Refer to Run:ai Departments for a detailed introduction to departments and projects.
3.4.1. Departments
Departments allow you to associate quotas with different teams and groups of users. Users are members of a department, and all projects must be created within a department.
Note
Only users with the role of Application Administrator and Editor can create and manage departments.
3.4.1.1. Modifying the Default Department
When your Run:ai on DGX Cloud cluster is provisioned, a “default” department will be created. However, this department won’t have any resource quota allocated to it, so no workloads can be run within the department.
Note
Only users with the role of Application Administrator and Editor can modify departments.
To add quota to the default department:
In the Run:ai UI, navigate to the Departments overview page using the left navigation menu.
Click the ‘default’ department entry. The blue ‘selected’ menu should appear at the top of the page.
Click EDIT to access the Edit department page.
(Optional) Rename the department.
In the Quota management section, select how many GPU devices, CPU cores, and how much CPU memory the department should be allocated.
Click SAVE to update the department. You will be taken back to the Departments overview page and the new quotas should be shown in the values in the table.
3.4.1.2. Creating a Department
Having multiple departments in your cluster allows you to manage groups of projects and set quotas for separate groups.
Note
Only users with the role of Application Administrator and Editor can create departments.
You can create a department using the following steps:
Navigate to the Departments overview page.
Click + NEW DEPARTMENT. You will be taken to the New department creation page.
Enter a department name.
Select a quota for the department. You should also select a quota for each of the three options: GPU devices, CPU cores and CPU Memory.
Select whether you will Allow department to go over quota. If the department is allowed to go over quota, it will use spare GPUs available in the cluster, when available, beyond the quota listed.
Click CREATE DEPARTMENT to save. You will be taken back to the Departments overview page.
3.4.2. Projects
Projects are used to implement resource allocation, and define clear guardrails between different research initiatives. Groups of users (or in some cases, an individual) are associated with a project and can then run workloads within that project, against a fixed project allocation.
All projects are associated with a department. It is important to take note of the department’s quota limits, as these will impact the allocation and quota of any project associated with that department.
Multiple users can be scoped into the same project. Project-level information, including credentials, are visible to all users in that project.
3.4.2.1. Creating a Project
Note
Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can create projects.
To create a project:
Navigate to the Projects overview page.
Click + NEW PROJECT. You will be taken to the New project creation page.
Select a department under which the project will be created. The project will be viewable to all users within this department.
In the Project name section, enter a name for the project.
In the Quota management section, select the total number of resources that can be used by the project.
(Optional) Use the scheduling rules drop-down menu to set up any rules for the cluster.
Click CREATE PROJECT. You will be taken to the Projects overview page, where you can see the status of your newly created project.
3.4.2.2. Editing a Project
Note
Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can edit projects.
You can update the project by clicking on the checkbox to the left of the project name, then clicking Edit on the menu bar at the top of the projects overview page. This will take you back to the project creation page, and you can update any entries as required.
3.4.2.3. Updating Access to Projects
Note
Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can update access to Projects.
If a user has access to a department and a new project is created within that department, they will automatically be granted access to the project.
To update access to a project:
From the Projects overview page, click on the checkbox to the left of the project name.
Click ACCESS RULES on the menu bar at the top of the Projects overview page. This will bring up a pop-up where you can enter additional user email addresses to grant them access to the project.
3.4.3. Compute Resources
In Run:ai, a compute resource is a resource request consisting of CPU devices, memory, and (optionally) GPUs and GPU Memory. When a workspace is requested, with a specific compute resource, the scheduler looks for those requested resources, and if they are available the workspace will be launched with access to those resources.
You can view the compute resources available within your cluster by navigating to Compute Resources from the left navigation menu in the Run:ai UI.
Note
Only users with the role of Application Administrator, Compute Resource Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create and manage compute resources.
As a cluster admin, you should ensure that when first given your cluster, the compute resources pre-loaded within your cluster are sensible for your users and their use cases. We recommend that you:
Remove any cluster resources that are using partial GPUs. GPU splitting is not possible on your Run:ai on DGX Cloud cluster.
Set up a CPU only compute resource.
Set up 1 (one) and 8 GPU compute resources.
These suggestions are only guidelines and you will likely require alternative compute resources for your teams. For instructions on how to set up compute resources, see the Compute Resources section in the Getting Started for Cluster Users guide.