1. Product Overview#

DGX Cloud Create is a Kubernetes-based AI workload management platform that empowers teams to efficiently schedule and run AI jobs and optimize GPU resource allocation for their AI initiatives.

Offered as a managed service, this solution is designed for enterprises and institutions to quickly enable and execute their data science and AI projects at any scale, without having to manage the infrastructure themselves. GPU clusters are provisioned, managed, and maintained by NVIDIA.

NVIDIA DGX Cloud Create stack diagram.

DGX Cloud Create provides:

  • A Compute Cluster: Customers get access to a dedicated AI cluster provisioned with state-of-the-art NVIDIA GPU capacity, provided by NVIDIA cloud service provider (CSP) partners, along with storage and networking, as part of their DGX Cloud Create subscription.

  • A User Interface: NVIDIA Run:ai provides a UI and CLI for interacting with the cluster.

  • AI Training Capabilities: DGX Cloud Create supports distributed AI training workloads for model development, fine-tuning and batch jobs, as well as interactive workloads for experimentation and data science workflows.

  • Optimized Resource Utilization: DGX Cloud Create offers automated GPU cluster management, orchestration and job queuing for efficient resource sharing and optimized utilization.

  • User and Resource Management: The features and functionalities available to the end user are managed with role-based access control (RBAC). NVIDIA Run:ai uses the concept of projects and departments to manage access to resources across the cluster.

  • Cluster and Workload Observability: The NVIDIA Run:ai GUI provides dashboards for monitoring and managing workloads, users, and resource utilization.

  • NVIDIA AI Enterprise Subscription: All DGX Cloud Create subscriptions include access to NVIDIA AI Enterprise, which provides access to NVIDIA’s suite of GPU-optimized software.

  • Support: Customers have access to 24/7 enterprise-grade support from NVIDIA. NVIDIA will be the primary support for customers. Customers will have access to a Technical Account Manager (TAM).

Note

NVIDIA Run:ai is an NVIDIA product and can be deployed to any modern Kubernetes cluster. However, DGX Cloud Create is a managed service co-engineered with NVIDIA CSP partners AWS and GCP. As such there are architectural and security differences for DGX Cloud Create. The majority of features provided by NVIDIA Run:ai also apply to DGX Cloud Create and the standard NVIDIA Run:ai documentation suite is here.

For more information regarding differences and limitations for DGX Cloud Create please consult the Security Restrictions and Cluster Limitations section.

1.1. Cluster Architecture#

The architecture stack consists of an NVIDIA-optimized Kubernetes cluster on a CSP connected to a NVIDIA Run:ai Software as a Service (SaaS) cloud control plane. The exact configuration and capacity of each of the components available is customized during the onboarding process based on an enterprise’s requirements.

NVIDIA DGX Cloud Create architecture and shared management model.

This diagram shows the high level cluster architecture and indicates the shared-management of the system.

The cluster is provisioned with CSP-specific compute, storage, and networking. Each GPU compute node consists of eight NVIDIA H100 GPUs.

Within the Kubernetes cluster, there are NVIDIA namespaces and customer namespaces. Customers are responsible for operating the customer namespaces, while NVIDIA is responsible for the NVIDIA namespaces.

For shared file storage, DGX Cloud Create leverages high-performance storage available from the CSP in which a given deployment operates. Storage made available as part of the deployment can provision both persistent data sources and temporary volumes. Some data sources are not supported in DGX Cloud Create. See the Managing Data Storage on the Cluster section of the User Guide for more information.

Local solid-state storage is attached to each node in a workload that uses GPU compute resources.

The compute cluster is connected to the NVIDIA Run:ai SaaS control plane for resource management, workload submission, and cluster monitoring. Every customer is assigned a Realm in the NVIDIA Run:ai control plane, where each Realm is considered a tenant in the multi-tenant control plane.

NVIDIA Run:ai’s GUI and CLI are the primary interfaces users interact with the platform.

1.2. Shared Responsibility Model#

NVIDIA manages your DGX Cloud Create cluster. This includes updating components, tuning components, sizing systems and node pools, health monitoring, and remediation.

As the customer, you will not be able to install cluster components that are cluster-scoped or require cluster-scoped privileges. This includes custom resource definitions (CRDs) and operators.

As the customer, you are responsible for user management and resource allocation, user workloads, data protection and support and issue reporting. Full details of NVIDIA responsibilities and customer responsibilities are as follows.

1.2.1. NVIDIA Responsibilities#

As the provider of Software and Underlying Infrastructure, NVIDIA is responsible for:

  1. Infrastructure Activities

    • Monitoring health and availability of the CSP-hosted infrastructure - compute, storage, and networking. Metric sources include:

      • Kubernetes control plane audit logs

      • Node system-level metrics

      • Pods in system namespaces, including NVIDIA Run:ai management pods

      • Infrastructure-related metrics

        Note

        No customer personal identifiable information (PII), including dataset information, model names and information, and workload names, are captured in this monitoring data.

    • Deployment and administration of Kubernetes cluster

    • Coordinating with the cloud service provider (CSP) for availability, maintenance, and updates

  2. Software and Support

    • Deployment and administration of The Software

    • Coordinating with the Third Party for availability, maintenance, and updates

    • Deploying updates and vulnerability and patch management of The Software

    • Resolution of issues reported by End User and coordinating with Third Party and/or the CSP for issue resolution, as required

  3. Security and Compliance

    • Security of cloud services, including NGC, NVIDIA DGX Cloud Create, and underlying infrastructure

    • Application security over NVIDIA apps

    • Security and isolation of customer workloads, applications, and data

  4. Alerting and Notifications

    • Notifying users and organizations of any planned outages affecting the ability to run training workloads and use of CSP capacity

    • Notifying users and organizations of any software updates promptly via pre-defined support channels

    • Notifying users and organizations of any planned outages affecting the ability to access NGC and The Software

1.2.2. End User Responsibilities#

As the user of The Software and underlying infrastructure, you are responsible for:

  1. User Management and Resource Allocation

    • Ensuring only authorized users are given access to DGX Cloud Create

    • Assigning appropriate roles to users on NVIDIA Run:ai

    • Ensuring appropriate quota allocations per NVIDIA Run:ai projects

    • Responsible for all user modifications to NVIDIA Run:ai projects and corresponding Kubernetes namespaces

  2. User Workloads

    • Responsible for all create, read, update, and delete (CRUD) operations of end-user initiated workloads

    • Responsible for monitoring end-user initiated workloads

    • Responsible for the data and storage used by end-user initiated workloads

  3. Data Protection

    • Responsible for all creation, modification, and deletion operations to user content and sensitive data in DGX Cloud Create

    • Responsible for all images stored by the End User on DGX Cloud Create

    • Responsible for all images stored by the End User on the NGC Private Registry

    • Ensuring the proper controls and procedures are in place for user content and sensitive data in DGX Cloud Create. This includes, but is not limited to, training data, container images, models, etc.

    • Responsible for any backup of user content and data

  4. Network Security

    • Providing NVIDIA with initial ingress rule guidance via classless inter-domain routing (CIDR) ranges for restricting access at the CSP infrastructure layer

    • Ensuring the ingress rules are configured appropriately for the duration of access to DGX Cloud Create

    • Configuring additional protection such as VPNs or proxies at the configured CIDR ranges for ingress to further secure access to the cluster

  5. Support and Issue Reporting

    • Report issues with relevant logs, detailed issue description, and other required information with TAM/NVEX

  6. Compliance

    • Abiding by the End User License Agreements or Terms of Use for the NVIDIA Software & Services you or your organization use. This includes, but is not limited to, the TOU for NVIDIA GPU Cloud.

1.3. Cluster Ingress and Egress#

As shown in the Overview section, each DGX Cloud Create cluster can be accessed by either the NVIDIA Run:ai cloud control plane or the NVIDIA-managed Kubernetes control plane.

All ingress to the NVIDIA Run:ai cloud control plane is controlled via Single Sign-On (SSO), which is configured by following instructions provided during the onboarding process as noted in the Admin Guide. Port 443 is externally accessible.

Ingress to the NVIDIA-managed Kubernetes control plane is controlled by a customer-specified CIDR range, which is also configured during the onboarding process through your TAM. Port 443 is also externally accessible.

Egress is embargo-restricted only by default.

The customer can modify the ingress and egress restrictions using the instructions provided in Configuring the Ingress/Egress CIDRs for the Cluster (CLI).

Important

NVIDIA advises against exposing the Kubernetes API to the internet. Such exposure goes against Kubernetes best practices. However, NVIDIA enables the customer to configure ingress rules by following the instructions described above. In these cases, NVIDIA recommends that the customer set up a VPN or proxy to restrict inbound IPs and use a single or limited set of IPs for access. While NVIDIA won’t block 0.0.0.0/0, it cannot recommend allowing it under any circumstances.

1.4. Cluster User Scopes and User Roles#

This section is intended as a high-level overview of user scopes and roles in your DGX Cloud Create cluster. In subsequent sections, we cover how to create and manage user roles.

1.4.1. User Scopes#

NVIDIA Run:ai uses the concept of scope to define which components of the cluster are accessible to each user, based on their assigned role or roles. Scopes can include the entire cluster, departments and projects. In a NVIDIA Run:ai cluster, each workload runs within a project. Projects are part of departments.

When users are added to the NVIDIA Run:ai cluster, they can be allocated to any combination of departments and/or projects.

Example Projects and Department with assigned users.

This image provides an example of a cluster which has three departments, Department 1 and Department 3 have one project each, whilst Department 2 contains two projects. User X has access to Project 1-a and Project 2-b. User Y has access to Department 1 and Department 2. User Z has access to Department 2 and Project 3-a.

1.4.1.1. NVIDIA Run:ai Departments#

Each NVIDIA Run:ai project is associated with a department, and multiple projects can be associated with the same department. Departments are assigned a resource quota, and it is recommended that the quota be greater than the sum of all its associated projects’ quotas.

1.4.1.2. NVIDIA Run:ai Projects#

Projects implement resource allocation and define clear boundaries between different research initiatives. Groups of users, or in some cases an individual, are associated with a project and can run workloads within it, utilizing the project’s resource allocation.

1.4.2. Cluster Users#

NVIDIA Run:ai uses role-based access control (RBAC) to determine users’ access and ability to interact with components of the cluster.

Note

More than one role can be assigned to each user.

The NVIDIA Run:ai RBAC documentation details the NVIDIA Run:ai user roles and the permissions for each role.

Your DGX Cloud Create cluster supports all the roles listed in the NVIDIA Run:ai documentation, except for the Department Administrator role, which has been modified for DGX Cloud Create cluster. See below for more details.

1.4.2.1. Customer Admin Roles and NVIDIA Admin Roles#

Both the customer and NVIDIA have administrative roles within the DGX Cloud Create cluster. There are differences between the permissions of the roles. By default, the customer administrator will be given the Department Administrator and Application Administrator roles. NVIDIA support staff will be given NVIDIA Cloud Operator and NVIDIA Cloud Support roles. All NVIDIA roles are for cluster management and support only, and do not have access to customer datasets.

1.4.2.1.1. Customer Admin Roles#

The two customer admin roles are the Application Administrator and the Department Administrator.

Note

The Department Administrator role exists in the standard NVIDIA Run:ai roles, but on the DGX Cloud Create cluster, the Department Administrator has different permissions from those listed in the NVIDIA Run:ai documentation.

Application Administrator

The Application Administrator can:

  • Create and edit departments and projects

  • Manage and assign access roles to users

  • Create and edit components in the cluster, including compute resources, data sources and credentials

The Application Administrator cannot:

  • Edit nodes or nodepools on the cluster

Department Administrator

The Department Administrator can:

  • Create and edit projects

  • Manage and assign access roles to users

  • Create and edit components in the cluster, including compute resources, data sources and credentials

The Department Administrator cannot:

  • Create departments

1.4.2.1.2. NVIDIA Admin Roles#

NVIDIA support staff will be given the NVIDIA Cloud Operator and NVIDIA Cloud Support roles.

Note

All NVIDIA roles are for cluster management and support only and do not have access to customer datasets or credentials.

NVIDIA Cloud Operator

The NVIDIA Cloud Operator can:

  • Manage settings

  • Manage SSO

  • Manage clusters

  • Manage node-pools

  • View nodes

  • View users

  • View access rules

  • View event history

The NVIDIA Cloud Operator cannot:

  • Create users

  • Assign roles

  • View customer private data

NVIDIA Cloud Support

The NVIDIA Cloud Support role can:

  • View All entities in the system

The NVIDIA Cloud Support role cannot:

  • Perform any actions in the cluster

1.4.2.2. Customer User Roles#

Here, we detail three user roles: L1-Researcher, Research Manager, and ML Engineer. For complete information on NVIDIA Run:ai roles, including the Editor and Viewer (which we do not cover here), refer to the NVIDIA Run:ai documentation or visit the Access rules & Roles page under the cluster Tools and Settings menu in the NVIDIA Run:ai UI. The Roles tab provides a full list of roles and permissions.

  • L1 Researcher

    L1 Researchers in the NVIDIA Run:ai platform can submit ML workloads, view the overview dashboard, and see cluster-level analytics. Researchers must be assigned to specific projects and can only submit workloads within these projects.

  • Research Manager

    Research managers can view the workloads running within their scope on the cluster. They can create environments, resources, templates and data sources, but cannot submit workloads.

  • ML Engineer

    ML Engineers can view departments, projects, node pools, nodes and dashboards within the cluster. They can view workloads and can also manage inference workloads.

1.5. Next Steps#

To get started, try out our Interactive Workload Examples. For more information on accessing your cluster, refer to the Cluster Administrator Guide or Cluster User Guide to get started with your primary responsibilities on the cluster.

Detailed information about using NVIDIA Run:ai is available in the NVIDIA Run:ai documentation.