NVIDIA Requirements for AI Clouds | NVIDIA DSX Documentation

Version	Date	Description of Change
2.1	Feb 26, 2026	Initial version
2.2	Apr 10, 2026	Update to v2.2
2.3	Jun 25, 2026	Update to v2.3

Purpose and Intent

This document serves three main purposes:

Setting requirements for NVIDIA Cloud Partners (NCPs) delivering GPU capacity to NVIDIA
This is the primary requirements document from NVIDIA to any NCP providing NVIDIA GPU/AI compute and software services. These requirements cover the full stack of AI cloud infrastructure services and operations needed to run NVIDIA DGX Cloud, expanding on the NVIDIA hardware reference design.
Providing a reference set of requirements for the industry
NVIDIA is publishing this document openly so that NCPs, GPU datacenter operators, and AI practitioners can use it as a reference for the capabilities a large GPU consumer requires.
Defining NVIDIA’s service delivery expectations
NVIDIA expects services to be delivered as generally available services, not as bespoke implementations built for NVIDIA alone. NVIDIA also expects operational excellence, transparent communication, and proactive engagement from all partners.

NVIDIA will consider additional services that an NCP offers or plans to offer beyond what is described here.

Service Delivery SLAs

NCPs should be able to demonstrate their ability to meet the SLAs and operational requirements below, by category, to be considered for offtake.

Service Delivery Timelines

The NCP must demonstrate API readiness and transport establishment at least 12 weeks ahead of GPU delivery. Additionally, the NCP must provide development capacity (ancillary CPU nodes only) and high-performance storage capacity with the API integrated 8 weeks before GPU and cluster delivery. At this T-minus-8-week milestone, the Data Movement Systems Requirements must be met.

SLAs and SLOs

Managed K8s

Control Plane SLA target: Financially-backed 99.95%+ uptime for production.

Storage

Performance (QoS): Must provision the requested minimum throughput and IOPS.
Home Directory Storage:
- Availability: Over 99.99% availability for unplanned incidents, exclusive of scheduled maintenance.
- Durability: Over 99.99% for any FS less than 1 PB.
High-Speed Storage Service Requirements:
- Availability (SLO): Must meet 99.99% availability in a 30-day rolling SLO exclusive of maintenance.
High-Speed Storage Filesystem Requirements:
- End-to-end Availability: Over 99.5% uptime per PB.
- Durability: Over 99.999% durability per PB annually.

Operational Requirements

Dedicated technical specialist or engineer available to NVIDIA.
Slack channel monitored by a technical specialist or engineer.
24x7 support available per the partner’s standard incident severity procedures, including emergency access recovery.
Service-impacting incidents and planned and unplanned maintenance events are communicated to NVIDIA.
For planned maintenance, NVIDIA can schedule maintenance windows through APIs or console tools, avoid unexpected outages, and provide feedback.
NCP must remediate critical vulnerabilities in a timely manner while providing transparent disclosure of any issues.

Telemetry

NCP shall deliver all required telemetry, including metrics and logs, with a latency of no longer than 120 seconds.

Testing Compliance

The requirements described in this document can largely be validated using the tests in the AI Cloud Ready test suite. See the AI Cloud Ready documentation for details on how to run the tests. Tests are added regularly, so review the Requirements Test Matrix to find which tests validate which requirements.

Exemplar Cloud Workload Performance

NVIDIA Exemplar Cloud seeks to improve performance per TCO with hardware and software recipes, references, tools, and capabilities. Run the latest publicly available release from https://github.com/NVIDIA/dgxc-benchmarking and always select the latest release version from the GitHub repository. The release must be completed successfully on one uniform hardware cluster type. Run all workloads for a given release and share the results in the template below.

Req ID	Feature	Min Size	Description
BM01	Benchmarking for Exemplar Cloud	Run per Scalable Unit (e.g. 512 GPU cluster)	Achieve within 5% of an NVIDIA provided target performance number; this should be run on every Scalable Unit (SU) handed off.

Compute and Network Provisioning

This section outlines the requirements for provisioning compute and network resources. Compute instances can be provided as either bare-metal instances (through BMaaS) or virtual machines (through VMaaS) to support the NVIDIA DGX Cloud engagement. All operations must be controlled through a fully documented and secure API; gRPC or REST is preferred. All systems are expected to scale and perform at scale.

General, Compute, and Lifecycle Management

Req ID	Requirement Area	Description
CNP01	API/CLI Access	DGXC must have API or CLI access to the NCP provisioning system for: (1) Node lifecycle management (create, update, delete, list, or manage power states (reboot, on/off power cycle); (2) Network configuration; (3) Inventory and topology discovery; (4) security configuration (users, service accounts, groups, roles); (5) Maintenance and operations (see later section)
CNP02	Declarative Resource Interfaces	For resources requiring multiple steps and a workflow, please provide the appropriate mechanism. A terraform provider is preferred. E.g. automating filesystem provisioning
CNP03	NVLink-Aware Allocation	For NVL72 the API must support NVLink domain-aware allocation.
CNP04	Resource States	Must support clear resource (e.g. instance, network) states where applicable. For example, provisioning, running, degraded, maintenance required, stopping, stopped, terminating, terminated.
CNP05	Tagging	Support for user-defined tags/labels and cloud-init metadata on instances.
CNP06	Console Access	Serial console access is required (read-only sufficient, interactive preferred).Serial console output shall be logged and be available for historic queries (at least 1 day retention, 1 month desired).
CNP07	If VMaaS Present: # VMs/Node	GPU Nodes: no more than one VM per Node General Purpose CPU Nodes: More than one per node, with ability to select via memory/core count shape.
CNP08	Stable Identifiers	All resources (e.g. nodes, switches) must have a stable and persistent ID that does not change during the lifespan, even when it goes offline for a service event. VMs must also have a stable identifier.
CNP09	Firmware	Between tenants, all firmware must be brought to a known good state, all firmware must be cryptographically signed and attested during boot.
CNP10	Remote Management	Platform management solutions (e.g., BMC) must support Redfish over TLS (Disable IPMI).

Boot Process and Disks

Req ID	Requirement Area	Description
BOOT01	Image Deployment & Updates	API-driven workflow allowing DGXC to deploy, update, and manage vendor-provided or custom disk images via bare-metal, VM, or k8s node pool provisioning.
BOOT02	Access to Instance Metadata from Guest OS	Support for cloud-init and instance metadata discovery via link-local addresses or virtual devices.
BOOT03	Custom Disk Images	Support for tenant created custom OS images (either of: raw, qcow2, etc). API calls: get, list, create, delete. Images should be accessible across all tenant projects/clusters/environments.
BOOT04	Node Local Storage	GPU and CPU nodes support access to node local storage (NVMe / SSD) for use as scratch storage or for caching services.

SDN and Virtual Networking

This section covers the virtual networking requirements. Physical transport and networking requirements are discussed later in this document.

Req ID	Requirement Area	Description
SDN01	Virtual Networking	Full API/CLI lifecycle management (Create, Read, Update, Delete, List) for software-defined private networks. Must support non-conflicting BYOIP (including 7.0.0.0/8) and stable private IP allocations. Applicable to all types of resource nodes (CPU, GPU, Storage, etc).
SDN02	Security Groups	Support for VPC-style security groups (or equivalent), including IP/CIDR-based allow and deny rules. Must define scope/application at workload, node, service (e.g. K8s API Service) and subnet/tenant levels.
SDN03	Security Operations	Full API/CLI capabilities to Create, Read, Update, and Delete security groups, including defined audit processes
SDN04	Tenant Isolation	Hard logical or physical network segmentation for out-of-band management (BMC), user traffic, and storage-specific operations.
SDN05	Floating/Movable IP	Ability to automatically or API-driven switch a floating private IP between nodes via API within <10 seconds without requiring an instance reboot.
SDN06	Localized DNS	Support for tenant-defined localized DNS configuration to enable internal domain resolution to private endpoints (e.g. storage endpoints)
SDN07	VPC Peering	Support for cross-virtual-network connectivity with full bandwidth and no “hairpin” routing.
SDN08	Storage Mesh Connectivity	The virtual network (from SDN01) must provide unrestricted L3 routing between all storage hosts, enabling full-mesh, all-to-all communication across different subnet (w/o going thru a gateway)
SDN09	Observability	The platform shall provide comprehensive logging for network infrastructure, including hardware faults, latency/performance fluctuations, and a detailed audit trail of all configuration changes to network filtering rules..
SDN10	DNS Private Domain	Must allow each nodes DNS resolver to forward a tenant-defined private domains (e.g. *.nvidia.com) to a tenant specified DNS server.

Kubernetes as a Service (KaaS) Requirements

Kubernetes Conformance, Versioning, and Compliance

Req ID	Requirement Area	Description
K8S01	Certified Versions	Certified Upstream Versions: Official CNCF-certified versions only; no proprietary forks, and passes the standard Kubernetes conformance tests.
K8S02	Version Updates	Support the three most recent minor releases (in the maintenance window); new minor versions must be available within 4-6 weeks of the upstream release; automated control plane security patching.
K8S03	EOL Policy	Defined notification periods for version deprecation.
K8S04	Kubernetes Security Response	Must participate in the Kubernetes Security Response Committee (SRC) process. Must be attempting to join if not part of the security committee. Must be able to: Responsibly disclose any discovered vulnerabilities to the Kubernetes SRC Receive and respond to embargo notifications from the SRC Patch disclosed vulnerabilities in the managed service during embargo prior to public disclosure and in compliance with direction provided from the Kubernetes SRC ensuring that the patching process does not violate embargo or SRC guidance.

Kubernetes Operational Excellence

Req ID	Requirement Area	Description
K8S05	Lifecycle Management - Control Plane	API/CLI/Terraform Provider for CRUD provisioning; <30 min control plane bring-up.
K8S06	Lifecycle Management - Node Pool	API/CLI/Terraform for CRUD provisioning ( e.g., create node pool, update node pool, delete node pool, scale a node pool to a target count). Must be able to specify node type (specific CPU or GPU instance type) including CPU-only node pools with high-performance networking for data movement and ingest workloads Ability to specify default node labels and node taints within a node pool when a node joins the cluster. When down-scaling a node pool, ability to down-scale bad/specific nodes.
K8S07	API Server Metrics	Share API Server metrics in a Prometheus scrapable format to allow NVIDIA to measure API Server SLO.
K8S8	Versioning	Provider-managed control plane upgrade processes.
K8S9	Zero-Downtime Upgrades	Minor version control plane updates without app downtime or maintenance windows.
K8S10	Node Upgrades	user-initiated rolling updates respecting pod disruption budgets.
K8S11	HA Control Plane	Redundant architecture with etcd separation.
K8S12	Backup & Disaster Recovery	Supported recovery within defined RPO/RTO; needs to be auditable & testable

Robust Kubernetes Security

Req ID	Requirement Area	Description
K8S14	Control Plane Isolation	Per tenant k8s control plane nodes must be separate from worker nodes and outside of the tenant cluster/VPC.
K8S15	Access Controls	Cluster endpoint must provide network access controls.
K8S16	IAM Integration	Kubernetes Service Accounts shall integrate with the platform IAM system to enable workloads to assume platform-managed identities and roles with appropriate scopes.
K8S17	Service Accounts	Kubernetes shall support standard Service Accounts and projected tokens as the workload identity mechanism, including a cluster-specific OIDC issuer to enable workload identity federation. The cluster shall expose OIDC discovery and JWKS endpoints that are reachable by configured external identity consumers (e.g. AWS IAM, GCP workload identity)
K8S19	Encryption	At-Rest Encryption for etcd and secrets.
K8S20	Logging	Ability to view or export Kubernetes control plane logs (apiserver, kcm).

Kubernetes Component and Extension Requirements

Req ID	Requirement Area	Description
K8S21	API Extensions	Mandatory support for CRDs and Validating/Mutating Admission Controllers.
K8S22	CNI	Standard compliance; supports Network Policies; IPv4/IPv6 dual-stack desired.
K8S23	CSI	NCP provides CSI Driver installable by NVIDIA (helm or kustomize) for Block, shared FS, and NFS. Support for static and dynamic provisioning, snapshots, and resizing via PVs and PVCs. CSI credentials are tenant cluster scoped (no cross cluster). APIs to query storage usage against overall cluster quota with per PVC/Volume usage to manage utilization across PVCs and manage quotas using provided credentials Vendor provided storage kernel modules and tools provided via (1) installed by CSI driver, (2) pre-installed in NCP provided machine image or (3) installable packages provided
K8S24	DRA	Enabled Dynamic Resource Allocation (DRA) regardless of upstream feature status (Beta/GA). Some DRA features require enabling feature gates for the control plane, in case our customers want to run AI workload with new DRA features.
K8S25	Operator Support	Support standard operator-based management of hardware accelerators and associated drivers. Provider-default accelerator operators and drivers shall be replaceable or overridable to allow installation of tenant-required operator and driver versions (e.g., GPU Operator, Network Operator). Provide golden configurations for GPU Operator and Network Operator.

Kubernetes Functionality

Req ID	Requirement Area	Description
K8S26	Clusters	Support multiple clusters in the same tenancy; support multiple clusters in the same VPC.
K8S27	Kubernetes Control Plane Size Pinning	Pin Control Plane instances to handle a particular load-limit.
K8S28	Performance	Meet the standard Kubernetes performance test certified up to 5000 nodes (or to the maximum size of the cluster, whichever is smaller) - size CP as necessary. Managed Kubernetes Control Plane SLO and Performance meets or better than the Kubernetes standards results.
K8S29	Kubernetes LoadBalancer Service Support	The platform shall support Kubernetes Service resources of type LoadBalancer, including: External load balancers with publicly routable IPs Internal load balancers with private IPs reachable via private network access Static IP assignment
K8S30	DNS Configuration	The platform shall support configuring Kubernetes internal DNS (e.g. CoreDNS) with conditional forwarding rules for specified DNS zones to designated enterprise or internal DNS resolvers.
K8S31	Configurable Kubernetes CIDR Ranges	Ability to configure Kubernetes service IP range, Node IP range, and Pod IP range.

Security and Identity Management

Identity and Access Management (IAM)

The platform must provide a centralized system for authentication, identity federation, authorization, and lifecycle provisioning across all platform services. It must integrate with a trusted external or platform-hosted identity provider and consume OIDC-based identity tokens for user authentication. Upstream identity sources and protocols, such as enterprise directories, may be used through federation with the identity provider.

Req ID	Requirement Area	Description
SEC01	Authentication	Users: Support standards-based user authentication via OIDC and SAML 2.0, including federation with external identity providers (e.g. NVIDIA’s enterprise IdP) for single sign-on (SSO) across platform and tenant-facing services. Validate OIDC-issued tokens including signature, issuer, audience, expiration, and required claims for identity and authorization decisions.
SEC03	Authentication	External Services: Support authentication of out of cluster service accounts for service-to-service access. Must support credential-based access, including long-lived credentials where required. If long-lived credentials (e.g. API keys) are issued, the platform will support configurable expiration and rotation. Need ownership attribution for all service accounts. The platform shall provide account information (such as detection of unused accounts).
SEC04	Authorization (RBAC)	The platform shall enforce least-privilege RBAC for all managed services and infrastructure, featuring granular API actions (e.g. CRUD), scopes (e.g. dev vs staging vs prod), and function (e.g. image builder, provisioner, auditor). Roles and permissions shall be assignable to groups, with users inheriting access through group membership (GBAC); group membership may be sourced from OIDC claims and/or SCIM-provisioned groups.
SEC05	Identity / Directory Services	The platform shall integrate with the NVIDIA LDAP (RFC2307bis) directory service such that users identities and group membership can be resolved by dependent services for authentication and authorization decisions (e.g. storage - POSIX-based access control )
SEC06	Workload/Service Identity	Support standard workload, service, and node security identities using short-lived credentials, including OIDC-based workload identity federation and Kubernetes Service Accounts where applicable.
SEC07	Admin Interfaces	All administrative interfaces—whether UI, CLI, or API—must be protected by Multi-Factor Authentication (e.g. mgmt API)
SEC08	Audit Logs	Audit logs must be generated and retained for all security-relevant events, including management and control plane API calls, authentication events, and authorization decisions. Audit logs shall be retained for a minimum of 30 days and accessible to authorized platform operators. Must provide a log export mechanism (such as publishing to an S3 bucket). Exported logs should include sufficient metadata to identify tenant, project/account, region, service, resource identifier, actor, event timestamp, source IP where applicable, action, and authorization result.
SEC23	Provisioning	The platform shall support SCIM 2.0 for automated user and group lifecycle management from enterprise identity providers. SCIM endpoints shall require authenticated and authorized access, support core User and Group resource operations, and synchronize group membership changes with the platform authorization engine. Synchronized groups shall be first-class RBAC objects targetable by role bindings and IAM policies; membership changes shall propagate promptly across all managed services.
SEC24	Authentication	Must support domain-based IdP routing, mapping multiple email domains to a designated identity provider (e.g., nvidia.com and nvw.nvidia.com to the NVIDIA enterprise IdP)
SEC25	Organization-Level Policies	The platform shall support organization-level security guardrail policies that cascade across all subordinate tenant resources (networks, clusters, storage, compute) and cannot be weakened or bypassed by lower-level configuration. Policy violations shall be denied at resource creation/update time and recorded in audit logs.
SEC26	SSO Enforcement	The platform shall allow authorized administrators to enforce federated SSO for a tenant, restricting local username/password and other non-federated login for regular users. Enforcement shall apply consistently across UI, CLI, API, and administrative interfaces.
SEC27	Account Management	The platform shall expose a programmatic mechanism to create and manage: Isolation Units (e.g., projects, sub-project) IAM Users Service Accounts Logs

Cryptography and Key Management

Req ID	Requirement Area	Description
SEC09	Key & Certificate Lifecycle	The platform shall support secure issuance, distribution, storage, rotation, and revocation of cryptographic keys and certificates used across platform services. It shall support automated rotation of provider-managed and customer-managed keys and certificates, with configurable rotation intervals. Must be auditable. Must have an expiration date.
SEC10	Key Usage	The platform shall support use of managed keys and certificates across platform services for encryption, authentication, and signing.

Network Isolation and Encryption

Req ID	Requirement Area	Description
SEC11	Tenancy Model	Hard physical or logical isolation for network, data, and compute. Separation of control planes and tenants is mandatory. This includes separation of storage resources. Provide hierarchical tenancy (at least organization → project).
SEC12	BMC Security	Out-of-band management (BMC) must be on a dedicated, restricted network (physically separate or VLAN/VRF-isolated). Direct access from the public internet or general corporate networks must be blocked, and only accessed via a hardened bastion (jumphost) server.
SEC13	Network Traffic Encryption	Encryption and mutual authentication (mTLS or equivalent) for all east-west and north-south network traffic

Edge Network Security

Req ID	Requirement Area	Description
SEC14	Private Access	No public internet access by default; all API endpoints (e.g. K8s API Server) must be restricted via firewall/private link.
SEC15	Edge Network Security Policy	All traffic must be filtered via Security Groups and/or user customizable ACLs using 5-tuple rules.
SEC16	Enforcement	NCP must specify the enforcement technology (e.g., Hardware firewalls, SDN, DPUs/SmartNICs) and its specific placement in the packet path.
SEC17	Threat Intelligence & Scale	Ability to subscribe to GeoIP threat & Embargo feeds and import them into security groups. NCP should share the max supported records/rules.
SEC18	MACsec Protection Links	Protect links between NCP Data Center and NVIDIA POP.

Hardware Security and Compliance

Req ID	Requirement Area	Description
SEC19	SOC 2	SOC2 type 1 or better is required covering Security, Availability, and Confidentiality across all services and DC infrastructure.
SEC20	At-Rest Data Protection	Mandatory encryption of all data at rest (e.g. local NVMe/SSD, network-attached storage) via Self-Encrypted Drives (SED).
SEC21	Data Sanitization	Data sanitization must be performed between tenants or on a hardware replacement, including cryptographic erase of all data drives between tenants; sanitization/wipe of any persistent or volatile memory including SRAM/GPU memory; resetting of TPM and BIOS.
SEC22	Root of Trust + Secure Boot	Mandatory support across all platforms for Hardware Root of Trust mechanisms (TPM 2.0). The platform must enable UEFI OS Secure Boot w/ TPM 2.0.

Breakfix Requirements

The NCP must provide a specific Breakfix API to support fleet reliability. Any node-level remediation must not impact other parts of the tenancy. Specifically, NVLink must be reconfigured properly to take a node out of the tenancy.

The API must enable the following actions:

Req ID	Requirement Area	Description
BFX01	Breakfix Lifecycle	Compute: Power-cycle individual nodes or reset a VM instance. GPU: Reset GPUs on an individual node (as needed - k8s). Maintenance: Return/Report an individual node and a rack to the Provider for maintenance. Cordon: Mark a node as unschedulable for new workloads (but finish existing). Replace: Request a host-replacement when health thresholds are breached
BFX02	Breakfix Events	Query for any upcoming/current maintenance events for a node or rack Query for any retirement notices for a node/rack. Query for historical / status information for equipment repair. Event information should include: ticket open date ticket update date ticket close date Hardware Stable Identifier (e.g., node ID) Hardware category/type impacted (e.g., GPU, fan, interconnect) Maintenance/Error/fault description (some short description of the issue) Action: Categorization of action (e.g. repairs done on faulty GPUs to resolve the fault) Provider Account ID ticket ID Node Handover Date (Date when the node was deployed in Production)
BFX03	Diagnostics	Identify serial numbers of installed hardware (chassis, baseboard, network adapters, CPU, GPU, etc). Obfuscated but stable identifiers are also OK. Inspect firmware versions of compute nodes and NV switch trays.

Telemetry Requirements

The telemetry requirements comprise two core components that require alignment between DGX Cloud and the NCP:

Delivery Method: How telemetry will be delivered by the NCP to DGX Cloud for ingestion.
Telemetry Scope: What telemetry the NCP will deliver to DGX Cloud.

Delivery Method

NCPs must deliver all required telemetry, including metrics and logs, in a manner that allows ingestion into DGX Cloud systems with a latency of no longer than 120 seconds. Native OpenTelemetry Protocol delivery is preferred.

Telemetry Scope

DGX Cloud will provide the NCP with a specification document with the required metrics and logs. Upon receipt, the NCP must provide a formal written response detailing:

Confirmation of its ability to deliver the specified metrics and logs.
Projected timelines for delivery.
Specific technical details, including metric names, label names, and label values.

Network Telemetry

The NCP must provide network telemetry across the following domains:

North-South (front-end) network, including client-facing and external interconnects.
East-West (back-end) network, including GPU/GPU interconnects.
Management network, including control-plane and orchestration traffic.
NVSwitch fabric, including intra-node GPU switching, applicable only for GB200 and later clusters.
Host network, including NIC-level and server connectivity.

Logs

DGX Cloud will require the NCP to provide logs from various network technologies, including, but not limited to:

Fabric Manager logs for the NVLink domain, where applicable.
Subnet Manager logs for the NVLink domain, where applicable.
VPC Flow logs for all ingress and egress traffic.
UFM event logs.
General switch logs.
Switch syslogs.
Switch kernel logs.
BMC SEL logs.
Syslogs.
Management logs.

Storage Requirements

NCPs must provide shared storage solutions, where applicable, that are manageable through standard APIs and UIs, including auditing rights for NVIDIA access.

Home Directory Storage

Quota Feature: Configurable filesystem-wide limit, default user/gid quota settings, and per-uid/gid overrides.
Accounting: Usage accounting for uid/gids must be available when the feature is enabled.

Req ID	Requirement Area	Description
DIR01	File Service UID/GID Quota Feature	Configurable filesystem-wide limit, default user/gid quota settings, and per uid/gid overrides available. Usage accounting for uid/gids when the feature is enabled.
DIR02	Must Be NFS Storage	NVIDIA requires NFSv4 protocol shared storage to work. Access control based on DLs requires POSIX.
DIR03	Snapshots	The file system must support the ability to provide snapshot / restore functionality.
DIR04	LDAP	File Service must support integration with an NVIDIA-managed LDAP (see SEC05)

High-Speed Storage Service Requirements

These are the requirements for provisioning and interacting with the provider’s service offering.

Req ID	Requirement Area	Description
HSS01	Provisioning APIs	Storage provisioning may be via vendor portal/API or NCP portal/API.
HSS02	Performance	Must provision needed throughput requested for minimum bandwidth and IOPS.
HSS03	Integration	K8s: CSI support Breakfix API required to report storage issues
HSS04	Quota Support	Configurable filesystem-wide limit, default user/gid quota settings, and per uid/gid overrides available. Usage accounting for uid/gids when the feature is enabled. Configurable directory quota settings … it must be possible to apply a quota for a given directory. Usage accounting for directory quotas when enabled.
HSS05	Upgrade, Maintenance	Provider / NCP initiates desired maintenance. NVIDIA can schedule actual maintenance and can defer maintenance up to 2 weeks. Upgrades should be non-disruptive.
HSS06	RDMA Memory Protection	Storage systems using RDMA must enforce memory protection via authorization keys for both local and remote access.

High-Speed Storage Filesystem Requirements

These capabilities are required for the high-speed filesystem.

Req ID	Requirement Area	Description
HSS07	Parallel High Speed Filesystem	Parallel or multi-path high-speed filesystem that supports scaling to thousands of simultaneous clients while sustaining requested performance.
HSS08	Single File System Size	It must be possible to allocate a file system of at least 1 PiB even if the initial request is less. Growing to > 10PiB as cluster size increases. This hard requirement may be higher for a specific site and if so will be communicated via the ancillary services document.
HSS09	Multiple Filesystems (Fungible Total Capacity)	Can have >1 filesystem within our total capacity. Minimal file system size <= 50 TiB.
HSS10	Filesystem Expansion	Live file system expansion is supported, in terms of capacity, inodes, IO performance, and metadata operations performance. Performance should scale linearly with capacity.
HSS11	Client	Ability to describe your client: In-Kernel, userspace, or bare-metal client installation requirements. Support integration with client kernels / OS used by NVIDIA, as needed. DKMS-enabled packages available for Ubuntu 20.04, 22.04, and 24.04-based operating systems. ARM64 versions compatible with GB200-ready kernels are mandatory, e.g. Linux 6.8.x. Managed Storage Service Provider will provide client configuration best practices and configuration guidelines for filesystem options and kernel module configuration to reliably achieve optimal performance on ARM and x86_64-based clients.
HSS12	Quota (User, Project & Group)	Must support soft and hard quotas - uid / gid / project(directory)-id quotas with enforcement.
HSS13	Root-Squash	Nvidia needs to be able to enable or disable and manage root-squash at any time.
HSS14	Flock	It must be possible to mount the file system with flock.
HSS15	Ability to Audit Changes	Enable Nvidia to have access to changelog data for filesystem auditing and detailed user operations tracking. Tracking by uid/gid, create files, create dirs, rename files, rename dirs, delete files, delete dirs
HSS16	HA	All services are required to tolerate any critical component failure in the backend and provide continued client access to all storage services in such cases.
HSS17	Multi-Node Coherency	One second or less for client attribute and dentry cache updates/invalidates
HSS18	Client Multipathing	Clients must have multipathing to all storage servers.
HSS19	LDAP (for NFS)	NFS-based high-speed filesystem services must support integration with an NVIDIA-managed LDAP server (including unix uid group membership for users with > 16 group memberships) as per SEC05.

Data Movement Systems Requirements

The Data Movement system copies data from an external data source, such as NVIDIA or another cloud, to the NCP data center. These requirements must be met eight weeks before the first tranche of GPU delivery.

Req ID	Requirement	Description
DMS01	Dedicated K8s Cluster	Provider-managed k8s cluster (or ability to stand up our own) for Data Mover stack available ahead of the GPU cluster bringup to pre-stage data
DMS02	Data Mover Nodes (CPU)	Dedicated CPU nodes for running data mover - needs high performance networking (exact quantity will be communicated via ancillary services doc)
DMS03	Access to Same GPU Storage	Same filesystem as mounted on GPU nodes mounted on the Data Mover nodes (or ability to mount the same filesystem via CSI)
DMS04	Access to NVIDIA Corp Net	Dedicate link (as described in network transport) to NVIDIA corp net, preferably with vpn, but otherwise with stable IP for allowlisting.
DMS05	Stable Egress IP	Stable IP to IP allowlist access to Nvidia services. (e.g. similar to NAT Gateway)

DGXC-Managed Storage System Deployment

For scenarios where DGXC, rather than the NCP, deploys and manages the storage-system software, the following requirements apply. These requirements enable DGXC to operate storage systems, such as high-speed parallel filesystems, capacity object storage, or block storage, using NCP-provided infrastructure while maintaining operational control. For storage systems provided by the NCP, disregard this section.

Host Provisioning and Lifecycle

Req ID	Requirement Area	Description
STG01	Operating System Support	NCP must support a workflow that allows DGXC storage operators to integrate vendor-provided or storage-specific operating system images via bare-metal or VM provisioning for storage servers. The workflow must: (a) Allow DGXC to deploy custom OS images (e.g., vendor-enhanced kernels for Lustre, Rocky Linux, Ubuntu 20.04/22.04/24.04).
STG02	Drive Sanitization Policy	Cryptographically erase data drive contents between storage system tenants with full attestation of host firmware. Must support an optional flag to skip drive sanitization during break/fix flows (e.g., power supply replacement) where tenancy does not change. Critical hardware component replacements may require sanitization without override, this is inclusive of GPU / CPU node local storage.
STG03	Stable IP Assignment	Storage nodes must support static IP addressing that remains stable during host lifecycle operations and does not reset between maintenance events.
STG04	Out-of-Band Failure Detection	NCP must provide the ability to detect system failures out-of-band, including device, network, memory, and drive failures, enabling DGXC to proactively respond to hardware issues.
STG05	Topology Observability	NCP must provide visibility into failure domains to enable DGXC to provision storage nodes with physical diversity. Storage systems must be able to provision nodes that purposefully span failure domains for resilience.
STG06	BlueField/DPU Support	For storage systems utilizing BlueField-based architectures, the host provisioning system must support lifecycle management and specific configuration requirements for BlueField “JBOF” systems that export NVMe-oF to hosts.

Network Transport and Fabric Visibility

Backend Switch Fabric API

The purpose of this API is to expose sufficient information about the cluster network topology to enable efficient scheduling, placement, and optimization of multi-node GPU workloads. Understanding the network hierarchy between compute instances and switches, as well as intra- and inter-node NVLink domains, is essential for minimizing communication latency and maximizing throughput. This applies to north-south, east-west, and NVLink networks, but not to management networks. See the appendix for a DGXC-recommended reference implementation.

Req ID	Requirement Area	Description
NET01	Backend Switch Fabric API	For each compute node, the API must provide visibility into the backend network switches connecting the node to the core. Identification: Each switch must be identified by a unique, stable identifier. A “switch” may represent a physical switch or a logical connectivity domain. Structure: API may be gRPC or REST. Response structure may include multiple nodes (pagination expected). Topology: Switch info can be returned as an ordered array of IDs (e.g., leaf, spine, core) or separate fields for each tier.
NET02	NVLink Domain API	Requirement: For compute nodes supporting NVLink (e.g., GB200, GB300, Vera Rubin), the API shall return the unique identifier of the NVLink domain associated with each node. Implementation: Can be a separate API method or part of the Backend Switch Fabric API.

Transport and Networking Requirements

Non-Conflicting IP Space Allocation for the DGXC Cluster

Purpose: Ensure DGXC GPU clusters deployed in an NCP can access the NVIDIA DGXC/CorpIT network directly through routing exchange. DGXC cluster IP address space must not conflict with existing NVIDIA private IP space.

Req ID	Requirement Area	Description
NET03	Non-Conflicting IP Space Allocation for the DGXC Cluster	Bring Your Own IP (BYOIP): NCP shall support the ability for NVIDIA to bring and allocate its own IP private address space for DGXC GPU clusters. Stable IP: NCP shall provide a possibility to create static IP allocations that persist across instance restarts and re-creations. That includes floating IP allocations. DoD space: NCP shall support allocation and use of the 7.0.0.0/8 IPv4 address space for DGXC GPU cluster deployments. This IP space shall be considered equivalent to RFC1918 addresses Routing Support: NCP must support advertising and routing of BYOIP prefixes within the NCP environment and across interconnects (Private Cloud Interconnect, IPSec, etc.)

Connection to NVIDIA CorpIT Network

Purpose: Provide a connection from DGXC GPU clusters within the NCP to NVIDIA CorpIT for internal command, control, and administrative access.

Req ID	Requirement Area	Description
NET04	Connection to NVIDIA CorpIT Network	Bandwidth: Low bandwidth (Up to 10Gbps). Transport: Private Cloud interconnect + VIF + BGP (preferred for better performance/security). DGXC will establish connectivity to NCP through a mutually agreed Point of Presence (POP) using Private Cloud Interconnect, functionally equivalent to AWS Direct Connect, GCP Dedicated Interconnect, Azure ExpressRoute, and OCI FastConnect. Connectivity will be provisioned with a Virtual Interface (VIF) and routing established via BGP. The interconnect will be used to exchange private IP space (RFC1918, as well as 7.0.0.0/8) between DGXC and NCP.

Corporate network connectivity diagram

Figure: Private Cloud Interconnect + VIF + BGP for CorpIT access

Connection to DGXC Storage

Purpose: Enable high-bandwidth, end-to-end MACsec-encrypted, fail-closed access between DGXC GPU clusters within the NCP and NVIDIA DGXC on-premises object storage for large-scale data movement.

Req ID	Requirement Area	Description
NET05	Connection to DGXC Storage	Transport: Private Cloud interconnect + VIF + BGP (preferred for better performance/security). DGXC will establish connectivity to NCP through a mutually agreed Point of Presence (POP) using Private Cloud Interconnect, functionally equivalent to AWS Direct Connect, GCP Dedicated Interconnect, Azure ExpressRoute, and OCI FastConnect. Connectivity will be provisioned with a Virtual Interface (VIF) and routing established via BGP. The interconnect will be used to exchange private IP space (RFC1918, as well as 7.0.0.0/8) between DGXC and NCP.

Storage connectivity diagram

Cluster Local Internet Access

Purpose: Provide general internet access from DGXC GPU clusters within the NCP to the internet, including NVIDIA DGXC hosted services on third-party public-cloud services.

Req ID	Requirement Area	Description
NET06	Cluster Local Internet Access	Cluster Internet access: Egress NAT IPs should be a static pool dedicated to only Nvidia Cluster/Tenancy/VPC. These persistent IP addresses must be used exclusively for DGXC traffic and shall not be shared with or carry traffic from other NCP tenants. Availability: Must support redundant upstream paths to ensure connectivity under failure.

Internet access diagram

Figure: Public internet access for DGXC-hosted services

Capacity and Fleet Management

This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements, supporting operations and contractual SLAs.

Req ID	Requirement Area	Description
CAP01	Governance Metrics	Required Governance Metrics The core metrics needed to track fleet health are: Delivered: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. Healthy: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. Reserved: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant.
CAP02	Resource Governance API Metrics	The Resource Governance API must return the following information for each node: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed)
CAP03	Resource Discovery APIs	It is not acceptable to have capacity be “handed” to DGXC through a phone, slack or email message. For example, when cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently). Instead, please provide the following mechanism (and we can poll): Programmatic Capacity Discovery: All newly delivered capacity must be discoverable via a centralized API. This “Resource Index” must provide a stable resource identifier and some information on why it’s being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc)
CAP04	Logical Compartmentalization & Resource Isolation	To ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIA’s reserved capacity. Capacity Reservations: A mechanism to logically group and “pin” a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy Atomic Allocation: Support for reserving a “topology block” as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries.
CAP05	Unified Health & Lifecycle APIs	NVIDIA requires a “single source of truth” for the health of both physical hosts and logical compute primitives. Per-Host Health: Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). Primitive-Level Status: Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block).

Appendix

This section contains links to reference documents and implementation guidance that provide additional details for NCPs.

Implementation Guidance

The following reference documents provide additional information on implementing some of the requirements in this guide.

Network Topology Discovery: NVIDIA/topograph. Aligning with this YAML format may be useful. Topograph currently provides in-cluster topology.
Breakfix B200: B200 DGXC Lazarus BreakFix Requirements.
Breakfix GB300: GB300 DGXC Lazarus BreakFix Requirements.
Breakfix scenarios: DGXC Breakfix Maintenance Events.
Networking: Revised GNI - NCP - Cluster Connection Requirements for DGX Cloud.
Exemplar Cloud: NVIDIA Exemplar Cloud and the DGXC benchmarking repository.
Kubernetes security guidance: Kubernetes Security Response Committee.

Other Feature Considerations (Not Minimum Requirements)

Disk Cloning: Disk-cloning capability for network-attached block devices. It should be possible to clone a disk even on a running instance.
Managed Control Plane Autoscaling: Strong preference for the control plane to automatically add capacity when load increases.
Threat Detection: Control planes, management planes, and hosts under the service provider’s control should deploy threat- and anomaly-detection solutions, for example HIDS and NIDS, that can identify active threats.
Break-Glass Administrative Access: The platform should support a limited break-glass access mechanism for designated NVIDIA administrative users when federated SSO is unavailable, misconfigured, compromised, or otherwise prevents authorized access to the tenant. Break-glass accounts should use local platform credentials independent of the external identity provider, be explicitly excluded from SSO enforcement, and be protected by strong authentication controls including MFA. Their use should be auditable, generate security-relevant logs and alerts, and support periodic review, rotation, disablement, and testing.