Capacity and Fleet Management

This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements to support operations and contractual SLAs.

Req ID	Test Details (Legend)	Requirement Area	Description
CAP01	INFO	Governance metrics	Required Governance Metrics The core metrics needed to track fleet health are: Delivered: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. Healthy: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. Reserved: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant.
CAP02	add	Resource Governance API Metrics	The Resource Governance API must return the following information for each node: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed)
CAP03	add	Resource Discovery APIs	It is not acceptable to have capacity be âhandedâ to DGXC through a phone, slack or email message. For example, when cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently). Instead, please provide the following mechanism (and we can poll): Programmatic Capacity Discovery: All newly delivered capacity must be discoverable via a centralized API. This “Resource Index” must provide a resource stable identifier and some information on why itâs being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc)
CAP04	INFO	Logical Compartmentalization & Resource Isolation	To ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIAâs reserved capacity. Capacity Reservations: A mechanism to logically group and “pin” a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy Atomic Allocation: Support for reserving a “topology block” as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries.
CAP05	INFO	Unified Health & Lifecycle APIs	NVIDIA requires a “single source of truth” for the health of both physical hosts and logical compute primitives. Per-Host Health: Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). Primitive-Level Status: Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block).