This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements to support operations and contractual SLAs.
| Req ID | Test Details (Legend) | Requirement Area | Description |
|---|---|---|---|
| CAP01 | INFO | Governance metrics | Required Governance Metrics The core metrics needed to track fleet health are: Delivered: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. Healthy: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. Reserved: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant. |
| CAP02 | add | Resource Governance API Metrics | The Resource Governance API must return the following information for each node: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed) |
| CAP03 | add | Resource Discovery APIs | It is not acceptable to have capacity be âhandedâ to DGXC through a phone, slack or email message. For example, when cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently). Instead, please provide the following mechanism (and we can poll): Programmatic Capacity Discovery: All newly delivered capacity must be discoverable via a centralized API. This “Resource Index” must provide a resource stable identifier and some information on why itâs being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc) |
| CAP04 | INFO | Logical Compartmentalization & Resource Isolation | To ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIAâs reserved capacity. Capacity Reservations: A mechanism to logically group and “pin” a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy Atomic Allocation: Support for reserving a “topology block” as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries. |
| CAP05 | INFO | Unified Health & Lifecycle APIs | NVIDIA requires a “single source of truth” for the health of both physical hosts and logical compute primitives. Per-Host Health: Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). Primitive-Level Status: Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block). |