Capacity and Fleet Management
Capacity and Fleet Management
This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements to support operations and contractual SLAs.
| Req ID | Test Details (Legend) | Requirement Area | Description |
|---|---|---|---|
| CAP01 | INFO | Governance metrics | Required Governance Metrics The core metrics needed to track fleet health are: Delivered: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. Healthy: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. Reserved: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant. |
| CAP02 | add | Resource Governance API Metrics | The Resource Governance API must return the following information for each node: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed) |
| CAP03 | add | Resource Discovery APIs | It is not acceptable to have capacity be âhandedâ to DGXC through a phone, slack or email message. For example, when cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently). Instead, please provide the following mechanism (and we can poll): Programmatic Capacity Discovery: All newly delivered capacity must be discoverable via a centralized API. This “Resource Index” must provide a resource stable identifier and some information on why itâs being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc) |
| CAP04 | INFO | Logical Compartmentalization & Resource Isolation | To ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIAâs reserved capacity. Capacity Reservations: A mechanism to logically group and “pin” a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy Atomic Allocation: Support for reserving a “topology block” as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries. |
| CAP05 | INFO | Unified Health & Lifecycle APIs | NVIDIA requires a “single source of truth” for the health of both physical hosts and logical compute primitives. Per-Host Health: Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). Primitive-Level Status: Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block). |