> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/dsx/_mcp/server.

# Capacity and Fleet Management

## Capacity and Fleet Management

This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements to support operations and contractual SLAs.

| Req ID    | Test Details [(Legend)](/dsx/ncp/nvidia-requirements-for-ai-clouds/appendix#test-legend) | Requirement Area                                  | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| :-------- | :--------------------------------------------------------------------------------------- | :------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **CAP01** | INFO                                                                                     | Governance metrics                                | **Required Governance Metrics** The core metrics needed to track fleet health are: **Delivered**: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. **Healthy**: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. **Reserved**: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant.                                                                                                                                                                                                            |
| **CAP02** | add                                                                                      | Resource Governance API Metrics                   | The Resource Governance API must return the following information for each node: **Node ID** (Unique identifier for a GPU node) **Health State** (Healthy/Unhealthy classification) **Instance ID** (Identifier for virtual workload) **Creation Timestamp** (Time workload/node was created) **Hardware Type** (Descriptor for the hardware model) **GPU Count** (Number of GPUs per node) **Top-levelAccount/ID** (Identifier for the top-level organization/account) **Sub-LevelProject/ID** (Identifier for the nested project/sub-account) **In Use** (True/False status indicating if the GPU Node is turned on and in use) **Region** (Region of the data center where nodes are deployed) |
| **CAP03** | add                                                                                      | Resource Discovery APIs                           | It is not acceptable to have capacity be âhandedâ to DGXC through a phone, slack or email message.  For example, when  cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently).  Instead, please provide the following mechanism (and we can poll): **Programmatic Capacity Discovery**: All newly delivered capacity must be discoverable via a centralized API. This "Resource Index" must provide a resource stable identifier and some information on why itâs being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc)                                                                             |
| **CAP04** | INFO                                                                                     | Logical Compartmentalization & Resource Isolation | To ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIAâs reserved capacity. **Capacity Reservations**: A mechanism to logically group and "pin" a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy **Atomic Allocation: Support for reserving a "topology block" as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries.**                                                                                                                                                                            |
| **CAP05** | INFO                                                                                     | Unified Health & Lifecycle APIs                   | NVIDIA requires a "single source of truth" for the health of both physical hosts and logical compute primitives. **Per-Host Health:** Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). **Primitive-Level Status:** Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block).                                                                                                                                                                                                                                                   |