Capacity and Fleet Management

View as Markdown

Capacity and Fleet Management

This section defines the essential metrics required for standardized monitoring and reporting of fleet health in partner engagements to support operations and contractual SLAs.

Req IDTest Details (Legend)Requirement AreaDescription
CAP01INFOGovernance metricsRequired Governance Metrics The core metrics needed to track fleet health are: Delivered: Nodes/GPUs provisioned and available to NVIDIA, allocated to a specific account/project/tenant. Healthy: Nodes/GPUs functioning and meeting SLA requirements, allocated to a specific account/project/tenant. Reserved: Resources allocated to a specific account/project/tenant. Total Active/In-Use: Nodes/GPUs currently in use within a specific account/project/tenant.
CAP02addResource Governance API MetricsThe Resource Governance API must return the following information for each node: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed)
CAP03addResource Discovery APIsIt is not acceptable to have capacity be “handed” to DGXC through a phone, slack or email message. For example, when cluster first comes online, nodes/racks are likely being handed off weekly (or more frequently). Instead, please provide the following mechanism (and we can poll): Programmatic Capacity Discovery: All newly delivered capacity must be discoverable via a centralized API. This “Resource Index” must provide a resource stable identifier and some information on why it’s being provided (e.g. capacity fulfillment on gb300 project, break-fix / RMA return to cluster, etc)
CAP04INFOLogical Compartmentalization & Resource IsolationTo ensure performance consistency and security, the NCP must support strict logical and physical isolation of NVIDIA’s reserved capacity. Capacity Reservations: A mechanism to logically group and “pin” a set of resources (compute, network, storage) to accounts (or equivalent constructs) in an NVIDIA tenancy Atomic Allocation: Support for reserving a “topology block” as a single unit, ensuring all resources in that block share identical performance characteristics and security boundaries.
CAP05INFOUnified Health & Lifecycle APIsNVIDIA requires a “single source of truth” for the health of both physical hosts and logical compute primitives. Per-Host Health: Real-time API access to the health bits of physical hardware (GPU state, thermal status, memory health). Primitive-Level Status: Health aggregation at the cluster, nodegroup, or reservation level to identify broad infrastructure failures (e.g., a spine switch failure affecting a whole block).