Inference power pilot: NVL72 fleet power management

Overview

This guide walks you through running an AI inference workload on GB200 NVL72 or GB300 NVL72 racks while DPS (Domain Power Service) applies automatic, telemetry-driven power management across the fleet.

You will establish a multi-rack baseline without DPS in the control path, then enable DPS so the same operating power envelope can safely cover more racks and GPUs than static MaxP-style planning would allow. Workloads stay on your existing launch path; DPS enforces power at the infrastructure layer without coupling to the workload schedulers or provisioning systems that own placement and fleet membership.

In other words, the pilot asks whether a fixed facility power budget can deliver higher aggregate inference throughput if you deploy more GPUs inside that budget than static, per-rack sizing would justify—provided a control plane continuously reconciles draw against the envelope. NVIDIA uses the name MaxLPS (Maximum Land Power Shell) for that pattern; the following section defines the term and the facility-level reasoning that makes it credible for inference.

Compared to the unmanaged baseline, a successful managed run should show these outcomes:

  • Improved aggregate token throughput or your service’s primary tokens/sec KPI from operating more inference capacity inside the same power shell.
  • Only a modest impact on latency. Per-request and tail behavior may shift slightly as per-GPU limits track telemetry, so quantify the change rather than assuming it is negligible.
  • Power compliance maintained for the full steady-state window, with aggregate draw and the agreed operating envelope aligned and no hidden excursions.

Collect evidence liberally: workload-side inference metrics, per-rack and aggregate power, DPS limit traces, BMC readings, and any facility or orchestration alarms you already rely on. Use consistent sampling and timestamps so you can tie service behavior to policy action. That record becomes the basis for the KPI-oriented comparison in Part 8.

What does MaxLPS mean?

Modern AI inference workloads rarely sustain peak GPU power draw. Because of this, a data center that statically provisions compute strictly against MaxP power leaves a large fraction of its power envelope unused. NVIDIA MaxLPS (Maximum Land Power Shell) names the data center pattern of intentionally deploying more compute than a naive power cap would allow, while a control plane keeps aggregate draw within the agreed operating envelope. A common rule of thumb is safe overprovisioning on the order of 40% more compute against that same envelope when power is actively managed.

This runbook does not treat MaxLPS as an abstract initiative to implement in the abstract. It shows how to validate that pattern on NVL72 using DPS: you reproduce a fixed envelope with more GPUs in the fleet, measure inference throughput, and confirm the envelope is respected.

Scheduler-independent configuration

Here scheduling and provisioning means workload schedulers (Slurm, Kubernetes, and similar job or pod placement layers) and provisioning systems (cluster installers, fleet managers, automation that registers, scales, or re-images nodes)—everything that normally owns where nodes run and which hardware is in service—not DPS’s power plane.

DPS is designed to follow those layers when you want it to: workload schedulers and provisioning systems can drive resource group lifecycle and policy through the nvidia.dcpower.v1 resource group API and the matching dpsctl resource-group commands so membership stays aligned with jobs, reservations, or capacity changes. Managing Resource Groups documents that integrated style of operation step by step.

You can still pursue the MaxLPS pattern without that coupling by defining one or more resource groups whose members you manage outside both workload schedulers and provisioning systems—fixed topology entities, dpsctl resource-group create / add / activate only—while inference continues to run through your existing scheduling and provisioning unchanged. Scheduler-independent configuration, in this runbook, is that decoupled path: the smallest setup that enforces a shared GPU power budget across the fleet while scheduling and provisioning never call DPS to add or remove nodes for you.

That configuration consists of:

  • A simplified topology that models the 4 racks and the operating power domain above them.
  • A single resource group containing every compute node in the topology, created with three capabilities enabled together:
  • DPM (Dynamic Power Management) — DPS actively enforces per-node power policy.
  • PRS (Power Reservation Steering) — DPS’s mechanism that redistributes headroom across GPUs based on live telemetry. See Power Reservation Steering.
  • Shared GPU — the resource group’s GPU budget is enforced as a group sum rather than per-node, which is what makes operating more racks within one envelope possible. See Resource Groups.

If any one of these three is missing, the pilot configuration described here is incomplete and the comparative results in this runbook will not be reproducible. Creating these conditions is covered in the following sections.

What you will do

  1. Validate the deployment — Authenticate, run dpsctl verify, and interpret failures so you are not debugging topology on a broken server (Part 1).
  2. Model the site and gate end-to-end BMC health — Import devices if needed, build the 4-rack topology with a single PowerDomain budget, run the deep BMC health check against every node, and activate the topology once as a temporary DPS-to-BMC end-to-end gate (Part 2).
  3. Deactivate before the baseline — After the BMC health and activation gates in Part 2 pass, deactivate the topology so the unmanaged baseline runs with DPS out of the control path (Part 3).
  4. Prepare the baseline run — Stage inference consistently, define the steady-state window and KPIs, and document the 3-rack footprint and power context (Part 4).
  5. Run the unmanaged baseline — Execute the workload without DPS in the control path and capture baseline metrics (Part 5).
  6. Enable fleet power policy — Re-activate the topology, then create and activate the resource group with DPM, PRS, and Shared GPU, aligned to the topology, and verify policy is live (Part 6).
  7. Run the managed comparison — Repeat the same inference profile across 4 racks under the same envelope with DPS enforcing the topology budget (Part 7).
  8. Compare and decide — Align metrics windows, quantify throughput uplift, and watch for envelope breaches or throttling (Part 8).

Prerequisites

Before starting this runbook, confirm each of the following:

  • Hardware: Four GB200 NVL72 or GB300 NVL72 racks identified for the pilot. All 18 compute nodes per rack must be powered on and BMC-reachable.

  • BMC Redfish compatibility assessed: Before deploying DPS, complete a Day 0 Redfish assessment for every Baseboard Management Controller (BMC) on the target GB200 NVL72 or GB300 NVL72 hardware.

    Confirm the following compatibility requirements:

    • The BMC supports all Redfish endpoints required by DPS. The Redfish API guide lists the endpoints DPS uses to collect telemetry and set GPU, CPU, module, or node power limits.
    • For GB200 NVL and GB300 NVL systems, the BMC firmware is 25.2.0 or later and the Redfish version is 17.1 or later.

    Confirm the following BMC client connection requirements:

    • For GB200 and GB300 systems, the NVIDIA BMC and DPS teams recommend no more than four BMC client connections, including two DPS client connections.
    • Use session token authentication instead of basic authentication.
    • Use keep-alive to maintain the connection.

    Collect the following inventory details for each BMC:

    • Manufacturer
    • Firmware build
    • BMC firmware version
    • Redfish version
    • BMC type: OEM-based or NVBMC-based

    Document any gaps before deployment:

    • Document unsupported or mismatched endpoints for any BMC that does not expose every required endpoint for the target hardware, and share that inventory with NVIDIA.
    • Use third-party tools such as redfish-crawler to collect BMC API responses. You can provide the generated output to the NVIDIA DPS team for support assessment.
  • DPS server deployed to a Kubernetes cluster following the Deployment Guide. BMC credential secrets for every node are already created in the Kubernetes cluster. Use the latest available DPS release and matching dpsctl client from DPS on NGC before starting the pilot.

  • PRS (Power Reservation Steering) in Helm values.yaml — For results comparable to this runbook, the deployment chart must enable PRS with the values below. Confirm (or set) these keys in the values.yaml you pass to Helm before or as part of install/upgrade:

    dps:
      prs:
        enabled: true
  • dpsctl installed and authenticated per Install dpsctl. dpsctl must be able to reach the DPS server.

  • Inference workload and dataset staged on every node in all 4 racks, launchable from your job scheduler or equivalent orchestration.

  • PowerDomain budget for the pilot is known and documented — this runbook uses the same 3-rack operating envelope as the baseline.

Note: This runbook assumes all of the above are already complete. If DPS has not yet been deployed, stop here and complete the Deployment Guide first.

Note: If you do not have the physical hardware to run this pilot but still want to try DPS, you can deploy the DPS SDK Simulator to a VM to explore DPS in a simulated environment.

Part 1: Verify DPS Deployment

Before configuring DPS, confirm the deployment is healthy end-to-end. Treat this as a gate: if any component is unhealthy, stop and fix it before moving on to Part 2 — a misconfigured deployment will surface as confusing failures later in the runbook.

Note: If you haven’t already, export a few dpsctl environment variables to keep the commands that follow concise:

export DPSCTL_HOST="api.dps.your-domain.com"
export DPSCTL_PORT="443"
export DPSCTL_INSECURE_TLS_SKIP_VERIFY=true # Needed if TLS was not configured

Otherwise, you must add --host, --port, and --insecure-tls-skip-verify to every dpsctl command.

Step 1: Authenticate with dpsctl login

Before any other dpsctl command will succeed, authenticate against the DPS server. The token is cached under ~/.dpsctl/credentials.yaml and reused by subsequent commands until it expires.

dpsctl login -u <username>

dpsctl login prompts for the password interactively, or reads DPSCTL_USERNAME and DPSCTL_PASSWORD from the environment if set. See Install dpsctl for the one-time install and initial configuration.

Step 2: Confirm server and client versions

Confirm authentication succeeded and capture the server version:

dpsctl server-version

A successful call returns the server’s version payload. Confirm the server version matches the latest DPS release available on NGC.

Confirm the local dpsctl client matches the server version:

dpsctl --version

Confirm this reports the same version as the server. An AUTHENTICATION_FAILED or missing token error from dpsctl server-version means the previous dpsctl login did not complete — rerun it before continuing.

Step 3: Run the verify gate

Proceed to Part 2 only when status.ok is true and no component reports healthy: false. A skipped: true component is acceptable — it means that component is not part of this deployment.

Run dpsctl verify with no component flags. It checks the DPS server, database, authentication service, UI ingress, and BCM connectivity in a single call:

dpsctl verify

Each component (dps_server, database, auth, ui, bcm) returns healthy, message, and skipped. The details field is informational and is mainly useful for troubleshooting.

Example healthy output

Troubleshooting: If deployment verification fails, see Deployment Verification.

Part 2: DPS Configuration

To manage power effectively, DPS needs to know the structure of the data center in which it is operating. We call this the data center topology. DPS also needs to know of any custom devices it will encounter in that topology.

Step 1: Import Device Specifications (if needed)

DPS ships with a set of supported device specifications out of the box. The supported devices include, at time of writing:

  • Compute: DGX_GB200, DGX_GB300, DGX_B200, DGX_B300, DGX_H100
  • GPUs: GB200, GB300, B200, B300, H100
  • CPUs: Grace, Grace_GB300
  • Power infrastructure: FloorPDU95 (Generic Floor PDU with 95% efficiency factor) , RackPDU95_57500W (Generic Rack PDU with 95% efficiency factor and max load of 57,000 W), PowerSupply95_3300W (Generic Power Supply with 95% efficiency factor and max load of 3,300 W)

GB300 GPU minimum power (optional). On GB300 NVL72 pilots, you can refine the GB300 GPU entry in your device-model YAML (devices.yaml) before dpsctl device upsert. Set each GPU’s minLoadWatts (see Device Specifications) to roughly 100 W below the average GPU power you measured during the baseline inference run—using the same steady-state window you plan to compare in Part 8. That lowers the floor DPM assumes for each GPU so the control loop tracks your real workload band instead of a conservative idle default, which usually sharpens dynamic power management without changing the upper cap, provided maxLoadWatts stays above minLoadWatts and still reflects hardware limits. Treat this as a deliberate model override; document the before/after values and re-import per Managing Devices.

If your data center uses PDUs, PSUs, or other equipment whose model does not match the list above, you must register a specification for it before it can appear in a topology. Follow the full procedure in Managing Devices, then import your additions:

dpsctl device upsert devices.yaml

Step 2: Verify device specifications

Confirm that every model you reference in your topology appears in the output:

dpsctl device list

Step 3: Define the topology and static load

The topology is how you describe your power-distribution network to DPS. For this runbook, the topology must contain at minimum:

Entity type Count Purpose
PowerDomain 1 Top-level 3-rack operating envelope applied across all 4 racks.
PowerDistribution 4 (one per rack) Rack PDU.
ComputerSystem 72 (18 per rack) GB200 or GB300 compute nodes, parented under their rack PDU.

The PowerDomain’s PowerValue is the top-level operating envelope DPS enforces across the 4-rack pilot topology. Set it to the same 3-rack envelope used by the baseline: 375 kW for GB200 NVL72 or 405 kW for GB300 NVL72.

Each GB300 NVL72 rack contains 9 NVSwitch trays. Their draw is not managed by DPS, but it must still be reflected in the rack PDU and PowerDomain budget.

The runbook represents the per-rack NVSwitch aggregate with the StaticLoad field on each FloorPDU-Rack0X (PowerDistribution) entity. StaticLoad declares fixed power consumption from unmanaged devices and is propagated up the power-distribution network automatically—see Entities → Static Load.

Use 7,416 W per rack for a standard GB300 NVL72 9-switch aggregate (replace with measured or vendor-specified values when available). The example topology in Step 4 applies this value on every rack PDU. For a standard GB200 NVL72 rack, we recommend a 9-switch aggregate power value of 11,160 W per rack.

Step 4: Prepare the topology file

Download the reference topology that matches your hardware:

  • maxlps-pilot-gb300.json — 4 GB300 NVL72 racks, 72 DGX_GB300 nodes, 4 FloorPDU95 rack PDUs at 135 kW (StaticLoad 7,416 W), PDU-Root PowerDomain at the 3-rack envelope of 405 kW. Ships with the GB300-* policy presets the runbook references in later steps.
  • maxlps-pilot-gb200.json — 4 GB200 NVL72 racks, 72 DGX_GB200 nodes, 4 FloorPDU95 rack PDUs at 125 kW (StaticLoad 11,160 W), PDU-Root PowerDomain at the 3-rack envelope of 375 kW. Ships with the GB200-Per-* policy presets derived from a live GB200 environment.

Replace the BMC URL and SecretName values with those for your fleet before importing.

Preview: topology root, one rack PDU, and the PowerDomain

For additional topology examples, see Managing Topologies.

Step 5: Validate and import the topology

Validate the topology file against the DPS schema and import it. Importing only registers the topology and its inventory in the DPS database — no policy is applied to any BMC yet, so this step is safe to run on a live fleet.

dpsctl topology validate topology.json # Confirms the topology matches the DPS topology schema
dpsctl topology import topology.json

Troubleshooting: If topology validation or import fails, see Topology Validation and Import.

Step 6: Run the BMC health check for every node in the topology

Before activating, run the deep BMC prerequisite health check against every BMC in the imported pilot topology. This check is broader than a reachability ping: it validates Redfish endpoint reachability, firmware inventory, power-limit write and read-back behavior, telemetry sampling, GPU power-policy state, and in-band/out-of-band power drift across every node. Passing --topology runs the check against every node in that topology in a single call. The topology only needs to be imported — not active — for this check to run.

On large clusters the probe can take up to an hour. dpsctl verify bmc-health start starts the check asynchronously and returns a task_id. This server-side task continues running even after dpsctl exits. Use the task_id to query the status of the health check:

dpsctl verify bmc-health start --topology maxlps-pilot --force-writes --expected-edpp-pct 100 --samples-per-telemetry 2500 --telemetry-interval 500ms
# Query the status until the health check is done
dpsctl verify bmc-health status <task-id>
# Once done, generate a report
dpsctl verify bmc-health report <task-id> --wait --output json > bmc-health-maxlps-pilot.json
dpsctl verify bmc-health report <task-id> --summary-only
Example Full GB300 Report: All Nodes Healthy

Step 7: Review BMC health results

This run is also where you collect BMC latency metrics for the pilot. From bmc-health-maxlps-pilot.json, capture the following latency summaries for every node:

  • endpoint_latency_ms
  • telemetry_latency_ms
  • power_get_latency_ms
  • power_set_latency_ms
  • edpp_get_latency_ms
  • wpps_get_latency_ms

For MaxLPS readiness, target p50 below 0.5 seconds, p95 below 0.7 seconds, p99 below 0.8 seconds, and max near 1 second for read and write paths. If p99 is greater than 1 to 2 seconds, document the latency measurements and assess cluster-wide BMC client connection pressure before activation.

Prepare other BMC clients before you run this check. Confirm the BMC client connection requirements from the prerequisites are still in place:

  • For GB200 and GB300 systems, the NVIDIA BMC and DPS teams recommend no more than four BMC client connections, including two DPS client connections.
  • Use session token authentication instead of basic authentication.
  • Use keep-alive to maintain the connection.

In the summary report generated in Step 6, confirm:

  • status.ok is true.
  • cluster_summary.total_nodes is 72, matching the node count in maxlps-pilot.
  • cluster_summary.failed is 0.
  • cluster_summary.nodes_unreachable is 0.
  • issues contains no SEVERITY_ERROR entries.

Known B200/B300 firmware-validation exception: on systems with B200 or B300 GPUs, DPS 0.8.x can report a FIRMWARE_VALIDATION_FAILED issue with message: firmware validation failed even when the BMC is usable. If this is the only SEVERITY_ERROR for the affected B200 or B300 node, cluster_summary.nodes_unreachable is still 0, and the rest of the health report is clean, document the exception and continue with the runbook.

If any of these do not hold, other than the known B200/B300 firmware-validation exception described here, the issues list identifies what to fix. Each entry includes the affected node, the resource within that node, an issue code, the observed value, the threshold it violated, and a human-readable message. Fix the affected nodes and re-run the health check until the report is clean before activating the topology.

Troubleshooting: For issue-specific remediation, see BMC Health Check.

Step 8: Activate the topology

Activate the topology as a temporary end-to-end gate. Activation exercises the full DPS-to-BMC set-policy path against every node so any device-model or policy-application issue that the BMC health review in Step 7 could not catch surfaces now, at setup time, instead of immediately before the 4-rack managed run in Part 7. The topology will be deactivated in Part 3 so the unmanaged baseline in Part 5 runs with DPS out of the control path, and re-activated later in this runbook.

dpsctl topology activate --topology maxlps-pilot # Must match the name given in topology.json

Troubleshooting: If topology activation fails, see Topology Activation.

Step 9: Verify topology activation

dpsctl tp list

Confirm the topology appears and leaf_node_names lists all 72 compute nodes.

Part 3: Deactivate the Topology Before the Baseline Run

We activated the topology in Part 2, rather than waiting until Part 6, so any DPS-to-BMC issue would surface at setup time rather than immediately before the 4-rack managed run. The BMC health check (dpsctl verify bmc-health) and the activation in Part 2 both passed; deactivate now so the unmanaged baseline in Part 5 runs with DPS out of the control path.

Step 1: Deactivate the topology

dpsctl topology deactivate --topology maxlps-pilot

Step 2: Verify no topology is active

dpsctl tp list --active

Confirm no topologies are listed as active. The topology definition itself remains imported and will be re-activated at the start of Part 6.

Part 4: Preparing for Testing

Workload requirements

What “baseline” means here. The baseline run is the unmanaged reference. DPS is installed and running, but neither a topology nor resource group is active, so it is not enforcing power policy on any node. This is the state produced by dpsctl topology deactivate in Part 3; confirm it with dpsctl tp list --active (no rows) and dpsctl rg list (no resource groups) before launching the workload.

For the 3-rack baseline and the 4-rack DPS-managed run to be directly comparable, the workload must meet these requirements:

  • It is an inference workload. Training workloads, synthetic power stressors (gpu-burn, etc.), and pure benchmarks do not exhibit the bursty, non-sustained power profile that envelope-based inference pilots target and will produce misleading results.
  • It is representative of the production workload. Ideally it is the actual model, framework, and request pattern (batch size, concurrency, input length distribution) you plan to run once the pilot succeeds.
  • It runs long enough to reach a steady-state comparison window. The “steady-state window” is the contiguous slice of the run that you and the 4-rack managed run will be compared on in Part 8. Identify it as follows, and apply the same procedure on every run:
    • Run the workload for at least one hour total so the run is not dominated by ramp-up.
    • Trim ramp-up and cool-down from the start and end of the run.
    • Within the trimmed interval, require that both per-rack power draw and aggregate tokens/sec have visibly flattened on your time-series plots — no monotonic climb, no large step changes, only normal short-term workload jitter.
    • The remaining window must be at least 30 minutes long and contiguous; if it is shorter or broken up, extend the run rather than relying on a noisy window.
    • Record the start and end timestamps of this window and use exactly those timestamps when capturing every metric in What to collect below. The same procedure applies to the 4-rack managed run in Part 7 and to the comparison in Part 8.
  • The 3 pilot racks are under load in isolation. No other workloads should be sharing these racks or competing for the same facility power budget during the baseline window; otherwise the “3-rack baseline” no longer describes what the 3 racks alone can do.

What to collect

Before you run the baseline inference test, capture the following for the duration of the baseline run. You will compare these signals against the 4-rack managed run in Part 7 and interpret them using Part 8:

  • Aggregate workload throughput — requests/sec, queries/sec, or whichever KPI matches the workload.
  • Aggregate token throughput (tokens per second across the cluster) and latency percentiles (see Part 8 for the KPI set).
  • Workload metadata and configuration - application, model, ISL/OSL, and any other relevant configuration parameters.
  • Per-rack and aggregate power draw, sampled at a consistent interval. Keep both the time series and a steady-state average.
  • Wall-clock duration of the run and of the steady-state window within it.
  • Per-GPU power draw distribution across the 3 racks, if available — useful for showing how headroom is being left on the table by static provisioning.
  • EDPp settings per node. The same EDPp setting should be used across the entire fleet.

Use whatever stack you already trust for time-series capture (BMC or rack telemetry, DCGM or vendor exporters, inference server metrics, facility BMS). The important part is consistent sampling, aligned clocks, and the same definitions on baseline and managed runs.

The baseline run establishes the reference numbers the 4-rack DPS-managed run will be compared against. It is run without DPS in the control path — that is, before any topology or resource group is activated.

This runbook does not prescribe how you launch the workload. Whatever you already use to orchestrate AI inference on these racks — Slurm, a Kubernetes inference operator, a Triton deployment, your own job runner — is exactly what you should use here. DPS cares about the power side of the run, not the scheduling side.

Part 5: Running the Baseline Test

Launch the workload across 3 of the 4 pilot racks (54 compute nodes) using your existing orchestration tooling. DPS should not be in the control path during this run.

After the deactivate step in Part 3, no topology is active and no resource group exists yet, so DPS is not in the control path during this baseline run.

Part 6: Create and Activate DPS Resource Group

Step 1: Re-activate the topology

The topology was deactivated in Part 3 so the baseline could run with DPS out of the control path. Re-activate it now before creating the resource group; the topology is already imported, so this is the only setup step required:

dpsctl topology activate --topology maxlps-pilot

Troubleshooting: If topology activation fails, see Topology Activation.

Step 2: Verify the topology is active

dpsctl tp list --active

Confirm maxlps-pilot appears as the active topology.

Step 3: Create the resource group

Create one resource group containing every compute node in the topology, with DPM, PRS, and shared-GPU all enabled. The --shared-gpu-enable flag is what binds the resource group’s GPU power consumption to a shared, group-wide budget rather than per-node caps. Set the power policy of the nodes to 75% of MaxP as a starting point. This is a tunable parameter and can be changed in subsequent runs if desired. If you are using the example topologies from Part 2, then the power policy for 75% of MaxP is GB300-Per-75 or GB200-Per-75, depending on your GPU model.

dpsctl resource-group create \
  --resource-group maxlps-pilot \
  --external-id 1 \
  --dpm-enable \
  --prs-enabled \
  --shared-gpu-enable \
  --policy GB300-Per-75 # For GB200, use GB200-Per-75

Troubleshooting: If resource group creation fails, see Resource Group Creation.

--external-id is required and must be a 64-bit integer. It is the handle a workload scheduler or provisioning system would normally use to correlate this resource group with one of its own records. In the scheduler-independent configuration this runbook describes, no external system owns the resource group — pick any positive integer and keep a note of it for your own records. DPS does not enforce uniqueness on this value, but using a distinct ID per resource group makes dpsctl rg list output easier to read once you have more than one.

Step 4: Add every compute node

Add every compute node in the topology to the resource group. Substitute your real node names for gb300-nvl-001-compute01,gb300-nvl-001-compute02,...:

dpsctl resource-group add \
  --resource-group maxlps-pilot \
  --entities "gb300-nvl-001-compute01,gb300-nvl-001-compute02,gb300-nvl-001-compute03..."

See dpsctl resource-group add for additional flags (partial-update behavior, strict policy, etc.).

Troubleshooting: If adding resources fails, see Adding Resources to the Resource Group.

Step 5: Activate the resource group

Activate the resource group. Use --sync so the command blocks until every node has its policy applied:

dpsctl resource-group activate \
  --resource-group maxlps-pilot \
  --allow-reprovision \
  --sync

See dpsctl resource-group activate for the full activation flag set, including strict-policy and partial-timeout options.

Step 6: Verify resource group activation

dpsctl rg list --active

Confirm the maxlps-pilot resource group appears with:

  • activation_status: active
  • dpm_enable: true
  • prs_enabled: true
  • shared_gpu_enable: true
  • resource_names listing all 72 compute nodes

Troubleshooting: If resource group activation fails, see Resource Group Activation.

Step 7: Verify PRS is updating GPU limits

After the resource group is active, confirm the PRS pod is running control-loop iterations and writing new GPU limits. Find the PRS pod, then inspect recent logs from the prs container:

kubectl get pods -n dps -l app=prs
kubectl logs -n dps <prs-pod-name> -c prs --since=10m

Look for repeated Devices power draw:, New power limits:, and Loop latency: blocks. The node names, timestamps, and watt values will differ, but a healthy PRS loop should look similar to this:

[2026-06-12 16:42:10,214][DEBUG] Devices power draw:
   node_name                 device_type  device_index  domain_name   min_watts  max_watts  value_watts
0  gb300-nvl-001-compute01  gpu          0             maxlps-pilot  200.0      1200.0     842.7
1  gb300-nvl-001-compute01  gpu          1             maxlps-pilot  200.0      1200.0     816.4
2  gb300-nvl-001-compute02  gpu          0             maxlps-pilot  200.0      1200.0     905.2

[2026-06-12 16:42:10,319][DEBUG] New power limits:
   node_name                 device_type  device_index  next_limit
0  gb300-nvl-001-compute01  gpu          0             985.3
1  gb300-nvl-001-compute01  gpu          1             962.1
2  gb300-nvl-001-compute02  gpu          0             1034.8

[2026-06-12 16:42:11,002][INFO] Loop latency: 0.68 s. About to sleep: 5.32 s

The important signal is that PRS continues to emit new New power limits: blocks after activation.

Troubleshooting: If PRS does not emit ongoing power-limit updates, see PRS Log Checks.

Part 7: 4-Rack Inference Test Under DPS

The only intended change between Part 5 and Part 7 is that DPS is now actively managing power across all 4 pilot racks. Workload, model, framework, batching, concurrency, and the operating envelope are unchanged; if anything else differs, the comparison in Part 8 is invalid.

The two runs differ on exactly two dimensions:

Dimension Baseline (Part 5) Managed (Part 7)
Racks Under Load 3 racks (54 compute nodes) 4 racks (72 compute nodes)
DPS Control Path Not in control path; no topology and no resource group active In control path; topology maxlps-pilot active and resource group maxlps-pilot active
Operating Power Envelope Same Same
Workload, Model, Framework, Batching, Concurrency Same Same

The Operating Power Envelope row is held by the topology in the managed run: the active PDU-Root PowerDomain already sets the same 3-rack operating limit used by the baseline, and DPS enforces that limit across all 4 racks through the resource group’s shared-GPU policy.

With the topology active and the pilot resource group (maxlps-pilot) in place—created and maintained without integration to your workload schedulers or provisioning systems—re-run the same workload you used for the 3-rack baseline—same model, same framework, same request pattern—but now across all 4 racks (72 nodes), with DPS in the control path.

As in Part 5, this runbook does not prescribe how you launch the workload. Use the same orchestration tooling (Slurm, Kubernetes inference operator, Triton, your own job runner) you used for the baseline. The only thing that changes between the two runs is the number of racks under load and whether DPS is enforcing power policy — everything else must stay the same so the comparison is meaningful.

Workload requirements

The 4-rack run must be directly comparable to the 3-rack baseline. In addition to the baseline’s requirements:

  • It is the same workload as the baseline. Same model, same framework, same batch size, same concurrency, same input distribution. If anything about the workload itself changes, the throughput numbers are no longer comparable.
  • All 4 pilot racks are under load in isolation. No other workloads should be sharing these racks or competing for the same facility power budget during the run. The same operating envelope that constrained the 3-rack baseline should constrain the 4-rack run.
  • DPS is in the control path. Confirm the pilot topology from Part 2 and the active resource group from Part 6 are in place (dpsctl tp list, dpsctl rg list) and that the BMC health gate from Part 2 still passes (dpsctl verify bmc-health start --topology maxlps-pilot --wait --summary-only) before launching the workload.
  • It produces a comparable steady-state window. DPS’s per-GPU caps are set dynamically from telemetry, so the run needs enough duration for the control loop to settle into steady-state redistribution, not just the workload itself. Apply the same steady-state window procedure defined in Part 4 under “Workload requirements”, and make the resulting window at least as long as the one used for the 3-rack baseline.

Enable DPS Prometheus metrics

dps-server exports operational metrics in Prometheus text format on its HTTP service at /metrics. Enable scraping for the pilot so the DPS-side signals called out in What to record below — per-GPU cap movement and group-sum allocation versus observed aggregate draw — land in the same time-series store as the workload and BMC telemetry you captured in Part 4, and can be cross-checked against the steady-state window with PromQL. The chart-managed ServiceMonitor (dps.serviceMonitor.enabled: true) is the recommended path for in-cluster Prometheus and is applied through the Helm upgrade flow in the Deployment Guide; see Prometheus Monitoring for the full setup, including external Prometheus and TLS variants.

Note: Land this Helm change once, before the 4-rack managed run starts so the steady-state window in What to record is fully covered by Prometheus samples.

Running the workload

Launch the workload across all 4 pilot racks (72 compute nodes) using the same orchestration tooling and submission flow you used for the 3-rack baseline. DPS is now in the control path and will actively redistribute per-GPU power headroom across the resource group in response to live BMC telemetry.

What to record

Capture the same metrics you captured for the 3-rack baseline, over a comparable steady-state window, so the two runs can be compared directly. Additionally, capture the DPS-managed run observations below.

Additionally, specific to the DPS-managed run:

  • Per-GPU cap movement during the run — the per-GPU power_max that DPS is actively setting, sampled over time. This demonstrates DPS redistributing budget live, rather than statically provisioning.
  • GPU power allocations vs. observed aggregate draw — confirms the group sum is being enforced rather than per-node caps leaving headroom on the table.
  • DPS-provided Prometheus metrics — once scraping is enabled per Enable DPS Prometheus metrics, capture the DPS server’s /metrics series for the same steady-state window so per-GPU cap movement and group-sum enforcement above can be cross-checked against PromQL.

To sample per-GPU caps and utilization over time, we recommend using DCGM on a representative node set on a fixed interval (for example every 60 seconds via a simple loop or your automation), or export the same fields from your existing BMC/telemetry pipeline. Store set_limit / max_limit alongside observed GPU power so cap movement can be plotted against workload phase.

Part 8: Evaluate Metrics

This part turns the raw captures from Part 4, Part 5, and Part 7 into a decision. Compare runs only on their steady-state windows — at minimum 1 hour total run time, ramp-up and cool-down trimmed, both per-rack power and aggregate tokens/sec visibly flattened, contiguous window of at least 30 minutes, and the exact start/end timestamps recorded. The full procedure is defined in Part 4 under “Workload requirements”. Do not compare headline numbers unless the steady-state windows, concurrency, and request mix match between the 3-rack baseline and the 4-rack managed run.

Pilot success quick-check

Before reading the KPI tables, confirm the run satisfies every pass condition below. A miss on any row means the pilot did not meet its goals on this configuration.

Check Pass condition Source KPI in this Part
Compliance Zero new compliance events in the managed run’s steady-state window Compliance events (Data center power and compliance KPIs)
Throughput uplift Aggregate output tokens/sec strictly greater than baseline at matched load Aggregate output tokens/sec (Inference service KPIs)
Service stability Error rate matches baseline within ±0.1 percentage points Success / error rate (Inference service KPIs)
Envelope adherence Overall observed power ≤ binding data center constraint for ~100% of steady-state samples Overall observed power (Data center power and compliance KPIs)

Time alignment

  • Clock authority. Synchronize every sampling agent — BMC/BMS exporter, DCGM, inference server, load generator — to the cluster’s NTP/chrony source (typically the cluster head node or site-wide NTP server) before either run. All timestamps in this Part are interpreted relative to that authority. If a sampling agent cannot sync to the cluster source, document the offset and treat its data as advisory, not authoritative.
  • Use the same wall-clock sampling interval (or finer on the managed run) for power and service metrics.
  • If you change anything about the client (concurrency, prompt distribution, batching), treat the run as a new experiment—do not attribute deltas to DPS.

Data center power and compliance KPIs

The definitions below assume you can obtain an overall facility or pilot-segment power budget (the data center constraint), provisioned electrical or thermal capacity for the GPUs or systems under test (from nameplate TDP, derated caps, or your capacity model), and metered or telemetry-derived actual draw for the same boundary. Map each to your monitoring source (BMS, rack PDU, BCM, or aggregated BMC).

KPI Definition What to expect in a successful pilot
Compliance events Distinct count of intervals (or events) where overall observed power exceeds any applicable data center constraint (contract limit, breaker headroom, pilot PowerDomain budget—use the tightest binding constraint you track). Target: zero new compliance events in the managed run’s steady-state window. The baseline may already be clean; the managed run must not introduce excursions.
Overall provisioned power Total system power (provisioned) — sum (or budgeted envelope) of provisioned capacity for the systems in scope (for example aggregate GPU or node TDP limits you could enforce statically). Between runs: provisioned total for the 4-rack fleet can be higher than for the 3-rack baseline because more hardware sits in the topology, while the same pilot envelope caps observed aggregate draw. You are measuring whether you provisioned more compute than naive sizing would have allowed under that envelope.
Overall observed power Total system power (observed utilization) — metered or telemetry-summed draw for the same boundary as provisioned and constraints. Must remain ≤ the binding data center constraint for essentially 100% of steady-state samples on the managed run (allow rare spikes only if your constraint definition already includes agreed crest factor). Expect observed at or below the pilot envelope you set in topology.
Capacity efficiency Power budget minus total system power (provisioned). Interpret as headroom between envelope and what you provisioned on paper. With MaxLPS-style sizing, this number is often smaller than in conservative MaxP-only planning because you deliberately provision closer to the envelope—that is expected if compliance stays clean.
Provisioning efficiency Total system power (provisioned) minus total system power (observed utilization). Interpret as stranded provisioned headroom not appearing as draw. You often want this gap to shrink when inference loads the fleet more evenly; a large persistent gap under load suggests under-utilization or caps binding unevenly.
Resource group efficiency Aggregated average power (W) divided by aggregated provisioned TDP limit (W) for all members of the pilot resource group, using the same time window. Ratios in roughly 0.35–0.85 are common for bursty inference once warm; values sustained below 0.25 under declared steady load suggest the workload or caps are not exercising the fleet, while values sustained above 0.95 leave little margin for spikes—watch compliance.
Node efficiency Node average power (W) divided by node provisioned TDP limit (W) per compute node, then compare distributions across racks. Expect a similar spread to baseline at matched load, with managed runs sometimes showing higher averages on more nodes because PRS moves limits toward demand. Flag nodes stuck below 0.2 while peers approach limits—possible mapping, cooling, or telemetry issues.

If your tooling cannot yet compute one of the aggregates exactly as defined, document the substitute signal (for example sum of BMC inlet metrics vs. rack PDU) and apply it consistently on both runs.

Inference service KPIs

These metrics come from the inference stack (vLLM, Triton, TensorRT-LLM, proprietary routers, etc.), not from DPS. Prefer server-reported token and latency counters over client-only estimates.

KPI Definition What to expect in a successful pilot
Aggregate output tokens/sec Total output tokens completed per second across all replicas in scope (cluster-wide), measured over the steady-state window. Exclude failed generations if your stack exposes goodput separately. Strictly greater than the 3-rack baseline steady-state value when load generator settings are unchanged—this is the headline uplift. Relative uplift depends on how much headroom the baseline left; single-digit to mid-double-digit percent improvements are plausible when the fleet was envelope-starved, but your measured baseline is the reference, not a fixed percentage from this table.
Aggregate total tokens/sec Input + output tokens per second if your economics or SLAs track both. Should move roughly in proportion to output tokens/sec unless the prompt length distribution changed.
Tokens/sec per concurrent user (or per active session) Aggregate output tokens/sec divided by a defined concurrency measure (open sessions, active clients, or in-flight requests—pick one and keep it fixed). Proxy for interactive saturation. At fixed concurrency, expect within ~±15% of the baseline value if power is not the dominant bottleneck; a large sustained drop warrants checking for cap-induced queueing or straggler GPUs.
Time to first token (TTFT) Latency from request accept to first output token. Report p50/p95/p99 over the window. Expect p95 within ~1.2× of baseline p95 and p99 not more than ~1.5× baseline p99 when configuration is healthy—PRS can reorder headroom and introduce modest tail movement. Investigate if p99 regresses beyond ~2× baseline.
Inter-token latency (ITL) / decode token interval Spacing between successive output tokens (stream quality). Similar tolerance band as TTFT: p95 ~1.2× baseline, p99 ~1.5× baseline as a first-pass gate; tighten to your product SLO if stricter.
End-to-end request latency Wall time for full response. Same percentile bands as TTFT/ITL for streaming workloads; for non-streaming, this is often the primary user-facing metric—use the same ~1.2× / ~1.5× p95/p99 gates unless your SLO says otherwise.
Success / error rate Fraction of requests completing without server-side error. Match baseline within ±0.1 percentage points on error rate; regressions here invalidate throughput comparisons.

The ±15%, 1.2×, and 1.5× bands above are starting heuristics for a first pilot read—not substitute contractual SLOs. Replace them with your own targets where you have historical production data.

How to document the outcome

  1. Compliance table — List each binding constraint, observed max, and count of excursions (baseline vs. managed).
  2. Throughput and latency table — Baseline vs. managed for aggregate tokens/sec, tokens/sec per concurrent user, and p50/p95/p99 for TTFT (or TTFT + E2E).
  3. Efficiency plots — Time series for overall observed vs. constraint, resource group efficiency, and (optional) per-rack node efficiency histograms.
  4. DPS overlay — Example plots: per-GPU set_limit vs. observed GPU power for a sample of nodes, keyed to the same timestamps as inference metrics.

Use the following one-page layout for the final comparison output. Each section should be a single paragraph, table, or plot — anything longer belongs in an appendix.

  1. Run metadata — date, operator, topology name, resource group name, policy, workload identifier, model, framework.
  2. Steady-state windows — start/end timestamps for the baseline run and the managed run, plus total run duration.
  3. Clock authority — name of the NTP/chrony source used and any sampling agents whose offset was outside it.
  4. Pilot success quick-check — a copy of the Pilot success quick-check table from this Part with the actual pass/fail per row.
  5. Compliance summary — binding constraints, observed maxima, and excursion counts (baseline vs. managed).
  6. Power KPI table — baseline vs. managed values for each row of Data center power and compliance KPIs.
  7. Service KPI table — baseline vs. managed values for each row of Inference service KPIs.
  8. Plots — overall observed power vs. constraint over time; per-GPU set_limit vs. observed GPU power for a sample of nodes; (optional) per-rack node-efficiency histograms.
  9. Verdict and next action — pass / fail / partial pass, plus the single next change to make if not a pass (per the discipline in Next Steps).

If compliance stays clean, aggregate output tokens/sec rises materially, and latency percentiles remain inside your agreed bands, you have a documented case that the pilot configuration met its goals. If not, capture the same package of telemetry before changing topology, policy, or workload shape so the next iteration is evidence-driven.

Cleanup

When the pilot is complete, reverse the configuration in the opposite order you applied it.

Step 1: Delete the resource group

DPS does not expose a standalone resource-group deactivate subcommand — deleting an active resource group automatically deactivates it first. See dpsctl resource-group delete.

dpsctl resource-group delete --resource-group maxlps-pilot

Step 2: Deactivate the topology

dpsctl topology deactivate --topology maxlps-pilot

See dpsctl topology deactivate for details.

At this point DPS is no longer in the power-control path for the pilot racks.

Step 3: Verify per-GPU caps returned to default

DPS does not actively reset per-GPU power.limit on deactivate; the pilot’s racks should already be at their default limit because the topology is no longer enforcing one, but a lingering cap from a failed activation, a manual nvidia-smi -pl ..., or another tool will not be cleared by dpsctl topology deactivate. Confirm on a representative compute node from each pilot rack:

nvidia-smi --query-gpu=index,power.default_limit,power.limit --format=csv

Every GPU should report power.limit equal to power.default_limit (within rounding). If any GPU shows a lower power.limit, reset it explicitly before moving on:

sudo nvidia-smi -i <gpu-index> -pl <default-limit-watts>

Repeat on each affected GPU. Do not proceed to Next Steps or to the optional uninstall below until every sampled GPU reads back its default limit.

Step 4: (Optional) Uninstall DPS

If you have more tuning passes planned (different policies or additional workloads from Next Steps), skip this section. DPS can remain installed indefinitely; with the resource group deleted and the topology deactivated, it is not in the power-control path and is safe to leave in place between runs.

Only if you are done with the pilot and want to remove DPS entirely:

helm uninstall dps -n dps

Optionally remove the namespace and any BMC secrets created for the pilot:

kubectl delete namespace dps

Next Steps

The pilot topology and decoupled resource group are intentionally small so you can prove behavior end-to-end. After Part 8 shows the outcome you want, work the three priorities below in order, re-running the compliance and service KPI captures from Part 8 after each change.

1. Model a larger, more complex topology

Expand beyond the 4-rack sketch to full rows, feeds, and redundancy paths your facility actually runs: additional PowerDomain boundaries, floor and rack PDUs, diversity groups, and any devices that were simplified or omitted here. The goal is to practice the same import → activate → verify loop on data that resembles a future data center rollout, so surprises surface in the lab—not the first production change window.

See Topologies for the entity model and Managing Topologies for operational procedures. Re-run the deployment-health gate from Part 1 and the BMC health gate (dpsctl verify bmc-health start --topology <name> --wait --summary-only) from Part 2 after each major topology change.

2. Tune power budgets and overprovisioning carefully

Many pilots still show observed draw below 100% of the binding budget during steady inference: that is normal when burstiness or conservative caps leave slack. Tune the topology model before you expand the pilot:

  • Topology PowerDomain — The operating envelope is expressed as the domain’s OperatingLimit (the cap DPS enforces against).
  • Topology overprovisioningOverprovisioning is not a Helm chart setting: it is the optional numeric field OverProvisioningPercentage on the PowerDomain entity in your topology JSON. Per the topology JSON schema (PowerDomain definition), provisioning headroom scales with the operating limit as provisioning limit = operating limit × (1 + OverProvisioningPercentage / 100)—see the schema for exact typing, constraints, and the dependency that OverProvisioningPercentage requires an OperatingLimit. After editing the topology, re-import and activate it following Managing Topologies, then re-validate metrics before comparing runs.

Use small steps on OperatingLimit and OverProvisioningPercentage so DPS can hand out a little more effective GPU budget—only while aggregate telemetry and facility constraints prove you remain safe.

Treat this as an experiment matrix, not a single knob turn:

  • Change one family of settings at a time (PowerDomain’s OperatingLimit or PowerDomain’s OverProvisioningPercentage), apply it through the topology workflow, and re-run a comparable workload window.
  • Stay within roughly 10% of the true electrical or contractual limit on the binding constraint when you first push outward—enough to learn sensitivity without jumping past derated margins you have not modeled here.
  • Observe heavily: the same compliance, efficiency, and inference tables from Part 8 should improve or at least not regress; any increase in compliance events or tail latency is a signal to roll back that change before stacking the next one.

3. Experiment with different workloads

Repeat measurement passes with other inference models, batch sizes, concurrency levels, and prompt length distributions that matter for your road map—including bursty or seasonal patterns if you can simulate them. The goal is to learn where token throughput and latency stay acceptable as power policy moves, not to optimize a single demo model forever. Non-inference loads (training, stress tools) remain poor substitutes for the envelope story this runbook targets.

Once those three are stable, common production-readiness follow-ons are: integrating scheduling and provisioning (Managing Resource Groups); promoting Part 8 plots into dashboards and SLOs; running multi-hour or multi-day soaks; validating the topology model against as-built power; defining change control for PowerDomain OperatingLimit and OverProvisioningPercentage; and rehearsing failure and maintenance drills with DPS active. Sequence them by risk — topology fidelity and compliance first, then throughput tuning, then integration and operations.