Inference power pilot: NVL72 fleet power management

Overview

This guide walks you through running an AI inference workload on GB200 NVL72 or GB300 NVL72 racks while DPS (Domain Power Service) applies automatic, telemetry-driven power management across the fleet. You will establish a multi-rack baseline without DPS in the control path, then enable DPS so the same operating power envelope can safely cover more racks and GPUs than static MaxP-style planning would allow—without coupling DPS to the workload schedulers or provisioning systems that own placement and fleet membership. Workloads stay on your existing launch path; DPS enforces power at the infrastructure layer.

In other words, the pilot asks whether a fixed facility power budget can deliver higher aggregate inference throughput if you deploy more GPUs inside that budget than static, per-rack sizing would justify—provided a control plane continuously reconciles draw against the envelope. NVIDIA uses the name MaxLPS (Maximum Land Power Shell) for that pattern; the following section defines the term and the facility-level reasoning that makes it credible for inference.

Expected outcomes. Compared to the unmanaged baseline, a successful managed run should show three things at once: improved aggregate token throughput (or your service’s primary tokens/sec KPI) from operating more inference capacity inside the same power shell; only a modest impact on latency—per-request and tail behavior may shift slightly as per-GPU limits track telemetry, which you should quantify rather than assume is negligible; and power compliance maintained for the full steady-state window, meaning aggregate draw and the agreed operating envelope stay aligned with no hidden excursions. None of that is persuasive without evidence, so plan to collect telemetry liberally—workload-side inference metrics, per-rack and aggregate power, DPS limit traces, BMC readings, and any facility or orchestration alarms you already rely on—using consistent sampling and timestamps so you can tie service behavior to policy action. That record becomes the basis for the KPI-oriented comparison in Part 9.

What does MaxLPS mean?

Modern AI inference workloads rarely sustain peak GPU power draw. Because of this, a data center that statically provisions compute strictly against MaxP power leaves a large fraction of its power envelope unused. NVIDIA MaxLPS (Maximum Land Power Shell) names the data center pattern of intentionally deploying more compute than a naive power cap would allow, while a control plane keeps aggregate draw within the agreed operating envelope. A common rule of thumb is safe overprovisioning on the order of 40% more compute against that same envelope when power is actively managed.

This runbook does not treat MaxLPS as an abstract initiative to implement in the abstract. It shows how to validate that pattern on NVL72 using DPS: you reproduce a fixed envelope with more GPUs in the fleet, measure inference throughput, and confirm the envelope is respected.

Scheduler-independent configuration

Here scheduling and provisioning means workload schedulers (Slurm, Kubernetes, and similar job or pod placement layers) and provisioning systems (cluster installers, fleet managers, automation that registers, scales, or re-images nodes)—everything that normally owns where nodes run and which hardware is in service—not DPS’s power plane.

DPS is designed to follow those layers when you want it to: workload schedulers and provisioning systems can drive resource group lifecycle and policy through the nvidia.dcpower.v1 resource group API and the matching dpsctl resource-group commands so membership stays aligned with jobs, reservations, or capacity changes. Managing Resource Groups documents that integrated style of operation step by step.

You can still pursue the MaxLPS pattern without that coupling by defining one or more resource groups whose members you manage outside both workload schedulers and provisioning systems—fixed topology entities, dpsctl resource-group create / add / activate only—while inference continues to run through your existing scheduling and provisioning unchanged. Scheduler-independent configuration, in this runbook, is that decoupled path: the smallest setup that enforces a shared GPU power budget across the fleet while scheduling and provisioning never call DPS to add or remove nodes for you.

That configuration consists of:

  • A simplified topology that models the 5 racks and the operating power domain above them.
  • A single resource group containing every compute node in the topology, created with three capabilities enabled together:
  • DPM (Dynamic Power Management) — DPS actively enforces per-node power policy.
  • PRS (Power Reservation Steering) — DPS’s mechanism that redistributes headroom across GPUs based on live telemetry. See Power Reservation Steering.
  • Shared GPU — the resource group’s GPU budget is enforced as a group sum rather than per-node, which is what makes operating more racks within one envelope possible. See Resource Groups.

If any one of these three is missing, the pilot configuration described here is incomplete and the comparative results in this runbook will not be reproducible. Creating these conditions is covered in the following sections.

What you will do

  1. Validate the deployment — Authenticate, run dpsctl verify, and interpret failures so you are not debugging topology on a broken server (Part 1).
  2. Model the site and gate end-to-end connectivity — Import devices if needed, build the 5-rack topology with a single PowerDomain budget, and activate it once as a temporary DPS-to-BMC connectivity gate (Part 2).
  3. Deactivate before the baseline — After the connectivity and activation gates in Part 2 pass, deactivate the topology so the unmanaged baseline runs with DPS out of the control path (Part 3).
  4. Prepare the baseline run — Stage inference consistently, define the steady-state window and KPIs, and document the 3-rack footprint and power context (Part 4).
  5. Run the unmanaged baseline — Execute the workload without DPS in the control path and capture baseline metrics (Part 5).
  6. Enable fleet power policy — Re-activate the topology, then create and activate the resource group with DPM, PRS, and Shared GPU, aligned to the topology, and verify policy is live (Part 6).
  7. Constrain the data center to a 3-rack envelope — Use dpsctl nvgrid set-load-target to publish the 3-rack MaxP envelope (125 kW × 3 for GB200, 135 kW × 3 for GB300) to DPS as a load target on the root-pdu feed (Part 7).
  8. Run the managed comparison — Repeat the same inference profile across 5 racks under the same envelope with DPS enforcing the budget (Part 8).
  9. Compare and decide — Align metrics windows, quantify throughput uplift, and watch for envelope breaches or throttling (Part 9).

Prerequisites

Before starting this runbook, confirm each of the following:

  • Hardware: 5 x GB200 NVL72 or GB300 NVL72 racks identified for the pilot. All 18 compute nodes per rack must be powered on and BMC-reachable.

  • DPS server deployed to a Kubernetes cluster following the Deployment Guide. BMC credential secrets for every node are already created in the Kubernetes cluster.

  • PRS (Power Reservation Steering) in Helm values.yaml — For results comparable to this runbook, the deployment chart must enable PRS with the values below. Confirm (or set) these keys in the values.yaml you pass to Helm before or as part of install/upgrade:

    dps:
      prs:
        enabled: true
  • dpsctl installed and authenticated per Install dpsctl. dpsctl must be able to reach the DPS server.

  • Inference workload and dataset staged on every node in all 5 racks, launchable from your job scheduler or equivalent orchestration.

  • Operating power envelope for the 5-rack group is known and documented — this is the top-level PowerDomain budget DPS will enforce.

Note: This runbook assumes all of the above are already complete. If DPS has not yet been deployed, stop here and complete the Deployment Guide first.

Note: If you do not have the physical hardware to run this pilot but still want to try DPS, you can deploy the DPS SDK Simulator to a VM to explore DPS in a simulated environment.

Part 1: Verify DPS Deployment

Before configuring DPS, confirm the deployment is healthy end-to-end. Treat this as a gate: if any component is unhealthy, stop and fix it before moving on to Part 2 — a misconfigured deployment will surface as confusing failures later in the runbook.

Note: If you haven’t already, export a few dpsctl environment variables to keep the commands that follow concise:

export DPSCTL_HOST="api.dps.your-domain.com"
export DPSCTL_PORT="443"
export DPSCTL_INSECURE_TLS_SKIP_VERIFY=true # Needed if TLS was not configured

Otherwise, you must add --host, --port, and --insecure-tls-skip-verify to every dpsctl command.

Step 1: Authenticate with dpsctl login

Before any other dpsctl command will succeed, authenticate against the DPS server. The token is cached under ~/.dpsctl/credentials.yaml and reused by subsequent commands until it expires.

dpsctl login -u <username>

dpsctl login prompts for the password interactively, or reads DPSCTL_USERNAME and DPSCTL_PASSWORD from the environment if set. See Install dpsctl for the one-time install and initial configuration.

Confirm authentication succeeded:

dpsctl server-version

A successful call returns the server’s version payload. An AUTHENTICATION_FAILED or missing token error here means the previous dpsctl login did not complete — rerun it before continuing.

Step 2: Run the verify gate

Proceed to Part 2 only when status.ok is true and no component reports healthy: false. A skipped: true component is acceptable — it means that component is not part of this deployment.

Run dpsctl verify with no component flags. It checks the DPS server, database, authentication service, UI ingress, and BCM connectivity in a single call:

dpsctl verify

Each component (dps_server, database, auth, ui, bcm) returns healthy, message, and skipped. The details field is informational and is mainly useful for troubleshooting.

Example healthy output

Troubleshooting

To re-check a single component after a fix, pass its flag — for example dpsctl verify --database. Any combination of --dps-server, --database, --auth, --ui, and --bcm is valid.

  • dps_server unhealthy — Check the DPS server pod is Running; kubectl -n dps logs statefulset/dps-server usually shows the underlying error.
  • database unhealthy — Check the Postgres pod and the database connection secret; a failed ping is almost always reachability or credentials, not load.
  • auth unhealthy — For LDAP, verify ServerURL, BindDN, BindPassword, and CA/client certificates. For JWT, verify the private key file is mounted and readable.
  • ui unhealthy or missing — Confirm the dps-ui Ingress exists in the same namespace as the DPS server with at least one rule and a non-empty host.
  • bcm unhealthy — Confirm the BCM credential secret is present, the URL is reachable from the cluster, and the account has API access. If BCM is not deployed, expect skipped: true instead.

Part 2: DPS Configuration

To manage power effectively, DPS needs to know the structure of the data center in which it is operating. We call this the data center topology. DPS also needs to know of any custom devices it will encounter in that topology.

Step 1: Import Device Specifications (if needed)

DPS ships with a set of supported device specifications out of the box. The supported devices include, at time of writing:

  • Compute: DGX_GB200, DGX_GB300, DGX_B200, DGX_B300, DGX_H100
  • GPUs: GB200, GB300, B200, B300, H100
  • CPUs: Grace, Grace_GB300
  • Power infrastructure: FloorPDU95 (Generic Floor PDU with 95% efficiency factor) , RackPDU95_57500W (Generic Rack PDU with 95% efficiency factor and max load of 57500W), PowerSupply95_3300W (Generic Power Supply with 95% efficiency factor and max load of 3300W)

If your data center uses PDUs, PSUs, or other equipment whose model is not matched the list above, you must register a specification for it before it can appear in a topology. Follow the full procedure in Managing Devices, then import your additions:

dpsctl device upsert devices.yaml

Verify the import:

dpsctl device list

Confirm that every model you reference in your topology appears in the output.

GB300 GPU minimum power (optional). On GB300 NVL72 pilots, you can refine the GB300 GPU entry in your device-model YAML (devices.yaml) before dpsctl device upsert. Set each GPU’s minLoadWatts (see Device Specifications) to roughly 100 W below the average GPU power you measured during the baseline inference run—using the same steady-state window you plan to compare in Part 9. That lowers the floor DPM assumes for each GPU so the control loop tracks your real workload band instead of a conservative idle default, which usually sharpens dynamic power management without changing the upper cap, provided maxLoadWatts stays above minLoadWatts and still reflects hardware limits. Treat this as a deliberate model override; document the before/after values and re-import per Managing Devices.

Step 2: Build and Activate the Topology

The topology is how you describe your power-distribution network to DPS. For this runbook, the topology must contain at minimum:

Entity type Count Purpose
PowerDomain 1 Top-level operating power envelope across all 5 racks.
PowerDistribution 5 (one per rack) Rack PDU.
ComputerSystem 90 (18 per rack) GB200 or GB300 compute nodes, parented under their rack PDU.

The PowerDomain’s PowerValue is the single most important knob: it is the operating power envelope DPS will enforce across the whole pilot. Set it to the same envelope you were operating under during the 3-rack baseline.

Accounting for NVSwitch power

Each GB300 NVL72 rack contains 9 NVSwitch trays. Their draw is not managed by DPS, but it must still be reflected in the rack PDU and PowerDomain budget.

The runbook represents the per-rack NVSwitch aggregate with the StaticLoad field on each FloorPDU-Rack0X (PowerDistribution) entity. StaticLoad declares fixed power consumption from unmanaged devices and is propagated up the power-distribution network automatically—see Entities → Static Load.

Use 7,416 W per rack for a standard GB300 NVL72 9-switch aggregate (replace with measured or vendor-specified values when available). The example topology below applies this value on every rack PDU. For a standard GB200 NVL72 rack, we recommend a 9-switch aggregate power value of 11,160 W per rack.

Example topology file

Download the reference topology that matches your hardware:

  • maxlps-pilot-gb300.json — 5 GB300 NVL72 racks, 90 DGX_GB300 nodes, 5 FloorPDU95 rack PDUs at 135 kW (StaticLoad 7,416 W), PDU-Root PowerDomain at 675 kW. Ships with the GB300-* policy presets the runbook references in later steps.
  • maxlps-pilot-gb200.json — 5 GB200 NVL72 racks, 90 DGX_GB200 nodes, 5 FloorPDU95 rack PDUs at 125 kW (StaticLoad 11,160 W), PDU-Root PowerDomain at 625 kW. Ships with the GB200-Per-* policy presets derived from a live GB200 environment.

Replace the BMC URL and SecretName values with those for your fleet before importing.

Preview: topology root, one rack PDU, and the PowerDomain
{
  "Topology": {
    "Name": "maxlps-pilot",
    "Entities": [
      {
        "Name": "FloorPDU-Rack01",
        "Children": [
          "gb300-nvl-001-compute01",
          "...",
          "gb300-nvl-001-compute18"
        ]
      },
      {
        "Name": "PDU-Root",
        "Children": [
          "FloorPDU-Rack01",
          "FloorPDU-Rack02",
          "FloorPDU-Rack03",
          "FloorPDU-Rack04",
          "FloorPDU-Rack05"
        ]
      }
    ]
  },
  "Policies": [
    {
      "Name": "GB300-Per-40",
      "Limits": [
        { "ElementType": "Node", "PowerLimit": { "Watts": 2520 } }
      ],
      "Properties": {}
    }
  ],
  "Entities": [
    {
      "OperatingLimit": { "PowerValue": { "Type": "W", "Value": 135000 } },
      "Model": "FloorPDU95",
      "Name": "FloorPDU-Rack01",
      "Redfish": { "URL": "https://localhost" },
      "StaticLoad": { "Type": "W", "Value": 7416 },
      "Type": "PowerDistribution"
    },
    {
      "OperatingLimit": { "PowerValue": { "Type": "W", "Value": 675000 } },
      "Name": "PDU-Root",
      "Redfish": { "URL": "https://localhost" },
      "Type": "PowerDomain"
    },
    {
      "Model": "DGX_GB300",
      "Name": "gb300-nvl-001-compute01",
      "Redfish": {
        "SecretName": "gb300-nvl-001-compute01",
        "URL": "https://10.0.0.1"
      },
      "Type": "ComputerSystem"
    }
  ]
}

For additional topology examples, see Managing Topologies.

Validate and import

Validate the topology file against the DPS schema and import it. Importing only registers the topology and its inventory in the DPS database — no policy is applied to any BMC yet, so this step is safe to run on a live fleet.

dpsctl topology validate topology.json # Confirms the topology matches the DPS topology Schema
dpsctl topology import topology.json

Confirm BMC connectivity for every node in the topology

Before activating, confirm DPS can reach every BMC in the imported pilot topology. Passing --topology checks every node in that topology in a single call. The topology only needs to be imported — not active — for this check to run.

dpsctl check connection --topology maxlps-pilot

See dpsctl check connection for the full response schema.

In the response, confirm:

  • success_nodes equals total_nodes (every node in maxlps-pilot was reached).
  • failure_nodes is [].

If failure_nodes is non-empty, each entry lists the failing node_name and an error_msg describing the BMC failure (typically connection timeout, TLS error, or authentication failure). Fix BMC reachability and credentials for those nodes — see the connectivity and credential troubleshooting later in this section — and re-run dpsctl check connection --topology maxlps-pilot until failure_nodes is empty before activating the topology.

Activate the topology

Activate the topology as a temporary end-to-end gate. Activation exercises the full DPS-to-BMC set-policy path against every node so any device-model or policy-application issue that the connectivity gate above could not catch surfaces now, at setup time, instead of immediately before the 5-rack managed run in Part 8. The topology will be deactivated in Part 3 so the unmanaged baseline in Part 5 runs with DPS out of the control path, and re-activated later in this runbook.

dpsctl topology activate --topology maxlps-pilot # Must match the name given in topology.json

Verify:

dpsctl tp list

Confirm the topology appears and leaf_node_names lists all 90 compute nodes.

Troubleshooting

If any of the commands above fail, the error reported by dpsctl will match one of the cases below. Each case lists the fix.

Validating the topology

dpsctl topology validate prints a list of ValidationError entries; the error field identifies the class of failure. The most common ones on a hand-authored topology file:

  • invalid_model — the file does not match the topology JSON schema (missing top-level Topology or Entities, wrong field casing, malformed JSON). Re-check against Managing Topologies.
  • device_not_found — an entity’s Type/Model pair is not in the DPS device registry (typo in Model, or the device was not seeded). Run dpsctl device list and correct the Model string to an exact match.
  • invalid_name — an entity, topology, or policy name has disallowed characters. Use lowercase letters, digits, and -.
  • invalid_secret_name — the Redfish.SecretName contains characters that are not valid for a Kubernetes Secret name. Use lowercase letters, digits, -, and ..
  • duplicate_entity — two entities share the same Name in the file. Make each node name unique.
  • referenced_entity_not_found / referenced_topology_entity_not_found — a topology entry lists a child that has no matching entity block. Compare the Topology.Entities[].Children list against the top-level Entities list.
  • self_reference / circular_dependency — a topology entity lists itself or creates a cycle through its children. Rebuild the parent/child chain so every leaf is reached exactly once from the root.
  • disconnected_graph — one or more compute nodes are not reachable from the topology root. Every compute node must be a descendant of the topology root entity.
  • invalid_connection — a parent entity cannot legally have the given child device type (for example, a rack entity listing another rack as a child). Correct the parent/child relationship to match the reference topology examples.

Fix the file, then re-run dpsctl topology validate until it prints Topology validation passed.

Importing the topology
  • failed to create topology / failed to upsert entities / failed to add topology entities — generic database failure. Check the DPS pod logs and the dps-postgresql pod, then retry once the database is healthy.
Activating the topology
  • topology maxlps-pilot is already active — activate was re-run after it succeeded. Skip ahead to dpsctl tp list to verify, or deactivate first with dpsctl topology deactivate --topology maxlps-pilot.
  • topology <name> not found / load-topology error — the value passed to --topology does not match any imported topology. Confirm the name used in topology.json and re-run dpsctl tp list to see what DPS has.

Part 3: Deactivate the Topology Before the Baseline Run

We activated the topology in Part 2, rather than waiting until Part 6, so any DPS-to-BMC issue would surface at setup time rather than immediately before the 5-rack managed run. The connectivity check (dpsctl check connection) and the activation in Part 2 both passed; deactivate now so the unmanaged baseline in Part 5 runs with DPS out of the control path.

dpsctl topology deactivate --topology maxlps-pilot

Verify:

dpsctl tp list --active

Confirm no topologies are listed as active. The topology definition itself remains imported and will be re-activated at the start of Part 6.

Part 4: Preparing for Testing

Workload requirements

What “baseline” means here. The baseline run is the unmanaged reference. DPS is installed and running, but no topology and no resource group is active, so it is not enforcing power policy on any node. This is the state produced by dpsctl topology deactivate in Part 3; confirm it with dpsctl tp list --active (no rows) and dpsctl rg list (no resource groups) before launching the workload.

For the 3-rack baseline and the 5-rack DPS-managed run to be directly comparable, the workload must meet these requirements:

  • It is an inference workload. Training workloads, synthetic power stressors (gpu-burn, etc.), and pure benchmarks do not exhibit the bursty, non-sustained power profile that envelope-based inference pilots target and will produce misleading results.
  • It is representative of the production workload. Ideally it is the actual model, framework, and request pattern (batch size, concurrency, input length distribution) you plan to run once the pilot succeeds.
  • It runs long enough to reach a steady-state comparison window. The “steady-state window” is the contiguous slice of the run that you and the 5-rack managed run will be compared on in Part 9. Identify it as follows, and apply the same procedure on every run:
    • Run the workload for at least one hour total so the run is not dominated by ramp-up.
    • Trim ramp-up and cool-down from the start and end of the run.
    • Within the trimmed interval, require that both per-rack power draw and aggregate tokens/sec have visibly flattened on your time-series plots — no monotonic climb, no large step changes, only normal short-term workload jitter.
    • The remaining window must be at least 30 minutes long and contiguous; if it is shorter or broken up, extend the run rather than relying on a noisy window.
    • Record the start and end timestamps of this window and use exactly those timestamps when capturing every metric in What to collect below. The same procedure applies to the 5-rack managed run in Part 8 and to the comparison in Part 9.
  • The 3 pilot racks are under load in isolation. No other workloads should be sharing these racks or competing for the same facility power budget during the baseline window; otherwise the “3-rack baseline” no longer describes what the 3 racks alone can do.

What to collect

Before you run the baseline inference test, capture the following for the duration of the baseline run. You will compare these signals against the 5-rack managed run in Part 8 and interpret them using Part 9:

  • Aggregate workload throughput — requests/sec, queries/sec, or whichever KPI matches the workload.
  • Aggregate token throughput (tokens per second across the cluster) and latency percentiles (see Part 9 for the KPI set).
  • Workload metadata and configuration - application, model, ISO/OSL, and any other relevant configuration parameters.
  • Per-rack and aggregate power draw, sampled at a consistent interval. Keep both the time series and a steady-state average.
  • Wall-clock duration of the run and of the steady-state window within it.
  • Per-GPU power draw distribution across the 3 racks, if available — useful for showing how headroom is being left on the table by static provisioning.
  • EDPp settings per node. The same EDPp setting should be used across the entire fleet.

Use whatever stack you already trust for time-series capture (BMC or rack telemetry, DCGM or vendor exporters, inference server metrics, facility BMS). The important part is consistent sampling, aligned clocks, and the same definitions on baseline and managed runs.

The baseline run establishes the reference numbers the 5-rack DPS-managed run will be compared against. It is run without DPS in the control path — that is, before any topology or resource group is activated.

This runbook does not prescribe how you launch the workload. Whatever you already use to orchestrate AI inference on these racks — Slurm, a Kubernetes inference operator, a Triton deployment, your own job runner — is exactly what you should use here. DPS cares about the power side of the run, not the scheduling side.

Prepping DPS parameters for testing

Before launching the baseline run, pick a value for prs.headroomPercent and apply it to the deployment. This Helm-side knob — first introduced in Prerequisites and revisited under Next Steps — sets how aggressively PRS reclaims slack from the per-node policy cap when it computes the resource group’s domain budget. Set it too high and PRS rarely engages, which shrinks the MaxLPS uplift you are trying to measure. Set it too low and PRS clips real workload peaks, which causes setpoint thrashing and observable overshoot. The value is read at DPS server startup, so you must land the Helm change before Part 6 activates the resource group.

Cluster default

Use prs.headroomPercent = 15 as the default for this pilot unless you have a specific reason to deviate. This is the largest value that still produces meaningful PRS engagement across the most common policies (from GB300-Per-90 down to GB300-Per-70) without forcing measurable overshoot on workloads that run close to the policy cap. Tighter or looser values exist for specific policy and workload combinations — the override list below covers them.

Per-policy override values

If you know which policy you will pass to dpsctl resource-group create --policy <name> in Part 6, pick the matching value below. The values are tuned for the mid utilisation tier (per-GPU mean draw roughly 50–80% of the policy cap), which is the sweet spot for PRS engagement and the best single proxy for typical inference workloads. See When to deviate from the default below if your workload sits outside the mid tier.

  • GB300-Per-90h = 20
  • GB300-Per-80h = 20
  • GB300-Per-75h = 15
  • GB300-Per-70h = 10
  • GB300-Per-65h = 5 (extrapolated; not yet validated on cluster)
  • GB300-Per-60h = 15 (PRS rarely engages at the mid tier on this policy; the value is chosen for provisioning headroom rather than active PRS engagement)
  • GB300-Per-50 — any (PRS does not engage; the policy cap sits below sustained workload draw)

When to deviate from the default

Map your workload to one of three tiers using the per-GPU power draw distribution you collect in What to collect above (or, if you have not yet run the baseline, an estimate from a previous unmanaged run on similar hardware). Compare per-GPU draw to the per-GPU implied cap of the policy you plan to use:

  • Low / idle tier — per-GPU mean draw stays well below the policy cap (roughly less than 50% of the cap). PRS rarely engages on this tier; the largest value that still produces some PRS activity is fine, otherwise prs.headroomPercent becomes purely a provisioning knob and the per-policy value above is safe.
  • Mid tier — per-GPU mean draw sits roughly 50–80% of the cap with consistent slack. The per-policy values above are tuned for this tier; use them as-is.
  • High tier — workload p95 per-GPU draw is within roughly 10% of the policy cap. PRS will clip real peaks here, so prefer one notch higher than the per-policy value (for example, GB300-Per-80 at h = 20 rather than the mid-tier value, or GB300-Per-70 at h = 15). Watch for intervals where observed draw exceeds the setpoint in BMC telemetry; if you see them, raise prs.headroomPercent further.

Apply the value via helm upgrade

Land the chosen value on the cluster before Part 6. The mechanics match the Deployment Guide upgrade flow:

  1. Edit the same values.yaml you used at install time and set prs.headroomPercent alongside prs.enabled:

    dps:
      prs:
        enabled: true
        headroomPercent: 15
  2. Apply the change with helm upgrade against the dps release in the dps namespace:

    helm repo update ngc
    helm upgrade dps ngc/dps \
      --namespace dps \
      --values values.yaml
  3. Wait for the DPS server pod to roll over, then confirm the new value is set on the running container by reading the PRS_HEADROOM_PERCENT environment variable that the chart renders from prs.headroomPercent:

    kubectl rollout status -n dps statefulset/dps-server
    kubectl get -n dps statefulset/dps-server \
      -o jsonpath='{.spec.template.spec.containers[?(@.name=="dps-server")].env[?(@.name=="PRS_HEADROOM_PERCENT")].value}'

    The second command should print the value you set (for example, 15).

Note: Change prs.headroomPercent only between runs, never mid-run, and pair every change with a fresh telemetry capture so you can attribute KPI deltas to the headroom change rather than to confounding workload variation. This is the same single-variable discipline called out under Next Steps.

Part 5: Running the Baseline Test

Launch the workload across the first 3 of the 5 pilot racks (54 compute nodes) using your existing orchestration tooling. DPS should not be in the control path during this run.

After the deactivate step in Part 3, no topology is active and no resource group exists yet, so DPS is not in the control path during this baseline run.

Part 6: Create and Activate DPS Resource Group

The topology was deactivated in Part 3 so the baseline could run with DPS out of the control path. Re-activate it now before creating the resource group; the topology is already imported, so this is the only setup step required:

dpsctl topology activate --topology maxlps-pilot

Verify:

dpsctl tp list --active

Confirm maxlps-pilot appears as the active topology.

Create one resource group containing every compute node in the topology, with DPM, PRS, and shared-GPU all enabled. The --shared-gpu-enable flag is what binds the resource group’s GPU power consumption to a shared, group-wide budget rather than per-node caps.

dpsctl resource-group create \
  --resource-group maxlps-pilot \
  --external-id 1 \
  --dpm-enable \
  --prs-enabled \
  --shared-gpu-enable \
  --policy GB300-Per-90

--external-id is required and must be a 64-bit integer. It is the handle a workload scheduler or provisioning system would normally use to correlate this resource group with one of its own records. In the scheduler-independent configuration this runbook describes, no external system owns the resource group — pick any positive integer and keep a note of it for your own records. DPS does not enforce uniqueness on this value, but using a distinct ID per resource group makes dpsctl rg list output easier to read once you have more than one.

Add every compute node in the topology to the resource group. Substitute your real node names for gb300-nvl-001-compute01,gb300-nvl-001-compute02,...:

dpsctl resource-group add \
  --resource-group maxlps-pilot \
  --entities "gb300-nvl-001-compute01,gb300-nvl-001-compute02,gb300-nvl-001-compute03..."

See dpsctl resource-group add for additional flags (partial-update behavior, strict policy, etc.).

Activate the resource group. Use --sync so the command blocks until every node has its policy applied:

dpsctl resource-group activate \
  --resource-group maxlps-pilot \
  --sync

See dpsctl resource-group activate for the full activation flag set, including strict-policy and partial-timeout options.

Verify:

dpsctl rg list --active

Confirm the maxlps-pilot resource group appears with:

  • activation_status: active
  • dpm_enable: true
  • prs_enabled: true
  • resource_names listing all 90 compute nodes

Troubleshooting

If any of the three commands above fails, the error message returned by dpsctl matches one of the cases below. Each case lists the fix.

Creating the resource group
  • resource group already exists — an RG named maxlps-pilot is already in the database. Run dpsctl rg list to confirm, then either reuse it or remove it first with dpsctl resource-group delete --resource-group maxlps-pilot.
  • resource-group name validation error — the --resource-group value contains disallowed characters. Use lowercase letters, digits, and -, matching the naming convention used elsewhere in this runbook.
  • failed to create resource group — generic database failure. Check the DPS pod logs and the dps-postgresql pod, then retry once the database is healthy.
Adding resources to the resource group
  • no active topologies — the topology activation step in Part 2 was skipped or rolled back. Run dpsctl tp list --active and activate the pilot topology before retrying.
  • some entities are not in the active topology — one or more node names in --entities are not part of the active topology. Compare the list against dpsctl tp list --active and correct the typo, or reimport the topology JSON.
  • some entities are already in a resource group — those nodes belong to another RG. Remove them from the other RG first, or delete that RG.
Activating the resource group
  • resource group maxlps-pilot is already active — activate was re-run after it succeeded. Skip ahead to the verify step, or remove the RG first with dpsctl resource-group delete (DPS has no standalone deactivate subcommand).
  • resource group maxlps-pilot has no devices — the resource-group add step did not land. Run dpsctl rg list and confirm resource_names is populated before retrying.
  • power budget exceeded for resource group maxlps-pilot, but reprovision is false — the sum of per-node policy caps, scaled by PRSHeadroomPercent, exceeds the topology’s available budget. Lower the policy cap or raise the topology budget.
  • failed to activate resource group with per-node failures in the --sync response — the BMCs for those nodes rejected the set-limit request. Re-run dpsctl check connection --nodes <node1>,<node2> against the failing nodes to confirm BMC reachability and credentials, then retry activation. Nodes that succeeded stay active; only the failing nodes need to be resolved.

Part 7: Constrain the Data Center with NvGrid

The 5-rack managed run in Part 8 must execute inside the same operating power envelope as the 3-rack baseline in Part 5. DPS publishes that envelope to its control loop through NvGrid, which sets a load target on a power feed in the active topology. This step is what converts “5 racks of hardware running an inference workload” into “5 racks of hardware running an inference workload inside a 3-rack power envelope,” which is the MaxLPS pattern this runbook validates.

The constraint targets the root-pdu feed — the FeedTag set on the PDU-Root PowerDomain entity in both maxlps-pilot-gb300.json and maxlps-pilot-gb200.json. For complete coverage of NvGrid, see NvGrid Concepts and Managing NvGrid.

Choose the constraint value

The constraint is the per-rack MaxP (the rack PDU OperatingLimit from the topology) multiplied by the 3 racks that carried the baseline. Assuming all compute nodes in the pilot racks are healthy:

  • GB200 NVL72: 125 kW per rack × 3 racks = 375 kW (matches maxlps-pilot-gb200.json).
  • GB300 NVL72: 135 kW per rack × 3 racks = 405 kW (matches maxlps-pilot-gb300.json).

If any compute nodes in the pilot racks are out of service, recompute the constraint as MaxP × the count of racks the baseline actually exercised so the managed run still operates inside an apples-to-apples envelope.

Apply the load target

Set the load target with an immediate start time and no end time so the constraint stays in effect for the duration of Part 8. The constraint is explicitly cleared in Cleanup — leaving it active after the pilot will continue to enforce a 3-rack envelope on whatever runs next on this fleet.

The example below uses the GB300 value of 405 kW. Substitute 375 kW for GB200:

dpsctl nvgrid set-load-target \
  --value 405000 \
  --unit watt \
  --feed-tags root-pdu \
  --start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --best-effort true

Verify:

dpsctl nvgrid get-current

Confirm the active target reports the value you set on the root-pdu feed.

Troubleshooting

If the command above fails or the verify step does not show the expected target, the error reported by dpsctl will match one of the cases below. Each case lists the fix.

  • feed tag not found — the active topology’s PDU-Root entity is missing FeedTag: "root-pdu". Compare against the reference topology files linked in Part 2, correct the JSON, and re-import and re-activate the topology.
  • no active topologies — the topology was deactivated since Part 6. Re-activate with dpsctl topology activate --topology maxlps-pilot before retrying.
  • Load target accepted but get-current shows it inactive — confirm the resource group from Part 6 is active with dpm_enable: true. NvGrid can only reduce power on DPM-enabled resource groups; without DPM, the target is recorded but not enforced. See DPM-Enabled Jobs.

Part 8: 5-Rack Inference Test Under DPS

The only intended change between Part 5 and Part 8 is that DPS is now actively managing power across all 5 pilot racks. Workload, model, framework, batching, concurrency, and the operating envelope are unchanged; if anything else differs, the comparison in Part 9 is invalid.

The two runs differ on exactly two dimensions:

Dimension Baseline (Part 5) Managed (Part 8)
Racks Under Load 3 racks (54 compute nodes) 5 racks (90 compute nodes)
DPS Control Path Not in control path; no topology and no resource group active In control path; topology maxlps-pilot active and resource group maxlps-pilot active
Operating Power Envelope Same Same
Workload, Model, Framework, Batching, Concurrency Same Same

The Operating Power Envelope row is held the same by different mechanisms in each run: in the baseline, the envelope is held implicitly by running on only 3 racks; in the managed run, it is held explicitly by the NvGrid load target applied in Part 7 on the root-pdu feed, which DPS enforces across all 5 racks via the resource group’s shared-GPU policy.

With the topology active and the pilot resource group (maxlps-pilot) in place—created and maintained without integration to your workload schedulers or provisioning systems—re-run the same workload you used for the 3-rack baseline—same model, same framework, same request pattern—but now across all 5 racks (90 nodes), with DPS in the control path.

As in Part 5, this runbook does not prescribe how you launch the workload. Use the same orchestration tooling (Slurm, Kubernetes inference operator, Triton, your own job runner) you used for the baseline. The only thing that changes between the two runs is the number of racks under load and whether DPS is enforcing power policy — everything else must stay the same so the comparison is meaningful.

Workload requirements

The 5-rack run must be directly comparable to the 3-rack baseline. In addition to the baseline’s requirements:

  • It is the same workload as the baseline. Same model, same framework, same batch size, same concurrency, same input distribution. If anything about the workload itself changes, the throughput numbers are no longer comparable.
  • All 5 pilot racks are under load in isolation. No other workloads should be sharing these racks or competing for the same facility power budget during the run. The same operating envelope that constrained the 3-rack baseline should constrain the 5-rack run.
  • DPS is in the control path. Confirm the pilot topology from Part 2 and the active resource group from Part 6 are in place (dpsctl tp list, dpsctl rg list) and that BMC connectivity is healthy (dpsctl check connection --topology maxlps-pilot) before launching the workload.
  • It produces a comparable steady-state window. DPS’s per-GPU caps are set dynamically from telemetry, so the run needs enough duration for the control loop to settle into steady-state redistribution, not just the workload itself. Apply the same steady-state window procedure defined in Part 4 under “Workload requirements”, and make the resulting window at least as long as the one used for the 3-rack baseline.

Enable DPS Prometheus metrics

dps-server exports operational metrics in Prometheus text format on its HTTP service at /metrics. Enable scraping for the pilot so the DPS-side signals called out in What to record below — per-GPU cap movement and group-sum allocation versus observed aggregate draw — land in the same time-series store as the workload and BMC telemetry you captured in Part 4, and can be cross-checked against the steady-state window with PromQL. The chart-managed ServiceMonitor (dps.serviceMonitor.enabled: true) is the recommended path for in-cluster Prometheus and is applied via the same helm upgrade flow used for prs.headroomPercent in Part 4; see Prometheus Monitoring for the full setup, including external Prometheus and TLS variants.

Note: Land this Helm change once, before the 5-rack managed run starts so the steady-state window in What to record is fully covered by Prometheus samples — the same single-variable, pre-run discipline applied to prs.headroomPercent.

Running the workload

Launch the workload across all 5 pilot racks (90 compute nodes) using the same orchestration tooling and submission flow you used for the 3-rack baseline. DPS is now in the control path and will actively redistribute per-GPU power headroom across the resource group in response to live BMC telemetry.

What to record

Capture the same metrics you captured for the 3-rack baseline, over a comparable steady-state window, so the two runs can be compared directly. Additionally, capture the DPS-managed run observations below.

Additionally, specific to the DPS-managed run:

  • Per-GPU cap movement during the run — the per-GPU power_max that DPS is actively setting, sampled over time. This demonstrates DPS redistributing budget live, rather than statically provisioning.
  • GPU power allocations vs. observed aggregate draw — confirms the group sum is being enforced rather than per-node caps leaving headroom on the table.
  • DPS-provided Prometheus metrics — once scraping is enabled per Enable DPS Prometheus metrics, capture the DPS server’s /metrics series for the same steady-state window so per-GPU cap movement and group-sum enforcement above can be cross-checked against PromQL.

To sample per-GPU caps and utilization over time, we recommend using DCGM on a representative node set on a fixed interval (for example every 60 seconds via a simple loop or your automation), or export the same fields from your existing BMC/telemetry pipeline. Store set_limit / max_limit alongside observed GPU power so cap movement can be plotted against workload phase.

Part 9: Evaluate Metrics

This part turns the raw captures from Part 4, Part 5, and Part 8 into a decision. Compare runs only on their steady-state windows — at minimum 1 hour total run time, ramp-up and cool-down trimmed, both per-rack power and aggregate tokens/sec visibly flattened, contiguous window of at least 30 minutes, and the exact start/end timestamps recorded. The full procedure is defined in Part 4 under “Workload requirements”. Do not compare headline numbers unless the steady-state windows, concurrency, and request mix match between the 3-rack baseline and the 5-rack managed run.

Pilot success quick-check

Before reading the KPI tables, confirm the run satisfies every pass condition below. A miss on any row means the pilot did not meet its goals on this configuration.

Check Pass condition Source KPI in this Part
Compliance Zero new compliance events in the managed run’s steady-state window Compliance events (Data center power and compliance KPIs)
Throughput uplift Aggregate output tokens/sec strictly greater than baseline at matched load Aggregate output tokens/sec (Inference service KPIs)
Service stability Error rate matches baseline within ±0.1 percentage points Success / error rate (Inference service KPIs)
Envelope adherence Overall observed power ≤ binding data center constraint for ~100% of steady-state samples Overall observed power (Data center power and compliance KPIs)

Time alignment

  • Clock authority. Synchronize every sampling agent — BMC/BMS exporter, DCGM, inference server, load generator — to the cluster’s NTP/chrony source (typically the cluster head node or site-wide NTP server) before either run. All timestamps in this Part are interpreted relative to that authority. If a sampling agent cannot sync to the cluster source, document the offset and treat its data as advisory, not authoritative.
  • Use the same wall-clock sampling interval (or finer on the managed run) for power and service metrics.
  • If you change anything about the client (concurrency, prompt distribution, batching), treat the run as a new experiment—do not attribute deltas to DPS.

Data center power and compliance KPIs

The definitions below assume you can obtain an overall facility or pilot-segment power budget (the data center constraint), provisioned electrical or thermal capacity for the GPUs or systems under test (from nameplate TDP, derated caps, or your capacity model), and metered or telemetry-derived actual draw for the same boundary. Map each to your monitoring source (BMS, rack PDU, BCM, or aggregated BMC).

KPI Definition What to expect in a successful pilot
Compliance events Distinct count of intervals (or events) where overall observed power exceeds any applicable data center constraint (contract limit, breaker headroom, pilot PowerDomain budget—use the tightest binding constraint you track). Target: zero new compliance events in the managed run’s steady-state window. The baseline may already be clean; the managed run must not introduce excursions.
Overall provisioned power Total system power (provisioned) — sum (or budgeted envelope) of provisioned capacity for the systems in scope (for example aggregate GPU or node TDP limits you could enforce statically). Between runs: provisioned total for the 5-rack fleet can be higher than for the 3-rack baseline because more hardware sits in the topology, while the same pilot envelope caps observed aggregate draw. You are measuring whether you provisioned more compute than naive sizing would have allowed under that envelope.
Overall observed power Total system power (observed utilization) — metered or telemetry-summed draw for the same boundary as provisioned and constraints. Must remain ≤ the binding data center constraint for essentially 100% of steady-state samples on the managed run (allow rare spikes only if your constraint definition already includes agreed crest factor). Expect observed at or below the pilot envelope you set in topology.
Capacity efficiency Power budget minus total system power (provisioned). Interpret as headroom between envelope and what you provisioned on paper. With MaxLPS-style sizing, this number is often smaller than in conservative MaxP-only planning because you deliberately provision closer to the envelope—that is expected if compliance stays clean.
Provisioning efficiency Total system power (provisioned) minus total system power (observed utilization). Interpret as stranded provisioned headroom not appearing as draw. You often want this gap to shrink when inference loads the fleet more evenly; a large persistent gap under load suggests under-utilization or caps binding unevenly.
Resource group efficiency Aggregated average power (W) divided by aggregated provisioned TDP limit (W) for all members of the pilot resource group, using the same time window. Ratios in roughly 0.35–0.85 are common for bursty inference once warm; values sustained below 0.25 under declared steady load suggest the workload or caps are not exercising the fleet, while values sustained above 0.95 leave little margin for spikes—watch compliance.
Node efficiency Node average power (W) divided by node provisioned TDP limit (W) per compute node, then compare distributions across racks. Expect a similar spread to baseline at matched load, with managed runs sometimes showing higher averages on more nodes because PRS moves limits toward demand. Flag nodes stuck below 0.2 while peers approach limits—possible mapping, cooling, or telemetry issues.

If your tooling cannot yet compute one of the aggregates exactly as defined, document the substitute signal (for example sum of BMC inlet metrics vs. rack PDU) and apply it consistently on both runs.

Inference service KPIs

These metrics come from the inference stack (vLLM, Triton, TensorRT-LLM, proprietary routers, etc.), not from DPS. Prefer server-reported token and latency counters over client-only estimates.

KPI Definition What to expect in a successful pilot
Aggregate output tokens/sec Total output tokens completed per second across all replicas in scope (cluster-wide), measured over the steady-state window. Exclude failed generations if your stack exposes goodput separately. Strictly greater than the 3-rack baseline steady-state value when load generator settings are unchanged—this is the headline uplift. Relative uplift depends on how much headroom the baseline left; single-digit to mid-double-digit percent improvements are plausible when the fleet was envelope-starved, but your measured baseline is the reference, not a fixed percentage from this table.
Aggregate total tokens/sec Input + output tokens per second if your economics or SLAs track both. Should move roughly in proportion to output tokens/sec unless the prompt length distribution changed.
Tokens/sec per concurrent user (or per active session) Aggregate output tokens/sec divided by a defined concurrency measure (open sessions, active clients, or in-flight requests—pick one and keep it fixed). Proxy for interactive saturation. At fixed concurrency, expect within ~±15% of the baseline value if power is not the dominant bottleneck; a large sustained drop warrants checking for cap-induced queueing or straggler GPUs.
Time to first token (TTFT) Latency from request accept to first output token. Report p50/p95/p99 over the window. Expect p95 within ~1.2× of baseline p95 and p99 not more than ~1.5× baseline p99 when configuration is healthy—PRS can reorder headroom and introduce modest tail movement. Investigate if p99 regresses beyond ~2× baseline.
Inter-token latency (ITL) / decode token interval Spacing between successive output tokens (stream quality). Similar tolerance band as TTFT: p95 ~1.2× baseline, p99 ~1.5× baseline as a first-pass gate; tighten to your product SLO if stricter.
End-to-end request latency Wall time for full response. Same percentile bands as TTFT/ITL for streaming workloads; for non-streaming, this is often the primary user-facing metric—use the same ~1.2× / ~1.5× p95/p99 gates unless your SLO says otherwise.
Success / error rate Fraction of requests completing without server-side error. Match baseline within ±0.1 percentage points on error rate; regressions here invalidate throughput comparisons.

The ±15%, 1.2×, and 1.5× bands above are starting heuristics for a first pilot read—not substitute contractual SLOs. Replace them with your own targets where you have historical production data.

How to document the outcome

  1. Compliance table — List each binding constraint, observed max, and count of excursions (baseline vs. managed).
  2. Throughput and latency table — Baseline vs. managed for aggregate tokens/sec, tokens/sec per concurrent user, and p50/p95/p99 for TTFT (or TTFT + E2E).
  3. Efficiency plots — Time series for overall observed vs. constraint, resource group efficiency, and (optional) per-rack node efficiency histograms.
  4. DPS overlay — Example plots: per-GPU set_limit vs. observed GPU power for a sample of nodes, keyed to the same timestamps as inference metrics.

Use the following one-page layout for the final comparison output. Each section should be a single paragraph, table, or plot — anything longer belongs in an appendix.

  1. Run metadata — date, operator, topology name, resource group name, policy, prs.headroomPercent, workload identifier, model, framework.
  2. Steady-state windows — start/end timestamps for the baseline run and the managed run, plus total run duration.
  3. Clock authority — name of the NTP/chrony source used and any sampling agents whose offset was outside it.
  4. Pilot success quick-check — a copy of the Pilot success quick-check table from this Part with the actual pass/fail per row.
  5. Compliance summary — binding constraints, observed maxima, and excursion counts (baseline vs. managed).
  6. Power KPI table — baseline vs. managed values for each row of Data center power and compliance KPIs.
  7. Service KPI table — baseline vs. managed values for each row of Inference service KPIs.
  8. Plots — overall observed power vs. constraint over time; per-GPU set_limit vs. observed GPU power for a sample of nodes; (optional) per-rack node-efficiency histograms.
  9. Verdict and next action — pass / fail / partial pass, plus the single next change to make if not a pass (per the discipline in Next Steps).

If compliance stays clean, aggregate output tokens/sec rises materially, and latency percentiles remain inside your agreed bands, you have a documented case that the pilot configuration met its goals. If not, capture the same package of telemetry before changing topology, headroomPercent, or workload shape so the next iteration is evidence-driven.

Cleanup

When the pilot is complete, reverse the configuration in the opposite order you applied it.

1. Reset the NvGrid load target

The 3-rack envelope set in Part 7 was applied with no end time, so it stays in effect until you explicitly clear it. Reset the root-pdu feed to its default constraint before deleting the resource group — otherwise DPS continues to enforce the 3-rack envelope on whatever runs next on this fleet. You can do this by setting the load target to 0.

dpsctl nvgrid set-load-target \
  --value 0 \
  --unit watt \
  --feed-tags root-pdu

Verify:

dpsctl nvgrid get-current

Confirm no active target is reported on the root-pdu feed.

2. Delete the resource group

DPS does not expose a standalone resource-group deactivate subcommand — deleting an active resource group automatically deactivates it first. See dpsctl resource-group delete.

dpsctl resource-group delete --resource-group maxlps-pilot

3. Deactivate the topology

dpsctl topology deactivate --topology maxlps-pilot

See dpsctl topology deactivate for details.

At this point DPS is no longer in the power-control path for the pilot racks.

Verify per-GPU caps returned to default. DPS does not actively reset per-GPU power.limit on deactivate; the pilot’s racks should already be at their default limit because the topology is no longer enforcing one, but a lingering cap from a failed activation, a manual nvidia-smi -pl ..., or another tool will not be cleared by dpsctl topology deactivate. Confirm on a representative compute node from each pilot rack:

nvidia-smi --query-gpu=index,power.default_limit,power.limit --format=csv

Every GPU should report power.limit equal to power.default_limit (within rounding). If any GPU shows a lower power.limit, reset it explicitly before moving on:

sudo nvidia-smi -i <gpu-index> -pl <default-limit-watts>

Repeat on each affected GPU. Do not proceed to Next Steps or to the optional uninstall below until every sampled GPU reads back its default limit.

4. (Optional) Uninstall DPS

If you have more tuning passes planned (different prs.headroomPercent values, different policies, additional workloads from Next Steps), skip this section. DPS can remain installed indefinitely; with the resource group deleted and the topology deactivated, it is not in the power-control path and is safe to leave in place between runs.

Only if you are done with the pilot and want to remove DPS entirely:

helm uninstall dps -n dps

Optionally remove the namespace and any BMC secrets created for the pilot:

kubectl delete namespace dps

Next Steps

The pilot topology and decoupled resource group are intentionally small so you can prove behavior end-to-end. After Part 9 shows the outcome you want, work the three priorities below in order, re-running the compliance and service KPI captures from Part 9 after each change.

1. Model a larger, more complex topology

Expand beyond the 5-rack sketch to full rows, feeds, and redundancy paths your facility actually runs: additional PowerDomain boundaries, floor and rack PDUs, diversity groups, and any devices that were simplified or omitted here. The goal is to practice the same import → activate → verify loop on data that resembles a future data center rollout, so surprises surface in the lab—not the first production change window.

See Topologies for the entity model and Managing Topologies for operational procedures. Re-run the deployment-health gate from Part 1 and the connectivity gate (dpsctl check connection --topology <name>) from Part 2 after each major topology change.

2. Tune power budgets, PRS headroom, and overprovisioning—carefully

Many pilots still show observed draw below 100% of the binding budget during steady inference: that is normal when burstiness, headroomPercent, or conservative caps leave slack. You tune that in two different places:

  • Topology PowerDomain — The operating envelope is expressed as the domain’s OperatingLimit (the cap DPS enforces against). Overprovisioning is not a Helm chart setting: it is the optional numeric field OverProvisioningPercentage on the PowerDomain entity in your topology JSON. Per the topology JSON schema (PowerDomain definition), provisioning headroom scales with the operating limit as provisioning limit = operating limit × (1 + OverProvisioningPercentage / 100)—see the schema for exact typing, constraints, and the dependency that OverProvisioningPercentage requires an OperatingLimit. After editing the topology, re-import and activate it following Managing Topologies, then re-validate metrics before comparing runs.
  • Helm PRS — Adjust prs.headroomPercent only through the deployment chart (see Prerequisites and Power Reservation Steering), using the Deployment Guide for install/upgrade mechanics.

Use small steps on OverProvisioningPercentage and prs.headroomPercent so DPS can hand out a little more effective GPU budget—only while aggregate telemetry and facility constraints prove you remain safe.

Treat this as an experiment matrix, not a single knob turn:

  • Change one family of settings at a time (PowerDomain’s OperatingLimit, PowerDomain’s OverProvisioningPercentage, or Helm prs.headroomPercent), apply it through the correct path (topology workflow vs. Helm upgrade), and re-run a comparable workload window.
  • Stay within roughly 10% of the true electrical or contractual limit on the binding constraint when you first push outward—enough to learn sensitivity without jumping past derated margins you have not modeled here.
  • Observe heavily: the same compliance, efficiency, and inference tables from Part 9 should improve or at least not regress; any increase in compliance events or tail latency is a signal to roll back that change before stacking the next one.

3. Experiment with different workloads

Repeat measurement passes with other inference models, batch sizes, concurrency levels, and prompt length distributions that matter for your road map—including bursty or seasonal patterns if you can simulate them. The goal is to learn where token throughput and latency stay acceptable as power policy moves, not to optimize a single demo model forever. Non-inference loads (training, stress tools) remain poor substitutes for the envelope story this runbook targets.

Once those three are stable, common production-readiness follow-ons are: integrating scheduling and provisioning (Managing Resource Groups); promoting Part 9 plots into dashboards and SLOs; running multi-hour or multi-day soaks; validating the topology model against as-built power; defining change control for PowerDomain OperatingLimit and OverProvisioningPercentage; and rehearsing failure and maintenance drills with DPS active. Sequence them by risk — topology fidelity and compliance first, then throughput tuning, then integration and operations.