Troubleshooting the NVL72 inference power pilot
Use this page when a command in the Inference power pilot: NVL72 fleet power management runbook fails or returns an unhealthy state. Start with the section that matches the step you were running.
- Deployment Verification
- Topology Validation and Import
- BMC Health Check
- Topology Activation
- Resource Group Creation
- Adding Resources to the Resource Group
- Resource Group Activation
- PRS Log Checks
Deployment Verification
Use this section when dpsctl verify reports an unhealthy component during Part 1, Step 3 of the runbook.
To re-check a single component after a fix, pass its flag — for example dpsctl verify --database. Any combination of --dps-server, --database, --auth, --ui, and --bcm is valid.
dps_serverunhealthy — Check the DPS server pod isRunning;kubectl -n dps logs statefulset/dps-serverusually shows the underlying error.databaseunhealthy — Check the Postgres pod and the database connection secret; a failed ping is almost always reachability or credentials, not load.authunhealthy — For LDAP, verifyServerURL,BindDN,BindPassword, and CA/client certificates. For JWT, verify the private key file is mounted and readable.uiunhealthy or missing — Confirm thedps-uiIngress exists in the same namespace as the DPS server with at least one rule and a non-empty host.bcmunhealthy — Confirm the BCM credential secret is present, the URL is reachable from the cluster, and the account has API access. If BCM is not deployed, expectskipped: trueinstead.
Topology Validation and Import
Use this section when dpsctl topology validate topology.json or dpsctl topology import topology.json fails during Part 2, Step 5 of the runbook.
dpsctl topology validate prints a list of ValidationError entries; the error field identifies the class of failure. The most common ones on a hand-authored topology file:
invalid_model— the file does not match the topology JSON schema (missing top-levelTopologyorEntities, wrong field casing, malformed JSON). Re-check against Managing Topologies.device_not_found— an entity’sType/Modelpair is not in the DPS device registry (typo inModel, or the device was not seeded). Rundpsctl device listand correct theModelstring to an exact match.invalid_name— an entity, topology, or policy name has disallowed characters. Use lowercase letters, digits, and-.invalid_secret_name— theRedfish.SecretNamecontains characters that are not valid for a Kubernetes Secret name. Use lowercase letters, digits,-, and..duplicate_entity— two entities share the sameNamein the file. Make each node name unique.referenced_entity_not_found/referenced_topology_entity_not_found— a topology entry lists a child that has no matching entity block. Compare theTopology.Entities[].Childrenlist against the top-levelEntitieslist.self_reference/circular_dependency— a topology entity lists itself or creates a cycle through its children. Rebuild the parent/child chain so every leaf is reached exactly once from the root.disconnected_graph— one or more compute nodes are not reachable from the topology root. Every compute node must be a descendant of the topology root entity.invalid_connection— a parent entity cannot legally have the given child device type (for example, a rack entity listing another rack as a child). Correct the parent/child relationship to match the reference topology examples.
Fix the file, then re-run dpsctl topology validate until it prints Topology validation passed.
If dpsctl topology import topology.json fails:
failed to create topology/failed to upsert entities/failed to add topology entities— generic database failure. Check the DPS pod logs and thedps-postgresqlpod, then retry once the database is healthy.
BMC Health Check
Use this section when dpsctl verify bmc-health reports issues during Part 2, Step 6 or Part 2, Step 7 of the runbook.
Use the issues list in the dpsctl verify bmc-health report as the starting point. Each issue includes a node, resource, code, observed value, threshold, and message. The dpsctl verify bmc-health reference documents the command flags and asynchronous task model.
After fixing an affected node, scope the retry to that node before re-running the full topology check:
dpsctl verify bmc-health start --topology maxlps-pilot --nodes <node-name> --force-writes --expected-edpp-pct 100 --samples-per-telemetry 2500 --telemetry-interval 500ms --wait --summary-onlyIf a failure looks like slow BMC readback rather than a hard incompatibility, rerun the check with a larger telemetry interval and write-resolution timeout:
dpsctl verify bmc-health start --topology maxlps-pilot --force-writes --expected-edpp-pct 100 --samples-per-telemetry 2500 --telemetry-interval 5s --write-resolution-timeout 5s --wait --summary-onlyBMC Unreachable or Firmware Validation Failed
Issue codes: BMC_UNREACHABLE, FIRMWARE_VALIDATION_FAILED
The BMC did not respond to the Redfish service root, the fallback BMC ping failed, or firmware validation failed.
On systems with B200 or B300 GPUs, a standalone FIRMWARE_VALIDATION_FAILED issue with message: firmware validation failed is a known DPS 0.8.x health-check false failure. If the affected B200 or B300 node is still reachable, this is the only SEVERITY_ERROR for that node, and no other BMC health checks failed, document the exception and continue the runbook. This exception will be removed in a future DPS version. Treat FIRMWARE_VALIDATION_FAILED on other hardware, or any firmware failure combined with BMC_UNREACHABLE or another SEVERITY_ERROR, as a blocker.
To remediate:
- Confirm BMC network reachability from the cluster.
- Confirm the Kubernetes BMC credential secret for the affected node.
- Confirm the Day 0 BMC compatibility assessment from the runbook prerequisites.
- Attempt a BMC reset if compatibility and credentials are correct.
- Review the hardware model if the compatibility assessment shows the required GB200 or GB300 Redfish endpoints but the model is different. DPS might support the node through a custom device plugin. For example, a Supermicro BMC based on NVBMC and OEM extensions for HGX B300 can expose the environment metrics endpoints needed for TGP power settings and metrics, and might be supportable under the GB200 plugin with a custom device definition.
- Contact the NVIDIA DPS support team if the node appears supportable through a custom device definition.
- Exclude the node if the BMC remains unreachable. DPS cannot operate on the affected node.
Power Limit Drift
Issue code: IB_OOB_LIMIT_DRIFT
The in-band and out-of-band BMC power limits do not match.
To remediate:
-
Rerun the health check with
--telemetry-interval 5sand--write-resolution-timeout 5sto rule out slow BMC readback. -
Query the current, default, minimum, and maximum power limits before choosing the reset value:
nvidia-smi -q -d POWER | grep -E 'Default Power Limit|Max Power Limit|Min Power Limit|Current Power Limit' -
Reset the in-band GPU power limits to the default or maximum TGP:
sudo nvidia-smi -pl <watts> # Set all GPUs. sudo nvidia-smi -i 0 -pl <watts> # Set one GPU. -
Rerun the node-scoped health check.
-
Attempt a BMC reset or node restart if drift persists.
-
Exclude the node if the issue cannot be resolved. DPS cannot operate on the affected node.
Power Write or Restore Mismatch
Issue codes: POWER_WRITE_READBACK_MISMATCH, POWER_LIMIT_RESTORE_FAILED
DPS wrote a power limit through the BMC, but the readback did not converge to the requested value, or the probe could not verify restoration to the original limit.
To remediate:
- Rerun the health check with
--telemetry-interval 5sand--write-resolution-timeout 5s. - Use the DPS BMC credentials to manually test Redfish setpoint writes on the affected node if the issue remains.
- Follow the Redfish API guide examples for
PowerLimitWatts.SetPoint. - Verify TGP, TMP where available, and TCP where available.
- Attempt a BMC reset if manual writes fail or do not read back correctly.
- Exclude the node if the issue cannot be resolved. DPS cannot operate on the affected node.
WPPS or EDPp State
Issue codes: WPPS_PROFILES_ACTIVE, WPPS_INACCESSIBLE, WPPS_RESET_FAILED, EDPP_BELOW_REFERENCE
Workload power profile settings (WPPS) or EDPp settings differ from the expected default state. No workload power profile should be active, and EDPp should report the expected current value of 100% for this pilot gate. WPPS can also affect the EDPp setpoint.
To remediate:
- Reset WPPS through both in-band and out-of-band paths where applicable.
- Reset EDPp directly where supported if EDPp remains below
100%. - Attempt a BMC reset if the settings continue to persist.
- Document the node, active settings, and measured values if no immediate fix is available. These settings affect MaxLPS performance metrics and baselines.
BMC Latency
Issue codes: TELEMETRY_LATENCY_MAX_HIGH, TELEMETRY_LATENCY_AVG_HIGH, POWER_GET_LATENCY_AVG_HIGH, POWER_SET_LATENCY_AVG_HIGH, POWER_LATENCY_HIGH, EDPP_GET_LATENCY_AVG_HIGH, EDPP_LATENCY_HIGH
Latency failures appear in issue codes and latency summaries.
To remediate:
-
Use the saved full report from the health-check step to review per-node latency summaries.
-
Compare the summaries against the MaxLPS readiness targets from the health-check step in the pilot runbook.
-
Treat the BMC health-check issue-code thresholds as coarse failure thresholds: 5 seconds average for power GET, power SET, EDPp GET, and telemetry latency, and 10 seconds maximum telemetry latency.
-
If the BMC latency p99 is greater than 10 seconds, tune the PRS iteration loop so PRS does not run faster than the BMC can reliably answer. Set the Helm value
prs.config.loopIntervalSecondsto10as the first adjustment. If p99 is much higher, set the interval to a comparable value; for example, use30when p99 is around 30 seconds.prs: config: loopIntervalSeconds: 10Changing the PRS loop interval changes how quickly PRS responds to workload and telemetry changes. Record the value used for each pilot run because it can affect workload-performance comparison in Part 8.
-
Assess cluster-wide BMC client connection pressure if latency exceeds the readiness targets.
-
Keep GB200 and GB300 systems at no more than four BMC client connections, including two DPS client connections.
-
Use session token authentication instead of basic authentication, and use keep-alive to maintain the connection.
-
Review the DMTF Redfish Host Interface Specification for Redfish connection guidance.
No Nodes Probed
Issue code: NO_NODES_PROBED
The server could not build a usable BMC health report for any requested node.
To remediate:
- Confirm the topology name.
- Confirm the
--nodesvalues. - Confirm that the requested nodes belong to the imported topology.
- Rerun the health check.
Topology Activation
Use this section when dpsctl topology activate --topology maxlps-pilot fails during Part 2, Step 8 or Part 6, Step 1 of the runbook.
topology maxlps-pilot is already active— activate was re-run after it succeeded. Skip ahead todpsctl tp listto verify, or deactivate first withdpsctl topology deactivate --topology maxlps-pilot.topology <name> not found/ load-topology error — the value passed to--topologydoes not match any imported topology. Confirm the name used intopology.jsonand re-rundpsctl tp listto see what DPS has.
Resource Group Creation
Use this section when dpsctl resource-group create fails during Part 6, Step 3 of the runbook.
resource group already exists— an RG namedmaxlps-pilotis already in the database. Rundpsctl rg listto confirm, then either reuse it or remove it first withdpsctl resource-group delete --resource-group maxlps-pilot.resource-group name validation error— the--resource-groupvalue contains disallowed characters. Use lowercase letters, digits, and-, matching the naming convention used elsewhere in this runbook.failed to create resource group— generic database failure. Check the DPS pod logs and thedps-postgresqlpod, then retry once the database is healthy.
Adding Resources to the Resource Group
Use this section when dpsctl resource-group add fails during Part 6, Step 4 of the runbook.
no active topologies— the topology activation step in Part 6, Step 1 was skipped or rolled back. Rundpsctl tp list --activeand activate the pilot topology before retrying.some entities are not in the active topology— one or more node names in--entitiesare not part of the active topology. Compare the list againstdpsctl tp list --activeand correct the typo, or reimport the topology JSON.some entities are already in a resource group— those nodes belong to another RG. Remove them from the other RG first, or delete that RG.
Resource Group Activation
Use this section when dpsctl resource-group activate fails during Part 6, Step 5 of the runbook.
resource group maxlps-pilot is already active— activate was re-run after it succeeded. Skip ahead to the verify step, or remove the RG first withdpsctl resource-group delete(DPS has no standalone deactivate subcommand).resource group maxlps-pilot has no devices— theresource-group addstep did not land. Rundpsctl rg listand confirmresource_namesis populated before retrying.power budget exceeded for resource group maxlps-pilot, but reprovision is false— the sum of per-node policy caps, scaled byPRSHeadroomPercent, exceeds the topology’s available budget. Lower the policy cap or raise the topology budget.failed to activate resource groupwith per-node failures in the--syncresponse — the BMCs for those nodes rejected the set-limit request. Re-rundpsctl verify bmc-health start --topology maxlps-pilot --nodes <node1>,<node2> --wait --summary-onlyagainst the failing nodes to confirm BMC reachability, credentials, and power-control health, then retry activation. Nodes that succeeded stay active; only the failing nodes need to be resolved.
PRS Log Checks
Use this section when the PRS pod logs do not show ongoing limit updates during Part 6, Step 7 of the runbook.
The important signal is that PRS continues to emit new New power limits: blocks after activation. If the PRS pod logs only startup messages, or if New power limits: stops appearing while the workload is running:
- Verify the resource group is active with
prs_enabled: true. - Confirm
dps.prs.enabled: truein the Helm values. - Check the PRS pod logs for errors.