Troubleshooting the NVL72 inference power pilot

Use this page when a command in the Inference power pilot: NVL72 fleet power management runbook fails or returns an unhealthy state. Start with the section that matches the step you were running.

Deployment Verification

Use this section when dpsctl verify reports an unhealthy component during Part 1, Step 3 of the runbook.

To re-check a single component after a fix, pass its flag — for example dpsctl verify --database. Any combination of --dps-server, --database, --auth, --ui, and --bcm is valid.

  • dps_server unhealthy — Check the DPS server pod is Running; kubectl -n dps logs statefulset/dps-server usually shows the underlying error.
  • database unhealthy — Check the Postgres pod and the database connection secret; a failed ping is almost always reachability or credentials, not load.
  • auth unhealthy — For LDAP, verify ServerURL, BindDN, BindPassword, and CA/client certificates. For JWT, verify the private key file is mounted and readable.
  • ui unhealthy or missing — Confirm the dps-ui Ingress exists in the same namespace as the DPS server with at least one rule and a non-empty host.
  • bcm unhealthy — Confirm the BCM credential secret is present, the URL is reachable from the cluster, and the account has API access. If BCM is not deployed, expect skipped: true instead.

Topology Validation and Import

Use this section when dpsctl topology validate topology.json or dpsctl topology import topology.json fails during Part 2, Step 5 of the runbook.

dpsctl topology validate prints a list of ValidationError entries; the error field identifies the class of failure. The most common ones on a hand-authored topology file:

  • invalid_model — the file does not match the topology JSON schema (missing top-level Topology or Entities, wrong field casing, malformed JSON). Re-check against Managing Topologies.
  • device_not_found — an entity’s Type/Model pair is not in the DPS device registry (typo in Model, or the device was not seeded). Run dpsctl device list and correct the Model string to an exact match.
  • invalid_name — an entity, topology, or policy name has disallowed characters. Use lowercase letters, digits, and -.
  • invalid_secret_name — the Redfish.SecretName contains characters that are not valid for a Kubernetes Secret name. Use lowercase letters, digits, -, and ..
  • duplicate_entity — two entities share the same Name in the file. Make each node name unique.
  • referenced_entity_not_found / referenced_topology_entity_not_found — a topology entry lists a child that has no matching entity block. Compare the Topology.Entities[].Children list against the top-level Entities list.
  • self_reference / circular_dependency — a topology entity lists itself or creates a cycle through its children. Rebuild the parent/child chain so every leaf is reached exactly once from the root.
  • disconnected_graph — one or more compute nodes are not reachable from the topology root. Every compute node must be a descendant of the topology root entity.
  • invalid_connection — a parent entity cannot legally have the given child device type (for example, a rack entity listing another rack as a child). Correct the parent/child relationship to match the reference topology examples.

Fix the file, then re-run dpsctl topology validate until it prints Topology validation passed.

If dpsctl topology import topology.json fails:

  • failed to create topology / failed to upsert entities / failed to add topology entities — generic database failure. Check the DPS pod logs and the dps-postgresql pod, then retry once the database is healthy.

BMC Health Check

Use this section when dpsctl verify bmc-health reports issues during Part 2, Step 6 or Part 2, Step 7 of the runbook.

Use the issues list in the dpsctl verify bmc-health report as the starting point. Each issue includes a node, resource, code, observed value, threshold, and message. The dpsctl verify bmc-health reference documents the command flags and asynchronous task model.

After fixing an affected node, scope the retry to that node before re-running the full topology check:

dpsctl verify bmc-health start --topology maxlps-pilot --nodes <node-name> --force-writes --expected-edpp-pct 100 --samples-per-telemetry 2500 --telemetry-interval 500ms --wait --summary-only

If a failure looks like slow BMC readback rather than a hard incompatibility, rerun the check with a larger telemetry interval and write-resolution timeout:

dpsctl verify bmc-health start --topology maxlps-pilot --force-writes --expected-edpp-pct 100 --samples-per-telemetry 2500 --telemetry-interval 5s --write-resolution-timeout 5s --wait --summary-only

BMC Unreachable or Firmware Validation Failed

Issue codes: BMC_UNREACHABLE, FIRMWARE_VALIDATION_FAILED

The BMC did not respond to the Redfish service root, the fallback BMC ping failed, or firmware validation failed.

On systems with B200 or B300 GPUs, a standalone FIRMWARE_VALIDATION_FAILED issue with message: firmware validation failed is a known DPS 0.8.x health-check false failure. If the affected B200 or B300 node is still reachable, this is the only SEVERITY_ERROR for that node, and no other BMC health checks failed, document the exception and continue the runbook. This exception will be removed in a future DPS version. Treat FIRMWARE_VALIDATION_FAILED on other hardware, or any firmware failure combined with BMC_UNREACHABLE or another SEVERITY_ERROR, as a blocker.

To remediate:

  1. Confirm BMC network reachability from the cluster.
  2. Confirm the Kubernetes BMC credential secret for the affected node.
  3. Confirm the Day 0 BMC compatibility assessment from the runbook prerequisites.
  4. Attempt a BMC reset if compatibility and credentials are correct.
  5. Review the hardware model if the compatibility assessment shows the required GB200 or GB300 Redfish endpoints but the model is different. DPS might support the node through a custom device plugin. For example, a Supermicro BMC based on NVBMC and OEM extensions for HGX B300 can expose the environment metrics endpoints needed for TGP power settings and metrics, and might be supportable under the GB200 plugin with a custom device definition.
  6. Contact the NVIDIA DPS support team if the node appears supportable through a custom device definition.
  7. Exclude the node if the BMC remains unreachable. DPS cannot operate on the affected node.

Power Limit Drift

Issue code: IB_OOB_LIMIT_DRIFT

The in-band and out-of-band BMC power limits do not match.

To remediate:

  1. Rerun the health check with --telemetry-interval 5s and --write-resolution-timeout 5s to rule out slow BMC readback.

  2. Query the current, default, minimum, and maximum power limits before choosing the reset value:

    nvidia-smi -q -d POWER | grep -E 'Default Power Limit|Max Power Limit|Min Power Limit|Current Power Limit'
  3. Reset the in-band GPU power limits to the default or maximum TGP:

    sudo nvidia-smi -pl <watts>      # Set all GPUs.
    sudo nvidia-smi -i 0 -pl <watts> # Set one GPU.
  4. Rerun the node-scoped health check.

  5. Attempt a BMC reset or node restart if drift persists.

  6. Exclude the node if the issue cannot be resolved. DPS cannot operate on the affected node.

Power Write or Restore Mismatch

Issue codes: POWER_WRITE_READBACK_MISMATCH, POWER_LIMIT_RESTORE_FAILED

DPS wrote a power limit through the BMC, but the readback did not converge to the requested value, or the probe could not verify restoration to the original limit.

To remediate:

  1. Rerun the health check with --telemetry-interval 5s and --write-resolution-timeout 5s.
  2. Use the DPS BMC credentials to manually test Redfish setpoint writes on the affected node if the issue remains.
  3. Follow the Redfish API guide examples for PowerLimitWatts.SetPoint.
  4. Verify TGP, TMP where available, and TCP where available.
  5. Attempt a BMC reset if manual writes fail or do not read back correctly.
  6. Exclude the node if the issue cannot be resolved. DPS cannot operate on the affected node.

WPPS or EDPp State

Issue codes: WPPS_PROFILES_ACTIVE, WPPS_INACCESSIBLE, WPPS_RESET_FAILED, EDPP_BELOW_REFERENCE

Workload power profile settings (WPPS) or EDPp settings differ from the expected default state. No workload power profile should be active, and EDPp should report the expected current value of 100% for this pilot gate. WPPS can also affect the EDPp setpoint.

To remediate:

  1. Reset WPPS through both in-band and out-of-band paths where applicable.
  2. Reset EDPp directly where supported if EDPp remains below 100%.
  3. Attempt a BMC reset if the settings continue to persist.
  4. Document the node, active settings, and measured values if no immediate fix is available. These settings affect MaxLPS performance metrics and baselines.

BMC Latency

Issue codes: TELEMETRY_LATENCY_MAX_HIGH, TELEMETRY_LATENCY_AVG_HIGH, POWER_GET_LATENCY_AVG_HIGH, POWER_SET_LATENCY_AVG_HIGH, POWER_LATENCY_HIGH, EDPP_GET_LATENCY_AVG_HIGH, EDPP_LATENCY_HIGH

Latency failures appear in issue codes and latency summaries.

To remediate:

  1. Use the saved full report from the health-check step to review per-node latency summaries.

  2. Compare the summaries against the MaxLPS readiness targets from the health-check step in the pilot runbook.

  3. Treat the BMC health-check issue-code thresholds as coarse failure thresholds: 5 seconds average for power GET, power SET, EDPp GET, and telemetry latency, and 10 seconds maximum telemetry latency.

  4. If the BMC latency p99 is greater than 10 seconds, tune the PRS iteration loop so PRS does not run faster than the BMC can reliably answer. Set the Helm value prs.config.loopIntervalSeconds to 10 as the first adjustment. If p99 is much higher, set the interval to a comparable value; for example, use 30 when p99 is around 30 seconds.

    prs:
      config:
        loopIntervalSeconds: 10

    Changing the PRS loop interval changes how quickly PRS responds to workload and telemetry changes. Record the value used for each pilot run because it can affect workload-performance comparison in Part 8.

  5. Assess cluster-wide BMC client connection pressure if latency exceeds the readiness targets.

  6. Keep GB200 and GB300 systems at no more than four BMC client connections, including two DPS client connections.

  7. Use session token authentication instead of basic authentication, and use keep-alive to maintain the connection.

  8. Review the DMTF Redfish Host Interface Specification for Redfish connection guidance.

No Nodes Probed

Issue code: NO_NODES_PROBED

The server could not build a usable BMC health report for any requested node.

To remediate:

  1. Confirm the topology name.
  2. Confirm the --nodes values.
  3. Confirm that the requested nodes belong to the imported topology.
  4. Rerun the health check.

Topology Activation

Use this section when dpsctl topology activate --topology maxlps-pilot fails during Part 2, Step 8 or Part 6, Step 1 of the runbook.

  • topology maxlps-pilot is already active — activate was re-run after it succeeded. Skip ahead to dpsctl tp list to verify, or deactivate first with dpsctl topology deactivate --topology maxlps-pilot.
  • topology <name> not found / load-topology error — the value passed to --topology does not match any imported topology. Confirm the name used in topology.json and re-run dpsctl tp list to see what DPS has.

Resource Group Creation

Use this section when dpsctl resource-group create fails during Part 6, Step 3 of the runbook.

  • resource group already exists — an RG named maxlps-pilot is already in the database. Run dpsctl rg list to confirm, then either reuse it or remove it first with dpsctl resource-group delete --resource-group maxlps-pilot.
  • resource-group name validation error — the --resource-group value contains disallowed characters. Use lowercase letters, digits, and -, matching the naming convention used elsewhere in this runbook.
  • failed to create resource group — generic database failure. Check the DPS pod logs and the dps-postgresql pod, then retry once the database is healthy.

Adding Resources to the Resource Group

Use this section when dpsctl resource-group add fails during Part 6, Step 4 of the runbook.

  • no active topologies — the topology activation step in Part 6, Step 1 was skipped or rolled back. Run dpsctl tp list --active and activate the pilot topology before retrying.
  • some entities are not in the active topology — one or more node names in --entities are not part of the active topology. Compare the list against dpsctl tp list --active and correct the typo, or reimport the topology JSON.
  • some entities are already in a resource group — those nodes belong to another RG. Remove them from the other RG first, or delete that RG.

Resource Group Activation

Use this section when dpsctl resource-group activate fails during Part 6, Step 5 of the runbook.

  • resource group maxlps-pilot is already active — activate was re-run after it succeeded. Skip ahead to the verify step, or remove the RG first with dpsctl resource-group delete (DPS has no standalone deactivate subcommand).
  • resource group maxlps-pilot has no devices — the resource-group add step did not land. Run dpsctl rg list and confirm resource_names is populated before retrying.
  • power budget exceeded for resource group maxlps-pilot, but reprovision is false — the sum of per-node policy caps, scaled by PRSHeadroomPercent, exceeds the topology’s available budget. Lower the policy cap or raise the topology budget.
  • failed to activate resource group with per-node failures in the --sync response — the BMCs for those nodes rejected the set-limit request. Re-run dpsctl verify bmc-health start --topology maxlps-pilot --nodes <node1>,<node2> --wait --summary-only against the failing nodes to confirm BMC reachability, credentials, and power-control health, then retry activation. Nodes that succeeded stay active; only the failing nodes need to be resolved.

PRS Log Checks

Use this section when the PRS pod logs do not show ongoing limit updates during Part 6, Step 7 of the runbook.

The important signal is that PRS continues to emit new New power limits: blocks after activation. If the PRS pod logs only startup messages, or if New power limits: stops appearing while the workload is running:

  1. Verify the resource group is active with prs_enabled: true.
  2. Confirm dps.prs.enabled: true in the Helm values.
  3. Check the PRS pod logs for errors.