Excursion Mitigation
Scope: This document covers DPS-side behavior only. For configuration, see the Deployment Guide.
Overview
This document describes how DPS receives power telemetry from Zapp, detects power excursions, and automatically mitigates them by reducing power to lower-priority workloads.
The simple strategy automatically reduces power to lower-priority resource groups (RGs) when an excursion is detected.
Simple Mitigation Algorithm
Identify Targets → Reduce Power → Wait → Verify
↑ |
| |
+--- still in excursion, retry ------+Step 1: Identify Targets
DPS determines which resource groups to take power from:
- It identifies the highest-level devices in excursion (e.g., if both a PDU and one of its nodes are in excursion, only the PDU is considered to avoid double-counting).
- It calculates the power deficit — how many watts over the limit the system is.
- It sets a steal target of
oversteal_factor x deficit(default: 2x the deficit) to account for the fact that actual power reduction may not exactly match estimates. - It finds all DPM-enabled resource groups with devices under the excursion points, ordered by priority. Lower-importance RGs are targeted first (higher priority number = lower importance).
- If no resource groups can be reduced, the mitigation fails and a critical alert is fired.
Step 2: Reduce Power
DPS reduces power policies on the identified resource groups:
- Starting with the least important RG, DPS lowers its power policy one step at a time (e.g., High to Medium, then Medium to Low).
- Each RG is fully exhausted (lowered as far as it can go) before moving to the next one.
- At each step, DPS estimates the watts saved based on the difference between the old and new policy limits.
- The process stops when either the steal target is reached or all candidate RGs have been lowered as far as possible.
- Power policy changes are sent to the device plugins to take effect.
Step 3: Wait
DPS waits for the configured interval (default 10 seconds) to allow the power changes to take effect and for fresh telemetry readings to arrive.
Step 4: Verify
DPS re-evaluates the excursion using fresh telemetry data:
- If all excursion points are now within their provisioning limits, the mitigation is considered successful. DPS then attempts to give back the stolen power where headroom allows.
- If the excursion persists and the maximum number of iterations has not been reached, the process loops back to Step 1 with updated telemetry.
- If the maximum number of iterations is reached without resolution, the mitigation fails and a
ExcursionMitigationFailedalert is fired.
Power Give-Back
After a successful mitigation, DPS automatically attempts to restore power to the resource groups that were reduced. This is a best-effort process — whether power can actually be restored depends on the real source of the excursion. If the underlying cause persists (e.g., sustained high workload), there may not be enough headroom to give power back, and the affected RGs will remain at their reduced policy levels. DPS will do its best to restore as much power as possible given the current conditions.
Alerts
DPS generates two excursion-related alerts via Alertmanager:
PowerExcursion (warning)
Fired for every device where aggregated power exceeds the operational limit.
- alertname:
PowerExcursion - severity:
warning - labels:
entity(device ID) - summary:
Power excursion on <entity>: <measured>W measured exceeds <expected>W expected
ExcursionMitigationFailed (critical)
Fired when the mitigation algorithm has exhausted all options without resolving the excursion.
- alertname:
ExcursionMitigationFailed - severity:
critical - labels:
entities(comma-separated device IDs) - summary:
Excursion mitigation failed for <entities> after stealing <watts>W from <count> resource groups
Note: Excursion alerts require an Alertmanager instance. Without it, excursions are still logged but no alerts are fired. See the Deployment Guide for Alertmanager configuration.
Further Reading
- Telemetry - Real-time power data that drives excursion detection
- Power Policies - Power configurations adjusted during mitigation
- Resource Groups - Workload groupings targeted by excursion mitigation
- Enabling Telemetry Provider - Configuring telemetry and excursion mitigation settings