Power Reservation Steering
Power Reservation Steering (PRS)
Overview
Power Reservation Steering (PRS) is an advanced dynamic power management feature in DPS that automatically redistributes power across compute nodes based on real-time telemetry data. PRS optimizes power allocation within a power domain by monitoring GPU utilization and dynamically adjusting power limits to maximize workload performance while staying within overall power budgets.
Unlike static power policies that set fixed power limits, PRS continuously analyzes workload behavior and steers unused power from idle or underutilized GPUs to those that need more power, enabling more efficient use of datacenter power capacity.
Key Concepts
Dynamic Power Redistribution
PRS operates in a continuous loop that:
- Collects Telemetry - Gathers real-time power usage and GPU utilization data from all nodes in the domain
- Analyzes Workload - Identifies GPUs that are underutilizing their allocated power
- Calculates Recommendations - Determines optimal power distribution based on actual demand
- Applies Changes - Dynamically adjusts GPU power limits through DPS
Integration with Resource Groups
PRS integrates seamlessly with DPS’s resource group system. When a resource group is activated with PRS enabled:
- The resource group’s nodes are added to the PRS domain
- PRS begins monitoring and optimizing power for those nodes
- When the resource group is deactivated, nodes return to their topology defaults
Configuration Methods
PRS can be configured at multiple levels, providing flexibility for different deployment scenarios.
1. Helm Chart Configuration
Enable PRS deployment and connection when installing DPS via Helm:
# Enable PRS deployment
prs:
enabled: true
image:
repository: nvcr.io/nvidia/prs
tag: 1.0.5-prs-0.1.1.6-dpsapi
config:
configServerPort: 8880
jobSchedServerPort: 8881
loopIntervalSeconds: 90
rpcTimeoutSeconds: 10
windowSize: 5
# Connect DPS-Server to PRS
dps:
prs:
enabled: true
# hostPort defaults to internal PRS service if not specified
hostPort: ""PRS Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
prs.enabled |
Deploy the PRS service | false |
prs.config.loopIntervalSeconds |
Interval between PRS optimization cycles | 90 |
prs.config.rpcTimeoutSeconds |
Timeout for RPC calls to DPS | 10 |
prs.config.windowSize |
Number of samples for averaging power metrics | 5 |
dps.prs.enabled |
Enable PRS integration in DPS-Server | false |
dps.prs.hostPort |
PRS service endpoint (auto-configured if empty) | "" |
2. Global Config API
Enable or disable PRS at runtime using the Global Configuration API:
Get Current Configuration
# gRPC request
grpcurl -plaintext localhost:50051 nvidia.dcpower.v1.ConfigurationManagementService/GetConfigurationResponse:
{
"status": { "code": "OK" },
"settings": {
"prs_enabled": "true",
"metrics_interval_seconds": "30"
}
}Update PRS Setting
# Enable PRS
grpcurl -plaintext -d '{"settings": {"prs_enabled": "true"}}' \
localhost:50051 nvidia.dcpower.v1.ConfigurationManagementService/UpsertConfiguration
# Disable PRS
grpcurl -plaintext -d '{"settings": {"prs_enabled": "false"}}' \
localhost:50051 nvidia.dcpower.v1.ConfigurationManagementService/UpsertConfiguration3. dpsctl Command Line
Manage PRS settings using the dpsctl command-line tool:
List Current Settings
dpsctl settings listOutput:
{
"status": { "code": "OK" },
"settings": {
"prs_enabled": "false",
"metrics_interval_seconds": "30"
}
}Enable PRS Globally
dpsctl settings update --set prs_enabled=trueDisable PRS Globally
dpsctl settings update --set prs_enabled=false4. User Interface (UI)
The DPS web interface provides a visual way to manage PRS:
- Navigate to Settings in the DPS UI
- Locate the Dynamic Power Integrations section
- Find the Power Reservation Steering (PRS) card
- Toggle the switch to enable or disable PRS
- Click Save to apply changes
The UI displays PRS with descriptive tags:
workload-agnostic- Works with any workload typetopology- Applies across the entire topologyresource-group- Integrates with resource group management
5. Resource Group Configuration
Control PRS behavior on a per-resource-group basis:
Create Resource Group with PRS Disabled
dpsctl resource-group create \
--resource-group "ml-inference" \
--external-id 12345 \
--policy "Node-Med" \
--prs-enabled=falseCreate Resource Group with PRS Enabled (Default)
# PRS is enabled by default when creating resource groups
dpsctl resource-group create \
--resource-group "ml-training" \
--external-id 12346 \
--policy "Node-High"SLURM Integration
When using SLURM, specify PRS settings in job comments:
# Submit job with PRS disabled
sbatch --comment="dps_policy:Node-High,dps_prs:false" job_script.sh
# Submit job with PRS enabled (default)
sbatch --comment="dps_policy:Node-High,dps_prs:true" job_script.shThe SLURM prolog script parses these settings and configures the resource group accordingly.
PRS vs Dynamic Power Management (DPM)
DPS supports two complementary dynamic power features:
| Feature | PRS | DPM |
|---|---|---|
| Scope | Cross-node optimization within domain | Per-resource-group policy management |
| Mechanism | Telemetry-based power steering | Policy-based power allocation |
| Control | Global setting + per-resource-group | Per-resource-group flag |
| Use Case | Maximize utilization of power budget | Flexible policy application |
Both can be used together:
- DPM enabled allows DPS to dynamically adjust policies
- PRS enabled allows automatic power redistribution based on telemetry
# Full dynamic power management
dpsctl resource-group create \
--resource-group "full-dynamic" \
--policy "Node-High" \
--dpm-enable=true \
--prs-enabled=true
# Static policy only (no dynamic adjustment)
dpsctl resource-group create \
--resource-group "strict-policy" \
--policy "Node-Med" \
--dpm-enable=falseArchitecture
PRS Service Components
PRS runs as a separate service in the dps namespace with the following components:
- Configuration Server (port 8880) - Manages PRS configuration and domain definitions
- Job Scheduler Server (port 8881) - Handles job-related power optimization requests
- Controller Loop - Continuously monitors and optimizes power distribution
Data Flow
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ DPS-Server │────▶│ PRS Service │────▶│ DPS-Server │
│ (Telemetry) │ │ (Optimization) │ │ (Apply Limits) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
GPU Metrics Power Budget GPU Power Limits
Collection Calculation ApplicationBest Practices
When to Enable PRS
- Heterogeneous workloads - When GPUs in a domain run different workload types
- Variable utilization - When GPU utilization fluctuates over time
- Power-constrained environments - When maximizing work within a fixed power budget is critical
- Large-scale deployments - When manual power tuning is impractical
When to Disable PRS
- Consistent workloads - When all GPUs run identical, steady-state workloads
- Strict power requirements - When exact power limits must be maintained
- Debugging scenarios - When isolating power-related issues
- Short jobs - When job duration is shorter than PRS optimization cycles
Configuration Recommendations
- Start with defaults - PRS default settings work well for most deployments
- Monitor before tuning - Observe PRS behavior before adjusting parameters
- Use per-resource-group control - Disable PRS for specific workloads that need strict power limits
- Coordinate with policies - Ensure power policies provide reasonable bounds for PRS optimization
Troubleshooting
PRS Not Redistributing Power
-
Check PRS is enabled globally:
dpsctl settings list -
Verify PRS service is running:
kubectl get pods -l app=prs -
Check resource group PRS setting:
dpsctl resource-group list
PRS Service Connection Issues
-
Verify DPS-Server can reach PRS:
kubectl logs deployment/dps-server | grep -i prs -
Check PRS configuration:
kubectl get configmap prs-config -o yaml
Further Reading
- Power Policies - Define static power limits
- Resource Groups - Manage workload power allocations
- Workload Power Profiles Settings - GPU-specific power optimization
- SLURM Integration - Automated job power management
- Topologies - Datacenter infrastructure modeling