Power Distribution Optimization Development
Overview
This use case demonstrates power distribution optimization integration with DPS.
Goal
Test integration of algorithms that distribute power across datacenter resources.
Workflow Steps
- Enable Optimization Integration - Utilize DPS interfaces for power limit changes
- Implement Power Distribution Algorithm - Distribute power across datacenter equipment
- Observe Aggregate Power Utilization - Monitor total datacenter power consumption
- Observe Node-Level Power Utilization - Monitor individual node power patterns
- Observe GPU-Level Power Utilization - Monitor detailed GPU power and compare results
SDK Configuration
Configure your DPS environment with:
- Topology file with your datacenter nodes
- DPS configuration (dps-values.yaml)
- BMC simulator configuration (bmc-sim-values.yaml)
Environment Setup
# Deploy DPS with your configuration
helm upgrade --reuse-values dps -f <your-dps-values.yaml>
# Login to DPS
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify login --username <username> --password <password>
# Import and activate your topology
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify topology import --filename <topology-file.json>
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify topology activate --topology <topology-name>Observable Metrics
Aggregate Power Distribution Metrics
- Total Power Consumption: Aggregate power across all managed nodes
- Power Distribution Variance: Variance in power allocation across nodes
- Power Utilization Rate: Actual vs available power capacity utilization
- Distribution Balance: Power distribution balance based on workload requirements
Node-Level Power Metrics
- Per-Node Power Consumption: Individual node power usage patterns
- Power Policy Compliance: Adherence to assigned power policies
- Node Power Stability: Consistency of power consumption over time
- Inter-Node Power Balance: Power distribution balance across nodes
GPU-Level Power Metrics
- Per-GPU Power Utilization: Individual GPU power consumption
- Intra-Node GPU Balance: Power distribution among GPUs within nodes
- GPU Power Stability: Stability of GPU power consumption
Command Line Example
# Create resource group with power policy
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group create --resource-group opt-test --external-id 1 --policy GB200-High
# Add nodes to the resource group
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group add --resource-group opt-test --entities <node1>,<node2>
# Activate resource group to apply power policy
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group activate --sync --resource-group opt-test
# Monitor power consumption
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify check metrics --nodes <node1>,<node2>
# Clean up when done
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group delete --resource-group opt-testAPI Integration Example (Python)
from dpsapi.api import DpsApi
# Initialize DPS API
api = DpsApi("api.dps", 443).with_username("<username>").with_password("<password>")
# Create resource group with policy
api.create_resource_group(
external_id=1,
resource_group_name="opt-test",
policy_name="GB200-High"
)
# Add nodes to resource group
api.add_entities_to_resource_group("opt-test", ["<node-name>"])
# Activate resource group to apply power policy
api.activate_resource_group("opt-test")
# Monitor power consumption
metrics = api.request_metrics(requested_gpus=[["<node-name>", 0]])
# Clean up when done
api.delete_resource_group("opt-test")Testing and Validation
Test the workflow by:
- Creating resource groups with different power policies
- Monitoring power consumption at aggregate, node, and GPU levels
- Observing power distribution across your datacenter resources