Power Distribution Optimization Development
5.3.1. Power Distribution Optimization Development
Overview
This use case demonstrates power distribution optimization integration with DPS.
Goal
Test integration of algorithms that distribute power across datacenter resources.
Workflow Steps
- Enable Optimization Integration - Utilize DPS interfaces for power limit changes
- Implement Power Distribution Algorithm - Distribute power across datacenter equipment
- Observe Aggregate Power Utilization - Monitor total datacenter power consumption
- Observe Node-Level Power Utilization - Monitor individual node power patterns
- Observe GPU-Level Power Utilization - Monitor detailed GPU power and compare results
SDK Configuration
Configure your DPS environment with:
- Topology file with your datacenter nodes
- DPS configuration (dps-values.yaml)
- BMC simulator configuration (bmc-sim-values.yaml)
Environment Setup
# Deploy DPS with your configuration
helm upgrade --reuse-values dps -f <your-dps-values.yaml>
# Login to DPS
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify login --username <username> --password <password>
# Import and activate your topology
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify topology import --filename <topology-file.json>
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify topology activate --topology <topology-name>Observable Metrics
Aggregate Power Distribution Metrics
- Total Power Consumption: Aggregate power across all managed nodes
- Power Distribution Variance: Variance in power allocation across nodes
- Power Utilization Rate: Actual vs available power capacity utilization
- Distribution Balance: Power distribution balance based on workload requirements
Node-Level Power Metrics
- Per-Node Power Consumption: Individual node power usage patterns
- Power Policy Compliance: Adherence to assigned power policies
- Node Power Stability: Consistency of power consumption over time
- Inter-Node Power Balance: Power distribution balance across nodes
GPU-Level Power Metrics
- Per-GPU Power Utilization: Individual GPU power consumption
- Intra-Node GPU Balance: Power distribution among GPUs within nodes
- GPU Power Stability: Stability of GPU power consumption
Command Line Example
# Create resource group with power policy
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group create --resource-group opt-test --external-id 1 --policy GB200-High
# Add nodes to the resource group
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group add --resource-group opt-test --entities <node1>,<node2>
# Activate resource group to apply power policy
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group activate --sync --resource-group opt-test
# Monitor power consumption
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify check metrics --nodes <node1>,<node2>
# Clean up when done
dpsctl --host api.dps --port 443 --insecure-tls-skip-verify resource-group delete --resource-group opt-testAPI Integration Example (Python)
from dpsapi.api import DpsApi
# Initialize DPS API
api = DpsApi("api.dps", 443).with_username("<username>").with_password("<password>")
# Create resource group with policy
api.create_resource_group(
external_id=1,
resource_group_name="opt-test",
policy_name="GB200-High"
)
# Add nodes to resource group
api.add_entities_to_resource_group("opt-test", ["<node-name>"])
# Activate resource group to apply power policy
api.activate_resource_group("opt-test")
# Monitor power consumption
metrics = api.request_metrics(requested_gpus=[["<node-name>", 0]])
# Clean up when done
api.delete_resource_group("opt-test")Testing and Validation
Test the workflow by:
- Creating resource groups with different power policies
- Monitoring power consumption at aggregate, node, and GPU levels
- Observing power distribution across your datacenter resources