Developer Guide
This guide provides a high-level overview of developing with and integrating against DPS using the DPS SDK. Whether you’re integrating with workload schedulers, building power optimization systems, or grid integration solutions, this guide will help you get started.
Quick Start
Setting Up Your Environment
Download the DPS SDK files from the NVIDIA NVOnline Portal and unarchive them.
Install Task
# Linux
sh -c "$(curl --location https://taskfile.dev/install.sh)" -- -d -b /usr/local/bin
# macOS
brew install go-task/tap/go-task
# Alternative (any platform)
go install github.com/go-task/task/v3/cmd/task@latestInstall SDK dependencies (kubectl, helm, dpsctl)
task setupDeploy the SDK
This task will install and run a k3s cluster locally and deploy the dps-sdk to that cluster.
# Using helm chart defaults
task sdk
# Using custom helm chart values
task sdk VALUES_FILE=path/to/my-values.yamlVerify the SDK is running properly:
kubectl get pods -n dpsYour output should look similar to the following:
alertmanager-dps-sdk-kube-prometheus-st-alertmanager-0 2/2 Running 0 15s
bmc-gb200-simulator-b6b88c546-687n2 1/1 Running 0 15s
dps-docs-67bb469ccc-mzslz 1/1 Running 0 15s
dps-sdk-alloy-0 2/2 Running 0 15s
dps-sdk-grafana-d676f6f8f-hnkhd 3/3 Running 0 15s
dps-sdk-kube-prometheus-st-operator-6fd968d45c-hj4pl 1/1 Running 0 15s
dps-sdk-kube-state-metrics-55d94c9858-b2hh5 1/1 Running 0 15s
dps-sdk-loki-0 2/2 Running 0 15s
dps-sdk-loki-chunks-cache-0 2/2 Running 0 15s
dps-sdk-loki-gateway-85c47bddc-972t7 1/1 Running 0 15s
dps-sdk-loki-results-cache-0 2/2 Running 0 15s
dps-sdk-postgresql-0 1/1 Running 0 15s
dps-sdk-prometheus-node-exporter-p4gtt 1/1 Running 0 15s
dps-sdk-promtail-dtbwc 1/1 Running 0 15s
dps-sdk-pyroscope-0 1/1 Running 0 15s
dps-sdk-tempo-0 1/1 Running 0 15s
dps-server-0 1/1 Running 0 15s
dps-ui-69cb67b98d-9fsqx 1/1 Running 0 15s
loki-canary-6qwwm 1/1 Running 0 15s
prometheus-dps-sdk-kube-prometheus-st-prometheus-0 2/2 Running 0 15s
prs-65f6f75b8f-qnmct 1/1 Running 0 15sThe dpsctl CLI provides it’s own verification command to verify the DPS application is running properly:
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify verifyA successful deployment should contain output that looks similar to the following:
{
"auth": {
"details": "JWT secret key is configured",
"healthy": true,
"message": "Authentication service is configured with JWT",
"name": "Authentication Service"
},
"bcm": {
"message": "BCM check skipped - not configured",
"name": "BCM",
"skip_reason": "BCM not configured or not using BCM credentials store",
"skipped": true
},
"database": {
"details": "Database ping successful",
"healthy": true,
"message": "Database connectivity successful",
"name": "Database"
},
"dps_server": {
"details": "gRPC endpoint accessible, topology server responsive",
"healthy": true,
"message": "DPS Server is healthy",
"name": "DPS Server"
},
"status": {
"diag_msg": "All requested DPS components are healthy",
"ok": true
},
"ui": {
"details": "URL: http://ui.dps.sdk.local. Note: You may need to update your /etc/hosts file to resolve ui.dps.sdk.local to the cluster's IP address. The UI should be tested from an external browser.",
"healthy": true,
"message": "UI ingress endpoint retrieved successfully",
"name": "DPS UI"
}
}Access services
- DPS API:
http://api.dps.sdk.local - DPS UI:
http://ui.dps.sdk.local(testuser/testuser) - Grafana:
http://grafana.dps.sdk.local(admin/dps) - Prometheus:
http://prometheus.dps.sdk.local - AlertManager:
http://alertmanager.dps.sdk.local - Pyroscope:
http://pyroscope.dps.sdk.local
Key Concepts for Developers
Topology
The topology represents the physical datacenter power distribution:
- Entities - Physical components (nodes, racks, PDUs, PSUs)
- Relationships - Power distribution hierarchy
- Baseline policies - Default power settings for each entity
The simulator includes a pre-configured topology with 96 emulated nodes across 8 racks.
Resource Groups
Resource groups are the primary abstraction for managing power policies for workloads. They follow a lifecycle:
- CREATE - Initialize an empty resource group with optional default policy
- ADD - Add compute nodes to the group
- ACTIVATE - Apply power policies to the hardware
- UPDATE (optional) - Dynamically adjust policies during execution
- DELETE - Clean up and restore topology defaults
Resource groups are ideal for:
- Workload schedulers (SLURM, Kubernetes) - Map jobs to resource groups
- Power optimization - Apply custom policies to workload collections
- Temporary overrides - Override topology policies for specific workloads
Power Policies
Power policies define power constraints for hardware:
- Node-level policies - Overall node power limits
- GPU-level policies - Per-GPU power limits (watts or percentage)
- CPU-level policies - CPU power constraints
- Memory-level policies - Memory power constraints
- Workload profiles - Hardware-specific optimizations for GPU workloads
The SDK includes pre-configured policies:
GB200-High: 5800W node, 5600W GPUGB200-Med: 3400W node, 3200W GPUGB200-Low: 1700W node, 1600W GPU
Dynamic Power Management (DPM)
DPS supports dynamic power optimization:
- Power Reservation Steering (PRS) - Automatic power redistribution based on telemetry
- Runtime adjustments - Update policies during workload execution
- Telemetry-driven - React to actual power consumption patterns
Common Integration Patterns
1. Workload Scheduler Integration - Integrate with Workload Schedulers like SLURM using prolog/epilog scripts
2. Power Distribution Optimization - Optimize power distribution across topologies
3. Grid Integration - Implement demand response for grid events using the NvGrid API
Integration Options
DPS provides a gRPC API for integration, with pre-built libraries for Golang and Python.
Python Client
The Python Client provides a Pythonic interface to DPS:
from dpsapi import DpsApi
# Connect to DPS
api = DpsApi(
dps_host="api.dps.sdk.local",
dps_port=80,
dps_insecure_tls_skip_verify=True
).with_username("testuser")
.with_password("testuser")
# Create resource group for workload
api.resource_groups.create(
external_id=12345,
group_name="job_12345",
policy_name="GB200-High",
prs_enabled=True
)
# Add compute nodes
api.resource_groups.add(
group_name="job_12345",
entity_names=["gb200-r01-0001", "gb200-r01-0002"]
)
# Activate with power policies
api.resource_groups.activate(group_name="job_12345")
# Query metrics
metrics = api.metrics.request_metrics([
("gb200-r01-0001", 0), # node, gpu_id
("gb200-r01-0001", 1)
])
# Cleanup when done
api.resource_groups.delete(group_name="job_12345")Complete Example: See this Power Distribution Optimization Demo for a full power optimization workflow demonstrating:
- Power distribution algorithms across datacenter resources
- Multi-level power monitoring (aggregate, node-level, and GPU-level metrics)
- Resource group lifecycle with power policy management
- Integration patterns for power optimization systems
Go gRPC Client
For Go applications, use the gRPC client directly:
package main
import (
"context"
"log"
"google.golang.org/grpc"
"nvidia.com/NVIDIA/dcpower/api/v1"
)
func main() {
// Connect to DPS
conn, err := grpc.Dial("api.dps.sdk.local:50051", grpc.WithInsecure())
if err != nil {
log.Fatalf("Failed to connect: %v", err)
}
defer conn.Close()
// Create resource group client
client := api.NewResourceGroupManagementServiceClient(conn)
// Create resource group
createReq := &api.ResourceGroupCreateRequest{
ExternalId: 12345,
GroupName: "job_12345",
PolicyName: "GB200-High",
PrsEnabled: true,
DpmEnable: true,
}
_, err = client.ResourceGroupCreate(context.Background(), createReq)
if err != nil {
log.Fatalf("Failed to create resource group: %v", err)
}
// Add entities
addReq := &api.ResourceGroupAddRequest{
GroupName: "job_12345",
EntityNames: []string{"gb200-r01-0001", "gb200-r01-0002"},
}
_, err = client.ResourceGroupAddEntities(context.Background(), addReq)
if err != nil {
log.Fatalf("Failed to add entities: %v", err)
}
// Activate
activateReq := &api.ResourceGroupActivateRequest{
GroupName: "job_12345",
}
_, err = client.ResourceGroupActivate(context.Background(), activateReq)
if err != nil {
log.Fatalf("Failed to activate: %v", err)
}
log.Println("Resource group created and activated successfully")
}Complete Example: See the Grid Integrator guide for complete integration examples demonstrating:
- NvGrid API client for grid integration
- Load target scheduling and management
- Webhook event handling
- Power feed metadata queries
Using the API:
The API reference can be found here.
gRPC API
For any language, use the gRPC API via grpcurl:
# Get Token
grpcurl -insecure -d '{
"passwordCredential": {
"username": "testuser",
"password": "testuser"
}
}' \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.AuthService/Token
# Use the token to call APIs
export DPS_ACCESS_TOKEN="<token-from-above>"
# List resource groups
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupList
# Create resource group
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{
"external_id": 12345,
"group_name": "job_12345",
"policy_name": "GB200-High"
}' \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupCreate
# Add entities
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{
"group_name": "job_12345",
"resource_names": ["gb200-r01-0001", "gb200-r01-0002"]
}' \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupAddResources
# Activate
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{"group_name": "job_12345"}' \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.ResourceGroupManagementService/ActivateResourceGroup
# Delete
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{"group_name": "job_12345"}' \
api.dps.sdk.local:443 \
nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupDeleteCLI (dpsctl)
For scripting and manual operations:
# Create resource group
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg create \
--external-id 12345 \
--resource-group "job_12345" \
--policy "GB200-High"
# Add entities
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg add \
--resource-group "job_12345" \
--entities "gb200-r01-0001,gb200-r01-0002"
# Activate
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg activate \
--resource-group "job_12345"
# List
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg list
# Delete
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg delete \
--resource-group "job_12345"Available APIs
DPS provides multiple gRPC services:
- ResourceGroupManagementService - Resource group lifecycle and management
- TopologyManagementService - Topology import, activation, and queries
- MetricsManagementService - Power telemetry and metrics
- PolicyManagementService - Power policy management
- NvGridService - Grid integration and load management
- AuthService - Authentication and authorization
- VersionService - Version information
Development Workflow
Local Development with DPS SDK
# Starts k3s and deploys SDK
task deploy:sdk
# Initialize simulator (optional)
task sim
# Run your integration code
python my_integration.py
# Monitor with Grafana
open http://grafana.dps.sdk.local # admin/dps
# View logs
kubectl logs -n dps -l app=dps-server -f
# Cleanup (tears down SDK and uninstalls k3s)
task sdk:downTesting Your Integration
Use the simulator to test integration scenarios:
# Run automated resource group simulation
task sim:rgs
# Custom simulation parameters
task sim:rgs MAX_RGS=10 MIN_DURATION=60 MAX_DURATION=180
# Monitor metrics in Grafana while testing
# - DPS Server Dashboard
# - DPS SDK Datacenter dashboard
# - DPS SDK Resource Groups dashboardTroubleshooting
Common Issues
Connection refused:
# Verify DPS is running
kubectl get pods -n dps
kubectl get svc -n dps
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify verify
# Check /etc/hosts entries
grep dps.sdk.local /etc/hostsAuthentication errors:
# Login to the server
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify loginResource group activation fails:
# View server logs
kubectl logs -n dps -l app=dps-server --tail=100Additional Resources
- Administrator Documentation - Deployment and operations
- Partner Integration Guides - Integration patterns
- API Reference - Detailed API Reference