Developer Guide

This guide provides a high-level overview of developing with and integrating against DPS using the DPS SDK. Whether you’re integrating with workload schedulers, building power optimization systems, or grid integration solutions, this guide will help you get started.

Quick Start

Setting Up Your Environment

Download the DPS SDK files from the NVIDIA NVOnline Portal and unarchive them.

Install Task

# Linux
sh -c "$(curl --location https://taskfile.dev/install.sh)" -- -d -b /usr/local/bin
# macOS
brew install go-task/tap/go-task
# Alternative (any platform)
go install github.com/go-task/task/v3/cmd/task@latest

Install SDK dependencies (kubectl, helm, dpsctl)

task setup

Deploy the SDK

This task will install and run a k3s cluster locally and deploy the dps-sdk to that cluster.

# Using helm chart defaults
task sdk

# Using custom helm chart values
task sdk VALUES_FILE=path/to/my-values.yaml

Verify the SDK is running properly:

kubectl get pods -n dps

Your output should look similar to the following:

alertmanager-dps-sdk-kube-prometheus-st-alertmanager-0   2/2     Running   0          15s
bmc-gb200-simulator-b6b88c546-687n2                      1/1     Running   0          15s
dps-docs-67bb469ccc-mzslz                                1/1     Running   0          15s
dps-sdk-alloy-0                                          2/2     Running   0          15s
dps-sdk-grafana-d676f6f8f-hnkhd                          3/3     Running   0          15s
dps-sdk-kube-prometheus-st-operator-6fd968d45c-hj4pl     1/1     Running   0          15s
dps-sdk-kube-state-metrics-55d94c9858-b2hh5              1/1     Running   0          15s
dps-sdk-loki-0                                           2/2     Running   0          15s
dps-sdk-loki-chunks-cache-0                              2/2     Running   0          15s
dps-sdk-loki-gateway-85c47bddc-972t7                     1/1     Running   0          15s
dps-sdk-loki-results-cache-0                             2/2     Running   0          15s
dps-sdk-postgresql-0                                     1/1     Running   0          15s
dps-sdk-prometheus-node-exporter-p4gtt                   1/1     Running   0          15s
dps-sdk-promtail-dtbwc                                   1/1     Running   0          15s
dps-sdk-pyroscope-0                                      1/1     Running   0          15s
dps-sdk-tempo-0                                          1/1     Running   0          15s
dps-server-0                                             1/1     Running   0          15s
dps-ui-69cb67b98d-9fsqx                                  1/1     Running   0          15s
loki-canary-6qwwm                                        1/1     Running   0          15s
prometheus-dps-sdk-kube-prometheus-st-prometheus-0       2/2     Running   0          15s
prs-65f6f75b8f-qnmct                                     1/1     Running   0          15s

The dpsctl CLI provides it’s own verification command to verify the DPS application is running properly:

dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify verify

A successful deployment should contain output that looks similar to the following:

{
  "auth": {
    "details": "JWT secret key is configured",
    "healthy": true,
    "message": "Authentication service is configured with JWT",
    "name": "Authentication Service"
  },
  "bcm": {
    "message": "BCM check skipped - not configured",
    "name": "BCM",
    "skip_reason": "BCM not configured or not using BCM credentials store",
    "skipped": true
  },
  "database": {
    "details": "Database ping successful",
    "healthy": true,
    "message": "Database connectivity successful",
    "name": "Database"
  },
  "dps_server": {
    "details": "gRPC endpoint accessible, topology server responsive",
    "healthy": true,
    "message": "DPS Server is healthy",
    "name": "DPS Server"
  },
  "status": {
    "diag_msg": "All requested DPS components are healthy",
    "ok": true
  },
  "ui": {
    "details": "URL: http://ui.dps.sdk.local. Note: You may need to update your /etc/hosts file to resolve ui.dps.sdk.local to the cluster's IP address. The UI should be tested from an external browser.",
    "healthy": true,
    "message": "UI ingress endpoint retrieved successfully",
    "name": "DPS UI"
  }
}

Access services

  • DPS API: http://api.dps.sdk.local
  • DPS UI: http://ui.dps.sdk.local (testuser/testuser)
  • Grafana: http://grafana.dps.sdk.local (admin/dps)
  • Prometheus: http://prometheus.dps.sdk.local
  • AlertManager: http://alertmanager.dps.sdk.local
  • Pyroscope: http://pyroscope.dps.sdk.local

Key Concepts for Developers

Topology

The topology represents the physical datacenter power distribution:

  • Entities - Physical components (nodes, racks, PDUs, PSUs)
  • Relationships - Power distribution hierarchy
  • Baseline policies - Default power settings for each entity

The simulator includes a pre-configured topology with 96 emulated nodes across 8 racks.

Resource Groups

Resource groups are the primary abstraction for managing power policies for workloads. They follow a lifecycle:

  1. CREATE - Initialize an empty resource group with optional default policy
  2. ADD - Add compute nodes to the group
  3. ACTIVATE - Apply power policies to the hardware
  4. UPDATE (optional) - Dynamically adjust policies during execution
  5. DELETE - Clean up and restore topology defaults

Resource groups are ideal for:

  • Workload schedulers (SLURM, Kubernetes) - Map jobs to resource groups
  • Power optimization - Apply custom policies to workload collections
  • Temporary overrides - Override topology policies for specific workloads

Power Policies

Power policies define power constraints for hardware:

  • Node-level policies - Overall node power limits
  • GPU-level policies - Per-GPU power limits (watts or percentage)
  • CPU-level policies - CPU power constraints
  • Memory-level policies - Memory power constraints
  • Workload profiles - Hardware-specific optimizations for GPU workloads

The SDK includes pre-configured policies:

  • GB200-High: 5800W node, 5600W GPU
  • GB200-Med: 3400W node, 3200W GPU
  • GB200-Low: 1700W node, 1600W GPU

Dynamic Power Management (DPM)

DPS supports dynamic power optimization:

  • Power Reservation Steering (PRS) - Automatic power redistribution based on telemetry
  • Runtime adjustments - Update policies during workload execution
  • Telemetry-driven - React to actual power consumption patterns

Common Integration Patterns

1. Workload Scheduler Integration - Integrate with Workload Schedulers like SLURM using prolog/epilog scripts

2. Power Distribution Optimization - Optimize power distribution across topologies

3. Grid Integration - Implement demand response for grid events using the NvGrid API

Integration Options

DPS provides a gRPC API for integration, with pre-built libraries for Golang and Python.

Python Client

The Python Client provides a Pythonic interface to DPS:

from dpsapi import DpsApi

# Connect to DPS
api = DpsApi(
    dps_host="api.dps.sdk.local",
    dps_port=80,
    dps_insecure_tls_skip_verify=True
).with_username("testuser")
.with_password("testuser")

# Create resource group for workload
api.resource_groups.create(
    external_id=12345,
    group_name="job_12345",
    policy_name="GB200-High",
    prs_enabled=True
)

# Add compute nodes
api.resource_groups.add(
    group_name="job_12345",
    entity_names=["gb200-r01-0001", "gb200-r01-0002"]
)

# Activate with power policies
api.resource_groups.activate(group_name="job_12345")

# Query metrics
metrics = api.metrics.request_metrics([
    ("gb200-r01-0001", 0),  # node, gpu_id
    ("gb200-r01-0001", 1)
])

# Cleanup when done
api.resource_groups.delete(group_name="job_12345")

Complete Example: See this Power Distribution Optimization Demo for a full power optimization workflow demonstrating:

  • Power distribution algorithms across datacenter resources
  • Multi-level power monitoring (aggregate, node-level, and GPU-level metrics)
  • Resource group lifecycle with power policy management
  • Integration patterns for power optimization systems

Go gRPC Client

For Go applications, use the gRPC client directly:

package main

import (
    "context"
    "log"

    "google.golang.org/grpc"
    "nvidia.com/NVIDIA/dcpower/api/v1"
)

func main() {
    // Connect to DPS
    conn, err := grpc.Dial("api.dps.sdk.local:50051", grpc.WithInsecure())
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()

    // Create resource group client
    client := api.NewResourceGroupManagementServiceClient(conn)

    // Create resource group
    createReq := &api.ResourceGroupCreateRequest{
        ExternalId: 12345,
        GroupName:  "job_12345",
        PolicyName: "GB200-High",
        PrsEnabled: true,
        DpmEnable:  true,
    }

    _, err = client.ResourceGroupCreate(context.Background(), createReq)
    if err != nil {
        log.Fatalf("Failed to create resource group: %v", err)
    }

    // Add entities
    addReq := &api.ResourceGroupAddRequest{
        GroupName:   "job_12345",
        EntityNames: []string{"gb200-r01-0001", "gb200-r01-0002"},
    }

    _, err = client.ResourceGroupAddEntities(context.Background(), addReq)
    if err != nil {
        log.Fatalf("Failed to add entities: %v", err)
    }

    // Activate
    activateReq := &api.ResourceGroupActivateRequest{
        GroupName: "job_12345",
    }

    _, err = client.ResourceGroupActivate(context.Background(), activateReq)
    if err != nil {
        log.Fatalf("Failed to activate: %v", err)
    }

    log.Println("Resource group created and activated successfully")
}

Complete Example: See the Grid Integrator guide for complete integration examples demonstrating:

  • NvGrid API client for grid integration
  • Load target scheduling and management
  • Webhook event handling
  • Power feed metadata queries

Using the API:

The API reference can be found here.

gRPC API

For any language, use the gRPC API via grpcurl:

# Get Token
grpcurl -insecure -d '{
  "passwordCredential": {
    "username": "testuser",
    "password": "testuser"
  }
}' \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.AuthService/Token

# Use the token to call APIs
export DPS_ACCESS_TOKEN="<token-from-above>"

# List resource groups
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupList


# Create resource group
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{
    "external_id": 12345,
    "group_name": "job_12345",
    "policy_name": "GB200-High"
    }' \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupCreate

# Add entities
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{
    "group_name": "job_12345",
    "resource_names": ["gb200-r01-0001", "gb200-r01-0002"]
    }' \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupAddResources

# Activate
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{"group_name": "job_12345"}' \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.ResourceGroupManagementService/ActivateResourceGroup

# Delete
grpcurl -insecure -H "authorization: Bearer $DPS_ACCESS_TOKEN" \
-d '{"group_name": "job_12345"}' \
  api.dps.sdk.local:443 \
  nvidia.dcpower.v1.ResourceGroupManagementService/ResourceGroupDelete

CLI (dpsctl)

For scripting and manual operations:

# Create resource group
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg create \
  --external-id 12345 \
  --resource-group "job_12345" \
  --policy "GB200-High"

# Add entities
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg add \
  --resource-group "job_12345" \
  --entities "gb200-r01-0001,gb200-r01-0002"

# Activate
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg activate \
  --resource-group "job_12345"

# List
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg list

# Delete
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify rg delete \
  --resource-group "job_12345"

Available APIs

DPS provides multiple gRPC services:

  • ResourceGroupManagementService - Resource group lifecycle and management
  • TopologyManagementService - Topology import, activation, and queries
  • MetricsManagementService - Power telemetry and metrics
  • PolicyManagementService - Power policy management
  • NvGridService - Grid integration and load management
  • AuthService - Authentication and authorization
  • VersionService - Version information

Development Workflow

Local Development with DPS SDK

# Starts k3s and deploys SDK
task deploy:sdk

# Initialize simulator (optional)
task sim

# Run your integration code
python my_integration.py

# Monitor with Grafana
open http://grafana.dps.sdk.local  # admin/dps

# View logs
kubectl logs -n dps -l app=dps-server -f

# Cleanup (tears down SDK and uninstalls k3s)
task sdk:down

Testing Your Integration

Use the simulator to test integration scenarios:

# Run automated resource group simulation
task sim:rgs

# Custom simulation parameters
task sim:rgs MAX_RGS=10 MIN_DURATION=60 MAX_DURATION=180

# Monitor metrics in Grafana while testing
# - DPS Server Dashboard
# - DPS SDK Datacenter dashboard
# - DPS SDK Resource Groups dashboard

Troubleshooting

Common Issues

Connection refused:

# Verify DPS is running
kubectl get pods -n dps
kubectl get svc -n dps
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify verify

# Check /etc/hosts entries
grep dps.sdk.local /etc/hosts

Authentication errors:

# Login to the server
dpsctl -H api.dps.sdk.local -p 443 --insecure-tls-skip-verify login

Resource group activation fails:

# View server logs
kubectl logs -n dps -l app=dps-server --tail=100

Additional Resources