SDK Simulator Playbooks
This guide walks through the SDK’s playbooks for exercising Domain Power Service (DPS). The Default-profile playbook is the recommended starting point - it uses dpsctl against a blank DPS environment so you can learn the API step by step. The Hardware Emulation playbooks automate resource group, grid, and load shedding scenarios against the 144-node GB300 topology.
Table of Contents
Overview
Playbooks fall into two categories:
- Default-profile playbook: Manual
dpsctlexercises against a blank DPS environment. Ideal for learning the API, experimenting with topologies, and running controlled demos. - Hardware Emulation playbooks: Automated shell scripts (
sim/sim_rgs.sh,sim/sim_grid.sh,sim/sim_loadshed.sh) that drive the full 144-node GB300 topology. Ideal for soak testing, CI runs, and reproducing realistic workload churn.
For workload-aware playbooks that generate realistic GPU power traces, see the Workload-Aware Simulation guide.
Default-Profile Playbook
The Default profile deploys DPS with no topology loaded and no simulators running. Use it to explore the DPS API, practice dpsctl commands, or run step-by-step demos against a live DPS server.
Deploy the Default Environment
task setup
task deploytask setup installs SDK dependencies (docker, kubectl, k3d, helm, helm-git, uv, dpsctl). task deploy creates a k3d cluster and deploys DPS with default chart values - a blank environment with no topology or simulators.
Configure Your Shell
After task deploy completes, paste these exports into your shell once. They configure dpsctl to connect to the local SDK endpoint so you can run commands without repeating connection flags:
export DPSCTL_HOST=api.dps.sdk
export DPSCTL_PORT=80
export DPSCTL_INSECURE_TLS_SKIP_VERIFY=trueLoad a Topology
The Default environment starts with no data. Register the custom device types used by the topology, then import and activate it:
# Register custom device definitions (required before topology import)
dpsctl device upsert sim/custom-devices.yaml
# Import and activate the topology
dpsctl tp import sim/topology.json
dpsctl tp activate --topology dps-simulatorVerify the topology is active:
dpsctl tp list
dpsctl tp list-entitiesCreate Resource Groups
A resource group represents an active workload. Its policy sets the power cap for every node in the group. The default topology provides three policies: GB300-High (5600 W), GB300-Med (3200 W), and GB300-Low (1600 W).
# Create a resource group
dpsctl rg create --external-id 1 --resource-group my-workload --policy GB300-High
# Add nodes from the topology
dpsctl rg add --resource-group my-workload --entities "gb300-r01-0001,gb300-r01-0002,gb300-r01-0003"
# Activate - DPS enforces the power cap on all nodes in the group
dpsctl rg activate --resource-group my-workload --sync
# List all resource groups
dpsctl rg list
# Delete when done
dpsctl rg delete --resource-group my-workloadThree-Tier Demo
Create, add, and activate three tiers in one sequence to mimic a multi-tenant datacenter or multiple workloads:
# premium-tier - GB300-High (5600 W), Racks 1-2 (36 nodes)
dpsctl rg create --external-id 1 --resource-group premium-tier --policy GB300-High --priority 0
dpsctl rg add --resource-group premium-tier --entities "gb300-r01-0001,gb300-r01-0002,gb300-r01-0003,gb300-r01-0004,gb300-r01-0005,gb300-r01-0006,gb300-r01-0007,gb300-r01-0008,gb300-r01-0009,gb300-r01-0010,gb300-r01-0011,gb300-r01-0012,gb300-r01-0013,gb300-r01-0014,gb300-r01-0015,gb300-r01-0016,gb300-r01-0017,gb300-r01-0018,gb300-r02-0001,gb300-r02-0002,gb300-r02-0003,gb300-r02-0004,gb300-r02-0005,gb300-r02-0006,gb300-r02-0007,gb300-r02-0008,gb300-r02-0009,gb300-r02-0010,gb300-r02-0011,gb300-r02-0012,gb300-r02-0013,gb300-r02-0014,gb300-r02-0015,gb300-r02-0016,gb300-r02-0017,gb300-r02-0018"
dpsctl rg activate --resource-group premium-tier --sync
# value-tier - GB300-Med (3200 W), Racks 3-4 (36 nodes)
dpsctl rg create --external-id 2 --resource-group value-tier --policy GB300-Med --priority 1
dpsctl rg add --resource-group value-tier --entities "gb300-r03-0001,gb300-r03-0002,gb300-r03-0003,gb300-r03-0004,gb300-r03-0005,gb300-r03-0006,gb300-r03-0007,gb300-r03-0008,gb300-r03-0009,gb300-r03-0010,gb300-r03-0011,gb300-r03-0012,gb300-r03-0013,gb300-r03-0014,gb300-r03-0015,gb300-r03-0016,gb300-r03-0017,gb300-r03-0018,gb300-r04-0001,gb300-r04-0002,gb300-r04-0003,gb300-r04-0004,gb300-r04-0005,gb300-r04-0006,gb300-r04-0007,gb300-r04-0008,gb300-r04-0009,gb300-r04-0010,gb300-r04-0011,gb300-r04-0012,gb300-r04-0013,gb300-r04-0014,gb300-r04-0015,gb300-r04-0016,gb300-r04-0017,gb300-r04-0018"
dpsctl rg activate --resource-group value-tier --sync
# research - GB300-Low (1600 W), Racks 5-8 (72 nodes)
dpsctl rg create --external-id 3 --resource-group research --policy GB300-Low --priority 2
dpsctl rg add --resource-group research --entities "gb300-r05-0001,gb300-r05-0002,gb300-r05-0003,gb300-r05-0004,gb300-r05-0005,gb300-r05-0006,gb300-r05-0007,gb300-r05-0008,gb300-r05-0009,gb300-r05-0010,gb300-r05-0011,gb300-r05-0012,gb300-r05-0013,gb300-r05-0014,gb300-r05-0015,gb300-r05-0016,gb300-r05-0017,gb300-r05-0018,gb300-r06-0001,gb300-r06-0002,gb300-r06-0003,gb300-r06-0004,gb300-r06-0005,gb300-r06-0006,gb300-r06-0007,gb300-r06-0008,gb300-r06-0009,gb300-r06-0010,gb300-r06-0011,gb300-r06-0012,gb300-r06-0013,gb300-r06-0014,gb300-r06-0015,gb300-r06-0016,gb300-r06-0017,gb300-r06-0018,gb300-r07-0001,gb300-r07-0002,gb300-r07-0003,gb300-r07-0004,gb300-r07-0005,gb300-r07-0006,gb300-r07-0007,gb300-r07-0008,gb300-r07-0009,gb300-r07-0010,gb300-r07-0011,gb300-r07-0012,gb300-r07-0013,gb300-r07-0014,gb300-r07-0015,gb300-r07-0016,gb300-r07-0017,gb300-r07-0018,gb300-r08-0001,gb300-r08-0002,gb300-r08-0003,gb300-r08-0004,gb300-r08-0005,gb300-r08-0006,gb300-r08-0007,gb300-r08-0008,gb300-r08-0009,gb300-r08-0010,gb300-r08-0011,gb300-r08-0012,gb300-r08-0013,gb300-r08-0014,gb300-r08-0015,gb300-r08-0016,gb300-r08-0017,gb300-r08-0018"
dpsctl rg activate --resource-group research --syncSet Grid Load Targets
A grid load target tells DPS that the datacenter power feed is constrained. Each set-load-target call immediately overrides the previous one. Run this sequence after creating resource groups to observe DPS reacting to demand-response events in real time.
These commands use GNU date syntax (-d '+1 minute'), available on both Linux and macOS after running task setup (which installs GNU coreutils via Homebrew).
# 1. Set an initial constraint - 700 kW, active until overridden
dpsctl nvgrid set-load-target --value 700000 --unit watt --feed-tags Utility --start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --best-effort
# Check active load targets
dpsctl nvgrid get-current
# 2. Reduce further - 500 kW for one minute, then expires automatically
dpsctl nvgrid set-load-target --value 500000 --unit watt --feed-tags Utility --start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --end-time "$(date -u -d '+1 minute' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v+1M +%Y-%m-%dT%H:%M:%SZ)" --best-effort
# Check active load targets
dpsctl nvgrid get-current
# 3. Restore to original level - 750 kW, active until overridden
dpsctl nvgrid set-load-target --value 750000 --unit watt --feed-tags Utility --start-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --best-effortOpen the Web UI at http://ui.dps.sdk (auth: dps/dps) or Grafana at http://grafana.dps.sdk (auth: admin/dps) while running the sequence above to watch DPS respond to each constraint change in real time.
Hardware Emulation Playbooks
The Hardware Emulation profile provides a complete emulated datacenter environment using pseudorandom hardware responses based on device characteristics. Deploy it before running the playbooks below:
task sdk
task simtask sdk creates the k3d cluster and deploys DPS. task sim imports the topology, runs the combined resource-groups plus grid simulation, and keeps driving load until you stop it with Ctrl+C.
Single-Script Playbooks
Resource Group Simulation
Description: Automated creation and deletion of resource groups with configurable parameters. Simulates realistic workload lifecycle patterns with randomized power management features.
The task sim:rgs target runs the sim/sim_rgs.sh script, which handles the full lifecycle of resource group management.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsing
Quick Start:
# Start basic simulation
task sim:rgs
# Quick test simulation
task sim:rgs END_AFTER=120 MAX_RGS=5 MIN_DURATION=30 MAX_DURATION=60
# Extended simulation
task sim:rgs END_AFTER=1800 MAX_RGS=20 MIN_DURATION=120 MAX_DURATION=300
# High-intensity simulation
task sim:rgs MAX_RGS=25 MIN_RG_SIZE=5 MAX_RG_SIZE=15Parameters:
Core parameters:
MAX_RGS: Maximum number of resource groups (default: 15)MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)END_AFTER: End simulation after specified seconds (default: 600)
Feature randomization flags:
NO_WPPS: Disable workload profile (WPPS) randomization (default: false)NO_PRS: Disable PRS randomization (default: false)NO_DPM: Disable DPM randomization (default: false)NO_ALLOW_REPROVISION: Disable allow-reprovision randomization (default: false)
What It Does:
-
Initialization Phase:
- Fetches available resources from topology
- Loads available power policies from DPS server
- Can optionally clean up existing resource groups
-
Simulation Loop:
- Creates resource groups up to
MAX_RGSlimit - Randomly selects resources (between
MIN_RG_SIZEandMAX_RG_SIZE) - Randomly selects a power policy from available policies
- Randomly enables or disables optional features (WPPS, PRS, DPM, allow-reprovision)
- Tracks resource allocation to prevent conflicts
- Automatically deletes RGs after their lifecycle duration expires
- Creates resource groups up to
-
Resource Group Lifecycle:
- Create: Creates RG with external ID and selected policy
- Add: Adds selected resources to the RG
- Activate: Activates the RG with optional features
- Delete: Removes RG after duration expires
Script Implementation Details:
- Maintains associative arrays for active RGs and resource usage
- Prevents resource conflicts by tracking which resources are in use
- Generates unique RG IDs in format:
sim-rg-XXXX(e.g.,sim-rg-0001) - Uses unique external IDs starting from 1000
- Default check interval: 10 seconds
Feature Randomization:
When enabled (default), the script randomly applies these features:
- WPPS (Workload Profile): Randomly selects 1-2 profiles from IDs 3-15
- PRS (Power Regulation Service): Randomly enables or disables
- DPM (Dynamic Power Management): Randomly enables or disables
- Allow Reprovision: Randomly enables or disables during activation
To disable specific features:
# Disable WPPS and PRS randomization
task sim:rgs NO_WPPS=true NO_PRS=true
# Disable only allow-reprovision randomization
task sim:rgs NO_ALLOW_REPROVISION=true
# Disable all randomization features
task sim:rgs NO_WPPS=true NO_PRS=true NO_DPM=true NO_ALLOW_REPROVISION=trueGrid Simulation
Description: Simulates nvgrid load targets for testing domain-level power management and grid integration. Schedules multiple concurrent load targets across power domains.
The task sim:grid target runs the sim/sim_grid.sh script, which manages nvgrid load target scheduling across power domains.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsing
Quick Start:
# Start grid simulation with defaults
task sim:grid
# Run longer simulation (1 hour)
task sim:grid END_AFTER=3600
# Quick test with higher load range
task sim:grid END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95
# Custom check interval
task sim:grid INTERVAL=120 MIN_DURATION=60 MAX_DURATION=300Parameters:
MAX_RGS: Maximum number of resource groups (default: 15)MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 120)MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 600)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)END_AFTER: End simulation after specified seconds (default: 1800)INTERVAL: Check interval in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load as percentage of domain capacity (default: 40)MAX_LOAD_PERCENT: Maximum load as percentage of domain capacity (default: 90)NO_WPPS: Disable workload profile (WPPS) randomization (default: false)NO_PRS: Disable PRS randomization (default: false)NO_DPM: Disable DPM randomization (default: false)NO_ALLOW_REPROVISION: Disable allow-reprovision randomization (default: false)
What It Does:
-
Initialization Phase:
- Fetches topology and identifies all power domains
- Retrieves capacity limits for each power domain
- Displays current load targets before starting
-
Simulation Loop:
- Every
INTERVALseconds, checks active targets - Schedules 2-5 new random grid events if under capacity limit (max 5 concurrent)
- Each event targets a random power domain
- Load values are randomized within specified percentage range
- Duration is randomized between min and max values
- Events are scheduled with random gaps (2-30 minutes) between them
- Every
-
Event Scheduling:
- Events can be scheduled on the same domain (non-conflicting times)
- Automatically adjusts start times to avoid conflicts
- Maintains at most 5 active targets at any time
- Tracks earliest start and latest end times per domain
-
Cleanup:
- On interruption (
Ctrl+C), expires all scheduled events - Uses bulk expiration by setting load to 0 W for the scheduled timeframe
- On interruption (
Implementation Details:
- Maintains associative arrays for power domains, active targets, and domain timelines
- Generates unique target IDs in format:
domain:counter - Cleans up expired targets automatically before scheduling new ones
- Supports multiple load targets on the same domain with time separation
Scheduling Strategy:
- Minimum events per interval: 2
- Maximum events per interval: 5
- Maximum concurrent active targets: 5
- Event gap range: 120-1800 seconds (2-30 minutes)
- Adds 30-second buffer between events on same domain
Load Shedding Simulation
Description: Simulates a single load shedding and reactivation cycle for testing power reduction scenarios and recovery. All events are scheduled upfront, making it ideal for testing system response to planned power constraints.
The task sim:loadshed (alias: task sim:ls) target runs the sim/sim_loadshed.sh script, which schedules and monitors a complete load shedding cycle.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsingbc- Floating-point calculations
Quick Start:
# Start load shedding simulation with defaults
# (100% to 5% in 5% steps, hold 5 min, reactivate to 100%)
task sim:loadshed
# Quick test with faster intervals
task sim:loadshed SHED_INTERVAL=30 HOLD_TIME=120
# Shed to 10% in 10% increments
task sim:loadshed SHED_INCREMENT=10 MIN_LOAD_PERCENT=10
# Aggressive load shedding (20% steps down to 20%)
task sim:loadshed SHED_INCREMENT=20 MIN_LOAD_PERCENT=20
# Target specific power domain
task sim:loadshed FEED_TAG=PDU-GB200-1
# Use short alias
task sim:ls SHED_INCREMENT=10 MIN_LOAD_PERCENT=10Parameters:
SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)SHED_INCREMENT: Load percentage increment for steps (default: 5)HOLD_TIME: Time to hold at minimum load in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)FEED_TAG: Specific feed tag (power domain) to target (default: auto-select first domain)
What It Does:
Phase 1: Load Shedding
- Reduces load from 100% to
MIN_LOAD_PERCENTinSHED_INCREMENTsteps - Each step lasts for
SHED_INTERVALseconds - Example: 100% to 95% to 90% to … to 5% (with default 5% increment)
Phase 2: Hold
- Maintains minimum load for
HOLD_TIMEseconds - Tests system stability under sustained low load
Phase 3: Reactivation
- Increases load from
MIN_LOAD_PERCENTback to 100% inSHED_INCREMENTsteps - Each step lasts for
SHED_INTERVALseconds - Example: 10% to 15% to 20% to … to 100% (with 5% increment from 10% min)
Implementation Details:
-
Domain Selection:
- If no
FEED_TAGspecified, auto-selects first power domain - Retrieves domain capacity and power factor from topology
- Calculates maximum load as
capacity x power_factor
- If no
-
Event Scheduling:
- All events are pre-scheduled at simulation start
- Calculates total simulation duration upfront
- Each load target is scheduled with precise start and end times
- No dynamic adjustments during execution
-
Calculation Example:
With defaults (5% increment, 5% min load, 60s interval, 300s hold):
- Shedding steps: 100%, 95%, 90%, …, 5% = 20 events
- Hold: 1 event
- Reactivation steps: 10%, 15%, 20%, …, 100% = 19 events
- Total events: 40 events
- Total duration: (20 x 60) + 300 + (19 x 60) = 2,640 seconds (~44 minutes)
-
Cleanup:
- On interruption (
Ctrl+C), cancels all scheduled events - Sets load to 0 W for the entire simulation timeframe
- Ensures clean exit with no lingering targets
- On interruption (
Behavior Notes:
- The simulator schedules all events upfront and then monitors progress
- Actual power adjustments happen automatically per the schedule
- The script does not dynamically adjust based on system response
- Power factor from topology is automatically applied to capacity calculations
Multi-Script Playbooks
Combined Simulation (Resource Groups and Grid)
Description: Runs both resource group and grid simulations simultaneously for thorough testing of workload management and power domain optimization.
Script Location: Uses both sim/sim_rgs.sh and sim/sim_grid.sh.
Quick Start:
# Start combined simulation with defaults
task sim
# Run longer simulation (1 hour)
task sim END_AFTER=3600
# Quick test with higher load range
task sim END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95Parameters:
Resource Group parameters:
MAX_RGS: Maximum number of resource groups (default: 15)MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)END_AFTER: End simulation after specified seconds (default: 600)
Grid simulation parameters:
INTERVAL: Check interval in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load as percentage of domain capacity (default: 40)MAX_LOAD_PERCENT: Maximum load as percentage of domain capacity (default: 90)
What It Does:
- Continuously creates and deletes resource groups based on configured lifecycle parameters
- Schedules 2-5 random grid load targets across power domains every interval
- Tests both workload churn and power optimization simultaneously
- Monitors resource utilization and power distribution
Implementation Details:
- Runs two simulator processes in parallel
- Automatically handles cleanup on interruption (
Ctrl+C) - Both simulators share the same
END_AFTERduration for synchronized termination - Resource groups and grid targets operate independently
Load Shedding With Resource Groups
Description: Runs both load shedding and resource groups simulators in parallel, providing thorough testing of system behavior under combined workload churn and power constraints.
Script Location: Uses both sim/sim_loadshed.sh and sim/sim_rgs.sh.
Quick Start:
# Run with defaults
task sim:loadshed-rgs
# Faster load shedding with aggressive RG spawning
task sim:loadshed-rgs SHED_INTERVAL=30 MAX_RGS=25
# Custom configuration for both simulators
task sim:loadshed-rgs SHED_INCREMENT=10 MIN_LOAD_PERCENT=10 MAX_RGS=20 END_AFTER=2000
# Use short alias
task sim:ls-rgs SHED_INCREMENT=10 MAX_RGS=25Parameters:
Load Shedding:
SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)SHED_INCREMENT: Load percentage increment for steps (default: 5)HOLD_TIME: Time to hold at minimum load in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)FEED_TAG: Specific feed tag (power domain) to target (default: auto-select)
Resource Groups:
MIN_DURATION: Minimum duration for RG lifecycle in seconds (default: 60)MAX_DURATION: Maximum duration for RG lifecycle in seconds (default: 180)END_AFTER: End RG simulation after specified seconds (default: 2640 - matches load shed duration)MAX_RGS: Maximum number of RGs to spawn (default: 15)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)
What It Does:
-
Parallel Execution:
- Starts load shedding simulator in background
- Starts resource group simulator in background
- Both run simultaneously and independently
-
Load Shedding Component:
- Executes full load shedding cycle as described in Load Shedding Simulation
- Schedules all power reduction and reactivation events upfront
- Runs for calculated duration based on shed parameters
-
Resource Groups Component:
- Continuously creates and deletes resource groups as described in Resource Group Simulation
- Runs until
END_AFTERduration (defaults to match load shed duration) - Maintains up to
MAX_RGSconcurrent resource groups
-
Synchronized Termination:
- By default, both simulators end at approximately the same time
END_AFTERfor RGs is set to match load shedding total duration (2640s with defaults)- Both simulators clean up gracefully on completion or interruption
-
Cleanup:
- Interrupt signal (
Ctrl+C) triggers cleanup for both processes - Kills both simulator PIDs and waits for clean exit
- Prevents recursive cleanup with guard variable
- Interrupt signal (
Implementation Details:
- Uses bash process management with background jobs
- Trap handles INT, TERM, HUP, QUIT, ABRT, ALRM, USR1, USR2 signals
- Cleanup function ensures both processes terminate together
- Exit code 130 (128 + SIGINT) on manual interruption
Timing Coordination:
With default values:
- Load shedding duration: ~2640 seconds (44 minutes)
- Resource groups
END_AFTER: 2640 seconds (synchronized) - Both simulators complete at approximately the same time
Cleanup
Interrupt simulators cleanly:
- Press
Ctrl+Cto trigger cleanup handlers - Scripts automatically clean up created resources
- Wait for cleanup to complete before restarting
Summary
The SDK provides a complete set of playbooks for testing datacenter power management scenarios:
- Default-profile playbook: Manual
dpsctlwalkthrough for learning the API, loading the topology, creating resource groups, and setting grid load targets. - Hardware Emulation single-script playbooks: Resource Group Simulation, Grid Simulation, and Load Shedding Simulation focus on one aspect each.
- Hardware Emulation multi-script playbooks: Combined Simulation and Load Shedding With Resource Groups combine multiple simulators for more complex scenarios.
Start with the Default-profile playbook to understand DPS concepts, then move to the Hardware Emulation playbooks for automated, long-running scenarios. For realistic GPU power traces driven by a workload model, see the Workload-Aware Simulation guide.