Simulator Playbooks
SDK Simulator Playbooks
This document provides complete documentation for all available playbooks in the SDK Simulator. These playbooks enable testing of various power management scenarios in a simulated datacenter environment.
Table of Contents
Overview
The SDK includes several automated playbooks that simulate realistic datacenter power management scenarios. The Resource Group Simulation continuously creates and deletes workload groups to test lifecycle management. Grid Simulation schedules random power loads across domains to test grid integration. Load Shedding Simulation runs planned power reduction and recovery cycles. Each playbook can be run independently or in combination to test different aspects of the Domain Power Service.
Playbooks are organized into two categories:
- Single-Script Playbooks: Focus on one specific aspect (Resource Groups, Grid, or Load Shedding)
- Multi-Script Playbooks: Combine multiple simulators for more complex testing scenarios
Single-Script Playbooks
1. Resource Group Simulation
Description: Automated creation and deletion of resource groups with configurable parameters. Simulates realistic workload lifecycle patterns with randomized power management features.
The task sim:rgs target runs the sim/sim_rgs.sh script, which handles the full lifecycle of resource group management.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsing
Quick Start:
# Start basic simulation
task sim:rgs
# Quick test simulation
task sim:rgs END_AFTER=120 MAX_RGS=5 MIN_DURATION=30 MAX_DURATION=60
# Extended simulation
task sim:rgs END_AFTER=1800 MAX_RGS=20 MIN_DURATION=120 MAX_DURATION=300
# High-intensity simulation
task sim:rgs MAX_RGS=25 MIN_RG_SIZE=5 MAX_RG_SIZE=15Parameters:
Core parameters:
MAX_RGS: Maximum number of resource groups (default: 15)MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)END_AFTER: End simulation after specified seconds (default: 600)
Feature randomization flags:
NO_WPPS: Disable workload profile (WPPS) randomization (default: false)NO_PRS: Disable PRS randomization (default: false)NO_DPM: Disable DPM randomization (default: false)NO_ALLOW_REPROVISION: Disable allow-reprovision randomization (default: false)
What It Does:
-
Initialization Phase:
- Fetches available resources from topology
- Loads available power policies from DPS server
- Can optionally clean up existing resource groups
-
Simulation Loop:
- Creates resource groups up to
MAX_RGSlimit - Randomly selects resources (between
MIN_RG_SIZEandMAX_RG_SIZE) - Randomly selects a power policy from available policies
- Randomly enables/disables optional features (WPPS, PRS, DPM, allow-reprovision)
- Tracks resource allocation to prevent conflicts
- Automatically deletes RGs after their lifecycle duration expires
- Creates resource groups up to
-
Resource Group Lifecycle:
- Create: Creates RG with external ID and selected policy
- Add: Adds selected resources to the RG
- Activate: Activates the RG with optional features
- Delete: Removes RG after duration expires
Script Implementation Details:
- Maintains associative arrays for active RGs and resource usage
- Prevents resource conflicts by tracking which resources are in use
- Generates unique RG IDs in format:
sim-rg-XXXX(e.g.,sim-rg-0001) - Uses unique external IDs starting from 1000
- Default check interval: 10 seconds
Feature Randomization:
When enabled (default), the script randomly applies these features:
- WPPS (Workload Profile): Randomly selects 1-2 profiles from IDs 3-15
- PRS (Power Regulation Service): Randomly enables or disables
- DPM (Dynamic Power Management): Randomly enables or disables
- Allow Reprovision: Randomly enables or disables during activation
To disable specific features:
# Disable WPPS and PRS randomization
task sim:rgs NO_WPPS=true NO_PRS=true
# Disable only allow-reprovision randomization
task sim:rgs NO_ALLOW_REPROVISION=true
# Disable all randomization features
task sim:rgs NO_WPPS=true NO_PRS=true NO_DPM=true NO_ALLOW_REPROVISION=true2. Grid Simulation
Description: Simulates nvgrid load targets for testing domain-level power management and grid integration. Schedules multiple concurrent load targets across power domains.
The task sim:grid target runs the sim/sim_grid.sh script, which manages nvgrid load target scheduling across power domains.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsing
Quick Start:
# Start grid simulation with defaults
task sim:grid
# Run longer simulation (1 hour)
task sim:grid END_AFTER=3600
# Quick test with higher load range
task sim:grid END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95
# Custom check interval
task sim:grid INTERVAL=120 MIN_DURATION=60 MAX_DURATION=300Parameters:
INTERVAL: Check interval in seconds (default: 300)MIN_DURATION: Minimum duration for load target in seconds (default: 120)MAX_DURATION: Maximum duration for load target in seconds (default: 600)END_AFTER: End simulation after specified seconds (default: 1800)MIN_LOAD_PERCENT: Minimum load as % of domain capacity (default: 40)MAX_LOAD_PERCENT: Maximum load as % of domain capacity (default: 90)
What It Does:
-
Initialization Phase:
- Fetches topology and identifies all power domains
- Retrieves capacity limits for each power domain
- Displays current load targets before starting
-
Simulation Loop:
- Every
INTERVALseconds, checks active targets - Schedules 2-5 new random grid events if under capacity limit (max 5 concurrent)
- Each event targets a random power domain
- Load values are randomized within specified percentage range
- Duration is randomized between min and max values
- Events are scheduled with random gaps (2-30 minutes) between them
- Every
-
Event Scheduling:
- Events can be scheduled on the same domain (non-conflicting times)
- Automatically adjusts start times to avoid conflicts
- Maintains at most 5 active targets at any time
- Tracks earliest start and latest end times per domain
-
Cleanup:
- On interruption (Ctrl+C), expires all scheduled events
- Uses bulk expiration by setting load to 0W for the scheduled timeframe
Implementation Details:
- Maintains associative arrays for power domains, active targets, and domain timelines
- Generates unique target IDs in format:
domain:counter - Cleans up expired targets automatically before scheduling new ones
- Supports multiple load targets on the same domain with time separation
Scheduling Strategy:
- Minimum events per interval: 2
- Maximum events per interval: 5
- Maximum concurrent active targets: 5
- Event gap range: 120-1800 seconds (2-30 minutes)
- Adds 30-second buffer between events on same domain
3. Load Shedding Simulation
Description: Simulates a single load shedding and reactivation cycle for testing power reduction scenarios and recovery. All events are scheduled upfront, making it ideal for testing system response to planned power constraints.
The task sim:loadshed (alias: task sim:ls) target runs the sim/sim_loadshed.sh script, which schedules and monitors a complete load shedding cycle.
Script Dependencies:
dpsctl- DPS command-line interfacejq- JSON parsingbc- Floating-point calculations
Quick Start:
# Start load shedding simulation with defaults
# (100% → 5% in 5% steps, hold 5 min, reactivate to 100%)
task sim:loadshed
# Quick test with faster intervals
task sim:loadshed SHED_INTERVAL=30 HOLD_TIME=120
# Shed to 10% in 10% increments
task sim:loadshed SHED_INCREMENT=10 MIN_LOAD_PERCENT=10
# Aggressive load shedding (20% steps down to 20%)
task sim:loadshed SHED_INCREMENT=20 MIN_LOAD_PERCENT=20
# Target specific power domain
task sim:loadshed FEED_TAG=PDU-GB200-1
# Use short alias
task sim:ls SHED_INCREMENT=10 MIN_LOAD_PERCENT=10Parameters:
SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)SHED_INCREMENT: Load percentage increment for steps (default: 5)HOLD_TIME: Time to hold at minimum load in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)FEED_TAG: Specific feed tag (power domain) to target (default: auto-select first domain)
What It Does:
Phase 1: Load Shedding
- Reduces load from 100% to
MIN_LOAD_PERCENTinSHED_INCREMENTsteps - Each step lasts for
SHED_INTERVALseconds - Example: 100% → 95% → 90% → … → 5% (with default 5% increment)
Phase 2: Hold
- Maintains minimum load for
HOLD_TIMEseconds - Tests system stability under sustained low load
Phase 3: Reactivation
- Increases load from
MIN_LOAD_PERCENTback to 100% inSHED_INCREMENTsteps - Each step lasts for
SHED_INTERVALseconds - Example: 10% → 15% → 20% → … → 100% (with 5% increment from 10% min)
Implementation Details:
-
Domain Selection:
- If no
FEED_TAGspecified, auto-selects first power domain - Retrieves domain capacity and power factor from topology
- Calculates maximum load as
capacity × power_factor
- If no
-
Event Scheduling:
- All events are pre-scheduled at simulation start
- Calculates total simulation duration upfront
- Each load target is scheduled with precise start/end times
- No dynamic adjustments during execution
-
Calculation Example: With defaults (5% increment, 5% min load, 60s interval, 300s hold):
- Shedding steps: 100%, 95%, 90%, …, 5% = 20 events
- Hold: 1 event
- Reactivation steps: 10%, 15%, 20%, …, 100% = 19 events
- Total events: 40 events
- Total duration: (20 × 60) + 300 + (19 × 60) = 2,640 seconds (~44 minutes)
-
Cleanup:
- On interruption (Ctrl+C), cancels all scheduled events
- Sets load to 0W for the entire simulation timeframe
- Ensures clean exit with no lingering targets
Behavior Notes:
- The simulator schedules all events upfront and then monitors progress
- Actual power adjustments happen automatically per the schedule
- The script doesn’t dynamically adjust based on system response
- Power factor from topology is automatically applied to capacity calculations
Multi-Script Playbooks
4. Combined Simulation (Resource Groups + Grid)
Description: Runs both resource group and grid simulations simultaneously for thorough testing of workload management and power domain optimization.
Script Location: Uses both sim/sim_rgs.sh and sim/sim_grid.sh
Quick Start:
# Start combined simulation with defaults
task sim
# Run longer simulation (1 hour)
task sim END_AFTER=3600
# Quick test with higher load range
task sim END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95Parameters:
Resource Group parameters:
MAX_RGS: Maximum number of resource groups (default: 15)MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)END_AFTER: End simulation after specified seconds (default: 1800)
Grid simulation parameters:
INTERVAL: Check interval in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load as % of domain capacity (default: 40)MAX_LOAD_PERCENT: Maximum load as % of domain capacity (default: 90)
What It Does:
- Continuously creates and deletes resource groups based on configured lifecycle parameters
- Schedules 2-5 random grid load targets across power domains every interval
- Tests both workload churn and power optimization simultaneously
- Monitors resource utilization and power distribution
Implementation Details:
- Runs two simulator processes in parallel
- Automatically handles cleanup on interruption (Ctrl+C)
- Both simulators share the same
END_AFTERduration for synchronized termination - Resource groups and grid targets operate independently
5. Load Shedding + Resource Groups Simulation
Description: Runs both load shedding and resource groups simulators in parallel, providing thorough testing of system behavior under combined workload churn and power constraints.
Script Location: Uses both sim/sim_loadshed.sh and sim/sim_rgs.sh
Quick Start:
# Run with defaults
task sim:loadshed-rgs
# Faster load shedding with aggressive RG spawning
task sim:loadshed-rgs SHED_INTERVAL=30 MAX_RGS=25
# Custom configuration for both simulators
task sim:loadshed-rgs SHED_INCREMENT=10 MIN_LOAD_PERCENT=10 MAX_RGS=20 END_AFTER=2000
# Use short alias
task sim:ls-rgs SHED_INCREMENT=10 MAX_RGS=25Parameters:
Load Shedding:
SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)SHED_INCREMENT: Load percentage increment for steps (default: 5)HOLD_TIME: Time to hold at minimum load in seconds (default: 300)MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)FEED_TAG: Specific feed tag (power domain) to target (default: auto-select)
Resource Groups:
MIN_DURATION: Minimum duration for RG lifecycle in seconds (default: 60)MAX_DURATION: Maximum duration for RG lifecycle in seconds (default: 180)END_AFTER: End RG simulation after specified seconds (default: 2640 - matches load shed duration)MAX_RGS: Maximum number of RGs to spawn (default: 15)MIN_RG_SIZE: Minimum resources per RG (default: 2)MAX_RG_SIZE: Maximum resources per RG (default: 10)
What It Does:
-
Parallel Execution:
- Starts load shedding simulator in background
- Starts resource group simulator in background
- Both run simultaneously and independently
-
Load Shedding Component:
- Executes full load shedding cycle as described in Playbook #3
- Schedules all power reduction/reactivation events upfront
- Runs for calculated duration based on shed parameters
-
Resource Groups Component:
- Continuously creates/deletes resource groups as described in Playbook #1
- Runs until
END_AFTERduration (defaults to match load shed duration) - Maintains up to
MAX_RGSconcurrent resource groups
-
Synchronized Termination:
- By default, both simulators end at approximately the same time
END_AFTERfor RGs is set to match load shedding total duration (2640s with defaults)- Both simulators clean up gracefully on completion or interruption
-
Cleanup:
- Interrupt signal (Ctrl+C) triggers cleanup for both processes
- Kills both simulator PIDs and waits for clean exit
- Prevents recursive cleanup with guard variable
Implementation Details:
- Uses bash process management with background jobs
- Trap handles INT, TERM, HUP, QUIT, ABRT, ALRM, USR1, USR2 signals
- Cleanup function ensures both processes terminate together
- Exit code 130 (128 + SIGINT) on manual interruption
Timing Coordination: With default values:
- Load shedding duration: ~2640 seconds (44 minutes)
- Resource groups END_AFTER: 2640 seconds (synchronized)
- Both simulators complete at approximately the same time
Cleanup
Interrupt simulators cleanly:
- Press
Ctrl+Cto trigger cleanup handlers - Scripts will automatically clean up created resources
- Wait for cleanup to complete before restarting
Summary
The SDK provides a complete set of playbooks for testing datacenter power management scenarios:
Single-Script Playbooks:
- Resource Group Simulation: Focuses on workload lifecycle and resource allocation
- Grid Simulation: Tests domain-level power management and grid integration
- Load Shedding Simulation: Tests power reduction and recovery scenarios
Multi-Script Playbooks: 4. Combined Simulation: Tests both workload management and power optimization simultaneously 5. Load Shedding + RGs: Combines workload churn with power constraints
Each playbook is designed to test specific aspects of the DPS system and can be customized through various parameters to simulate different scenarios. Start with single-script playbooks to understand individual components, then use multi-script playbooks for more complex testing scenarios.