Simulator Playbooks

SDK Simulator Playbooks

This document provides complete documentation for all available playbooks in the SDK Simulator. These playbooks enable testing of various power management scenarios in a simulated datacenter environment.

Table of Contents

Overview

The SDK includes several automated playbooks that simulate realistic datacenter power management scenarios. The Resource Group Simulation continuously creates and deletes workload groups to test lifecycle management. Grid Simulation schedules random power loads across domains to test grid integration. Load Shedding Simulation runs planned power reduction and recovery cycles. Each playbook can be run independently or in combination to test different aspects of the Domain Power Service.

Playbooks are organized into two categories:

  • Single-Script Playbooks: Focus on one specific aspect (Resource Groups, Grid, or Load Shedding)
  • Multi-Script Playbooks: Combine multiple simulators for more complex testing scenarios

Single-Script Playbooks

1. Resource Group Simulation

Description: Automated creation and deletion of resource groups with configurable parameters. Simulates realistic workload lifecycle patterns with randomized power management features.

The task sim:rgs target runs the sim/sim_rgs.sh script, which handles the full lifecycle of resource group management.

Script Dependencies:

  • dpsctl - DPS command-line interface
  • jq - JSON parsing

Quick Start:

# Start basic simulation
task sim:rgs

# Quick test simulation
task sim:rgs END_AFTER=120 MAX_RGS=5 MIN_DURATION=30 MAX_DURATION=60

# Extended simulation
task sim:rgs END_AFTER=1800 MAX_RGS=20 MIN_DURATION=120 MAX_DURATION=300

# High-intensity simulation
task sim:rgs MAX_RGS=25 MIN_RG_SIZE=5 MAX_RG_SIZE=15

Parameters:

Core parameters:

  • MAX_RGS: Maximum number of resource groups (default: 15)
  • MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)
  • MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)
  • MIN_RG_SIZE: Minimum resources per RG (default: 2)
  • MAX_RG_SIZE: Maximum resources per RG (default: 10)
  • END_AFTER: End simulation after specified seconds (default: 600)

Feature randomization flags:

  • NO_WPPS: Disable workload profile (WPPS) randomization (default: false)
  • NO_PRS: Disable PRS randomization (default: false)
  • NO_DPM: Disable DPM randomization (default: false)
  • NO_ALLOW_REPROVISION: Disable allow-reprovision randomization (default: false)

What It Does:

  1. Initialization Phase:

    • Fetches available resources from topology
    • Loads available power policies from DPS server
    • Can optionally clean up existing resource groups
  2. Simulation Loop:

    • Creates resource groups up to MAX_RGS limit
    • Randomly selects resources (between MIN_RG_SIZE and MAX_RG_SIZE)
    • Randomly selects a power policy from available policies
    • Randomly enables/disables optional features (WPPS, PRS, DPM, allow-reprovision)
    • Tracks resource allocation to prevent conflicts
    • Automatically deletes RGs after their lifecycle duration expires
  3. Resource Group Lifecycle:

    • Create: Creates RG with external ID and selected policy
    • Add: Adds selected resources to the RG
    • Activate: Activates the RG with optional features
    • Delete: Removes RG after duration expires

Script Implementation Details:

  • Maintains associative arrays for active RGs and resource usage
  • Prevents resource conflicts by tracking which resources are in use
  • Generates unique RG IDs in format: sim-rg-XXXX (e.g., sim-rg-0001)
  • Uses unique external IDs starting from 1000
  • Default check interval: 10 seconds

Feature Randomization:

When enabled (default), the script randomly applies these features:

  • WPPS (Workload Profile): Randomly selects 1-2 profiles from IDs 3-15
  • PRS (Power Regulation Service): Randomly enables or disables
  • DPM (Dynamic Power Management): Randomly enables or disables
  • Allow Reprovision: Randomly enables or disables during activation

To disable specific features:

# Disable WPPS and PRS randomization
task sim:rgs NO_WPPS=true NO_PRS=true

# Disable only allow-reprovision randomization
task sim:rgs NO_ALLOW_REPROVISION=true

# Disable all randomization features
task sim:rgs NO_WPPS=true NO_PRS=true NO_DPM=true NO_ALLOW_REPROVISION=true

2. Grid Simulation

Description: Simulates nvgrid load targets for testing domain-level power management and grid integration. Schedules multiple concurrent load targets across power domains.

The task sim:grid target runs the sim/sim_grid.sh script, which manages nvgrid load target scheduling across power domains.

Script Dependencies:

  • dpsctl - DPS command-line interface
  • jq - JSON parsing

Quick Start:

# Start grid simulation with defaults
task sim:grid

# Run longer simulation (1 hour)
task sim:grid END_AFTER=3600

# Quick test with higher load range
task sim:grid END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95

# Custom check interval
task sim:grid INTERVAL=120 MIN_DURATION=60 MAX_DURATION=300

Parameters:

  • INTERVAL: Check interval in seconds (default: 300)
  • MIN_DURATION: Minimum duration for load target in seconds (default: 120)
  • MAX_DURATION: Maximum duration for load target in seconds (default: 600)
  • END_AFTER: End simulation after specified seconds (default: 1800)
  • MIN_LOAD_PERCENT: Minimum load as % of domain capacity (default: 40)
  • MAX_LOAD_PERCENT: Maximum load as % of domain capacity (default: 90)

What It Does:

  1. Initialization Phase:

    • Fetches topology and identifies all power domains
    • Retrieves capacity limits for each power domain
    • Displays current load targets before starting
  2. Simulation Loop:

    • Every INTERVAL seconds, checks active targets
    • Schedules 2-5 new random grid events if under capacity limit (max 5 concurrent)
    • Each event targets a random power domain
    • Load values are randomized within specified percentage range
    • Duration is randomized between min and max values
    • Events are scheduled with random gaps (2-30 minutes) between them
  3. Event Scheduling:

    • Events can be scheduled on the same domain (non-conflicting times)
    • Automatically adjusts start times to avoid conflicts
    • Maintains at most 5 active targets at any time
    • Tracks earliest start and latest end times per domain
  4. Cleanup:

    • On interruption (Ctrl+C), expires all scheduled events
    • Uses bulk expiration by setting load to 0W for the scheduled timeframe

Implementation Details:

  • Maintains associative arrays for power domains, active targets, and domain timelines
  • Generates unique target IDs in format: domain:counter
  • Cleans up expired targets automatically before scheduling new ones
  • Supports multiple load targets on the same domain with time separation

Scheduling Strategy:

  • Minimum events per interval: 2
  • Maximum events per interval: 5
  • Maximum concurrent active targets: 5
  • Event gap range: 120-1800 seconds (2-30 minutes)
  • Adds 30-second buffer between events on same domain

3. Load Shedding Simulation

Description: Simulates a single load shedding and reactivation cycle for testing power reduction scenarios and recovery. All events are scheduled upfront, making it ideal for testing system response to planned power constraints.

The task sim:loadshed (alias: task sim:ls) target runs the sim/sim_loadshed.sh script, which schedules and monitors a complete load shedding cycle.

Script Dependencies:

  • dpsctl - DPS command-line interface
  • jq - JSON parsing
  • bc - Floating-point calculations

Quick Start:

# Start load shedding simulation with defaults
# (100% → 5% in 5% steps, hold 5 min, reactivate to 100%)
task sim:loadshed

# Quick test with faster intervals
task sim:loadshed SHED_INTERVAL=30 HOLD_TIME=120

# Shed to 10% in 10% increments
task sim:loadshed SHED_INCREMENT=10 MIN_LOAD_PERCENT=10

# Aggressive load shedding (20% steps down to 20%)
task sim:loadshed SHED_INCREMENT=20 MIN_LOAD_PERCENT=20

# Target specific power domain
task sim:loadshed FEED_TAG=PDU-GB200-1

# Use short alias
task sim:ls SHED_INCREMENT=10 MIN_LOAD_PERCENT=10

Parameters:

  • SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)
  • SHED_INCREMENT: Load percentage increment for steps (default: 5)
  • HOLD_TIME: Time to hold at minimum load in seconds (default: 300)
  • MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)
  • FEED_TAG: Specific feed tag (power domain) to target (default: auto-select first domain)

What It Does:

Phase 1: Load Shedding

  • Reduces load from 100% to MIN_LOAD_PERCENT in SHED_INCREMENT steps
  • Each step lasts for SHED_INTERVAL seconds
  • Example: 100% → 95% → 90% → … → 5% (with default 5% increment)

Phase 2: Hold

  • Maintains minimum load for HOLD_TIME seconds
  • Tests system stability under sustained low load

Phase 3: Reactivation

  • Increases load from MIN_LOAD_PERCENT back to 100% in SHED_INCREMENT steps
  • Each step lasts for SHED_INTERVAL seconds
  • Example: 10% → 15% → 20% → … → 100% (with 5% increment from 10% min)

Implementation Details:

  1. Domain Selection:

    • If no FEED_TAG specified, auto-selects first power domain
    • Retrieves domain capacity and power factor from topology
    • Calculates maximum load as capacity × power_factor
  2. Event Scheduling:

    • All events are pre-scheduled at simulation start
    • Calculates total simulation duration upfront
    • Each load target is scheduled with precise start/end times
    • No dynamic adjustments during execution
  3. Calculation Example: With defaults (5% increment, 5% min load, 60s interval, 300s hold):

    • Shedding steps: 100%, 95%, 90%, …, 5% = 20 events
    • Hold: 1 event
    • Reactivation steps: 10%, 15%, 20%, …, 100% = 19 events
    • Total events: 40 events
    • Total duration: (20 × 60) + 300 + (19 × 60) = 2,640 seconds (~44 minutes)
  4. Cleanup:

    • On interruption (Ctrl+C), cancels all scheduled events
    • Sets load to 0W for the entire simulation timeframe
    • Ensures clean exit with no lingering targets

Behavior Notes:

  • The simulator schedules all events upfront and then monitors progress
  • Actual power adjustments happen automatically per the schedule
  • The script doesn’t dynamically adjust based on system response
  • Power factor from topology is automatically applied to capacity calculations

Multi-Script Playbooks

4. Combined Simulation (Resource Groups + Grid)

Description: Runs both resource group and grid simulations simultaneously for thorough testing of workload management and power domain optimization.

Script Location: Uses both sim/sim_rgs.sh and sim/sim_grid.sh

Quick Start:

# Start combined simulation with defaults
task sim

# Run longer simulation (1 hour)
task sim END_AFTER=3600

# Quick test with higher load range
task sim END_AFTER=600 MIN_LOAD_PERCENT=80 MAX_LOAD_PERCENT=95

Parameters:

Resource Group parameters:

  • MAX_RGS: Maximum number of resource groups (default: 15)
  • MIN_DURATION: Minimum RG lifecycle duration in seconds (default: 60)
  • MAX_DURATION: Maximum RG lifecycle duration in seconds (default: 300)
  • MIN_RG_SIZE: Minimum resources per RG (default: 2)
  • MAX_RG_SIZE: Maximum resources per RG (default: 10)
  • END_AFTER: End simulation after specified seconds (default: 1800)

Grid simulation parameters:

  • INTERVAL: Check interval in seconds (default: 300)
  • MIN_LOAD_PERCENT: Minimum load as % of domain capacity (default: 40)
  • MAX_LOAD_PERCENT: Maximum load as % of domain capacity (default: 90)

What It Does:

  • Continuously creates and deletes resource groups based on configured lifecycle parameters
  • Schedules 2-5 random grid load targets across power domains every interval
  • Tests both workload churn and power optimization simultaneously
  • Monitors resource utilization and power distribution

Implementation Details:

  • Runs two simulator processes in parallel
  • Automatically handles cleanup on interruption (Ctrl+C)
  • Both simulators share the same END_AFTER duration for synchronized termination
  • Resource groups and grid targets operate independently

5. Load Shedding + Resource Groups Simulation

Description: Runs both load shedding and resource groups simulators in parallel, providing thorough testing of system behavior under combined workload churn and power constraints.

Script Location: Uses both sim/sim_loadshed.sh and sim/sim_rgs.sh

Quick Start:

# Run with defaults
task sim:loadshed-rgs

# Faster load shedding with aggressive RG spawning
task sim:loadshed-rgs SHED_INTERVAL=30 MAX_RGS=25

# Custom configuration for both simulators
task sim:loadshed-rgs SHED_INCREMENT=10 MIN_LOAD_PERCENT=10 MAX_RGS=20 END_AFTER=2000

# Use short alias
task sim:ls-rgs SHED_INCREMENT=10 MAX_RGS=25

Parameters:

Load Shedding:

  • SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)
  • SHED_INCREMENT: Load percentage increment for steps (default: 5)
  • HOLD_TIME: Time to hold at minimum load in seconds (default: 300)
  • MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)
  • FEED_TAG: Specific feed tag (power domain) to target (default: auto-select)

Resource Groups:

  • MIN_DURATION: Minimum duration for RG lifecycle in seconds (default: 60)
  • MAX_DURATION: Maximum duration for RG lifecycle in seconds (default: 180)
  • END_AFTER: End RG simulation after specified seconds (default: 2640 - matches load shed duration)
  • MAX_RGS: Maximum number of RGs to spawn (default: 15)
  • MIN_RG_SIZE: Minimum resources per RG (default: 2)
  • MAX_RG_SIZE: Maximum resources per RG (default: 10)

What It Does:

  1. Parallel Execution:

    • Starts load shedding simulator in background
    • Starts resource group simulator in background
    • Both run simultaneously and independently
  2. Load Shedding Component:

    • Executes full load shedding cycle as described in Playbook #3
    • Schedules all power reduction/reactivation events upfront
    • Runs for calculated duration based on shed parameters
  3. Resource Groups Component:

    • Continuously creates/deletes resource groups as described in Playbook #1
    • Runs until END_AFTER duration (defaults to match load shed duration)
    • Maintains up to MAX_RGS concurrent resource groups
  4. Synchronized Termination:

    • By default, both simulators end at approximately the same time
    • END_AFTER for RGs is set to match load shedding total duration (2640s with defaults)
    • Both simulators clean up gracefully on completion or interruption
  5. Cleanup:

    • Interrupt signal (Ctrl+C) triggers cleanup for both processes
    • Kills both simulator PIDs and waits for clean exit
    • Prevents recursive cleanup with guard variable

Implementation Details:

  • Uses bash process management with background jobs
  • Trap handles INT, TERM, HUP, QUIT, ABRT, ALRM, USR1, USR2 signals
  • Cleanup function ensures both processes terminate together
  • Exit code 130 (128 + SIGINT) on manual interruption

Timing Coordination: With default values:

  • Load shedding duration: ~2640 seconds (44 minutes)
  • Resource groups END_AFTER: 2640 seconds (synchronized)
  • Both simulators complete at approximately the same time

Cleanup

Interrupt simulators cleanly:

  • Press Ctrl+C to trigger cleanup handlers
  • Scripts will automatically clean up created resources
  • Wait for cleanup to complete before restarting

Summary

The SDK provides a complete set of playbooks for testing datacenter power management scenarios:

Single-Script Playbooks:

  1. Resource Group Simulation: Focuses on workload lifecycle and resource allocation
  2. Grid Simulation: Tests domain-level power management and grid integration
  3. Load Shedding Simulation: Tests power reduction and recovery scenarios

Multi-Script Playbooks: 4. Combined Simulation: Tests both workload management and power optimization simultaneously 5. Load Shedding + RGs: Combines workload churn with power constraints

Each playbook is designed to test specific aspects of the DPS system and can be customized through various parameters to simulate different scenarios. Start with single-script playbooks to understand individual components, then use multi-script playbooks for more complex testing scenarios.