Workload-Aware Simulation

The Workload-Aware profile deploys the DPS BMC Simulator with a workload-model service that generates realistic power phases, durations, and signal characteristics based on recorded workloads. Unlike Hardware Emulation (which returns pseudorandom power values), the workload model accepts a workload name and a power cap, then delivers a realistic GPU power trace.

The DPS BMC Simulator calls this model over gRPC and serves the resulting trace through its Redfish API, so DPS manages power exactly as it would with real hardware.

The SDK includes a built-in example model. You can replace it with your own implementation — for example, an ML model trained on production telemetry to analyze how different workloads perform under various power caps.

Overview

The DPS BMC simulator runs a surrogate plugin that delegates GPU power predictions to an external gRPC service (the workload model). The workload model runs as a sidecar container alongside the BMC simulator:

┌─────────────────────────────────────────────────┐
│ bmc-surrogate-simulator pod                     │
│                                                 │
│  ┌─────────────────┐    gRPC    ┌────────────┐  │
│  │  BMC simulator   │◄─────────►│  Workload  │  │
│  │  (surrogate      │ :50052    │  model     │  │
│  │   plugin)        │           │  (Python)  │  │
│  └─────────────────┘            └────────────┘  │
└─────────────────────────────────────────────────┘

The gRPC interface (surrogate.proto) has two methods:

  • Predict — Given a power cap and hardware/job context, return a predicted GPU power time series.
  • GetInfo — Return model metadata. Called at startup for health checks.

When a workload starts on a node, the surrogate plugin calls the workload model’s Predict method with the current power cap and job context. The model returns a finite series of power and utilization values that the surrogate plugin replays in a loop.

The SDK includes a complete, runnable example workload model in examples/workload-model/. The example generates a sin-shaped power trace for demonstration purposes. Copy it and replace the prediction logic in workload_model_server.py with your own trained model to produce realistic GPU power traces for any workload.

Use Cases

By building a custom workload model that captures real GPU power behavior, the Workload-Aware profile enables power-aware GPU workload management — showing how power capping affects job performance and cluster power consumption before changes are applied.

Key use cases include:

  • Cluster Power Simulation - Assess the impact of power limits on job runtimes and overall cluster power consumption without disrupting production workloads
  • Workload Optimization - Explore performance-power trade-offs across caps to find optimal operating points for GPU applications
  • Power-Aware Scheduling - Inform scheduling decisions about placement and power allocation based on predicted power behavior
  • Cost Analysis - Evaluate cost-benefit of capping strategies by balancing performance requirements and energy costs
  • Research and Development - Study power management behavior across workloads and caps using synthetic but realistic traces

Prerequisites

Before deploying the Workload-Aware profile, ensure you have completed the SDK setup described in the SDK Simulator User Guide:

  • Linux (Ubuntu/Debian) or macOS (limited support)
  • Minimum 8GB RAM, 20GB free disk space
  • Internet connection for dependencies
  • SDK dependencies installed (task setup)

Deployment

Step 1: Create the Cluster

task deploy:cluster:create

Step 2: Deploy the SDK in Workload-Aware Mode

task deploy:sdk:surrogate

This command builds the example workload model, publishes it to the local cluster, and deploys the DPS SDK using the surrogate environment configuration (environments/surrogate/values.yaml), which:

  • Disables all default BMC simulator profiles (Hardware Emulation)
  • Deploys the example Python workload model as a sidecar alongside the BMC simulator
  • Configures all 144 DGX GB300 nodes for workload-aware simulation
  • Enables monitoring (Prometheus, Grafana) and the BMC simulator admin API ingress

Step 3: Import and Activate Topology

task simulator:topology

This command imports the default topology (sim/topology.json) and activates the simulator under the name dps-simulator. For details about the topology structure, refer to the Simulator Topology Guide.

To override the topology file or name:

task simulator:topology TOPOLOGY_FILE=./sim/example.json TOPOLOGY_NAME=example

Note: A topology must be active before running workload-aware simulations.

Step 4: Run a Simulation

Run a quick simulation to verify the deployment is working:

task simulator:surrogate:install-deps
task sim:surrogate:rgs END_AFTER=120 MAX_RGS=3

Open Grafana at http://grafana.dps.sdk (admin/dps) to see GPU power traces generated by the workload model. With the default example model, you will see sin-shaped power curves. See Simulation Playbooks for all available simulation scenarios.

Custom Workload Model

The SDK includes an example Python workload model in examples/workload-model/ that implements the gRPC surrogate interface. The example generates a sin-shaped power trace for demonstration. Replace the prediction logic in workload_model_server.py with your own ML model to produce realistic GPU power traces.

Deployment Options

Default deployment — the surrogate environment deploys the example model automatically:

task deploy:sdk:surrogate

Deploy with the built-in Go model (no custom image needed) — override the workload model in environments/surrogate/values.yaml:

dps-bmc-simulator:
  dpsBmcSimulator:
    surrogateMocks:
      surrogate:
        workloadModel:
          enabled: true
          command: ["/app/dps-bmc-simulator"]
          args: ["serve", "fake-workload-model", "--port", "50052"]

Deploy your own custom model — build and push your image, then set:

        workloadModel:
          enabled: true
          image: "your-registry/your-workload-model:latest"

Then redeploy:

helm upgrade --install dps-sdk chart --namespace dps \
  --values environments/surrogate/values.yaml

Development Loop

The inner development loop for building a custom workload model is:

  1. Edit workload_model_server.py — replace generate_fake_timeseries() with your model inference
  2. Buildtask example:workload-model:build
  3. Publishtask example:workload-model:push
  4. Restartkubectl rollout restart deploy/bmc-surrogate-simulator -n dps
  5. Verify — check logs and run a simulation

Run locally (without Kubernetes):

task example:workload-model:run

Regenerate gRPC stubs (after editing surrogate.proto):

task example:workload-model:generate

See examples/workload-model/README.md in the SDK repository for the full development guide, gRPC interface reference, and tips.

gRPC Interface

The workload model implements a gRPC service defined in surrogate.proto with two methods:

  • Predict — Given a power cap and hardware/job context, return a predicted GPU power time series.
  • GetInfo — Return model metadata. Called at startup for health checks.

What Predict Receives

Field Type Description
cap_w double Per-GPU power cap in watts
hardware.num_gpus int32 GPUs per node
hardware.gpu_model string GPU model (e.g. “GB300”)
hardware.hostname string Simulator hostname
hardware.gpu_id int32 GPU index within the node
job.job_id string DPS job identifier
job.workload.* WorkloadMetadata Workload info (model name, batch size)
duration_hint_sec double Optional hint for trace length

What Predict Must Return

A PredictResponse with a list of TimeSeriesPoint:

Field Type Description
timestamp_sec int64 Seconds from job start (0, 30, 60, …)
power_usage_w double Per-GPU power consumption in watts
utilization_pct double GPU utilization percentage (0-100)

The series must be finite. The surrogate plugin replays it in a loop. Sample period is 30 seconds by convention.

Simulation Playbooks

The Workload-Aware profile provides dedicated playbooks that leverage the surrogate workload model. These playbooks mirror the Hardware Emulation playbooks but produce power traces driven by the workload model instead of pseudorandom values.

Install Dependencies

Before running workload-aware playbooks, install the required Python dependencies:

task simulator:surrogate:install-deps

Single-Node Workload Simulation

Simulates a single workload execution on a specific node using the workload-aware BMC simulator.

task simulator:surrogate:workload NODE=gb300-r01-0001 WORKLOAD=nemotron_nano_12b

List available options:

task simulator:surrogate:workload -- --list-workloads
task simulator:surrogate:workload -- --list-hosts
task simulator:surrogate:workload -- --list-jobs

Resource Group Simulation

Creates and manages resource groups with realistic workloads. Nodes are randomly distributed between resource groups, and when a workload completes, nodes are recycled for new resource groups.

# Start basic simulation (10 minutes, 10 RGs max, 3 parallel)
task sim:surrogate:rgs

# Quick test simulation
task sim:surrogate:rgs END_AFTER=120 MAX_RGS=5 MAX_PARALLEL=2

# Extended simulation with more concurrent RGs
task sim:surrogate:rgs END_AFTER=1800 MAX_RGS=20 MAX_PARALLEL=5

# Custom workload and time scale
task sim:surrogate:rgs WORKLOAD=canonical_trace_100pct TIME_SCALE=100

List available options:

task sim:surrogate:rgs -- --list-nodes
task sim:surrogate:rgs -- --list-workloads

Parameters:

  • END_AFTER: End simulation after N seconds (default: 600)
  • MAX_RGS: Maximum total resource groups to create (default: 10)
  • MAX_PARALLEL: Maximum concurrent resource groups (default: 3)
  • MIN_RG_SIZE: Minimum nodes per resource group (default: 1)
  • MAX_RG_SIZE: Maximum nodes per resource group (default: 20)
  • WORKLOAD: Surrogate workload name (default: nemotron_nano_12b)
  • TIME_SCALE: Time acceleration factor (default: 50.0)

Combined Simulation (Resource Groups and Grid)

Runs both workload-aware resource group simulation and grid simulation together. This is the workload-aware equivalent of task sim.

# Start combined simulation (default 10 minutes)
task sim:surrogate

# Run longer simulation (1 hour)
task sim:surrogate END_AFTER=3600

# Quick test with more RGs and higher grid load
task sim:surrogate END_AFTER=300 MAX_PARALLEL=5 MIN_LOAD_PERCENT=80

# Custom workload configuration
task sim:surrogate WORKLOAD=canonical_trace_100pct TIME_SCALE=100 MAX_RGS=15

Parameters (combines resource group and grid parameters):

  • END_AFTER: End simulation after N seconds (default: 600)
  • MAX_RGS: Maximum total resource groups to create (default: 10)
  • MAX_PARALLEL: Maximum concurrent resource groups (default: 3)
  • WORKLOAD: Surrogate workload name (default: nemotron_nano_12b)
  • TIME_SCALE: Time acceleration factor (default: 50.0)
  • MIN_LOAD_PERCENT: Minimum grid load percentage (default: 40)
  • MAX_LOAD_PERCENT: Maximum grid load percentage (default: 90)

Load Shedding with Resource Groups

Runs load shedding simulation alongside workload-aware resource group simulation. This is the workload-aware equivalent of task sim:loadshed-rgs, testing how load shedding affects nodes running realistic workloads.

# Run with default settings
task sim:surrogate:loadshed-rgs

# Run with faster load shedding
task sim:surrogate:loadshed-rgs SHED_INTERVAL=30 MAX_PARALLEL=5

# Run with custom parameters
task sim:surrogate:loadshed-rgs SHED_INTERVAL=45 HOLD_TIME=180 MAX_RGS=15

Parameters (combines load shedding and surrogate resource group parameters):

  • SHED_INTERVAL: Time between load shedding steps in seconds (default: 60)
  • SHED_INCREMENT: Load percentage increment for steps (default: 5)
  • HOLD_TIME: Time to hold at minimum load in seconds (default: 300)
  • MIN_LOAD_PERCENT: Minimum load percentage to shed down to (default: 5)
  • END_AFTER: End resource group simulation after specified seconds (default: 2640)
  • MAX_RGS: Maximum total resource groups to create (default: 10)
  • MAX_PARALLEL: Maximum concurrent resource groups (default: 3)
  • WORKLOAD: Surrogate workload name (default: nemotron_nano_12b)

GPU Power Limit Control

The Workload-Aware profile supports direct GPU power limit manipulation through two methods.

Using the Redfish API (BMC Simulator)

Sets the total node power limit by patching the Redfish EnvironmentMetrics PowerLimitWatts property directly on the BMC simulator. Power is distributed evenly across all GPUs.

# Set total node power limit to 2800W (700W per GPU for 4 GPUs)
task sim:gpu-power-limit NODE=gb300-r01-0001 -- --power-limit 2800

Using the DPS API

Sets the total node power limit using the DPS gpu-policy command (UpdateGPUPolicies API). No resource group is required. Changes propagate through DPS to the BMC simulator. Power is distributed evenly across all GPUs.

# Set total node power limit to 2800W (700W per GPU for 4 GPUs)
task sim:gpu-policy NODE=gb300-r01-0001 -- --power-limit 2800

Surrogate Environment Configuration

The Workload-Aware profile uses environments/surrogate/values.yaml to configure the SDK deployment. This file controls which BMC simulator profiles are active and how the surrogate workload model is configured.

Key Configuration Settings

The surrogate values file makes the following changes compared to the default deployment:

  • Disables all default BMC mock profiles (h100Mocks, b200Mocks, b300Mocks, gb200Mocks, gb300Mocks are all set to enabled: false)
  • Enables the surrogate mock profile (surrogateMocks.enabled: true)
  • Configures the BMC simulator to use the GB300 mockup directory and the surrogate metrics plugin
  • Lists all 144 nodes from the topology as surrogate-enabled nodes

Surrogate Plugin Settings

The surrogate BMC simulator is configured with:

dps-bmc-simulator:
  dpsBmcSimulator:
    surrogateMocks:
      enabled: true
      ssl: true
      debug: true
      mockupDir: /app/mockups/gb300
      ingress:
        enabled: true
        hostname: bmc-sim.dps.sdk
      surrogate:
        workloadModel:
          enabled: true
          image: "registry:5001/dps-example-workload-model:latest"
      nodes:
        - gb300-r01-0001
        - gb300-r01-0002
        # ... all 144 nodes

The surrogateMocks section enables the surrogate plugin, which connects to the workload model sidecar over gRPC. The workload model runs as a sidecar container in the same pod when configured through the Helm chart.

Customizing the Configuration

To override surrogate settings, create a custom values file and deploy with:

task deploy:custom VALUES_FILE=path/to/my-surrogate-values.yaml

Operations

Restart the BMC Simulator

If the workload-aware BMC simulator needs to be restarted (for example, after a configuration change):

task simulator:surrogate:restart-bmc

Access the BMC Simulator Admin API

When the surrogate ingress is enabled, the BMC simulator admin API is available at:

https://bmc-sim.dps.sdk/api/v1/admin/Plugins/Surrogate/Status

This endpoint returns the current plugin status, active workloads, and model information.

Monitoring

The Workload-Aware profile shares the same monitoring stack as the default profile:

  • Grafana: http://grafana.dps.sdk (admin/dps)
  • Prometheus: http://prometheus.dps.sdk

The Grafana dashboards display GPU power traces generated by the workload model. With the default example model you will see sin-shaped curves; with a custom model the traces will reflect your model’s power predictions.

Comparison: Hardware Emulation Versus Workload-Aware

Aspect Hardware Emulation Workload-Aware
Metrics source Pseudorandom values based on device characteristics Power traces from a pluggable workload model (example: sin-shaped demo; replace with your own ML model for realistic traces)
Power phases Not modeled Determined by the workload model (custom models can preserve startup, compute, idle, and cooldown phases)
Duration awareness Fixed or random duration Determined by the workload model (custom models can scale duration with power cap)
Deploy command task sdk task deploy:surrogate
Simulate command task sim task sim:surrogate
Use case General API testing, integration development Power-aware analysis, workload optimization, custom model development