Runbooks

Overview

Runbooks are guided, end-to-end procedures for exercising specific DPS configurations against real hardware. Unlike the Tasks section, which documents individual operations, each runbook walks through a complete pilot scenario from baseline measurement through cleanup.

The Inference power pilot: NVL72 fleet management runbook walks through inference on GB200/GB300 NVL72 with automatic fleet power management decoupled from workload schedulers and provisioning systems: a 3-rack baseline without DPS in the control path, then the same workload across 4 racks under DPS while keeping the operating envelope fixed. It defines MaxLPS as a foundational concept and follows a scheduler-independent path. In that scheduler-independent path, DPS manages fleet-level GPU power across a fixed 4-rack topology, while existing schedulers and provisioning systems continue to own workload placement. If any step in the runbook fails, use Troubleshooting the NVL72 inference power pilot for targeted remediation.