Concepts and Components#
Concepts#
Power Domain (PD)#
A Power Domain (PD) is a group of compute nodes that share a common power source, typically protected by a single electrical breaker or power distribution unit. Each PD is assigned a power budget, which defines the maximum allowable power draw. This budget imposes a restriction: the total power consumption of all nodes within the PD must not exceed this limit to ensure compliance with power constraints.
Power Domain Network (PDN)#
The Power Domain Network (PDN) is the cluster-wide configuration that maps each compute node to a PD, reflecting the physical power distribution layout of the data center. It is a core concept in PRS, enabling it to dynamically manage power across the cluster while enforcing power budgets per PD. By leveraging the PDN, PRS can optimize power allocation, ensure compliance with PD-level limits, and minimize impact on workload performance.
Components#
PRS Daemon#
The PRS daemon (nvidia-prs
) is the central service responsible for managing all power allocation in the cluster. It runs on the head node (or both head nodes in HA configurations), executes periodic control loops to collect GPU and CPU telemetry and enforce power limits, maintains power history for predictions, manages configuration, interfaces with Slurm for power-aware job scheduling, and stores metrics in BCM while ensuring PDs stay within their power budgets.
BCM - CMDaemon#
The BCM CMDaemon acts as the communication and enforcement layer between PRS and power-managed devices. It collects real-time power telemetry from power-managed devices across the cluster (such as GPUs and CPUs), distributes PRS-calculated power limits to individual devices, reports node availability and failures to PRS, handles batch operations for scalability, and provides the RPC interface that PRS uses for all device interactions.
BCM - Management Interfaces#
Management interfaces, including cmsh
(command-line) and Base View (web UI), enable administrators to configure and monitor PRS operations. They provide access to PD management, power budget configuration, node assignment, real-time status monitoring, and historical metrics, communicating with PRS via mTLS-secured REST APIs while offering both scriptable automation and visual dashboards for different administrative preferences.