Components and Concepts#

Core Components#

WPPS is composed of several essential elements designed to facilitate streamlined, efficient power management across large-scale compute environments:

Optimized Workload Power Profiles: Carefully tuned configurations that target the unique requirements of various workload classes.
Arbitration Engine: Manages conflicts and prioritizes competing requests for power profile adjustments, maintaining system harmony.
BCM-CMDaemon Scheduler Integration: Ensures profile application is tightly coordinated with job scheduling for seamless operational flow.
DCGM APIs: Offers advanced control and comprehensive monitoring to model builders, enabling granular oversight and adjustment of GPU power and performance settings.

Scalability and Performance#

Scalability#

WPPS is purpose-built for scalability, supporting a diverse range of model complexities, memory configurations, and network bandwidth scenarios. Its architecture enables the solution to scale flexibly with workload demands—whether deploying larger models, accommodating greater memory requirements, or optimizing for network latency and throughput. The system’s scalable nature ensures that even as workloads grow in complexity and size, WPPS continues to deliver consistent optimization for both HPC and AI tasks.

Performance Optimization#

At its core, WPPS is engineered to maximize performance across compute, memory, and network domains, all while maintaining stringent power management goals. By expertly balancing resource utilization and tuning power profiles to specific workload characteristics, WPPS enables robust, efficient performance tailored to the needs of every data center environment. Performance gains are realized through intelligent orchestration of hardware resources, leveraging NVIDIA’s extensive GPU knowledge to deliver the best possible outcomes for both throughput and energy efficiency.

In summary, the Workload Power Profile Solution offers a flexible, scalable, and expertly engineered framework for administrators and model builders to achieve superior workload performance and power optimization across modern AI and HPC data centers.

Concepts#

Workload Power Profiles#

Max-P Training Profile#

The Max-P Training Profile is an advanced, finely tuned configuration using all available power knobs, each set to deliver optimal results for various categories of training workloads. For every class of training tasks, specific knob adjustments have been made to extract peak application performance. This profile represents a collection of settings tailored to distinct training scenarios, ensuring that workloads achieve the highest possible throughput. NVIDIA plans to release detailed white papers, supporting documentation, and ablation studies in the coming months to further elaborate on the methodologies and findings behind these optimized profiles.

Max-Q Training Profile#

The Max-Q Training Profile is similarly engineered with all power knobs set to deliver the highest application performance per watt, maximizing energy efficiency without a significant compromise in speed. Configurations for each category of training workload have been carefully optimized to deliver minimal performance loss compared to peak settings, while achieving substantial power savings. Each profile corresponds to a specific class of training tasks, reflecting the nuanced tuning applied. NVIDIA will soon publish comprehensive white papers, documentation, and ablation studies to share deeper insights into the development and efficacy of these profiles.

Max-P Inference Profile#

This profile is designed for inference workloads, with every knob meticulously adjusted to the optimal setting for maximum application performance. Each configuration is tailored to a specific category of inference task, forming a comprehensive suite of profiles tuned to different inference scenarios. As with the training profiles, future white papers, technical documentation, and ablation studies from NVIDIA will detail the strategies and empirical results underlying these configurations.

Max-Q Inference Profile#

The Max-Q Inference Profile provides optimally balanced settings for inference workloads, prioritizing higher application performance per watt and significant energy savings. Each knob is fine-tuned for the characteristics of distinct inference tasks, resulting in profiles that achieve near-peak performance with substantial reductions in power consumption. These profiles have been crafted for various classes of inference workloads, and NVIDIA will soon provide additional documentation and research publications to detail these advancements.

Key Power Profile Knobs#

TGP (Total Graphics Power):#

TGP is the total amount of electrical power (measured in watts) that a GPU draws under maximum load. It’s a critical metric for system design, as it determines the necessary power supply capacity and influences the effectiveness of cooling solutions. TGP includes the sum of core power consumption (the GPU chip), memory power consumption (all on-board memory modules), and the additional circuitry on the graphics card (such as power delivery, VRMs, and fan controllers). TGP is often adjustable in laptops and certain desktop cards, allowing users or manufacturers to dial power consumption up or down for efficiency or performance. Unlike TDP (Thermal Design Power), which is focused on the amount of heat to be dissipated, TGP strictly measures total power usage. Understanding TGP ensures that your system has an adequate power supply and cooling for intense workloads such as gaming, rendering, scientific simulations, or deep learning tasks.

Core power consumption: The power used by the main GPU processor itself.
Memory power consumption: The energy needed for the onboard memory modules.
Other components: Power for supporting circuits and hardware on the GPU card.

EDPp/TGP Ratio#

This ratio compares the GPU’s peak electrical power consumption (EDP-peak) to its typical draw (TGP). EDP-peak, especially critical in data center GPUs, describes the maximum power a GPU can draw during intense, sustained workloads—like AI model training or large-scale simulations. Knowing the EDP-peak is vital for:

Power supply planning: Ensuring the infrastructure (including PDUs and wiring) can handle simultaneous peak loads.
Cooling needs: Providing sufficient cooling capacity to dissipate heat generated at peak power levels.
System efficiency: Balancing maximum performance and optimal energy use, especially in large clusters.

The EDP-peak is distinct from TDP and TGP; it does not primarily indicate thermal requirements but rather the actual instantaneous electrical draw during the most demanding tasks. High-performance GPUs (e.g., NVIDIA A100, H100) are engineered with robust power and cooling features to sustain such loads.

XBAR:GPC Ratio#

This architectural ratio reflects the relationship between the number of crossbar units (which route data internally within the GPU) and the number of Graphics Processing Clusters (GPCs) that manage compute workloads.

XBAR (Crossbar): Functions as the interconnect between cores, memory, and other processing blocks.
GPC (Graphics Processing Cluster): The primary compute blocks within the GPU, each containing multiple Streaming Multiprocessors (SMs).

A higher XBAR:GPC ratio generally means more bandwidth and lower data-traffic congestion between GPCs and the memory subsystem, supporting better scaling and performance, especially under compute-heavy workloads. Optimizing this ratio minimizes internal bottlenecks, enabling efficiency in parallel processing tasks such as AI, HPC, and high-resolution rendering. Different GPU architectures (Ampere, Hopper, etc.) feature various XBAR:GPC ratios depending on their intended use cases—balancing throughput, efficiency, and scalability.

Fmax (Maximum Frequency)#

Fmax is the highest clock speed, measured in GHz, that the GPU cores can achieve under ideal conditions. It serves as the upper limit for processing speed and is dynamically adjusted based on workload, power limits, and temperature.

Clock Speed: The number of cycles per second the GPU can execute; higher Fmax means more operations in less time.
Dynamic Adjustment: Modern GPUs change their core frequency on-the-fly to balance performance, power consumption, and thermal output.
Turbo/Boost Clock: Many GPUs can temporarily exceed their base clock to reach Fmax during demanding tasks, as long as temperature and power limits allow.

Several factors influence how reliably a GPU can sustain Fmax, including cooling efficiency, workload intensity, and power delivery. Comparing base clock and Fmax gives users a sense of the potential performance range for the GPU.

NvLink L1 Threshold#

NvLink’s L1 threshold determines how rapidly these high-speed interconnects transition from active to idle power-saving states. Lowering the threshold makes links enter L1 (low-power) mode more quickly, improving overall power efficiency—especially in workloads with frequent idle periods. However, each transition entails some entry/exit latency, which can delay resuming data transfer and slightly affect performance.

Power Saving: Fast transitions to L1 reduce energy waste during idle time.
Performance Tuning: Thresholds should be adjusted based on the typical workload’s communication patterns to balance power savings and latency impacts.

NvLink Reduced Bandwidth Mode (RBM)#

RBM allows the system to disable some NVLINK connections, reducing aggregate bandwidth and power consumption. This is beneficial for workloads where peak bandwidth is less critical, and compute efficiency takes priority. By shutting down unused links, the GPU saves power and may allocate more energy to core computation.

Workload Suitability: Best for compute-heavy workloads that are less sensitive to communication latency.
Performance Consideration: Workloads highly dependent on NVLINK bandwidth (like large-scale multi-GPU training) may not benefit from RBM and perform best in full-bandwidth mode.

MCLK (Memory Clock)#

MCLK is the operational frequency (in MHz) of the GPU’s memory subsystem. It directly dictates the memory bandwidth—the rate at which data can move between GPU cores and memory modules.

Bandwidth Impact: Higher MCLK settings mean more data throughput, which is essential for AI, scientific computing, and large dataset tasks.
Programmability: Certain systems and tools allow users to adjust the MCLK to suit workload requirements, optimizing for speed or power efficiency.
Monitoring: NVIDIA-SMI and similar tools can be used to track and configure memory clock speeds.

Fine-tuning MCLK and other power knobs help optimize GPU performance and efficiency for specific use cases, whether maximizing throughput for training, reducing energy usage for inference, or balancing needs in a shared environment.