ASIC and Power Profiles
This section is only relevant to GB300 systems.
NVOS includes robust features for monitoring and managing the platform's ASICs power usage, ensuring optimal performance and system reliability. These features provide detailed insights into each ASIC's power usage during varying time windows, and enable to configure different power-profiles to alter the system's power usage.
Power Profiles: Configurable pre-defined profiles that limit the ASIC's power usage.
ASIC Power Usage Monitoring: Monitor the power usage of each ASIC using short/long term averages and different histograms.
Power capping is designed to reduce the system's provisioned power by allowing the switch to meet a lower power budget, typically over a 1-second interval—longer than most short workload spikes. If high bandwidth is needed for extended periods, users should select a network-oriented, unlimited profile.
There are three selectable profiles:
Compute: Limits switch power to a budget defined by Nvidia, suitable for AI workloads.
Networking: Removes power limits, requiring system provisioning for maximum switch power (suitable for NCCL benchmark).
Reduced-bandwidth: Uses the lowest power, available when only half the switch links are active.
Switch power depends on network traffic. In the "compute" profile, power usage can briefly exceed the set budget, as long as the average over several hundred milliseconds stays within limits. Prolonged high power use triggers traffic limits to maintain the budget. In these cases, switch to the "networking" profile to remove limits, accepting increased power consumption. The "reduced-bandwidth" profile is best when fewer links are active and more power is needed for computation.