Workload Power Profiles Settings

Workload Power Profiles Settings

Overview

Workload Power Profiles Settings (WPPS) are NVIDIA’s pretuned GPU power optimization profiles. DPS provides a mechanism to enable and disable these profiles on supported GPUs through resource groups settings.

WPPS profiles are designed for different performance and power optimization goals (e.g. Max-Q vs Max-P) and automatically configure GPU power settings without requiring manual tuning.

Key Concepts

Profile IDs

WPPS profiles are identified by numeric IDs ranging from 3 to 258. DPS allows you to:

  • Enable specific profile IDs for your workloads
  • Enable multiple profiles simultaneously (GPU firmware handles conflicts automatically)
  • Apply profiles to entire resource groups for consistent optimization

Important: The power profile values configured via DPS use Out-Of-Band (OOB) management (Redfish APIs on the BMC), which uses a different indexing system than the one used in DCGMI and NVSMI. Specifically, the OOB values are offset by 3:

OOB Value = DCGMI/NVSMI Value + 3

For example, if a profile is set to 1 in DCGMI or NVSMI, it should be set to 4 in OOB management.

Example Usage

# Enable single profile
--workload-profile-ids 5

# Enable multiple profiles
--workload-profile-ids 3,7,12

Profile States

For reference, each GPU maintains three profile state masks that are managed automatically by WPPS:

  • Supported Profile Mask - Profiles available on the hardware (read-only)
  • Requested Profile Mask - Profiles requested by DPS
  • Enforced Profile Mask - Profiles currently active on the GPU

DPS users don’t need to manage these states directly - they are handled automatically by the GPU firmware.

DPS Integration and Operations

Resource Group Management

DPS provides simple commands to manage WPPS through resource groups:

# Create resource group with WPPS
dpsctl resource-group create \
  --resource-group "ml-training" \
  --policy "Node-High" \
  --workload-profile-ids 5,7

# Update existing resource group profiles
dpsctl resource-group update \
  --resource-group "ml-training" \
  --workload-profile-ids 3,12

# Remove all profiles
dpsctl resource-group update \
  --resource-group "training-job" \
  --remove-all-workload-profiles

# Delete existing resource with WPPS
dpsctl resource-group delete \
  --resource-group "ml-training"

DPS applies the profiles to all GPUs in the resource group and handles any conflicts automatically. When a resource group configured with WPPS is deleted, DPS removes the WPPS profiles it configured.

SLURM Integration

Use SLURM job comments to automatically apply WPPS:

# Submit job with WPPS
sbatch --comment="dps_policy:Node-High,dps_wpps:3,9" training_job.sh

Device Requirements

GPUs must have wppsSupport: true in their device specification to use WPPS.

Note: WPPS is not supported on GPU architectures before Blackwell.

Best Practices

  • Choose appropriate profiles for your workload type (training vs inference)
  • Test profile combinations to find optimal performance for your specific use case
  • Use SLURM job comments for automatic profile selection
  • Monitor job performance to validate profile effectiveness

Troubleshooting

If WPPS aren’t working:

  1. Check device support: Verify GPU specifications include wppsSupport: true
  2. Verify profile IDs: Ensure IDs are in the valid range (3-258)
  3. Check logs: Monitor DPS server logs for profile application errors

Further Reading