resourcegroup

API Reference: v1/resourcegroup.proto

Resource Groups manage power policies for temporary workload allocations on datacenter hardware. They provide ephemeral power management that overrides topology defaults during job execution.

A resource group is a collection of compute resources (nodes) allocated for a specific workload, such as a SLURM job or machine learning training run. Resource groups temporarily override the topology-specified power policies of their entities during workload execution, then restore topology defaults when the workload completes.

Resource groups follow a 5-step lifecycle:

  1. CREATE - Create an empty, inactive resource group with optional default power policy
  2. ADD - Add compute resources (nodes) to the group before activation
  3. ACTIVATE - Apply power policies to hardware and mark the group as active
  4. UPDATE (Optional) - Dynamically adjust power policies during workload execution
  5. DELETE - Deactivate and cleanup, restoring topology defaults

Power Policy Hierarchy

Resource groups use a 3-level policy hierarchy to determine effective power settings:

  1. Entity-level policy - Specific policy for individual nodes (highest priority)
  2. Resource group policy - Default policy for all nodes in the group
  3. Topology policy - Baseline policy from the datacenter topology (lowest priority)

Dynamic Power Management

Resource groups support dynamic power management (DPM) through:

  • Power Reservation Steering (PRS) - Automatic power redistribution based on telemetry
  • Policy Updates - Runtime policy adjustments for optimization
  • GPU Workload Profiles - Hardware-specific power optimization for GPU workloads

Integration with Workload Schedulers

Resource groups are designed to integrate with workload schedulers like SLURM:

  • Use external IDs to map to scheduler job IDs (e.g., SLURM_JOB_ID)
  • Follow scheduler lifecycle events (job start/end)
  • Support scheduler-driven power policy updates

Power policies are defined in policy.proto, and topology entities are defined in topology.proto. Workload-specific optimizations use telemetry data structures from common.proto.

Table of Contents

Services

ResourceGroupManagementService

ResourceGroupManagementService manages ephemeral power policy allocations for workloads

This service provides APIs to create, manage, and monitor resource groups - collections of compute resources with customized power policies for specific workloads. Resource groups temporarily override topology-specified power policies during workload execution, enabling dynamic power management optimized for specific computational tasks.

The service integrates with workload schedulers like SLURM to provide power management throughout the workload lifecycle. It supports both static policy assignment and dynamic policy updates based on real-time telemetry data.

Resource groups must be created from an active topology. The topology defines the available entities and baseline power policies that resource groups can override.

ResourceGroupCreate

rpc ResourceGroupCreate(ResourceGroupCreateRequest) ResourceGroupCreateResponse

Create a new empty and inactive resource group.

This is the first step in the resource group lifecycle. The created resource group is initially empty (no resources) and inactive (no policies applied to hardware). Resources must be added and the group activated before power policies take effect.

ResourceGroupDelete

rpc ResourceGroupDelete(ResourceGroupDeleteRequest) ResourceGroupDeleteResponse

Deactivate and delete a given resource group

This permanently removes the resource group and restores all associated hardware resources to their topology-specified power policies. If the resource group is active, it is automatically deactivated before deletion. This is typically called when a workload completes.

ResourceGroupList

rpc ResourceGroupList(ResourceGroupListAllRequest) ResourceGroupListAllResponse

List all available resource groups

Returns comprehensive information about all resource groups in the system, including their activation status, assigned resources, and power policies. This is used for monitoring, administration, and troubleshooting resource group state.

ResourceGroupAddResources

rpc ResourceGroupAddResources(ResourceGroupAddResourcesRequest) ResourceGroupAddResourcesResponse

Add resources to a given resource group. The resource group must be inactive.

This assigns compute resources (nodes) to the resource group before activation. Resources can be assigned group-level or entity-specific power policies. The resource group must be inactive - resources cannot be added to active resource groups.

ResourceGroupRemoveResources

rpc ResourceGroupRemoveResources(ResourceGroupRemoveResourcesRequest) ResourceGroupRemoveResourcesResponse

Remove resources from a given resource group. The resource group must be inactive.

This removes compute resources from the resource group before activation. Used to adjust resource allocation based on workload requirements. The resource group must be inactive - resources cannot be removed from active resource groups.

ActivateResourceGroup

rpc ActivateResourceGroup(ActivateResourceGroupRequest) ActivateResourceGroupResponse

Activate a given resource group. The resource group must be inactive.

This applies the resource group’s power policies to the assigned hardware resources and marks the group as active. The activation process validates power allocation constraints and applies policies to the hardware. Once active, the resource group manages power policies for its assigned resources until deactivation.

ResourceGroupUpdate

rpc ResourceGroupUpdate(ResourceGroupUpdateRequest) ResourceGroupUpdateResponse

Update a resource group policy. The new policy affects all entities in the resource group that does not have entity-level policies set. If the resource group is active, the policy is validated and applied immediately.

This changes the default power policy for the entire resource group. Entities with entity-specific policies are not affected. If the resource group is active, the new policy is immediately validated and applied to hardware. This enables dynamic power optimization during workload execution.

ResourceGroupUpdateResources

rpc ResourceGroupUpdateResources(ResourceGroupUpdateResourcesRequest) ResourceGroupUpdateResourcesResponse

Update policies for some of the resources in the resource group. If the resource group is active, the changes are applied immediately. This API may perform partial updates.

This updates entity-specific power policies for selected resources in the group. If the resource group is active, changes are immediately applied to hardware. The API supports partial updates - some policy changes may succeed while others fail, allowing for selective power optimization.

UpdateGPUPolicies

rpc UpdateGPUPolicies(UpdateGPUPoliciesRequest) UpdateGPUPoliciesResponse

UpdateGPUPolicies allows updating GPU policies without specifying the resource group. Because of this, a single call may end up updating multiple resource groups. If the resource groups are active, the changes are applied immediately.

This provides direct GPU power management without requiring knowledge of resource group assignments. It’s designed for telemetry-driven power optimization where external monitoring systems can adjust GPU power limits based on real-time performance data. A single call may affect multiple resource groups if GPUs span multiple groups.

AsyncOperationStatus

rpc AsyncOperationStatus(ResourceGroupAsyncOperationStatusRequest) ResourceGroupAsyncOperationStatus

Queries the server about the status of an asynchronous operation started by one of the service APIs

Resource group operations (especially activation) can be performed asynchronously for better scalability. This API allows clients to check the progress and results of these asynchronous operations. The operation results are available until explicitly queried and forgotten.

Messages

ActivateResourceGroupRequest

ActivateResourceGroupRequest is used by ResourceGroupManagementService.ActivateResourceGroup.

Validates and applies power policies to all hardware resources in the resource group. Resource group must be inactive with resources added. After activation, workload can start.

Examples: dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” –sync # Synchronous activation dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” # Asynchronous activation dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” –partial-n-hosts 8 # Succeed if ≥8 hosts activate

Field Type Description
group_name string Name of resource group to activate
strict bool If true, and if the given resource group cannot be activated because of power limits, do not try reducing the resource group policy. If false, the server will try a lower power policy if the given one fails.
async ResourceGroupAsyncStrategy Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation ResourceGroupPartialActivation Options for partial activation with failure tolerance. Allows the resource group to be deemed active even if some hosts fail, as long as at least the specified number/fraction of hosts can be activated.
allow_reprovision bool When activating a resource and this flag is set to true, a resource group that doesn’t have sufficient power to activate will lower the power policy of other resource groups in order to be able to activate. Setting this flag to false will result in failure to activate if there isn’t sufficient power to activate the resource group.

ActivateResourceGroupResponse

ActivateResourceGroupResponse is returned if resource group activation was successful.

Field Type Description
node_statuses map ActivateResourceGroupResponse.NodeStatusesEntry Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status Status Operation status
operation_id string The operation id for asynchronous operations, if the status is “async”

ActivateResourceGroupResponse.NodeStatusesEntry

Field Type Description
key string none
value NodeStatusResponse none

DeactivateResourceGroupRequest

DeactivateResourceGroupRequest is used to deactivate a resource group and remove applied policies from physical entities.

Removes power policies from hardware resources and returns them to topology defaults. Typically called automatically during resource group deletion. Resource group must be active. After deactivation, resources return to topology defaults.

Field Type Description
group_name string Name of resource group to deactivate
wpps_disable_async_verification bool If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

DeactivateResourceGroupResponse

Empty response from deactivating a resource group

Field Type Description
status Status Operation status

GPUWorkloadProfileResponse

GPUWorkloadProfileResponse describes the result of setting workload profile on one GPU

Field Type Description
gpu_id string The GPU id within the node
ok bool Whether or not the operation was successful
enforced_workload_profile_ids repeated int32 The actual workload profiles set
diag_msg string Diagnostic msg, if any

GPUWorkloadProfileResponses

GPUWorkloadProfileResponse describes the result of setting workload profile on one GPU

Field Type Description
workload_profile_result repeated GPUWorkloadProfileResponse Operation status
status Status Operation status

NodeStatusResponse

NodeStatusResponse combines the node status and workload profile results

Field Type Description
policy_apply_status PolicyApplyStatus Struct with an activation status and an actual policy
workload_profile_results GPUWorkloadProfileResponses Result of the per-GPU workload profile operation

ResourceGroupAddResourcesRequest

ResourceGroupAddResourcesRequest is used by ResourceGroupManagementService.ResourceGroupAddResources.

Adds compute resources (nodes) to an existing inactive resource group before activation. Resource group must be inactive. Resources cannot be added to active resource groups.

Examples: dpsctl resource-group add –resource-group “rg_$SLURM_JOB_ID” –entities “node001,node002” –policy “Node-High” # Add nodes with policy

Field Type Description
group_name string Name of resource group to add resource entities to
resource_names repeated string Names of resources to add to the resource group
oneof _policy_name.policy_name optional string Optional policy to add to each resource added, no policy defaults to resource group policy

ResourceGroupAddResourcesResponse

ResourceGroupAddResourcesResponse is returned if adding resources to a resource group was successful.

Field Type Description
status Status Operation status

ResourceGroupAsyncOperationStatus

ResourceGroupAsyncOperationStatus returns the status of an asynchronous resource group operation. After it returns that the operation was completed, the results are forgotten.

Field Type Description
operation_id string Internal ID of the asynchronous operation
completed bool Completed flag
n_hosts uint32 Number of hosts in the request
n_success uint32 Number of requests completed successfull
n_failed uint32 Number of requests failed
n_in_progress uint32 Number of requsts in progress
oneof status.activate ActivateResourceGroupResponse Activate resource group status object
oneof status.update ResourceGroupUpdateResponse Update resource group status object

ResourceGroupAsyncOperationStatusRequest

ResourceGroupAsyncOperationStatusRequest is used to query the status of an asynchronous operation.

Checks the progress and result of asynchronous resource group operations like activation or updates. Returns operation progress including success count, failure count, and completion status.

Field Type Description
oneof id.operation_id string Operation ID returned from async resource group operation
oneof id.resource_group_info ResourceGroupAsyncOperationStatusRequest.ResourceGroupStatusInfo Resource group information

ResourceGroupAsyncOperationStatusRequest.ResourceGroupStatusInfo

Resource group information

Field Type Description
resource_group_name string Resource group name
operation_type string Operation type, i.e. “activate” or “update”

ResourceGroupAsyncStrategy

ResourceGroupAsyncStrategy specifies the asynchronous resource group activation strategy

Field Type Description
oneof Options.nHosts uint32 Wait until the operation is complete for nHosts, then return. The operation continues asynchronously.
oneof Options.fracHosts double Wait until the operation is completed for the given fraction of hosts, then return. The operation continues asynchronously
oneof Options.wait google.protobuf.Duration wait this long before returning. The operation continues asynchronously. If 0, the operation returns immediately.

ResourceGroupCreateRequest

ResourceGroupCreateRequest is used by the ResourceGroupManagementService.ResourceGroupCreate.

Creates a new, empty, and inactive resource group for managing power policies during workload execution. A resource group must be identified by a unique name, such as slurm job id. When created, the resource group does not have any resources and it is not active.

Examples: dpsctl resource-group create –resource-group “rg_$SLURM_JOB_ID” –external-id “$SLURM_JOB_ID” –policy “Node-Med”

Field Type Description
external_id int64 External ID (e.g. SLURM Job ID) - Unique identifier from external workload scheduler (e.g. SLURM_JOB_ID)
group_name string Unique resource group name - Human-readable identifier, must be unique (e.g. “rg_12345”, “job12345”)
oneof _policy_name.policy_name optional string Optional policy for the whole resource group. If no policy is given, entities will use the topology policy. The policy with this name must already exist in the topology.
workload_profile_ids repeated int32 Array of requested workload profile IDs associated with the resource group
oneof _prs_enabled.prs_enabled optional bool Optional bool flag for the resource group to enable or disable prs dynamic power management default is enabled
properties google.protobuf.Struct Properties of the resource group
dpm_enable bool Boolean flag to enable all dynamic power management, resource groups with dpm_enable set to false will follow strict policy management, if enough power for the selected policy is not available, activation will fail, allocated power will not be dynamically adjusted at any time during the lifetime of the resource group. default is true (i.e. enable dynamic power management)

ResourceGroupCreateResponse

ResourceGroupCreateResponse is returned if the resource group is created successfully.

Field Type Description
status Status Operation status

ResourceGroupDeleteRequest

ResourceGroupDeleteRequest is used by ResourceGroupManagementService.ResourceGroupDelete.

Deactivates and deletes a resource group, returning all hardware resources to their topology defaults. The resource group is automatically deactivated if active before deletion.

Field Type Description
group_name string Name of resource group to delete
wpps_disable_async_verification bool If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

ResourceGroupDeleteResponse

ResourceGroupDeleteResponse is returned if the resource group deletion was successful.

Field Type Description
status Status Operation status

ResourceGroupListAllRequest

ResourceGroupListAllRequest is used to list resource groups.

Queries all resource groups in the system with optional filtering by activation status. Used for monitoring, administration, and troubleshooting resource group state. Returns comprehensive information including policies, resources, and status for each resource group.

Field Type Description
oneof _list_active_only.list_active_only optional bool Set to true to filter by active resource groups only

ResourceGroupListAllResponse

Response containing all (filtered) resource group information

Field Type Description
status Status none
resource_groups repeated ResourceGroupListAllResponse.ResourceGroupInfo List of all resource groups

ResourceGroupListAllResponse.ResourceGroupInfo

Field Type Description
group_name string Name of resource group
external_id int64 External ID of resource group
activation_status string Activation status of the resource group
oneof _policy_name.policy_name optional string Optional default policy for resource group
resource_names repeated string Names of resources in resource group
oneof _workload_profile_ids.workload_profile_ids optional WorkloadProfileIDs Resource group workload profile ids
resource_policies repeated ResourceGroupListAllResponse.ResourceGroupInfo.ResourcePolicy List of policy/entity pairs
properties google.protobuf.Struct Properties of the resource group
oneof _prs_enabled.prs_enabled optional bool Optional bool flag for the resource group to enable or disable prs dynamic power management default is enabled
dpm_enable bool Boolean flag for enabling dynamic power management for the resource group default is true (dpm enabled)

ResourceGroupListAllResponse.ResourceGroupInfo.ResourcePolicy

Define policy for a resource

Field Type Description
resource_name string Name of the resource
policy_name string Name of the policy
oneof _applied_policy_name.applied_policy_name optional string Optional name of applied policy when DPM is enabled

ResourceGroupPartialActivation

ResourceGroupPartialActivation specifies the parameters for acceptable level of failure during host configuration

Field Type Description
oneof Options.atleast_n_hosts uint32 At least this many hosts must be activated for the resource group activation to be successful
oneof Options.atleast_frac_hosts double At least this fraction of hosts must be activated for the resource group activation to be successful
host_activation_timeout google.protobuf.Duration Host activation timeout. If a host is not accessible after this duration, host is deemed inaccessible

ResourceGroupRemoveResourcesRequest

ResourceGroupRemoveResourcesRequest is used by ResourceGroupManagementService.ResourceGroupRemoveResources.

Removes compute resources (nodes) from an existing inactive resource group before activation. Used to adjust resource allocation before the workload starts. Resource group must be inactive. Resources cannot be removed from active resource groups.

Field Type Description
group_name string Name of the resource group to remove resource entities from
resource_names repeated string Names of resource entities to remove from the resource group

ResourceGroupRemoveResourcesResponse

ResourceGroupRemoveResourcesResponse is returned if entity removal was successful.

Field Type Description
status Status Operation status

ResourceGroupUpdateRequest

ResourceGroupUpdateRequest is used by ResourceGroupManagementService.ResourceGroupUpdate to modify the resource group level policy setting.

Updates the default power policy for the entire resource group. Can be used on active or inactive resource groups. If resource group is active, changes are applied immediately to hardware.

Examples: dpsctl resource-group update –resource-group “rg_$SLURM_JOB_ID” –policy “Node-High” –sync # Synchronous update

Field Type Description
group_name string Name of resource group to update
oneof _policy_name.policy_name optional string Name of policy to update. Null policy removes the policy assignment from resource group. Modifying the resource group level policy setting updates the policies of all entities that do not have entity-level policies. If the policy assignment was removed, those entities will revert back to topology-specified policies.
strict bool If the resource group is already active and if strict is true, and if the given resource group cannot be activated because of power limits, do not try reducing the resource group policy. If strict is false, the server will try a lower power policy if the given one fails.
workload_profile_ids WorkloadProfileIDs Array of new requested workload profile IDs
async ResourceGroupAsyncStrategy Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation ResourceGroupPartialActivation Options for partial activation with failure tolerance
wpps_disable_async_verification bool If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

ResourceGroupUpdateResourcesRequest

ResourceGroupUpdateResourcesRequest is used by ResourceGroupManagementService.ResourceGroupUpdateResources.

Updates power policies for individual entities (nodes) within the resource group. If resource group is active, changes are applied immediately to hardware.

Field Type Description
group_name string Name of resource group to update
updates repeated ResourceGroupUpdateResourcesRequest.ResourcePolicy List of policy/resource pairs to be updated
async ResourceGroupAsyncStrategy Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation ResourceGroupPartialActivation Options for partial activation with failure tolerance

ResourceGroupUpdateResourcesRequest.ResourcePolicy

Specifies the resource and the policy to apply to that resource. To remove a policy from a resource, don’t include optional policy

Field Type Description
resource_name string Name of the resource
oneof PolicyUpdate.policy_name string Set the entity-level policy for the resource. The entity-level policy overrides the resource-group level or topology-level policy. To remove entity level policy assignment, use empty policy name

ResourceGroupUpdateResourcesResponse

Empty response from updating resources for a resource group

Field Type Description
node_statuses map ResourceGroupUpdateResourcesResponse.NodeStatusesEntry Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status Status Operation status
operation_id string The operation id for asynchronous operations, if the status is “async”

ResourceGroupUpdateResourcesResponse.NodeStatusesEntry

Field Type Description
key string none
value NodeStatusResponse none

ResourceGroupUpdateResponse

ResourceGroupUpdateResponse is returned if resource group update was successful.

Field Type Description
node_statuses map ResourceGroupUpdateResponse.NodeStatusesEntry Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status Status Operation status
operation_id string The operation id for asynchronous operations, if the status is “async”

ResourceGroupUpdateResponse.NodeStatusesEntry

Field Type Description
key string none
value NodeStatusResponse none

UpdateGPUPoliciesRequest

UpdateGPUPoliciesRequest is used by ResourceGroupManagementService.UpdateGPUPolicies to update GPU power policies without a reference to the resource group.

Updates individual GPU power limits based on real-time telemetry data from external monitoring systems. Used for dynamic power optimization during workload execution without knowing resource group details. All GPUs in a node must be specified or the update fails. Updates node-level policy to satisfy aggregate GPU requirements.

Field Type Description
node_gpu_policies map UpdateGPUPoliciesRequest.NodeGpuPoliciesEntry A map of node name -> GPU Policies

UpdateGPUPoliciesRequest.NodeGpuPoliciesEntry

Field Type Description
key string none
value GPUPolicies none

UpdateGPUPoliciesResponse

UpdateGPUPoliciesResponse describes the result of each update operation

Field Type Description
results repeated UpdateGPUPoliciesResponse.Result Results of the per-GPU update operation
status Status Operation status

UpdateGPUPoliciesResponse.Result

Result describes the result of setting the power limit of one GPU

Field Type Description
resource_name string The node name containing this GPU
gpu_id uint32 The GPU id within the node
ok bool Whether or not the operation was successful
set_limit double The actual limit set
diag_msg string Diagnostic msg, if any

WorkloadProfileIDs

WorkloadProfileIDs is a wrapper around an array of workload profile IDs

Field Type Description
ids repeated int32 none

Scalar Value Types

.proto Type Notes C++ Type Java Type Python Type

double
double double float

float
float float float

int32
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int

int64
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long

uint32
Uses variable-length encoding. uint32 int int/long

uint64
Uses variable-length encoding. uint64 long int/long

sint32
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int

sint64
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long

fixed32
Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int

fixed64
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long

sfixed32
Always four bytes. int32 int int

sfixed64
Always eight bytes. int64 long int/long

bool
bool boolean boolean

string
A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode

bytes
May contain any arbitrary sequence of bytes. string ByteString str