resourcegroup

API Reference: v1/resourcegroup.proto

Resource Groups manage power policies for temporary workload allocations on datacenter hardware. They provide ephemeral power management that overrides topology defaults during job execution.

A resource group is a collection of compute resources (nodes) allocated for a specific workload, such as a SLURM job or machine learning training run. Resource groups temporarily override the topology-specified power policies of their entities during workload execution, then restore topology defaults when the workload completes.

Resource groups follow a 5-step lifecycle:

CREATE - Create an empty, inactive resource group with optional default power policy
ADD - Add compute resources (nodes) to the group before activation
ACTIVATE - Apply power policies to hardware and mark the group as active
UPDATE (Optional) - Dynamically adjust power policies during workload execution
DELETE - Deactivate and cleanup, restoring topology defaults

Power Policy Hierarchy

Resource groups use a 3-level policy hierarchy to determine effective power settings:

Entity-level policy - Specific policy for individual nodes (highest priority)
Resource group policy - Default policy for all nodes in the group
Topology policy - Baseline policy from the datacenter topology (lowest priority)

Dynamic Power Management

Resource groups support dynamic power management (DPM) through:

Power Reservation Steering (PRS) - Automatic power redistribution based on telemetry
Policy Updates - Runtime policy adjustments for optimization
GPU Workload Profiles - Hardware-specific power optimization for GPU workloads

Integration with Workload Schedulers

Resource groups are designed to integrate with workload schedulers like SLURM:

Use external IDs to map to scheduler job IDs (e.g., SLURM_JOB_ID)
Follow scheduler lifecycle events (job start/end)
Support scheduler-driven power policy updates

Power policies are defined in policy.proto, and topology entities are defined in topology.proto. Workload-specific optimizations use telemetry data structures from common.proto.

Services
- ResourceGroupManagementService
Messages
Scalar Value Types

Services

ResourceGroupManagementService

ResourceGroupManagementService manages ephemeral power policy allocations for workloads

This service provides APIs to create, manage, and monitor resource groups - collections of compute resources with customized power policies for specific workloads. Resource groups temporarily override topology-specified power policies during workload execution, enabling dynamic power management optimized for specific computational tasks.

The service integrates with workload schedulers like SLURM to provide power management throughout the workload lifecycle. It supports both static policy assignment and dynamic policy updates based on real-time telemetry data.

Resource groups must be created from an active topology. The topology defines the available entities and baseline power policies that resource groups can override.

ResourceGroupCreate

rpc ResourceGroupCreate(ResourceGroupCreateRequest) ResourceGroupCreateResponse

Create a new empty and inactive resource group.

This is the first step in the resource group lifecycle. The created resource group is initially empty (no resources) and inactive (no policies applied to hardware). Resources must be added and the group activated before power policies take effect.

ResourceGroupDelete

rpc ResourceGroupDelete(ResourceGroupDeleteRequest) ResourceGroupDeleteResponse

Deactivate and delete a given resource group

This permanently removes the resource group and restores all associated hardware resources to their topology-specified power policies. If the resource group is active, it is automatically deactivated before deletion. This is typically called when a workload completes.

ResourceGroupList

rpc ResourceGroupList(ResourceGroupListAllRequest) ResourceGroupListAllResponse

List all available resource groups

Returns comprehensive information about all resource groups in the system, including their activation status, assigned resources, and power policies. This is used for monitoring, administration, and troubleshooting resource group state.

ResourceGroupAddResources

rpc ResourceGroupAddResources(ResourceGroupAddResourcesRequest) ResourceGroupAddResourcesResponse

Add resources to a given resource group. The resource group must be inactive.

This assigns compute resources (nodes) to the resource group before activation. Resources can be assigned group-level or entity-specific power policies. The resource group must be inactive - resources cannot be added to active resource groups.

ResourceGroupRemoveResources

rpc ResourceGroupRemoveResources(ResourceGroupRemoveResourcesRequest) ResourceGroupRemoveResourcesResponse

Remove resources from a given resource group. The resource group must be inactive.

This removes compute resources from the resource group before activation. Used to adjust resource allocation based on workload requirements. The resource group must be inactive - resources cannot be removed from active resource groups.

ActivateResourceGroup

rpc ActivateResourceGroup(ActivateResourceGroupRequest) ActivateResourceGroupResponse

Activate a given resource group. The resource group must be inactive.

This applies the resource group’s power policies to the assigned hardware resources and marks the group as active. The activation process validates power allocation constraints and applies policies to the hardware. Once active, the resource group manages power policies for its assigned resources until deactivation.

ResourceGroupUpdate

rpc ResourceGroupUpdate(ResourceGroupUpdateRequest) ResourceGroupUpdateResponse

Update a resource group policy. The new policy affects all entities in the resource group that does not have entity-level policies set. If the resource group is active, the policy is validated and applied immediately.

This changes the default power policy for the entire resource group. Entities with entity-specific policies are not affected. If the resource group is active, the new policy is immediately validated and applied to hardware. This enables dynamic power optimization during workload execution.

ResourceGroupUpdateResources

rpc ResourceGroupUpdateResources(ResourceGroupUpdateResourcesRequest) ResourceGroupUpdateResourcesResponse

Update policies for some of the resources in the resource group. If the resource group is active, the changes are applied immediately. This API may perform partial updates.

This updates entity-specific power policies for selected resources in the group. If the resource group is active, changes are immediately applied to hardware. The API supports partial updates - some policy changes may succeed while others fail, allowing for selective power optimization.

UpdateGPUPolicies

rpc UpdateGPUPolicies(UpdateGPUPoliciesRequest) UpdateGPUPoliciesResponse

UpdateGPUPolicies allows updating GPU policies without specifying the resource group. Because of this, a single call may end up updating multiple resource groups. If the resource groups are active, the changes are applied immediately.

This provides direct GPU power management without requiring knowledge of resource group assignments. It’s designed for telemetry-driven power optimization where external monitoring systems can adjust GPU power limits based on real-time performance data. A single call may affect multiple resource groups if GPUs span multiple groups.

AsyncOperationStatus

rpc AsyncOperationStatus(ResourceGroupAsyncOperationStatusRequest) ResourceGroupAsyncOperationStatus

Queries the server about the status of an asynchronous operation started by one of the service APIs

Resource group operations (especially activation) can be performed asynchronously for better scalability. This API allows clients to check the progress and results of these asynchronous operations. The operation results are available until explicitly queried and forgotten.

Messages

ActivateResourceGroupRequest

ActivateResourceGroupRequest is used by ResourceGroupManagementService.ActivateResourceGroup.

Validates and applies power policies to all hardware resources in the resource group. Resource group must be inactive with resources added. After activation, workload can start.

Examples: dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” –sync # Synchronous activation dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” # Asynchronous activation dpsctl resource-group activate –resource-group “rg_$SLURM_JOB_ID” –partial-n-hosts 8 # Succeed if ≥8 hosts activate

Field	Type	Description
group_name	string	Name of resource group to activate
strict	bool	If true, and if the given resource group cannot be activated because of power limits, do not try reducing the resource group policy. If false, the server will try a lower power policy if the given one fails.
async	ResourceGroupAsyncStrategy	Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation	ResourceGroupPartialActivation	Options for partial activation with failure tolerance. Allows the resource group to be deemed active even if some hosts fail, as long as at least the specified number/fraction of hosts can be activated.
allow_reprovision	bool	When activating a resource and this flag is set to true, a resource group that doesn’t have sufficient power to activate will lower the power policy of other resource groups in order to be able to activate. Setting this flag to false will result in failure to activate if there isn’t sufficient power to activate the resource group.

ActivateResourceGroupResponse

ActivateResourceGroupResponse is returned if resource group activation was successful.

Field	Type	Description
node_statuses	map ActivateResourceGroupResponse.NodeStatusesEntry	Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status	Status	Operation status
operation_id	string	The operation id for asynchronous operations, if the status is “async”

ActivateResourceGroupResponse.NodeStatusesEntry

Field	Type	Description
key	string	none
value	NodeStatusResponse	none

DeactivateResourceGroupRequest

DeactivateResourceGroupRequest is used to deactivate a resource group and remove applied policies from physical entities.

Removes power policies from hardware resources and returns them to topology defaults. Typically called automatically during resource group deletion. Resource group must be active. After deactivation, resources return to topology defaults.

Field	Type	Description
group_name	string	Name of resource group to deactivate
wpps_disable_async_verification	bool	If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

DeactivateResourceGroupResponse

Empty response from deactivating a resource group

Field	Type	Description
status	Status	Operation status

GPUWorkloadProfileResponse

GPUWorkloadProfileResponse describes the result of setting workload profile on one GPU

Field	Type	Description
gpu_id	string	The GPU id within the node
ok	bool	Whether or not the operation was successful
enforced_workload_profile_ids	repeated int32	The actual workload profiles set
diag_msg	string	Diagnostic msg, if any

GPUWorkloadProfileResponses

GPUWorkloadProfileResponse describes the result of setting workload profile on one GPU

Field	Type	Description
workload_profile_result	repeated GPUWorkloadProfileResponse	Operation status
status	Status	Operation status

NodeStatusResponse

NodeStatusResponse combines the node status and workload profile results

Field	Type	Description
policy_apply_status	PolicyApplyStatus	Struct with an activation status and an actual policy
workload_profile_results	GPUWorkloadProfileResponses	Result of the per-GPU workload profile operation

ResourceGroupAddResourcesRequest

ResourceGroupAddResourcesRequest is used by ResourceGroupManagementService.ResourceGroupAddResources.

Adds compute resources (nodes) to an existing inactive resource group before activation. Resource group must be inactive. Resources cannot be added to active resource groups.

Examples: dpsctl resource-group add –resource-group “rg_$SLURM_JOB_ID” –entities “node001,node002” –policy “Node-High” # Add nodes with policy

Field	Type	Description
group_name	string	Name of resource group to add resource entities to
resource_names	repeated string	Names of resources to add to the resource group
oneof _policy_name.policy_name	optional string	Optional policy to add to each resource added, no policy defaults to resource group policy

ResourceGroupAddResourcesResponse

ResourceGroupAddResourcesResponse is returned if adding resources to a resource group was successful.

Field	Type	Description
status	Status	Operation status

ResourceGroupAsyncOperationStatus

ResourceGroupAsyncOperationStatus returns the status of an asynchronous resource group operation. After it returns that the operation was completed, the results are forgotten.

Field	Type	Description
operation_id	string	Internal ID of the asynchronous operation
completed	bool	Completed flag
n_hosts	uint32	Number of hosts in the request
n_success	uint32	Number of requests completed successfull
n_failed	uint32	Number of requests failed
n_in_progress	uint32	Number of requsts in progress
oneof status.activate	ActivateResourceGroupResponse	Activate resource group status object
oneof status.update	ResourceGroupUpdateResponse	Update resource group status object

ResourceGroupAsyncOperationStatusRequest

ResourceGroupAsyncOperationStatusRequest is used to query the status of an asynchronous operation.

Checks the progress and result of asynchronous resource group operations like activation or updates. Returns operation progress including success count, failure count, and completion status.

Field	Type	Description
oneof id.operation_id	string	Operation ID returned from async resource group operation
oneof id.resource_group_info	ResourceGroupAsyncOperationStatusRequest.ResourceGroupStatusInfo	Resource group information

ResourceGroupAsyncOperationStatusRequest.ResourceGroupStatusInfo

Resource group information

Field	Type	Description
resource_group_name	string	Resource group name
operation_type	string	Operation type, i.e. “activate” or “update”

ResourceGroupAsyncStrategy

ResourceGroupAsyncStrategy specifies the asynchronous resource group activation strategy

Field	Type	Description
oneof Options.nHosts	uint32	Wait until the operation is complete for nHosts, then return. The operation continues asynchronously.
oneof Options.fracHosts	double	Wait until the operation is completed for the given fraction of hosts, then return. The operation continues asynchronously
oneof Options.wait	google.protobuf.Duration	wait this long before returning. The operation continues asynchronously. If 0, the operation returns immediately.

ResourceGroupCreateRequest

ResourceGroupCreateRequest is used by the ResourceGroupManagementService.ResourceGroupCreate.

Creates a new, empty, and inactive resource group for managing power policies during workload execution. A resource group must be identified by a unique name, such as slurm job id. When created, the resource group does not have any resources and it is not active.

Examples: dpsctl resource-group create –resource-group “rg_$SLURM_JOB_ID” –external-id “$SLURM_JOB_ID” –policy “Node-Med”

Field	Type	Description
external_id	int64	External ID (e.g. SLURM Job ID) - Unique identifier from external workload scheduler (e.g. SLURM_JOB_ID)
group_name	string	Unique resource group name - Human-readable identifier, must be unique (e.g. “rg_12345”, “job12345”)
oneof _policy_name.policy_name	optional string	Optional policy for the whole resource group. If no policy is given, entities will use the topology policy. The policy with this name must already exist in the topology.
workload_profile_ids	repeated int32	Array of requested workload profile IDs associated with the resource group
oneof _prs_enabled.prs_enabled	optional bool	Optional bool flag for the resource group to enable or disable prs dynamic power management default is enabled
properties	google.protobuf.Struct	Properties of the resource group
dpm_enable	bool	Boolean flag to enable all dynamic power management, resource groups with `dpm_enable` set to false will follow strict policy management, if enough power for the selected policy is not available, activation will fail, allocated power will not be dynamically adjusted at any time during the lifetime of the resource group. default is true (i.e. enable dynamic power management)

ResourceGroupCreateResponse

ResourceGroupCreateResponse is returned if the resource group is created successfully.

Field	Type	Description
status	Status	Operation status

ResourceGroupDeleteRequest

ResourceGroupDeleteRequest is used by ResourceGroupManagementService.ResourceGroupDelete.

Deactivates and deletes a resource group, returning all hardware resources to their topology defaults. The resource group is automatically deactivated if active before deletion.

Field	Type	Description
group_name	string	Name of resource group to delete
wpps_disable_async_verification	bool	If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

ResourceGroupDeleteResponse

ResourceGroupDeleteResponse is returned if the resource group deletion was successful.

Field	Type	Description
status	Status	Operation status

ResourceGroupListAllRequest

ResourceGroupListAllRequest is used to list resource groups.

Queries all resource groups in the system with optional filtering by activation status. Used for monitoring, administration, and troubleshooting resource group state. Returns comprehensive information including policies, resources, and status for each resource group.

Field	Type	Description
oneof _list_active_only.list_active_only	optional bool	Set to true to filter by active resource groups only

ResourceGroupListAllResponse

Response containing all (filtered) resource group information

Field	Type	Description
status	Status	none
resource_groups	repeated ResourceGroupListAllResponse.ResourceGroupInfo	List of all resource groups

ResourceGroupListAllResponse.ResourceGroupInfo

Field	Type	Description
group_name	string	Name of resource group
external_id	int64	External ID of resource group
activation_status	string	Activation status of the resource group
oneof _policy_name.policy_name	optional string	Optional default policy for resource group
resource_names	repeated string	Names of resources in resource group
oneof _workload_profile_ids.workload_profile_ids	optional WorkloadProfileIDs	Resource group workload profile ids
resource_policies	repeated ResourceGroupListAllResponse.ResourceGroupInfo.ResourcePolicy	List of policy/entity pairs
properties	google.protobuf.Struct	Properties of the resource group
oneof _prs_enabled.prs_enabled	optional bool	Optional bool flag for the resource group to enable or disable prs dynamic power management default is enabled
dpm_enable	bool	Boolean flag for enabling dynamic power management for the resource group default is true (dpm enabled)

ResourceGroupListAllResponse.ResourceGroupInfo.ResourcePolicy

Define policy for a resource

Field	Type	Description
resource_name	string	Name of the resource
policy_name	string	Name of the policy
oneof _applied_policy_name.applied_policy_name	optional string	Optional name of applied policy when DPM is enabled

ResourceGroupPartialActivation

ResourceGroupPartialActivation specifies the parameters for acceptable level of failure during host configuration

Field	Type	Description
oneof Options.atleast_n_hosts	uint32	At least this many hosts must be activated for the resource group activation to be successful
oneof Options.atleast_frac_hosts	double	At least this fraction of hosts must be activated for the resource group activation to be successful
host_activation_timeout	google.protobuf.Duration	Host activation timeout. If a host is not accessible after this duration, host is deemed inaccessible

ResourceGroupRemoveResourcesRequest

ResourceGroupRemoveResourcesRequest is used by ResourceGroupManagementService.ResourceGroupRemoveResources.

Removes compute resources (nodes) from an existing inactive resource group before activation. Used to adjust resource allocation before the workload starts. Resource group must be inactive. Resources cannot be removed from active resource groups.

Field	Type	Description
group_name	string	Name of the resource group to remove resource entities from
resource_names	repeated string	Names of resource entities to remove from the resource group

ResourceGroupRemoveResourcesResponse

ResourceGroupRemoveResourcesResponse is returned if entity removal was successful.

Field	Type	Description
status	Status	Operation status

ResourceGroupUpdateRequest

ResourceGroupUpdateRequest is used by ResourceGroupManagementService.ResourceGroupUpdate to modify the resource group level policy setting.

Updates the default power policy for the entire resource group. Can be used on active or inactive resource groups. If resource group is active, changes are applied immediately to hardware.

Examples: dpsctl resource-group update –resource-group “rg_$SLURM_JOB_ID” –policy “Node-High” –sync # Synchronous update

Field	Type	Description
group_name	string	Name of resource group to update
oneof _policy_name.policy_name	optional string	Name of policy to update. Null policy removes the policy assignment from resource group. Modifying the resource group level policy setting updates the policies of all entities that do not have entity-level policies. If the policy assignment was removed, those entities will revert back to topology-specified policies.
strict	bool	If the resource group is already active and if strict is true, and if the given resource group cannot be activated because of power limits, do not try reducing the resource group policy. If strict is false, the server will try a lower power policy if the given one fails.
workload_profile_ids	WorkloadProfileIDs	Array of new requested workload profile IDs
async	ResourceGroupAsyncStrategy	Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation	ResourceGroupPartialActivation	Options for partial activation with failure tolerance
wpps_disable_async_verification	bool	If true, workload power profile service (WPPS) operations will use synchronous verification instead of asynchronous. Default is false (asynchronous verification enabled).

ResourceGroupUpdateResourcesRequest

ResourceGroupUpdateResourcesRequest is used by ResourceGroupManagementService.ResourceGroupUpdateResources.

Updates power policies for individual entities (nodes) within the resource group. If resource group is active, changes are applied immediately to hardware.

Field	Type	Description
group_name	string	Name of resource group to update
updates	repeated ResourceGroupUpdateResourcesRequest.ResourcePolicy	List of policy/resource pairs to be updated
async	ResourceGroupAsyncStrategy	Asynchronous activation strategy. If set, activation is asynchronous. If not set, activation is synchronous.
partial_activation	ResourceGroupPartialActivation	Options for partial activation with failure tolerance

ResourceGroupUpdateResourcesRequest.ResourcePolicy

Specifies the resource and the policy to apply to that resource. To remove a policy from a resource, don’t include optional policy

Field	Type	Description
resource_name	string	Name of the resource
oneof PolicyUpdate.policy_name	string	Set the entity-level policy for the resource. The entity-level policy overrides the resource-group level or topology-level policy. To remove entity level policy assignment, use empty policy name

ResourceGroupUpdateResourcesResponse

Empty response from updating resources for a resource group

Field	Type	Description
node_statuses	map ResourceGroupUpdateResourcesResponse.NodeStatusesEntry	Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status	Status	Operation status
operation_id	string	The operation id for asynchronous operations, if the status is “async”

ResourceGroupUpdateResourcesResponse.NodeStatusesEntry

Field	Type	Description
key	string	none
value	NodeStatusResponse	none

ResourceGroupUpdateResponse

ResourceGroupUpdateResponse is returned if resource group update was successful.

Field	Type	Description
node_statuses	map ResourceGroupUpdateResponse.NodeStatusesEntry	Node statuses is a map where key is entity name and value is a struct with policy apply status and workload profile results
status	Status	Operation status
operation_id	string	The operation id for asynchronous operations, if the status is “async”

ResourceGroupUpdateResponse.NodeStatusesEntry

Field	Type	Description
key	string	none
value	NodeStatusResponse	none

UpdateGPUPoliciesRequest

UpdateGPUPoliciesRequest is used by ResourceGroupManagementService.UpdateGPUPolicies to update GPU power policies without a reference to the resource group.

Updates individual GPU power limits based on real-time telemetry data from external monitoring systems. Used for dynamic power optimization during workload execution without knowing resource group details. All GPUs in a node must be specified or the update fails. Updates node-level policy to satisfy aggregate GPU requirements.

Field	Type	Description
node_gpu_policies	map UpdateGPUPoliciesRequest.NodeGpuPoliciesEntry	A map of node name -> GPU Policies

UpdateGPUPoliciesRequest.NodeGpuPoliciesEntry

Field	Type	Description
key	string	none
value	GPUPolicies	none

UpdateGPUPoliciesResponse

UpdateGPUPoliciesResponse describes the result of each update operation

Field	Type	Description
results	repeated UpdateGPUPoliciesResponse.Result	Results of the per-GPU update operation
status	Status	Operation status

UpdateGPUPoliciesResponse.Result

Result describes the result of setting the power limit of one GPU

Field	Type	Description
resource_name	string	The node name containing this GPU
gpu_id	uint32	The GPU id within the node
ok	bool	Whether or not the operation was successful
set_limit	double	The actual limit set
diag_msg	string	Diagnostic msg, if any

WorkloadProfileIDs

WorkloadProfileIDs is a wrapper around an array of workload profile IDs

Field	Type	Description
ids	repeated int32	none

Scalar Value Types

.proto Type	Notes	C++ Type	Java Type	Python Type
double		double	double	float
float		float	float	float
int32	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	int32	int	int
int64	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	int64	long	int/long
uint32	Uses variable-length encoding.	uint32	int	int/long
uint64	Uses variable-length encoding.	uint64	long	int/long
sint32	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	int32	int	int
sint64	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	int64	long	int/long
fixed32	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	uint32	int	int
fixed64	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	uint64	long	int/long
sfixed32	Always four bytes.	int32	int	int
sfixed64	Always eight bytes.	int64	long	int/long
bool		bool	boolean	boolean
string	A string must always contain UTF-8 encoded or 7-bit ASCII text.	string	String	str/unicode
bytes	May contain any arbitrary sequence of bytes.	string	ByteString	str