devicecontroller

API Reference: v1/devicecontroller.proto

The device controller deals with communicating with the devices in the datacenter to control power

Table of Contents

Services

DeviceController

The DeviceController service handles device requests. When the request is completed, it will call the topology service back.

SendDeviceRequest

rpc SendDeviceRequest(DeviceControllerRequest) .google.protobuf.Empty

Messages

DeviceControllerRequest

DeviceControllerRequest is sent to device plugins for power management. This message contains the target device id as specified in the topology, and a request id. Once the request is handled by the plugin, a response with a matching requestId must be returned.

Field Type Description
device string The device ID in topology
request_id string Request ID is initialized by the topology server when sending the requests out to the devices. The device should response with a DeviceControllerResponse with the same RequestId
timeout_seconds uint64 Timeout for this device request in seconds. If not set, defaults to the controller’s DeviceCallTimeout
device_model_json bytes Device model JSON from the topology. This is sent to the device plugin so it can parse necessary bits from the topology
device_state_json bytes Device state JSON that is returned by previous calls to this device
oneof request.resource_group_activate DeviceControllerRequest.AllocationRequest none
oneof request.resource_group_deactivate DeviceControllerRequest.AllocationRequest none
oneof request.resource_group_update DeviceControllerRequest.AllocationRequest none
oneof request.topology_activate DeviceControllerRequest.PolicySetRequest none
oneof request.topology_deactivate DeviceControllerRequest.PolicySetRequest none
oneof request.gpu_policies_update DeviceControllerRequest.GPUPoliciesSetRequest none
oneof request.enable_workload_profile DeviceControllerRequest.PowerProfileRequest none
oneof request.disable_workload_profile DeviceControllerRequest.PowerProfileRequest none
oneof request.update_workload_profile DeviceControllerRequest.PowerProfileRequest none
oneof request.ping DeviceControllerRequest.PingRequest none
oneof request.gpu_metrics DeviceControllerRequest.GPUMetricsRequest none
oneof request.raw_metrics google.protobuf.Empty none
oneof request.std_metrics google.protobuf.Empty none
oneof request.modify_entity_model DeviceControllerRequest.EntityModificationRequest none

DeviceControllerRequest.AllocationRequest

AllocationRequest is sent for resource group activation, deactivation and update

Field Type Description
resource_group string Resource group name
oneof _policy.policy optional DeviceControllerRequest.PolicySetRequest The policy request. This can be null, meaning only GPU policies are being set
oneof _gpu_policies.gpu_policies optional DeviceControllerRequest.GPUPoliciesSetRequest The GPU policies request. This can be null, meaning no changes to GPU policies
oneof _workload_profiles.workload_profiles optional DeviceControllerRequest.PowerProfileRequest The Workload Power Profiles to apply. This can be null, meaning no changes to WPPS.

DeviceControllerRequest.EntityModificationRequest

EntityModificationRequest is sent to device plugins as part of the two-phase entity modification cycle. When an entity is modified during operation, the plugins are notified by this message at the first phase (phase=0), before the entity modifications are committed to the in-memory topology. The plugin may request a second call after the first one. If the plugin requests a second call, this is done after the topology is updated with the new entity, and the same message is sent with phase=1

Field Type Description
phase int32 The current phase of the operation, 0 or 1.
old_device_model bytes The device model that was
old_default_policy PolicyObject Old topology-level node policy
new_device_model bytes The device model that will be
new_default_policy PolicyObject New topology-level node policy
resource_group string Resource group name, if the node is allocated for a resource group

DeviceControllerRequest.GPUMetricsRequest

Out-of-band GPU metrics request

Field Type Description
gpu_ids repeated uint32 none

DeviceControllerRequest.GPUPoliciesSetRequest

GPUPoliciesRequest is sent to update GPU-specific power policies

Field Type Description
resource_group string Resource group name
gpu_policy_action DeviceControllerRequest.PolicyAction Are we setting policies, or resetting them? If resetting, gpu_policies field is ignored
current_policy PolicyObject The current node policy for the node containing these GPUs
gpu_policies GPUPolicies The GPU policies. This can be nil

DeviceControllerRequest.PingRequest

Out-of-band ping request

Field Type Description
retry_count int32 Max number of times to retry pinging the target host

DeviceControllerRequest.PolicySetRequest

PolicySetRequest is used to set node policies during topology activation/deactivation

Field Type Description
policy_action DeviceControllerRequest.PolicyAction Set or unset node policy
policy PolicyObject The policy to set

DeviceControllerRequest.PowerProfileRequest

Field Type Description
enable_profile_ids repeated int32 none
disable_profile_ids repeated int32 none
disable_async_verification bool If true, profile asynchronously verification(s) is disabled

DeviceControllerResponse

DeviceControllerResponse is returned by the device once it finishes handling the request. It contains the device ID, the request ID, and the response payload. The response payload type matches the request payload type.

Field Type Description
device string The device ID as specified in the topology
request_id string The request_id must match the id of the request that initiated this response.
device_state_json bytes The device plugin state data, in JSON format, that should be stored and passed in subsequent requests If nil, the existing plugin state data should not be modified
device_ping_time google.protobuf.Timestamp The timestamp when attempt was made to access the device
device_ping_error string The error message when attempt was made to access the device and it failed
oneof response.resource_group_activate DeviceControllerResponse.AllocationResponse none
oneof response.resource_group_deactivate DeviceControllerResponse.AllocationResponse none
oneof response.resource_group_update DeviceControllerResponse.AllocationResponse none
oneof response.topology_activate DeviceControllerResponse.AllocationResponse none
oneof response.topology_deactivate DeviceControllerResponse.AllocationResponse none
oneof response.gpu_policies_update DeviceControllerResponse.GPUPoliciesUpdateResponse none
oneof response.enable_workload_profile DeviceControllerResponse.PowerProfileResponse none
oneof response.disable_workload_profile DeviceControllerResponse.PowerProfileResponse none
oneof response.update_workload_profile DeviceControllerResponse.PowerProfileResponse none
oneof response.ping DeviceControllerResponse.PingResponse none
oneof response.gpu_metrics DeviceControllerResponse.GPUMetricsResponse none
oneof response.raw_metrics DeviceControllerResponse.RawMetricsResponse none
oneof response.std_metrics DeviceControllerResponse.StdMetricsResponse none
oneof response.modify_entity_model DeviceControllerResponse.EntityModificationResponse none

DeviceControllerResponse.AllocationResponse

AllocationResponse is the device response to an AllocationRequest

Field Type Description
success bool If not true, then error field provides the diagnostic message
error string If the call is not successful, the diagnostic message
policy PolicyObject The policy applied on the node. Can be nil.
power_cap_watts double The power cap
allocated_gpu_policies DeviceControllerResponse.GPUPoliciesUpdateResponse The GPU policies applied on the node. Can be nil.
workload_profile_result DeviceControllerResponse.PowerProfileResponse The result of applying workload power profiles. Can be nil.
resource_group string The resource group name associated with this allocation
allocated_node_policy DeviceControllerResponse.PolicyComponentResult The result of applying the node policy. Can be nil.
allocated_cpu_policy DeviceControllerResponse.PolicyComponentResult The result of applying the CPU policy. Can be nil.
allocated_memory_policy DeviceControllerResponse.PolicyComponentResult The result of applying the memory policy. Can be nil.

DeviceControllerResponse.EntityModificationResponse

Field Type Description
response DeviceControllerResponse.EntityModificationResponse.Response none

DeviceControllerResponse.GPUMetricsResponse

Out-of-band GPU metrics response

Field Type Description
metrics repeated DeviceControllerResponse.GPUMetricsResponse.GPUMetrics Metrics for each GPU
success bool none
error string none

DeviceControllerResponse.GPUMetricsResponse.GPUMetrics

Field Type Description
gpu_id uint32 GPU ID
usage double GPU power usage

DeviceControllerResponse.GPUPoliciesUpdateResponse

GPUPoliciesUpdateResponse is returned for GPU policies update call

Field Type Description
results repeated DeviceControllerResponse.GPUPolicyResult GPU-specific results
resource_group string The resource group name associated with this Update

DeviceControllerResponse.GPUPolicyResult

GPUPolicyResult is the results of applying a power cap to a single GPU

Field Type Description
gpu_id uint32 The GPU id within the node
ok bool Whether or not the operation was successful
power_cap_watts double The actual limit set
diag_msg string Diagnostic msg, if any

DeviceControllerResponse.GPUProfileStatus

Field Type Description
gpu_id string none
success bool none
enforced_workload_profile_ids repeated int32 none
diag_msg string none

DeviceControllerResponse.PingResponse

PingResponse contains success, and if success is false, the error msg.

Field Type Description
success bool none
error string none

DeviceControllerResponse.PolicyComponentResult

PolicyComponentResult represents the result of applying a policy to a specific component

Field Type Description
ok bool Whether or not the operation was successful
power_cap_watts double The actual limit set in watts
diag_msg string Diagnostic message, if any

DeviceControllerResponse.PowerProfileResponse

Field Type Description
success bool none
error string none
gpu_statuses repeated DeviceControllerResponse.GPUProfileStatus none
verification_id string Optional: verification ID for async verification tracking

DeviceControllerResponse.RawMetricsResponse

Raw metrics response

Field Type Description
raw_metrics google.protobuf.Any none
success bool none
error string none

DeviceControllerResponse.StdMetricsResponse

StdMetricsResponse

Field Type Description
metrics NodeMetricsResponse none
success bool none
error string none

Scalar Value Types

.proto Type Notes C++ Type Java Type Python Type

double
double double float

float
float float float

int32
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. int32 int int

int64
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. int64 long int/long

uint32
Uses variable-length encoding. uint32 int int/long

uint64
Uses variable-length encoding. uint64 long int/long

sint32
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. int32 int int

sint64
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. int64 long int/long

fixed32
Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int

fixed64
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long

sfixed32
Always four bytes. int32 int int

sfixed64
Always eight bytes. int64 long int/long

bool
bool boolean boolean

string
A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode

bytes
May contain any arbitrary sequence of bytes. string ByteString str