devicecontroller
API Reference: v1/devicecontroller.proto
The device controller deals with communicating with the devices in the datacenter to control power
Table of Contents
-
Services
-
Messages
- DeviceControllerRequest
- DeviceControllerRequest.AllocationRequest
- DeviceControllerRequest.EntityModificationRequest
- DeviceControllerRequest.GPUMetricsRequest
- DeviceControllerRequest.GPUPoliciesSetRequest
- DeviceControllerRequest.PingRequest
- DeviceControllerRequest.PolicySetRequest
- DeviceControllerRequest.PowerProfileRequest
- DeviceControllerResponse
- DeviceControllerResponse.AllocationResponse
- DeviceControllerResponse.EntityModificationResponse
- DeviceControllerResponse.GPUMetricsResponse
- DeviceControllerResponse.GPUMetricsResponse.GPUMetrics
- DeviceControllerResponse.GPUPoliciesUpdateResponse
- DeviceControllerResponse.GPUPolicyResult
- DeviceControllerResponse.GPUProfileStatus
- DeviceControllerResponse.PingResponse
- DeviceControllerResponse.PolicyComponentResult
- DeviceControllerResponse.PowerProfileResponse
- DeviceControllerResponse.RawMetricsResponse
- DeviceControllerResponse.StdMetricsResponse
Services
DeviceController
The DeviceController service handles device requests. When the request is completed, it will call the topology service back.
SendDeviceRequest
rpc SendDeviceRequest(DeviceControllerRequest) .google.protobuf.Empty
Messages
DeviceControllerRequest
DeviceControllerRequest is sent to device plugins for power management. This message contains the target device id as specified in the topology, and a request id. Once the request is handled by the plugin, a response with a matching requestId must be returned.
| Field | Type | Description |
|---|---|---|
| device | string | The device ID in topology |
| request_id | string | Request ID is initialized by the topology server when sending the requests out to the devices. The device should response with a DeviceControllerResponse with the same RequestId |
| timeout_seconds | uint64 | Timeout for this device request in seconds. If not set, defaults to the controller’s DeviceCallTimeout |
| device_model_json | bytes | Device model JSON from the topology. This is sent to the device plugin so it can parse necessary bits from the topology |
| device_state_json | bytes | Device state JSON that is returned by previous calls to this device |
| oneof request.resource_group_activate | DeviceControllerRequest.AllocationRequest | none |
| oneof request.resource_group_deactivate | DeviceControllerRequest.AllocationRequest | none |
| oneof request.resource_group_update | DeviceControllerRequest.AllocationRequest | none |
| oneof request.topology_activate | DeviceControllerRequest.PolicySetRequest | none |
| oneof request.topology_deactivate | DeviceControllerRequest.PolicySetRequest | none |
| oneof request.gpu_policies_update | DeviceControllerRequest.GPUPoliciesSetRequest | none |
| oneof request.enable_workload_profile | DeviceControllerRequest.PowerProfileRequest | none |
| oneof request.disable_workload_profile | DeviceControllerRequest.PowerProfileRequest | none |
| oneof request.update_workload_profile | DeviceControllerRequest.PowerProfileRequest | none |
| oneof request.ping | DeviceControllerRequest.PingRequest | none |
| oneof request.gpu_metrics | DeviceControllerRequest.GPUMetricsRequest | none |
| oneof request.raw_metrics | google.protobuf.Empty | none |
| oneof request.std_metrics | google.protobuf.Empty | none |
| oneof request.modify_entity_model | DeviceControllerRequest.EntityModificationRequest | none |
DeviceControllerRequest.AllocationRequest
AllocationRequest is sent for resource group activation, deactivation and update
| Field | Type | Description |
|---|---|---|
| resource_group | string | Resource group name |
| oneof _policy.policy | optional DeviceControllerRequest.PolicySetRequest | The policy request. This can be null, meaning only GPU policies are being set |
| oneof _gpu_policies.gpu_policies | optional DeviceControllerRequest.GPUPoliciesSetRequest | The GPU policies request. This can be null, meaning no changes to GPU policies |
| oneof _workload_profiles.workload_profiles | optional DeviceControllerRequest.PowerProfileRequest | The Workload Power Profiles to apply. This can be null, meaning no changes to WPPS. |
DeviceControllerRequest.EntityModificationRequest
EntityModificationRequest is sent to device plugins as part of the two-phase entity modification cycle. When an entity is modified during operation, the plugins are notified by this message at the first phase (phase=0), before the entity modifications are committed to the in-memory topology. The plugin may request a second call after the first one. If the plugin requests a second call, this is done after the topology is updated with the new entity, and the same message is sent with phase=1
| Field | Type | Description |
|---|---|---|
| phase | int32 | The current phase of the operation, 0 or 1. |
| old_device_model | bytes | The device model that was |
| old_default_policy | PolicyObject | Old topology-level node policy |
| new_device_model | bytes | The device model that will be |
| new_default_policy | PolicyObject | New topology-level node policy |
| resource_group | string | Resource group name, if the node is allocated for a resource group |
DeviceControllerRequest.GPUMetricsRequest
Out-of-band GPU metrics request
| Field | Type | Description |
|---|---|---|
| gpu_ids | repeated uint32 | none |
DeviceControllerRequest.GPUPoliciesSetRequest
GPUPoliciesRequest is sent to update GPU-specific power policies
| Field | Type | Description |
|---|---|---|
| resource_group | string | Resource group name |
| gpu_policy_action | DeviceControllerRequest.PolicyAction | Are we setting policies, or resetting them? If resetting, gpu_policies field is ignored |
| current_policy | PolicyObject | The current node policy for the node containing these GPUs |
| gpu_policies | GPUPolicies | The GPU policies. This can be nil |
DeviceControllerRequest.PingRequest
Out-of-band ping request
| Field | Type | Description |
|---|---|---|
| retry_count | int32 | Max number of times to retry pinging the target host |
DeviceControllerRequest.PolicySetRequest
PolicySetRequest is used to set node policies during topology activation/deactivation
| Field | Type | Description |
|---|---|---|
| policy_action | DeviceControllerRequest.PolicyAction | Set or unset node policy |
| policy | PolicyObject | The policy to set |
DeviceControllerRequest.PowerProfileRequest
| Field | Type | Description |
|---|---|---|
| enable_profile_ids | repeated int32 | none |
| disable_profile_ids | repeated int32 | none |
| disable_async_verification | bool | If true, profile asynchronously verification(s) is disabled |
DeviceControllerResponse
DeviceControllerResponse is returned by the device once it finishes handling the request. It contains the device ID, the request ID, and the response payload. The response payload type matches the request payload type.
| Field | Type | Description |
|---|---|---|
| device | string | The device ID as specified in the topology |
| request_id | string | The request_id must match the id of the request that initiated this response. |
| device_state_json | bytes | The device plugin state data, in JSON format, that should be stored and passed in subsequent requests If nil, the existing plugin state data should not be modified |
| device_ping_time | google.protobuf.Timestamp | The timestamp when attempt was made to access the device |
| device_ping_error | string | The error message when attempt was made to access the device and it failed |
| oneof response.resource_group_activate | DeviceControllerResponse.AllocationResponse | none |
| oneof response.resource_group_deactivate | DeviceControllerResponse.AllocationResponse | none |
| oneof response.resource_group_update | DeviceControllerResponse.AllocationResponse | none |
| oneof response.topology_activate | DeviceControllerResponse.AllocationResponse | none |
| oneof response.topology_deactivate | DeviceControllerResponse.AllocationResponse | none |
| oneof response.gpu_policies_update | DeviceControllerResponse.GPUPoliciesUpdateResponse | none |
| oneof response.enable_workload_profile | DeviceControllerResponse.PowerProfileResponse | none |
| oneof response.disable_workload_profile | DeviceControllerResponse.PowerProfileResponse | none |
| oneof response.update_workload_profile | DeviceControllerResponse.PowerProfileResponse | none |
| oneof response.ping | DeviceControllerResponse.PingResponse | none |
| oneof response.gpu_metrics | DeviceControllerResponse.GPUMetricsResponse | none |
| oneof response.raw_metrics | DeviceControllerResponse.RawMetricsResponse | none |
| oneof response.std_metrics | DeviceControllerResponse.StdMetricsResponse | none |
| oneof response.modify_entity_model | DeviceControllerResponse.EntityModificationResponse | none |
DeviceControllerResponse.AllocationResponse
AllocationResponse is the device response to an AllocationRequest
| Field | Type | Description |
|---|---|---|
| success | bool | If not true, then error field provides the diagnostic message |
| error | string | If the call is not successful, the diagnostic message |
| policy | PolicyObject | The policy applied on the node. Can be nil. |
| power_cap_watts | double | The power cap |
| allocated_gpu_policies | DeviceControllerResponse.GPUPoliciesUpdateResponse | The GPU policies applied on the node. Can be nil. |
| workload_profile_result | DeviceControllerResponse.PowerProfileResponse | The result of applying workload power profiles. Can be nil. |
| resource_group | string | The resource group name associated with this allocation |
| allocated_node_policy | DeviceControllerResponse.PolicyComponentResult | The result of applying the node policy. Can be nil. |
| allocated_cpu_policy | DeviceControllerResponse.PolicyComponentResult | The result of applying the CPU policy. Can be nil. |
| allocated_memory_policy | DeviceControllerResponse.PolicyComponentResult | The result of applying the memory policy. Can be nil. |
DeviceControllerResponse.EntityModificationResponse
| Field | Type | Description |
|---|---|---|
| response | DeviceControllerResponse.EntityModificationResponse.Response | none |
DeviceControllerResponse.GPUMetricsResponse
Out-of-band GPU metrics response
| Field | Type | Description |
|---|---|---|
| metrics | repeated DeviceControllerResponse.GPUMetricsResponse.GPUMetrics | Metrics for each GPU |
| success | bool | none |
| error | string | none |
DeviceControllerResponse.GPUMetricsResponse.GPUMetrics
| Field | Type | Description |
|---|---|---|
| gpu_id | uint32 | GPU ID |
| usage | double | GPU power usage |
DeviceControllerResponse.GPUPoliciesUpdateResponse
GPUPoliciesUpdateResponse is returned for GPU policies update call
| Field | Type | Description |
|---|---|---|
| results | repeated DeviceControllerResponse.GPUPolicyResult | GPU-specific results |
| resource_group | string | The resource group name associated with this Update |
DeviceControllerResponse.GPUPolicyResult
GPUPolicyResult is the results of applying a power cap to a single GPU
| Field | Type | Description |
|---|---|---|
| gpu_id | uint32 | The GPU id within the node |
| ok | bool | Whether or not the operation was successful |
| power_cap_watts | double | The actual limit set |
| diag_msg | string | Diagnostic msg, if any |
DeviceControllerResponse.GPUProfileStatus
| Field | Type | Description |
|---|---|---|
| gpu_id | string | none |
| success | bool | none |
| enforced_workload_profile_ids | repeated int32 | none |
| diag_msg | string | none |
DeviceControllerResponse.PingResponse
PingResponse contains success, and if success is false, the error msg.
| Field | Type | Description |
|---|---|---|
| success | bool | none |
| error | string | none |
DeviceControllerResponse.PolicyComponentResult
PolicyComponentResult represents the result of applying a policy to a specific component
| Field | Type | Description |
|---|---|---|
| ok | bool | Whether or not the operation was successful |
| power_cap_watts | double | The actual limit set in watts |
| diag_msg | string | Diagnostic message, if any |
DeviceControllerResponse.PowerProfileResponse
| Field | Type | Description |
|---|---|---|
| success | bool | none |
| error | string | none |
| gpu_statuses | repeated DeviceControllerResponse.GPUProfileStatus | none |
| verification_id | string | Optional: verification ID for async verification tracking |
DeviceControllerResponse.RawMetricsResponse
Raw metrics response
| Field | Type | Description |
|---|---|---|
| raw_metrics | google.protobuf.Any | none |
| success | bool | none |
| error | string | none |
DeviceControllerResponse.StdMetricsResponse
StdMetricsResponse
| Field | Type | Description |
|---|---|---|
| metrics | NodeMetricsResponse | none |
| success | bool | none |
| error | string | none |