devicecontroller
API Reference: v1/devicecontroller.proto
The device controller deals with communicating with the devices in the datacenter to control power
Table of Contents
Services
DeviceController
DeviceController is the service interface implemented by device plugins. The topology server
sends power management requests via SendDeviceRequest. The RPC returns Empty; device plugins
deliver DeviceControllerResponse asynchronously via a callback to the topology server.
SendDeviceRequest
rpc SendDeviceRequest(DeviceControllerRequest)
.google.protobuf.Empty
Messages
DeviceControllerRequest
DeviceControllerRequest is sent to device plugins for power management. This message contains the
target device id as specified in the topology, and a request id. Once the request is handled
by the plugin, a response with a matching requestId must be returned.
DeviceControllerRequest.AllocationRequest
AllocationRequest is sent for resource group activation, deactivation and update
DeviceControllerRequest.BMCHealthDeviceRequest
Per-device BMC pre-requisite health check request. Carries the per-node
probe parameters; the server-side BMCHealthCheckTask fans this out to one
request per entity. Plugins that implement device.BMCHealthSupport perform
the deep probe (firmware, power write/read, telemetry, EDPp, WPPS, IB/OOB
drift); plugins that do not implement it fall back to a ping + common
endpoint surface check (firmware inventory, ServiceRoot, WPPS endpoint
probe).
| Field |
Type |
Description |
| samples_per_telemetry |
int32 |
Number of telemetry samples to collect for each component on this node. Zero means use the agent-side default. |
| skip_writes |
bool |
When true, the agent must skip the power-limit write/read-back/MaxP/ restore cycle and only perform read-only probes. |
| telemetry_interval_ms |
int32 |
Sleep between telemetry samples on this node, in milliseconds. Zero means no inter-sample delay (back-to-back samples). |
| write_resolution_timeout_ms |
int32 |
Per-PATCH readback window for the write-cycle probe, in milliseconds. Zero uses the agent-side default (1s). |
DeviceControllerRequest.EntityModificationRequest
EntityModificationRequest is sent to device plugins as part of the two-phase entity modification
cycle. When an entity is modified during operation, the plugins are notified by this message
at the first phase (phase=0), before the entity modifications are committed to the in-memory
topology. The plugin may request a second call after the first one. If the plugin requests
a second call, this is done after the topology is updated with the new entity, and the same
message is sent with phase=1
| Field |
Type |
Description |
| phase |
int32 |
The current phase of the operation, 0 or 1. |
| old_device_model |
bytes |
The previous device model |
| old_default_policy |
PolicyObject |
Old topology-level node policy |
| new_device_model |
bytes |
The new device model |
| new_default_policy |
PolicyObject |
New topology-level node policy |
| resource_group |
string |
Resource group name, if the node is allocated for a resource group |
DeviceControllerRequest.GPUMetricsRequest
Out-of-band GPU metrics request
DeviceControllerRequest.GPUPoliciesSetRequest
GPUPoliciesSetRequest is sent to update GPU-specific power policies
| Field |
Type |
Description |
| resource_group |
string |
Resource group name |
| gpu_policy_action |
DeviceControllerRequest.PolicyAction |
Are we setting policies, or resetting them? If resetting, gpu_policies field is ignored |
| current_policy |
PolicyObject |
The current node policy for the node containing these GPUs |
| gpu_policies |
GPUPolicies |
The GPU policies. This can be nil |
DeviceControllerRequest.PingRequest
Out-of-band ping request
| Field |
Type |
Description |
| retry_count |
int32 |
Max number of times to retry pinging the target host |
DeviceControllerRequest.PolicySetRequest
PolicySetRequest is used to set node policies during topology activation/deactivation
and resource group operations
DeviceControllerRequest.PowerProfileRequest
| Field |
Type |
Description |
| enable_profile_ids |
repeated int32 |
none |
| disable_profile_ids |
repeated int32 |
none |
| disable_async_verification |
bool |
If true, profile asynchronously verification(s) is disabled |
DeviceControllerRequest.RawMetricsRequest
| Field |
Type |
Description |
| effective |
bool |
Read power-limit fields from BMC EnvironmentMetrics. |
DeviceControllerRequest.StdMetricsRequest
| Field |
Type |
Description |
| effective |
bool |
Read power-limit fields from BMC EnvironmentMetrics. |
DeviceControllerResponse
DeviceControllerResponse is returned by the device once it finishes handling the request.
It contains the device ID, the request ID, and the response payload. The response payload
type matches the request payload type.
DeviceControllerResponse.AllocationResponse
AllocationResponse is the device response to an AllocationRequest
DeviceControllerResponse.BMCHealthDeviceResponse
Per-device BMC pre-requisite health check response. Carries the per-node
NodeReport plus any node-scoped Issues raised during the probe. Cluster-
level aggregation (ClusterSummary, threshold-based Issues that span more
than one node) is the BMCHealthCheckTask’s job on the server side.
| Field |
Type |
Description |
| success |
bool |
True if the agent was able to reach the BMC and produce at least a partial NodeReport; false on hard reachability or auth failure. |
| error |
string |
Diagnostic message when success is false. May be set on success too if a non-fatal probe step failed (the matching Issue is also emitted). |
| report |
NodeReport |
Per-node report. Always populated, even on partial failure (with the bits that did succeed). |
| issues |
repeated Issue |
Node-scoped Issues raised during the probe (e.g. EDPP_NO_REFERENCE, POWER_LATENCY_HIGH localized to this node). |
| deep_probe |
bool |
True when the agent dispatched the deep probe path (plugin implements BMCHealthSupport). False when the ping+common fallback was used. |
| endpoint_durations_us |
repeated int64 |
Raw per-endpoint request durations in microseconds. Populated by the deep-probe path; left empty by the fallback path. The server-side BMCHealthCheckTask uses these to compute exact cross-node ClusterSummary percentiles via stats.ComputeDurations - the per-node EndpointStat already carries summary statistics, but the raw samples are needed for cross-node re-percentiling. |
| telemetry_durations_us |
repeated int64 |
Raw per-telemetry-fetch durations in microseconds. |
| power_get_durations_us |
repeated int64 |
Raw per-power-limit-GET durations in microseconds. |
| power_set_durations_us |
repeated int64 |
Raw per-power-limit-SET (PATCH) durations in microseconds. |
| edpp_get_durations_us |
repeated int64 |
Raw per-EDPp-percent-GET durations in microseconds. |
| wpps_get_durations_us |
repeated int64 |
Raw per-WPPS-GET durations in microseconds. |
DeviceControllerResponse.EntityModificationResponse
DeviceControllerResponse.GPUMetricsResponse
Out-of-band GPU metrics response
DeviceControllerResponse.GPUMetricsResponse.GPUMetrics
DeviceControllerResponse.GPUPoliciesUpdateResponse
GPUPoliciesUpdateResponse is returned for GPU policies update call
DeviceControllerResponse.GPUPolicyResult
GPUPolicyResult is the results of applying a power cap to a single GPU
| Field |
Type |
Description |
| oneof _gpu_id.gpu_id |
optional uint32 |
The GPU id within the node |
| ok |
bool |
Whether or not the operation was successful |
| power_cap_watts |
double |
The actual limit set |
| diag_msg |
string |
Diagnostic msg, if any |
DeviceControllerResponse.GPUProfileStatus
DeviceControllerResponse.PingResponse
PingResponse contains success, and if success is false, the error msg.
| Field |
Type |
Description |
| success |
bool |
none |
| error |
string |
none |
DeviceControllerResponse.PolicyComponentResult
PolicyComponentResult represents the result of applying a policy to a specific component
| Field |
Type |
Description |
| ok |
bool |
Whether or not the operation was successful |
| power_cap_watts |
double |
The actual limit set in watts |
| diag_msg |
string |
Diagnostic message, if any |
DeviceControllerResponse.PowerProfileResponse
DeviceControllerResponse.RawMetricsResponse
Raw metrics response
DeviceControllerResponse.StdMetricsResponse
StdMetricsResponse
Scalar Value Types
| .proto Type |
Notes |
C++ Type |
Java Type |
Python Type |
| double |
|
double |
double |
float |
| float |
|
float |
float |
float |
| int32 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. |
int32 |
int |
int |
| int64 |
Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. |
int64 |
long |
int/long |
| uint32 |
Uses variable-length encoding. |
uint32 |
int |
int/long |
| uint64 |
Uses variable-length encoding. |
uint64 |
long |
int/long |
| sint32 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. |
int32 |
int |
int |
| sint64 |
Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. |
int64 |
long |
int/long |
| fixed32 |
Always four bytes. More efficient than uint32 if values are often greater than 2^28. |
uint32 |
int |
int |
| fixed64 |
Always eight bytes. More efficient than uint64 if values are often greater than 2^56. |
uint64 |
long |
int/long |
| sfixed32 |
Always four bytes. |
int32 |
int |
int |
| sfixed64 |
Always eight bytes. |
int64 |
long |
int/long |
| bool |
|
bool |
boolean |
boolean |
| string |
A string must always contain UTF-8 encoded or 7-bit ASCII text. |
string |
String |
str/unicode |
| bytes |
May contain any arbitrary sequence of bytes. |
string |
ByteString |
str |