What can I help you with?
NMX Telemetry (NMX-T) Documentation v1.0.0

gRPC Interface

The NMX-T instance runs a gRPC server that allows clients to retrieve application information and subscribe to telemetry data. The full gRPC interface prototype definition, nmx-telemetry.proto, can be found in the ./proto subdirectory of the package installation directory.

Copy
Copied!
            

service TelemetryService { rpc Hello(ClientHello) returns (ServerHello); rpc SubscribeTelemetryData(TelemetrySubscription) returns (stream TelemetryData); }

The gRPC interface is optionally secured with TLS and mTLS. By default gRPC interface runs unsecured.

  • disabled - no security communication enforced

  • tls - TLS encryption enforced, where the gRPC interface trust could be verified by the client

  • mtls - mutual TLS enforced, where the gRPC server also checks the trust of a connected client

The gRPC interface can be enabled or disabled. By default, it is enabled.

The parameter nmx-telemetry-grpc-interface controls the interface's on/off state in the user_config.json file.

The "Hello" remote procedure call is used to synchronize the client and server versions, and if needed, enforce version matching and adjust the logic accordingly.

Copy
Copied!
            

service TelemetryService { rpc Hello(ClientHello) returns (ServerHello); }

Client parameters to the handshake

Copy
Copied!
            

message ClientHello { string gatewayId = 1; ProtoMsgMajorVersion major_version = 2; ProtoMsgMinorVersion minor_version = 3; }

In addition to other application-specific data, the telemetry service returns the application instance and environment identifiers.

  • domain_uuid environment domain identifier, unique identifier of the GB200 instance

  • app_uuidApplication instance unique identifier

  • app_verApplication version string

The Remote Procedure Call SubscribeTelemetryData enables clients to receive a stream of telemetry data collected by NMX-Telemetry.

Copy
Copied!
            

service TelemetryService { rpc SubscribeTelemetryData(TelemetrySubscription) returns (stream TelemetryData); }

Message TelemetrySubscription defines subscription parameters.

Copy
Copied!
            

message TelemetrySubscription { string data_type = 1; // * | ib_counters | sys_log | gpu_counters string source_id = 2; string source_tag = 3; }

Set the parameter values to select the types or sources of data to receive, or leave the values blank to subscribe to all available data.

  • data_typeType of the data to subscribe for

    • empty string or asterisk * to subscribe for all the data types

    • comma-separated list of data types for a fine-grained subscription

  • source_iddata source identifier to get data from

  • source_tagdata source tag

Note

Leave all the parameters empty to receive all telemetry data as it is collected, without any filtering or pre-selection.

The telemetry data response includes metadata fields and the actual data payload. The format of the payload may vary depending on the type of data received.

Copy
Copied!
            

message TelemetryData { string aggregator_id = 1; string source_id = 2; string source_tag = 3; string data_type = 4; int64 timestamp = 6; Encoding encoding_type = 7; bytes message = 8; }

Metadata fields describe the payload

  • aggregator_id - the unique identifier of the application domain (Oberon domain UUID)

  • data_type - a name of the type of data the payload contains, for example "counters"

  • soruce_id - identifier of the data source - device guid for the NVLink telemetry counters, switch ip and port for the gNMI aggregation, server ip for the syslog message aggregation

  • timestamp - moment of time the message has been formed, in microseconds

  • encoding_type - a hint to interpret the payload, could be JSON or BYTES

  • message - is the actual data payload, as described in the section below

For example a message representing an event of type nvl_packet_types_counters may have the following values:

Copy
Copied!
            

aggregator_id = b954ce10-be66-4d75-a538-405ac8517c38 data_type = nvl_packet_types_counters source_id = 0x1070fd030058c216 source_tag = nvlink

Telemetry data, including counters and events, is presented as comma-separated values (CSV) enclosed within a JSON format.

The JSON object consists of

  • Timestamp: The time at which the data is collected.

  • Fields: A comma-separated list of data fields contained in the payload.

  • Values: A list of strings, each representing a list of values corresponding to the respective fields.

Message payload of data type nvl_packet_types_counters may look like the following:

Copy
Copied!
            

[     {         "timestamp"100,         "fields""node_guid,port_guid,port_num,port_rcv_ibg1_nvl_pkts,port_rcv_ibg1_non_nvl_pkts,port_rcv_ibg2_pkts,port_xmit_ibg1_nvl_pkts,port_xmit_ibg1_non_nvl_pkts,port_xmit_ibg2_pkts",         "values": [             "0x1070fd0300580000,0x1070fd030058c216,9,0,0,0,0,0,0",             "0x1070fd0300580002,0x1070fd030058c216,9,0,0,0,0,0,0"         ]     },     {         "timestamp"200,         "fields""node_guid,port_guid,port_num,port_rcv_ibg1_nvl_pkts,port_rcv_ibg1_non_nvl_pkts,port_rcv_ibg2_pkts,port_xmit_ibg1_nvl_pkts,port_xmit_ibg1_non_nvl_pkts,port_xmit_ibg2_pkts",         "values": [             "0x1070fd0300580000,0x1070fd030058c216,9,0,0,0,0,0,0",             "0x1070fd0300580002,0x1070fd030058c216,9,0,0,0,0,0,0"        ]     } ]

Another example, the data payload of the "counters" data type:

Copy
Copied!
            

[     {         "timestamp"1729872473718869,         "fields""node_guid,port_guid,port_num,node_description,roundtrip_time_port_counters_extended",         "values": [             "0xb83fd20300f9b7dc,0xb83fd20300f9b7dc,1,swx-proton03-bf3-2 HCA-1,,0"         ]     } ]

The TelemetryData response that is a result of the gNMI Aggregated Data consists of the following:

  • aggregator_id: The unique identifier for the application domain (Oberon domain UUID).

  • data_type: The name of the gNMI subscription.

  • source_id: The address and port of the gNMI target from which the data is being aggregated.

  • timestamp: The time, in microseconds, when the message was formed.

  • encoding_type: A hint for interpreting the payload, which could be either JSON or PROTO.

  • message: The gNMI update response received from the aggregation target, either in its original binary form (encoded in PROTO) or as a JSON representation of the gNMI update message.

For example a JSON-marshalled gNMI response could look like the following:

Copy
Copied!
            

{     "update": {         "prefix": {             "elem": [                 {                     "name""interfaces"                 },                 {                     "key": {                         "name""fnma1p1"                     },                     "name""interface"                 }             ],             "target""netq"         },         "timestamp""1729513043599315230",         "update": [             {                 "path": {                     "elem": [                         {                             "name""state"                         },                         {                             "name""counters"                         },                         {                             "name""in-octets"                         }                     ]                 },                 "val": {                     "uintVal""353952"                 }             }         ]     } }

The TelemetryData response that is a result of the syslog collection consists of the following:

  • aggregator_id: The unique identifier for the application domain (Oberon domain UUID).

  • data_type: The value "log_message".

  • source_id: The address and port of the log message's source.

  • source_tag: The name of the process that sent the log message.

  • timestamp: The time, in microseconds, when the message was generated.

  • encoding_type: The encoding format, either JSON or ASCII.

  • message: The syslog message, which may be in its original text form (encoded in BYTES) or a JSON-serialized OpenTelemetry message.

Example:

Copy
Copied!
            

{     "time_unix_nano"1731603557000000000,     "observed_time_unix_nano"1731596357165630000,     "severity_number"10,     "severity_text""notice",     "body": {         "Value": {             "StringValue""Nov 14 16:59:17 swx-proton04: Hey!"         }     },     "attributes": [         {             "key""facility",             "value": {                 "Value": {                     "IntValue"1                 }             }         },         {             "key""hostname",             "value": {                 "Value": {                     "StringValue""swx-proton04"                 }             }         },         {             "key""message",             "value": {                 "Value": {                     "StringValue""Hey!"                 }             }         },         {             "key""priority",             "value": {                 "Value": {                     "IntValue"13                 }             }         },         {             "key""appname",             "value": {                 "Value": {                     "StringValue""bash"                 }             }         }     ] }

© Copyright 2025, NVIDIA. Last updated on Apr 30, 2025.