High-Frequency (Primary) Telemetry Fields

NVIDIA UFM Enterprise User Manual v6.17.2

The following is a list of available counters which includes a variety of metrics related to timestamps, port and node information, error statistics, firmware versions, temperatures, cable details, power levels, and various other telemetry-related data.

Field Name

Description

timestamp

source_id

tag

node_guid

node GUID

port_guid

Port GUID

port_num

Port Number

PortXmitDataExtended

Transmitted data rate per egress port in bytes passing through the port during the sample period

PortRcvDataExtended

The received data on the ingress port in bytes during the sample period

PortXmitPktsExtended

Total number of packets transmitted on the port.

PortRcvPktsExtended

Total number of packets received on the port

SymbolErrorCounterExtended

This counter provides information on error bits that were not corrected by phy correction mechanisms.

LinkErrorRecoveryCounterExtended

Total number of times the Port Training state machine has successfully completed the link error recovery process.

LinkDownedCounterExtended

Perf.PortCounters

PortRcvErrorsExtended

Total number of packets containing an error that were received on the port

PortRcvRemotePhysicalErrorsExtended

Total number of packets marked with the EBP delimiter received on the port.

PortRcvSwitchRelayErrorsExtended

Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.

PortXmitDiscardsExtended

Total number of outbound packets discarded by the port because the port is down or congested.

PortXmitConstraintErrorsExtended

Total number of packets not transmitted from the switch physical port.

PortRcvConstraintErrorsExtended

Total number of packets received on the switch physical port that are discarded.

LocalLinkIntegrityErrorsExtended

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

ExcessiveBufferOverrunErrorsExtended

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

VL15DroppedExtended

Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port

PortXmitWaitExtended

The time an egress port had data to send but could not send it due to lack of

credits or arbitration - in time ticks within the sample-time window

hist[0-4]

Hist[i] give the number of FEC blocks that had RS-FEC symbols errors of value i or range of errors

infiniband_CBW

Normalized_CBW

NormalizedXW

Normalized_XmitData

The following is a list of available counters which includes a variety of metrics related to timestamps, port and node information, error statistics, firmware versions, temperatures, cable details, power levels, and various other telemetry-related data.

Field Name

Description

timestamp

source_id

tag

node_guid

node GUID

port_guid

Port GUID

port_num

Port Number

PortXmitDataExtended

Transmitted data rate per egress port in bytes passing through the port during the sample period

PortRcvDataExtended

The received data on the ingress port in bytes during the sample period

PortXmitPktsExtended

Total number of packets transmitted on the port.

PortRcvPktsExtended

Total number of packets received on the port

SymbolErrorCounterExtended

LinkErrorRecoveryCounterExtended

LinkDownedCounterExtended

PortRcvErrorsExtended

PortRcvRemotePhysicalErrorsExtended

PortRcvSwitchRelayErrorsExtended

PortXmitDiscardsExtended

PortXmitConstraintErrorsExtended

PortRcvConstraintErrorsExtended

LocalLinkIntegrityErrorsExtended

ExcessiveBufferOverrunErrorsExtended

VL15DroppedExtended

PortXmitWaitExtended

hist[0-4]

Hist[i] give the number of FEC blocks that had RS-FEC symbols errors of value i or range of errors

infiniband_CBW

Normalized_CBW

NormalizedXW

Normalized_XmitData

© Copyright 2024, NVIDIA. Last updated on Jun 27, 2024.