Low-Frequency (Secondary) Telemetry Fields

NVIDIA UFM Enterprise User Manual v6.17.0

The following is a list of available counters which includes a variety of metrics related to timestamps, port and node information, error statistics, firmware versions, temperatures, cable details, power levels, and various other telemetry-related data.

Field Name

Description

Node_GUID

node GUID

Device_ID

PCI device ID

node_description

node description

lid

lid

Port_Number

port number

port_label

port label

Phy_Manager_State

FW Phy Manager FSM state

phy_state

physical state

logical_state

Port Logical link state

Link_speed_active

ib link active speed

Link_width_active

ib link active widthsource_id

Active_FEC

Active FEC

Total_Raw_BER

Pre-FEC monitor parameters

Effective_BER

Post FEC monitor parameters

Symbol_BER

BER after all phy correction mechanism: post FEC + PLR monitor parameters

Raw_Errors_Lane_[0-3]

This counter provides information on error bits that were identified on lane X. When FEC is enabled this induction corresponds to corrected errors. In PRBS test mode, indicates the number of PRBS errors on lane X.

Effective_Errors

This counter provides information on error bits that were not corrected by FEC correction algorithm or that FEC is not active.

Symbol_Errors

This counter provides information on error bits that were not corrected by phy correction mechanisms.

Time_since_last_clear_Min

The time passed since the last counters clear event in msec. (physical layer statistical counters)

hist[0-15]

Hist[i] give the number of FEC blocks that had RS-FEC symbols errors of value i or range of errors

FW_Version

Node FW version

Chip_Temp

switch temperature

Link_Down

Perf.PortCounters(LinkDownedCounter)

Link_Down_IB

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

LinkErrorRecoveryCounter

Total number of times the Port Training state machine has successfully completed the link error recovery process.

PlrRcvCodes

Number of received PLR codewords

PlrRcvCodeErr

The total number of rejected codewords received

PlrRcvUncorrectableCode

The number of uncorrectable codewords received

PlrXmitCodes

Number of transmitted PLR codewords

PlrXmitRetryCodes

The total number of codewords retransmitted

PlrXmitRetryEvents

The total number of retransmitted event

PlrSyncEvents

The number of sync events

HiRetransmissionRate

Recieved bandwidth loss due to codes retransmission

PlrXmitRetryCodesWithinTSecMax

The maximum number of retransmitted events in t sec window

link_partner_description

node description of the link partner

link_partner_node_guid

node_guid of the link partner

link_partner_lid

lid of the link partner

link_partner_port_num

port number of the link partner

Cable_PN

Vendor Part Number

Cable_SN

Vendor Serial Number

cable_technology

cable_type

Cable/module type

cable_vendor

cable_length

cable_identifier

vendor_rev

Vendor revision

cable_fw_version

rx_power_lane_[0-7]

RX measured power

tx_power_lane_[0-7]

TX measured power

Module_Voltage

Internally measured supply voltage

Module_Temperature

Module temperature

fast_link_up_status

Indicates if fast link-up was performed in the link

time_to_link_up_ext_msec

Time in msec to link up from disable until phy up state. While the phy manager did not reach phy up state the timer will return 0.

Advanced_Status_Opcode

Status opcode: PHY FW indication

Status_Message

ASCII code message

down_blame

Which receiver caused last link down

local_reason_opcode

Opcde of link down reason - local

remote_reason_opcode

Opcde of link down reason - remote

e2e_reason_opcode

see local_reason_opcode for local reason opcode

for remote reason opcode: local_reason_opcode+100

PortRcvRemotePhysicalErrors

Total number of packets marked with the EBP delimiter received on the port.

PortRcvErrors

Total number of packets containing an error that were received on the port

PortXmitDiscards

Total number of outbound packets discarded by the port because the port is down or congested.

PortRcvSwitchRelayErrors

Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay.

ExcessiveBufferOverrunErrors

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

LocalLinkIntegrityErrors

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

PortRcvConstraintErrors

Total number of packets received on the switch physical port that are discarded.

PortXmitConstraintErrors

Total number of packets not transmitted from the switch physical port.

VL15Dropped

Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) in the port

PortXmitWait

The time an egress port had data to send but could not send it due to lack of

credits or arbitration - in time ticks within the sample-time window

PortXmitDataExtended

Transmitted data rate per egress port in bytes passing through the port during the sample period

PortRcvDataExtended

The received data on the ingress port in bytes during the sample period

PortXmitPktsExtended

Total number of packets transmitted on the port.

PortRcvPktsExtended

Total number of packets received on the port

PortUniCastXmitPkts

Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors, and excludes link packets

PortUniCastRcvPkts

Total number of unicast packets, including unicast packets containing errors, and excluding link packets, received from all VLs on the port.

PortMultiCastXmitPkts

Total number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors.

PortMultiCastRcvPkts

Total number of multicast packets, including multicast packets containing errors received from all VLs on the port.

SyncHeaderErrorCounter

Count of errored block sync header on one or more lanes

PortSwLifetimeLimitDiscards

Total number of outbound packets discarded by the port because the Switch Lifetime Limit was exceeded. Applies to switches only.

PortSwHOQLifetimeLimitDiscards

Total number of outbound packets discarded by the port because the switch HOQ Lifetime Limit was exceeded.

Applies to switches only.

rq_num_wrfe

Responder - number of WR flushed errors

rq_num_lle

Responder - number of local length errors

sq_num_wrfe

Requester - number of WR flushed errors

Temp_flags

Latched temperature flags of module

Vcc_flags

Latched VCC flags of module

device_hw_rev

Node HW Revision

sw_revision

switch revision

sw_serial_number

switch serial number

measured_freq_[0-1]

Clock frequency measurement in last 100msec

min_freq_[0-1]

Minutes of clock frequency measured. Units of 0.1 KHz

max_freq_[0-1]

Max of clock frequency measured. Units of 0.1 KHz

max_delta_freq_[0-1]

Observed max delta frequency in window of 100msec. Units of 0.1 KHz

snr_media_lane_[0-7]

SNR value on the media lane <i>. In unit scale of 1/256 dB.

The SNR value represents the electrical signal-to-noise ratio on an optical lane, and is defined as the minimum of the three individual eye SNR values.

snr_host_lane_[0-7]

SNR value on the host lane <i>. In unit scale of 1/256 dB.

The SNR value represents the electrical signal-to-noise ratio on an optical lane, and is defined as the minimum of the three individual eye SNR values.

tx_cdr_lol

Bitmask for latched Tx cdr loss of lock flag per lane.

rx_cdr_lol

Bitmask for latched Rx cdr loss of lock flag per lane.

tx_los

Bitmask for latched Tx loss of signal flag per lane.

rx_los

Bitmask for latched Rx loss of signal flag per lane.

phy_received_bits

This counter provides information on the total amount of traffic (bits) received

rq_general_error

The total number of packets that were dropped since it contained errors. Reasons for this include: Dropped due to MPR mismatch.

© Copyright 2024, NVIDIA. Last updated on May 24, 2024.