UFM Telemetry
Unified Fabric Manager Telemetry collects over 120 unique counters (BER, Temperature, Histograms, Retransmissions, and many more) for each port in the InfiniBand fabric, enabling the user to predict which cables are marginal and should be replaced during the bring-up process to avoid malfunctions in the future.
The tool collects data samples from all ports over all the cluster and save the data in csv file.
To collect InfiniBand Link Quality metrics, perform the following:
curl http://{machine_ip}:9002/csv/xcset/low_freq_debug >> my_telemetry_file.csv
Example:
my_telemetry_file.csv

The following table lists the link monitoring key indicators and provides their descriptions and evaluation criteria.
Parameter |
Description |
Evaluation Criteria |
||||||||||||||||||||||||||||||||
Link State |
||||||||||||||||||||||||||||||||||
Phy_state |
Physical link state |
Verify link up ( Enumeration value = 5 ) |
||||||||||||||||||||||||||||||||
Link Quality |
||||||||||||||||||||||||||||||||||
NDR Link Quality |
Link Quality criteria depend of error correction scheme type. |
DAC - directly attach copper ACC - active copper cable AOC - active optical cable Note
Minimum port up time for BER measurement - 125 minutes.
|
||||||||||||||||||||||||||||||||
PHY Errors |
||||||||||||||||||||||||||||||||||
Link_Down counter |
Total number of link down occurred as a result of involuntary link shutdown. |
If delta from last sample > 0:
|
||||||||||||||||||||||||||||||||
Cable Information |
||||||||||||||||||||||||||||||||||
Module_Temperature |
Temperature of the transceiver - optic transceiver only |
There is an alarm and threshold for each transceiver. Usually Warning [70c, 0c] and Alarm [80c, -10c] |
||||||||||||||||||||||||||||||||
rx_power_lane_x and tx_power_lane_x |
Rx power and Tx power per transceiver lane - optic transceiver only |
There is an alarm and threshold for each transceiver. |