UFM Telemetry

InfiniBand Cluster Bring-up Procedure

Unified Fabric Manager Telemetry collects over 120 unique counters (BER, Temperature, Histograms, Retransmissions, and many more) for each port in the InfiniBand fabric, enabling the user to predict which cables are marginal and should be replaced during the bring-up process to avoid malfunctions in the future.
The tool collects data samples from all ports over all the cluster and save the data in csv file.​

To collect InfiniBand Link Quality metrics, perform the following:

Copy
Copied!
            

curl http://{machine_ip}:9002/csv/xcset/low_freq_debug >> my_telemetry_file.csv

Example:
my_telemetry_file.csv

image-2024-3-26_18-28-20-version-1-modificationdate-1716821932493-api-v2.png

The following table lists the link monitoring key indicators and provides their descriptions and evaluation criteria.

Parameter

Description

Evaluation Criteria

Link State

Phy_state

Physical link state

Verify link up ( Enumeration value = 5 )

Link Quality

NDR Link Quality

Link Quality criteria depend of error correction scheme type.

Error Correction Scheme TYPE

Media Type

Post-FEC

Symbol

Normal

Warning

Error

Normal

Warning

Error

Default for DAC/ACC/AOC < 100m

Low_Latency_RS_FEC_PLR

DAC/ACC/AOC

1.00E-12

5.00E-12

1.00E-11

1.00E-15

5.00E-15

1.00E-14

Default for AOC> 100m

KP4_Standard_RS_FEC

AOC

1.00E-15

5.00E-15

1.00E-14

1.00E-15

5.00E-15

1.00E-14

DAC - directly attach copper

ACC - active copper cable

AOC - active optical cable

Note

Minimum port up time for BER measurement - 125 minutes.

PHY Errors

Link_Down counter

Total number of link down occurred as a result of involuntary link shutdown.

If delta from last sample > 0:

  • Trace the event and include switch, port, date and time, link down counter.

  • If same switch and port has at least 2 link down occurrences within 24 hours, further investigation required.

  • Note:

    • Make sure link down was due to involuntary port down from the partner side (e.g. not due to partner server reboot).

    • The criteria intends to catch major link down events.

Cable Information

Module_Temperature

Temperature of the transceiver - optic transceiver only

There is an alarm and threshold for each transceiver.

Usually Warning [70c, 0c] and Alarm [80c, -10c]

rx_power_lane_x and tx_power_lane_x

Rx power and Tx power per transceiver lane - optic transceiver only

There is an alarm and threshold for each transceiver.

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.