What can I help you with?
NVIDIA UFM Cable Validation Tool v1.5.0

Introduction

The Cable Validation Tool (CVT) is designed to ensure the accuracy and quality of network cluster wiring. Its primary purpose is to validate the connectivity of physical links within the cluster and verify high-quality communication between the network components. By maintaining the integrity of these connections.

The Collector, also referred to as the "bring-up server," serves as the central component of the system and operates as a Docker container. It can be deployed on any machine connected to the management network of the switches. This deployment enables seamless communication with the switches. For large-scale systems, the Collector relies on dedicated agents installed on each switch. These agents are responsible for verifying the connections between the switches.

worddavfd8724b24729448e786e26d84df4a4b9-version-1-modificationdate-1746456409537-api-v2.png

Collector Responsibilities

The Collector performs the following critical tasks:

  • Deployment and Execution: It is installed and executed on a server with network access to all nodes requiring validation.

  • Topology Validation: Reads the Topology Files (P2P, topo or dot), which serves as the authoritative source for validating the physical link connections in the fabric topology.

  • Agent Management:

    • deploy agents on all nodes

    • Monitor agents' health

    • Supports external (unmanaged) agent deployments, with the Collector only monitoring their health.

  • Data Collection and Processing: Collects and processes agents' reports from all validated nodes

  • User Interface and Reporting:

    • Displays validation results through a web page, along with recommended remediation steps.

    • Provides data visualization options, including aggregation, sorting, and filtering, and supports downloading reports in CSV format. REST APIs are also available for integration with other systems.

Agent Responsibilities

Agents are installed on all switches and servers within the cluster. Their key functions include:

  • deployed on all Switches and Servers

  • Real-Time Monitoring:

    • Monitor node and link statuses every 10 seconds. Agents detect changes in link states and, when a change occurs, send an updated status report to the Collector.

    • If no changes are detected, agents send a periodic status report every 10 minutes.

    • Amberfile collection takes place upon state changes, which can take 40-50s.

  • Event-Driven Reporting:

    • Upon receiving a "start_validation" message from the Collector, agents initiate status reporting.

    • Reports are triggered in two scenarios:

      • When a link status change is detected (ad-hoc report).

      • Every 10 minutes as part of routine reporting.

Further details on the Collector and Cable Agents, including their operational workflows, will be discussed in the subsequent chapters.

Supported Fabric Types

The UFM Cable Validation Tool is compatible with four types of fabric:

  1. InfiniBand

    • Supported on Mlnx-OS Switches with port speed NDR and HDR

  2. XDR

    • Supported on Nvos Blackmamba Switches with XDR port speed

  3. Ethernet

    • Supported on Cumulus Switches and ethernet Hosts

  4. Nvlink

    • Supported on Nvos Switches and GB200 compute nodes of GB200 rack types: 36x2 and 72x1.

    • The internal links of the rack are Nvlink which are supported.

    • Hosts with external links as ethernet are supported. The external IB links will have to be managed from the neighboring IB switch.

The CVT tool supports the following symptoms/issues:

  • Validate Physical Connections (Cable, End points) / Miswiring

    • Wrong-neighbor – the cable connects to a different device than Topo file (P2P, topo or dot) dictates

    • Wrong-port – the cable connects to the expected device but on the wrong port

    • Unknown-neighbor – the cable connects to a device not mentioned in the Topo file (P2P, topo or dot) or LLDP is not enabled/failing

    • Extra Cable – a cable was found to be connected but not part of the loaded Topo file (P2P, topo or dot)

    • Media Unplugged - a cable is missing in the port. A transceiver is not present in the port if optical cable is used.

  • Validate Layer1 Link Integrity (Bit Error Rate, Lane powers, Temperature)

  • Flapping link – the link state has transitioned up->down->up on its own or due to external actions.

  • Link Down, No Signal – fiber is not connected or broken

  • ErrDisable-Rx - interface down events due to the Server NIC firmware bug issues (RX Disable)

  • Err-Disable-Flap - Link Protection feature (5 flaps/10 sec.) due to excessive flaps

  • Anomalous-Port - out-of-range parameters such as transceiver lane signal strength or transceiver temperature

  • Underperforming (BER) Bit-Error-Rate

    • Effective BER errors should be 0 during the first 125 mins of the link being up

    • Raw BER should be ≤ 1e-6

    • Effective BER should be ≤ 1.5E-254 for ≤ 6hrs measurements and ≤ 1E-15 for ≥ 6hrs measurements

  • Triage Non-cable issues (Provisioning, CVT issues) requiring Escalations

    • AdminDown - spectrum switch port is administratively shutdown

    • Negotiation Fail - detects interface issue due to speed and duplex mismatch between devices

    • No-report – agent communication is working but report not received

    • unreachable-device – agent not installed, not running, not reachable (e.g. port 8251 not open in switch configuration)

Syndrome

Description

non-admin users

Supported Fabric

Negotiation Fail

Negotiation of speed, fec or config issue

YES

IB, ETH, XDR

AdminDown

Link is disabled administratively

Yes

IB, ETH, XDR

Wrong-neighbor

Port is connected to the wrong peer node

YES

IB, ETH, XDR

Wrong-port

Port is connected to the wrong port in the correct peer node

YES

IB, ETH, XDR

Extra-cable

port is connected but neighbor is not in the P2P topology

YES

IB, ETH, XDR

Flapping-link

On switches:

Carrier transitions are monitored every 10 seconds. If it increments by more than 2 in 125 sec interval, a link flap alarm is raised

YES

IB, ETH, XDR

Underperforming-link

High BER counters

YES

IB, ETH, XDR

Anomalous-port (Signal, Temperature)

Some counters are not in range

YES

IB, ETH, XDR

Unreachable-device

Cannot ping it and/or

Agent not deployed or

Agent communication is failing

NO

IB, ETH, XDR

Media Unplugged

Port is down and no cable is plugged in or transceiver is not plugged in if its a optical cable

YES

IB, ETH, XDR

Unknown-neighbor

Port is up, however no peer info found

one known instance is when the far end is not reachable

NO

IB, ETH, XDR

Link Down, No signal

Port is down, while transceivers are plugged in

YES

IB, ETH, XDR

ErrDisable – Flap, Proto Down

Cumulus switch Port locally disabled by Link Protection (≥5 flaps/10s), defensive mechanism enabled by default.

YES

IB, ETH, XDR

ErrDisable -Rx

Interface down events due to the Server NIC firmware bug issues (RX Disable)

YES

IB, ETH, XDR

no report

Node is reachable but no agent report was received yet

NO

IB, ETH, XDR

Feature

Description

Supported Fabric

Circuit View

Shows the links being monitored

IB, ETH, XDR, GB200

Port Status

Shows the port is up or down

IB, ETH, XDR, GB200

Link Syndrome

Shows the cable/port issues for the link

IB, ETH, XDR, GB200

BER Stats

Shows the eff BER, raw BER, and grading based on BER

IB, ETH, XDR, GB200

Report Anomalies

Shows the anomalies detected

IB, ETH, XDR

Flapping Status

Shows the advanced flapping stats

IB, ETH

Rack View

Shows the switches and hosts on the rack

IB, ETH, XDR, GB200

System Admin

Allows to start the bringup service, load topology, set credentials and start validation. Shows overview of the validation session and reporting status of agents. Allows to manage GUI users and displays collector resource utilization.

IB, ETH, XDR, GB200

Golden BER Test

Creates a test to monitor Bit Error Rates (BER) and analyze the interface counters

IB, ETH

Amber Collection Test

Allows to collect amber files on demand

IB, ETH

Advanced Flapping Test

Creates a flapping test to analyze the metrics that could lead to a flapping event

IB, ETH

© Copyright 2025, NVIDIA. Last updated on May 5, 2025.