What can I help you with?
NVIDIA UFM Cable Validation Tool v1.4.0

Introduction

The Cable Validation Tool (CVT) is designed to ensure the accuracy and quality of network cluster wiring. Its primary purpose is to validate the connectivity of physical links within the cluster and verify high-quality communication between the network components. By maintaining the integrity of these connections.

The Collector, also referred to as the "bring-up server," serves as the central component of the system and operates as a Docker container. It can be deployed on any machine connected to the management network of the switches. This deployment enables seamless communication with the switches. For large-scale systems, the Collector relies on dedicated agents installed on each switch. These agents are responsible for verifying the connections between the switches.

worddavfd8724b24729448e786e26d84df4a4b9-version-1-modificationdate-1738761905810-api-v2.png

Collector Responsibilities

The Collector performs the following critical tasks:

  • Deployment and Execution: It is installed and executed on a server with network access to all nodes requiring validation.

  • Topology Validation: Reads the Topology Files (P2P, topo or dot), which serves as the authoritative source for validating the physical link connections in the fabric topology.

  • Agent Management:

    • deploy agents on all nodes

    • Monitor agents' health

    • Supports external (unmanaged) agent deployments, with the Collector only monitoring their health.

  • Data Collection and Processing: Collects and processes agents' reports from all validated nodes

  • User Interface and Reporting:

    • Displays validation results through a web page, along with recommended remediation steps.

    • Provides data visualization options, including aggregation, sorting, and filtering, and supports downloading reports in CSV format. REST APIs are also available for integration with other systems.

Agent Responsibilities

Agents are installed on all switches and servers within the cluster. Their key functions include:

  • deployed on all Switches and Servers

  • Real-Time Monitoring:

    • Monitor node and link statuses every 10 seconds. Agents detect changes in link states and, when a change occurs, send an updated status report to the Collector.

    • If no changes are detected, agents send a periodic status report every 10 minutes.

    • Amberfile collection takes place upon state changes, which can take 40-50s.

  • Event-Driven Reporting:

    • Upon receiving a "start_validation" message from the Collector, agents initiate status reporting.

    • Reports are triggered in two scenarios:

      • When a link status change is detected (ad-hoc report).

      • Every 10 minutes as part of routine reporting.

Further details on the Collector and Cable Agents, including their operational workflows, will be discussed in the subsequent chapters.

Supported Fabric Types

The UFM Cable Validation Tool is compatible with three types of fabric:

  1. InfiniBand

  2. NVOS

  3. Ethernet

The CVT tool supports the following symptoms/issues:

  • Validate Physical Connections (Cable, End points) / Miswiring

    • Wrong-neighbor – the cable connects to a different device than Topo file (P2P, topo or dot) dictates

    • Wrong-port – the cable connects to the expected device but on the wrong port

    • Unknown-neighbor – the cable connects to a device not mentioned in the Topo file (P2P, topo or dot) or LLDP is not enabled/failing

    • Extra Cable – a cable was found to be connected but not part of the loaded Topo file (P2P, topo or dot)

    • No Transceiver – a transceiver is not present in the port

  • Validate Layer1 Link Integrity (Bit Error Rate, Lane powers, Temperature)

  • Flapping link – the link state has transitioned up->down->up on its own or due to external actions.

  • Link Down, No Signal – fiber is not connected or broken

  • ErrDisable-Rx - interface down events due to the Server NIC firmware bug issues (RX Disable)

  • Err-Disable-Flap - Link Protection feature (5 flaps/10 sec.) due to excessive flaps

  • Anomalous-Port - out-of-range parameters such as transceiver lane signal strength or transceiver temperature

  • Underperforming (BER) Bit-Error-Rate

    • Effective BER errors should be 0 during the first 125 mins of the link being up

    • Raw BER should be ≤ 1e-6

    • Effective BER should be ≤ 1.5E-254 for ≤ 6hrs measurements and ≤ 1E-15 for ≥ 6hrs measurements

  • Triage Non-cable issues (Provisioning, CVT issues) requiring Escalations

    • AdminDown - spectrum switch port is administratively shutdown

    • Negotiation Fail - detects interface issue due to speed and duplex mismatch between devices

    • No-report – agent communication is working but report not received

    • unreachable-device – agent not installed, not running, not reachable (e.g. port 8251 not open in switch configuration)

Syndrome

Description

non-admin users

Supported Fabric

Negotiation Fail

Negotiation of speed, fec or config issue.

YES

ib, eth, nvos

AdminDown

link is disabled administratively

Yes

ib, eth, nvos

Wrong-neighbor

Port is connected to the wrong peer node

YES

ib, eth, nvos

Wrong-port

Port is connected to the wrong port in the correct peer node

YES

ib, eth, nvos

Extra-cable

port is connected but neighbor is not in the P2P topology

YES

ib, eth, nvos

Flapping-link

On switches:

Carrier transitions are monitored every 10 seconds. If it increments by more than 2 in 125 sec interval, a link flap alarm is raised

YES

ib, eth, nvos

Underperforming-link

High BER counters

YES

ib, eth

Anomalous-port (Signal, Temperature)

Some counters are not in range

YES

ib, eth, nvos

Unreachable-device

Cannot ping it and/or

Agent not deployed or

Agent communication is failing

NO

ib, eth, nvos

No Transceiver

port is down and transceiver is not plugged in

YES

ib, eth, nvos

Unknown-neighbor

Port is up, however no peer info found

one known instance is when the far end is not reachable

NO

ib, eth, nvos

Link Down, No signal

port is down, while transceivers are plugged in

YES

ib, eth, nvos

ErrDisable – Flap, Proto Down

Cumulus switch Port locally disabled by Link Protection (≥5 flaps/10s), defensive mechanism enabled by default.

YES

ib, eth, nvos

ErrDisable -Rx

interface down events due to the Server NIC firmware bug issues (RX Disable)

YES

ib, eth, nvos

no report

node is reachable but no agent report was received yet

NO

ib, eth, nvos

© Copyright 2025, NVIDIA. Last updated on Mar 26, 2025.