InfiniBand Cluster Bring-up Procedure

Cable Validation

NVIDIA Cable Validation tool is a platform for connectivity validation.

Cable validation is the process of validating the actual cable deployment, against the expected topology (from the planning).

This process can be done in parallel / in between the deployment of portions of the cluster, to validate that the current deployed equipment is valid as expected.

The tool is not dependent on a working SM, or any working IB communication at all, but utilizes the management interfaces of all the connected network devices.

This allows to bring up the cluster gradually / incrementally, and to validate the deployment in smaller pieces, rather than all at once.

NOTE: the tool is dependent on connectivity of the host it works on (the UFM host), to the management network of the network devices. i.e., the network devices should be reachable from the host through the management network.

NOTE: the validation can utilize managed switches only

Main flow of how the tool works

  1. load the expected/planned topology into the tool

  2. install and run agents on all reachable managed switches (through the management interface/network)

  3. initiate start of validation - it triggers each agent to:

    1. perform IB neighbors search

    2. compare the result to the expected topology

    3. report it to the main tool

  4. aggregate the report of all switches, and display the results/issues

  5. repeat steps 3-4 every ~10 minutes (by default. interval time can change in different situations)

  1. load and activate the cable validation tool as UFM-plugin

  2. load the planned topology into the tool

  3. deploy agents on all reachable switches

  4. start the validation

  5. check the results

more detailed information can be found HERE

load and activate the cable validation tool as UFM-plugin

  1. in the UFM GUI, click on 'Settings' in left side main menu

  2. enter to 'Plugin Management' tab

  3. click on 'Upload new plugin's image' green button

    1. if loading from local, browse to the image file location

    2. if pulling from online repository, can use this one: https://hub.docker.com/r/mellanox/ufm-plugin-cablevalidation/tags

      image-2024-5-21_16-51-52-version-1-modificationdate-1752059043280-api-v2.png

  4. the pulling/upload of the image will be displayed in 'Jobs' (on left side menu)

    image-2024-5-21_16-52-47-version-1-modificationdate-1752059042950-api-v2.png

  5. when the job is completed, the plugin can be seen again in 'Plugin Management' tab, back in the 'Settings' window

    image-2024-5-21_16-49-6-version-1-modificationdate-1752059042637-api-v2.png

  6. right click on the 'cablevalidation', and hit 'Add'

    image-2024-5-21_17-10-0-version-1-modificationdate-1752059042363-api-v2.png

  7. after activation, you'll be requested to refresh the UFM GUI, then the 'Cable Validation' will be visible on the left side menu (with no data yet at it's window)

    image-2024-5-21_17-10-43-version-1-modificationdate-1752059041917-api-v2.png

verify that the tool is using reachable management interface

  1. in terminal, connect to the UFM host (the primary/master)

  2. check UFM's default management address

    Copy
    Copied!
                

    ~# hostname -i

  3. if the address in the output is reachable/pingable by the switches, this stage is done. otherwise, continue to step 4 and on.

  4. look for a management interface (using 'ifconfig' command) that is reachable from a switch (using ping from the switch to the address of the interface)

  5. enter to the cable validation container

    Copy
    Copied!
                

    ~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d6170edaf7ff mellanox/ufm-plugin-cablevalidation:latest "/usr/bin/supervisor…" 4 minutes ago Up 4 minutes ufm-plugin-cablevalidation d5919cfda713 mellanox/ufm-enterprise:latest "/bin/bash /usr/sbin…" 6 hours ago Up 6 hours ufm

    Copy
    Copied!
                

    ~# docker exec -it ufm-plugin-cablevalidation bash /#

  6. edit the config file at: /config/config.cfg

  7. set there the following env variable with the name of the chosen management interface, and save

    Copy
    Copied!
                

    AGENTS_IFC_NAME=<interface-name>

load the planned topology

  1. in terminal, connect to the UFM host (the primary/master)

  2. show the containers, verify that cable validation container is there

    Copy
    Copied!
                

    ~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d6170edaf7ff mellanox/ufm-plugin-cablevalidation:latest "/usr/bin/supervisor…" 4 minutes ago Up 4 minutes ufm-plugin-cablevalidation d5919cfda713 mellanox/ufm-enterprise:latest "/bin/bash /usr/sbin…" 6 hours ago Up 6 hours ufm

  3. upload the topo file into the cable validation container

    Copy
    Copied!
                

    ~# docker cp output.topo ufm-plugin-cablevalidation:/tmp/output.topo Successfully copied 2.56kB to ufm-plugin-cablevalidation:/tmp/output.topo

  4. enter to the cable validation container terminal

    Copy
    Copied!
                

    ~# docker exec -it ufm-plugin-cablevalidation bash

  5. enter to the tool cli shell

    Copy
    Copied!
                

    /# bringupcli

    to exit bringup cli shell, can type 'exit' or hit CTRL+D

  6. FYI: you can use 'help' to check what commands/operations available, or use help on each command for description. also, auto completion is available for these commands

    Copy
    Copied!
                

    Cable Bringup: help   Documented commands (type help <topic>): ======================================== add_certificate load remove_single_agent start_validation amber_show_latest load_clusters set_default_creds stop_validation check_switch_status load_ip set_node_creds version deploy_all_agents load_ptp show_clusters deploy_single_agent load_topo show_switch_history exit remove_all_agents show_switches   Cable Bringup:

    Copy
    Copied!
                

    Cable Bringup: help load_topo   load_topo filename dns=true/false [cluster=<cluster name>]   default dns=true If no dns server to resolve hostnames in topo file, you should set dns=false and provide IP addresses file. when true, no need to provide IP addresses. if cluster name is provided it will be set to the provided value, else it will be set to 'default'.   Cable Bringup:

  7. load the topo using 'load_topo' command

    Copy
    Copied!
                

    Cable Bringup: load_topo /tmp/output.topo Load topology from file: /tmp/output.topo Loaded 3 switches, 12 links. Loaded IP addresses of 3 switches! Cable Bringup:

deploy agents on all reachable switches

deploy agents on all managed switches, wait for all agents to be installed and started

Copy
Copied!
            

Cable Bringup: deploy_all_agents


start the validation

start the validation process

Copy
Copied!
            

Cable Bringup: start_validation


check results

the results can be seen on the terminal or in the Cable Validation window in UFM GUI

the actual output shows syndromes / issues of the actual connections, compared to the expected (from the topo).

the expectation is to have no issues related to the currently deployed connections

© Copyright 2025, NVIDIA. Last updated on Jul 15, 2025.