Cable Validation Tool

NVIDIA UFM Cyber-AI Documentation v2.3.0

The Cable Validation tool contains two sub-projects - Collector and Cable Agent.

The collector is the main module that should be deployed and run on a host with management network access. It is important to note that an IB interface is not required on the host.

Deploy

Deploy the cables_bringup container on a host, as follows:

  1. docker load -i /tmp/cables_bringup_<version>.tar.gz

  2. docker run --name cables_bringup -itd --network=host cables_bringup

  3. docker exec -it cables_bringup /bin/bash

Setting Docker Environment

Specifying the Network Interface

If the host system is equipped with multiple network interfaces and the switches are connected to the host through an interface that differs from the default management interface, the user has the option to designate this particular interface through the utilization of a specific environment variable, namely AGENTS_IFC_NAME. To illustrate, assuming the hypothetical interface name is eno3:

Copy
Copied!
            

docker run --name cables_bringup -itd --network=host --env AGENTS_IFC_NAME=eno3 cables_bringup


Adding Hostnames

If the switches are not configured in the DNS server, you may add hostnames; the user may use the --add-host option when running the container. For example (assuming the switch name is switch-3245fa and its IP is 192.168.1.1):

Copy
Copied!
            

docker run --name cables_bringup -itd --network=host --add-host=switch-3245fa:192.168.1.1 cables_bringup


Using Volumes

Volumes can be used for data persistence or easier file transfer to the cables_bringup container. For data persistency, the volume must be mapped to /cable_bringup_root in the container. This volume can also be used for loading topology files. Example:

Copy
Copied!
            

docker run --name cables_bringup -itd --network=host -v /opt/bringup_data:/cable_bringup_root cables_bringup

Running bringup CLI

  1. Run exec bringupcli in the container:

    Copy
    Copied!
                

    docker exec -it cables_bringup bringupcli

  2. Alternatively, it is possible to run exec bash in the container and run bringcli from anywhere within the container:

    Copy
    Copied!
                

    docker exec -it cables_bringup

bringupcli Usage

bringupcli may have command line arguments, see usage below for more details:

Note

root@r-ufm65:/# bringupcli -h
usage: bringupcli [-h] [-V] [-k]

Optional Arguments:

Argument

Description

-h, --help

Show this help message and exit

-V, --version

Show program version number and exit

-k, --kill-other-sessions

Kill other CLI sessions if existent

To initialize the tool, perform the following:

  1. Load the fabric topology file:

    Copy
    Copied!
                

    load_topo <topo filename> topo file extension load_ptp <topo filename > excel file extension load_ip <ip filename> load <topo filename> <ip filename>(both topo and ips)

  2. Set the credentials for the switches. Use set_default_creds/set_switch_creds to set the credentials.

  3. Deploy the agent on all switches. Run:

    Copy
    Copied!
                

    deploy_all_agents

Run bringup GUI

  1. Open the following URL in the browser: https://<bringup_machine_ip>/cables_validation

  2. Enter default credentials in the login page.

  3. User management is not supported in the current version. To change it manually, use the htpasswd Linux utility.

    1. In the bringup container, locate the .htaccees file

    2. It is located at ${BRINGUP_CONF_APACHE_PATH}/.htaccess

    3. Use htpasswd to add, modify or delete users.

  4. user may change the default self signed certificate located by default in the container at:

    Copy
    Copied!
                

    SSLCertificateFile ${BRINGUP_CONF_APACHE_PATH}/certs/cv-cert.crt SSLCertificateKeyFile ${BRINGUP_CONF_APACHE_PATH}/private/cv-cert.key

Validations

  • show_switches: Show list of loaded switches as loaded from the topology file

  • check_switch_status: Check switch connectivity status (Ping/JSON-API/Agent )

  • start_validation: Push topology to switches and get validation reports

  • stop_validation: Unsubscribe from getting switches updates

Other commands

  • show_switch_history: Lists data files collected from switches in the last days

  • amber_show_latest: Shows latest collected amber data from switches

Troubleshooting

  • deploy_single_agent

  • deploy_all_agents

  • remove_all_agents

  • remove_single_agent

Complete CLI commands reference

  1. load_topo - Loads topology file (topo file extension).
    load_topo <filename> dns=true –> assumes that dns is active and you can access the switches by hostnames by default dns=true.

    A topo file example:

    Copy
    Copied!
                

    MQM8700 sw-hdr-proton01 CFG: main=4x     P1 -4x-50G-> sw-hdr-proton02 P1     P2 -4x-50G-> sw-hdr-proton02 P2     P3 -4x-50G-> HCA_12 swx-proton03 mlx5_0/P1     P4 -4x-50G-> HCA_12 swx-proton04 mlx5_2/P1```

  2. load_ptp - Loads PTP topology file (Excel file).
    load_ptp <filename> sheets="sheet 1,my-sheet" dns=true –> assumes that DNS is active and that you can access the switches by hostnames by the default setting of dns=true.
    If sheets argument is provided, only given sheets are loaded, otherwise, all sheets will be loaded. An example of sheet in the ptp file:

    rack

    U

    Name

    HCA/Port

    Rack

    U

    Name

    Port

    316

    22

    c-csi-0329s

    1

    R113

    22

    c-csi-mqm9700-0327

    1

    316

    24

    c-csi-0331s

    1

    R113

    22

    c-csi-mqm9700-0327

    1

  3. oad_ip - Loads switch ip addresses, can be used if DNS is inactive. Loads the IP/switch-name mapping, to allow reaching the switch via REST API to retrieve local topology, GUID, etc. The file format is pairs of IP addresses and hostname. This file will be used in association with a ‘topo’ file in case DNS is unavailable.
    An IP file example:

    Copy
    Copied!
                

    # A comment 10.0.30  switch1 10.0.0.31  switch2

  4. load - Loads both IP addresses and topo files. load inputs/my-topo loads inputs/my-topo.topo and inputs/my-topo.ip

  5. show_switches - Shows the list of loaded switches as loaded from the topology file.
    Example output:

    Copy
    Copied!
                

    MQM8700 sw-hdr-proton01 ----------------------- MQM8700 sw-hdr-proton01 P3  --> swx-proton03 mlx5_0     P1            MQM8700 sw-hdr-proton01 P4  --> swx-proton04 mlx5_2     P1 MQM8700 ufm-sw-hdr01 --------------------            MQM8700 ufm-sw-hdr01 P1  --> ufm-sw-hdr02   P1 MQM8700 ufm-sw-hdr02 --------------------           MQM8700 ufm-sw-hdr02 P1  --> ufm-sw-hdr01   P1

  6. set_default_creds - Sets the default switch/host credentials to override the built-in default credentials. These credentials are used for communication with any switch that does not have specific credentials.

    Copy
    Copied!
                

    set_default_creds user=<user> pwd=<pwd> [type=switch|host]

  7. set_node_creds - Sets the credentials for a specific switch/host, it can be used when the switch credentials are different than the defaults.

    Copy
    Copied!
                

    set_node_creds <switch> user=<user> pwd=<pwd>

  8. deploy_all_agents - Deploys agents on loaded switches that have no agents.

  9. deploy_single_agent - Deploys agent on a specific switch.

  10. remove_all_agents - Removes agents from loaded switches that have agents.

  11. remove_single_agent - Removes an agent from a specific switch.

  12. show_switch_history - Lists data files collected from switches in the last days show_switch_history past=3d. Past argument can be used to specify the history interval, by default it is set to one week past=1w.

  13. amber_show_latest - Shows the latest collected amber data from switches

  14. check_switch_status - Checks switch connectivity status (Ping/JSON-API/Agent).
    Example output:

    Copy
    Copied!
                

    Host IP              ping   JSONAPI   Agent -----------------------------   -------------   ----   -------   ----- sw-hdr-proton01.mtr.labs.mlnx   209.44.74    True      True    True ufm-sw-hdr01.mtr.labs.mlnx      10.209.36.113   True      True    True ufm-sw-hdr02.mtr.labs.mlnx      10.209.36.122   True      True    True

  15. upgrade_switch_os - TBD

  16. start_validation - Initiates validation routine: pushes topology to switches and gets validation reports timeout (an optional argument), in which validation stops. (For example timeout=20m or timeout=2h). If timeout is not provided, use the stop_validation command to stop it. start_validation timeout=n (in seconds/minutes/hours/days).

  17. stop_validation - Stops validation routine. Unsubscribe from getting switches updates.

  18. version - Shows application version.

  19. exit - Exits the application.

  20. help - Shows a list of commands. For help on a specific command, run help <command>

Bringup Server REST API

The collector has a web server listening on two internal ports 8251 and 8252. These ports are not advertised outside the machine. The bringup server is running on the Apache server which uses the default http/https ports. It is not recommended to change the internal ports, as this requires changing the Apache service configuration. The Apache service uses a self signed certificate, that the user can change to his own certificate. All REST APIs can run only with https. The following listed the supported REST APIs

Login

To use a REST API, you need to have session credentials. If you want to use curl to access the REST API, you should log in first by going to the URL cablevalidation/login and saving the cookie. After that, you can use the saved cookie for subsequent requests.

Copy
Copied!
            

# login and save cookie curl -k -X POST -c cookies.txt -d "httpd_username=<user>" -d "httpd_password=<password>" https://127.0.0.1/cablevalidation/login # use saved cookie for REST API requests curl -k --cookie cookies.txt https://127.0.0.1/cablevalidation/report/validation


Retrieving Validation Report

Run:

Copy
Copied!
            

GET https://<host-ip-or-name>/cablevalidation/report/validation

Validation Report Output Example

Copy
Copied!
            

curl -k https://swx-proton01/cablevalidation/report/validation | python3 -m json.tool {     "report": "ValidationReport",     "stats": {         "in_progress": 3,         "no_issues": 0,         "not_started": 0     },     "issues": [         {             "timestamp": 1666176949.5110743,             "node_desc": "MQM8700 sw-hdr-proton01",             "issues": [                 [                     "Wrong-neighbor",                     "MQM8700 sw-hdr-proton01:P3",                     "HCA_12 swx-proton03 mlx5_0:P1",                     "None:PNA"                 ],                 [                     "Wrong-neighbor",                     "MQM8700 sw-hdr-proton01:P4",                     "HCA_12 swx-proton04 mlx5_2:P1",                     "HCA_12 swx-proton04 mlx5_0:P1"                 ]             ]         },         {             "timestamp": 1666176949.4999607,             "node_desc": "MQM8700 ufm-sw-hdr02",             "issues": [                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr02:P2",                     "NONE",                     "MQM8700 ufm-sw-hdr01:P2"                 ],                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr02:P3",                     "NONE",                     "MQM8700 ufm-sw-hdr01:P3"                 ],                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr02:P7",                     "NONE",                     "MQM8700 ufm-sw-hdr01:P7"                 ]             ]         },         {             "timestamp": 1666176949.4870453,             "node_desc": "MQM8700 ufm-sw-hdr01",             "issues": [                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr01:P2",                     "NONE",                     "MQM8700 ufm-sw-hdr02:P2"                 ],                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr01:P3",                     "NONE",                     "MQM8700 ufm-sw-hdr02:P3"                 ],                 [                     "Extra-cable",                     "MQM8700 ufm-sw-hdr01:P7",                     "NONE",                     "MQM8700 ufm-sw-hdr02:P7"                 ]             ]         }     ] }

Bringup commands support via REST API

The processing of bringup commands is not limited to the CLI; it can also be accomplished through the REST API.

Processing a Command

Run:

Copy
Copied!
            

POST https://<host-ip-or-name>/cablevalidation/commands/{command_name} <command-data>

Process Command Example

The command body is a JSON dictionary of key-value arguments as described in the table below.

Copy
Copied!
            

curl -k https://127.0.0.1/cablevalidation/commands/load_topo -d '{"files":["inputs/lab.topo"], "dns":true}' -X POST Command load_topo completed successfully


Supported Commands

Command

Async

Argument

Type

Mandatory

load_topo

False

dns

bool

False

files

list

True

load_ip

False

files

list

True

load_ptp

False

dns

bool

False

sheets

list

False

files

str

True

set_default_creds

False

user

str

True

pwd

str

True

type

str

False

set_node_creds

False

user

str

True

pwd

str

True

type

str

True

deploy_all_agents

True

deploy_single_agent

True

switch

str

True

remove_all_agents

True

remove_single_agent

True

switch

str

True

start_validation

True

stop_validation

True

Getting List of Supported Commands

The following command returns a JSON dictionary with all supported commands as well as their arguments and if it async or sync.

Copy
Copied!
            

GET https://<host-ip-or-name>/cablevalidation/commands

Supported Commands Output Example

Output has been cut.

Copy
Copied!
            

{     "load_topo": {         "args": {             "dns": {                 "type": "bool",                 "mandatory": false             },             "files": {                 "type": "list",                 "mandatory": true             }         },         "is_async": false     } }

Rack View

Rack and unit information can be shown when loading a PTP Excel file, however, topo files do not contain such information, therefore, rack view is not available.

Rack view is supported via two REST APIs.

Getting List of Racks

The following command returns a JSON list of all loaded racks.

Copy
Copied!
            

GET https://<host-ip-or-name>/resources/racks

Racks List Output Example

Copy
Copied!
            

[     "1108",     "1106" ]

Getting Rack View of a Specific Rack

The following command returns a JSON dictionary with rack details.

Copy
Copied!
            

GET https://<host-ip-or-name>/resources/racks/{rack-name}

Rack View Output Example

Copy
Copied!
            

{     "name": "1108",     "units": [         {             "nodedesc": "MSB7800 r-ufm-sw10",             "ports": [                 {                     "port": "P25",                     "syndrome": "Wrong-neighbor"                 },                 {                     "port": "P26",                     "syndrome": "Wrong-neighbor"                 },                 {                     "port": "P27",                     "syndrome": "Active"                 },                 {                     "port": "P28",                     "syndrome": "Active"                 }             ],             "unit": "40"         }     ] }

Build Collector

Note: this section is for development only Run build/build_collector a new docker image will be created: image /tmp/cablesbringup_<version>.tar.gz was created

Run bringup GUI from source

  1. In case you are running the bringup from the docker, all you need is to open the following URL in the browser: https://<bringup_machine_ip>/cables_validation

  2. In case you are running the bringup from the source code, you need first to run build/build_gui.sh in order to compile the GUI code; the media directory will be created under cables_validation/src/collector as a result of the building script.

  3. TBD: How to run GUI, without Apache

Build Agent

Warning

This section is for development only.

Executing the build/build_agent script generates a new Docker image, which will subsequently be stored as /tmp/cables_agent_<version>.tar.gz. This file can be utilized for disseminating the Docker image to other computing environments or for preservation purposes.

Check if cable agent runs on the switch:

  1. Run:

    Copy
    Copied!
                

    ssh admin@<switch-ip-or-name>

  2. Enable

  3. Show docker images

  4. Exit

If cables agent is running on the switch, the following output is prompted.

Copy
Copied!
            

---------------------------------------------------------------------------- Image                                 Version      Created            Size ---------------------------------------------------------------------------- cables_agent                          latest       13 hours ago       788MB


Deploy on the Switch

Usually, it is not necessary to manually deploy the agent onto the switch, as it is recommended to use the deploy_all_agents or deploy_single_agent commands from the bringup CLI. However, in instances where manual deployment is required, the following commands can be executed:

  1. enable

  2. configure terminal

  3. no docker shutdown

  4. image fetch scp://<user>:<pwd>@<hostname>/tmp/cables_agent_<version>.tar.gz cables_agent_latest.tar.gz

  5. docker load cables_agent_latest.tar.gz

  6. docker start cables_agent latest cables_agent now-and-init privileged network

For cleanup, run:

  1. docker no start cables_agent

  2. docker remove image cables_agent latest

  3. image delete cables_agent_latest.tar.gz

To enter terminal in the container running on the switch, run:

  1. enable

  2. configure terminal

  3. docker exec cables_agent /bin/bash

Cables Agent REST API

the agent has a web server listening on port 8251. The following two REST APIs are supported:

  1. https://<switch-ip-or-name>:8251/resources/links

  2. https://<switch-ip-or-name>:8251/resources/ports

Output Example of Links:

Copy
Copied!
            

curl -k https://sw-hdr-proton01:8251/resources/links | python3 -m json.tool [  {     "info": {         "md5": "256477d766fa8d8853848c43c35982ba",         "timestamp": 1659355401394591,         "time": "2022-08-01 12:03:21.394601"     },     "src": {         "Node Description": "MF0;sw-hdr-proton01:MQM8700/U1",         "Guid": "0x0c42a1030079a6ec",         "ip": "10.209.44.74",         "Node Name": "sw-hdr-proton01"     },     "dests": {         "4": {             "Node Description": "swx-proton04 mlx5_2",             "Guid": "0xb8cef6030083bea2",             "LocalPort": "1"         },         "2": {             "Node Description": "Quantum Mellanox Technologies",             "Guid": "0xb8cef60300fbf210",             "LocalPort": "2"         },         "3": {             "Node Description": "swx-proton03 mlx5_0",             "Guid": "0xb8cef6030083bf02",             "LocalPort": "1"         },         "1": {             "Node Description": "Quantum Mellanox Technologies",             "Guid": "0xb8cef60300fbf210",             "LocalPort": "1"         }     } } ]

Output Example of Ports

Copy
Copied!
            

curl -k https://sw-hdr-proton01:8251/resources/ports | python3 -m json.tool [     {         "port": "IB1/10",         "port_num": "10",         "logical": "Down",         "physical": "Polling"     },     {         "port": "IB1/11",         "port_num": "11",         "logical": "Down",         "physical": "Polling"     },     {         "port": "IB1/12",         "port_num": "12",         "logical": "Down",         "physical": "Polling"     },     {         "port": "IB1/13",         "port_num": "13",         "logical": "Down",         "physical": "Polling"     } ]

© Copyright 2023, NVIDIA. Last updated on Sep 5, 2023.