InfiniBand Bring-up Tool v4.0.0
InfiniBand Bring-up Tool v4.0.0

Supported Job Templates

The following subsections describe the currently supported job templates.

Create, update, or destroy one or more hosts on a specific AWX inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "AWX Inventory Host Update".

Warning

Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update inventory:

Variable

Description

Type

controller_host

URL to the AWX controller instance

String

controller_oauthtoken

OAuth token for the AWX controller instance

String

hostname

Hostname or a hostname expression of the host(s) to update

String

Alternatively, you can specify the following variables for update inventory:

Variable

Description

Type

controller_host

URL to the AWX controller instance

String

controller_username

Username for the AWX controller instance

String

controller_password

Password for the AWX controller instance

String

hostname

Hostname or a hostname expression of the host(s) to update

String

The following variables are available to update inventory:

Variable

Description

api_url

URL to your cluster bring-up REST API. This variable item is required when the hostname_regex_enabled is set to true.

description

Description to use for the host(s)

host_enabled

Determine whether the host(s) should be enabled

hostname_regex_enabled

Determine whether to use hostname expression to create the hostnames

host_state

State of the hosts resources. Options: present; or absent.

inventory

Name of the inventory the host(s) should be made a member of

The following are variable definitions and default values to update inventory:

Variable

Default

Type

api_url

''

String

description

''

String

host_enabled

true

Boolean

hostname_regex_enabled

true

Boolean

host_state

'present'

String

inventory

'IB Cluster Inventory'

String

Perform cable validation according to a given topology file.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "Cable Validation".

    Warning

    make sure that the filenames you provide in the ip_files and topo_files parameters, are names of files located at /opt/nvidia/cot/cable_validation_files.

The following variables are required to run cable validation:

Variable

Description

api_url

URL to your cluster bring-up REST API.

ip_files

List of IP filenames to use for cable validation.

topo_files

List of topology filenames to use for cable validation.

Alternatively, you can specify the following variables for cable validation:

Variable

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

remove_agents

Specify to remove the agents from the switches once validation is complete.

delay_time

Time (in seconds) to wait between queries of async requests.

The following are variable definitions and default values to run cable validation:

Variable

Default

Type

disable_report

false

Boolean

remove_agents

true

Boolean

delay_time

10

Integer

The following example shows how to provide the ip_files and topo_files parameters:

Copy
Copied!
            

ip_files: ['test-ip-file.ip'] topo_files: ['test-topo-file.topo']

In this example, the cable validation tool would expect to find the test-ip-file.ip and test-topo-file.topo files at /opt/nvidia/cot/cable_validation_files.

Ensure that Python environment for the COT client is installed on one or more hosts.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "COT Python Alignment".

The following variables are available for cluster orchestration Python environment installation:

Variable

Description

cot_dir

Target path to installation root folder

force

Install the package even if it is already up to date

working_dir

Path to the working directory on the host

The following are variable definitions and default values for cluster bring-up client installation:

Variable

Default

Type

cot_dir

'/opt/nvidia/cot'

String

force

false

Boolean

working_dir

'/tmp'

String

This job runs high performance tests on the hosts of the inventory.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "ClusteKit".

Warning

ClusterKit relies on the HPC-X package. Make sure HPC-X package is installed.

The following variables are available for running ClusterKit:

Variable

Description

clusterkit_hostname

Hostname expressions that represent the hostnames to run tests on

clusterkit_options

List of optional arguments for the tests

clusterkit_path

Path to the clusterkit executable script

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

hpcx_dir

Path to the HPC-X directory

ib_device

Name of the RDMA device of the port used to connect to the fabric

inventory_group

Name of the inventory group for the hostnames to run tests on. This variable item is not available when either the use_hostfile is set to false or the clusterkit_hostname is set.

max_hosts

Limit the number of hostnames. This variable item is not available when the use_hostfile is set to false.

use_hostfile

Determine whether to use a file for hostnames to run tests on

working_dir

Path to the working directory on the host

The following are variable definitions and default values for running ClusterKit:

Variable

Default

Type

clusterkit_hostname

null

String

clusterkit_options

[]

List[String]

clusterkit_path

'{hpcx_dir}/clusterkit/bin/clusterkit.sh'

String

disable_report

false

Boolean

hpcx_dir

'/opt/nvidia/hpcx'

String

ib_device

'mlx5_0'

String

inventory_group

all

String

max_hosts

-1

Integer

use_hostfile

true

Boolean

working_dir

'/tmp'

String

The ClusterKit results are uploaded to the database after each run and can be accessed via the API.

The following are REST requests to retrieve ClusterKit results:

URL

Response

Method Type

/api/performance/clusterkit/results

Get a list of all the ClusterKit run IDs stored in the database

GET

/api/performance/clusterkit/results/<run_id>

Get a ClusterKit run's results based on its run ID

GET

/api/performance/clusterkit/results/<run_id>?raw_data=true

Get a ClusterKit run's test results as they are stored in the ClusterKit JSON output file based on its run ID. Using the query param "raw_data".

GET

/api/performance/clusterkit/results/<run_id>?test=<test name>

Get a specific test result of the ClusterKit run based on its run ID. Using the query param "test".

GET

Query Param

Description

test

Returns a specific test result of the ClusterKit run

raw

Returns the data as it is stored in the ClusterKit output JSON files

Examples:

Copy
Copied!
            

$ curl 'http://cluster-bringup:5000/api/performance/clusterkit/results' ["20220721_152951", "20220721_151736", "20220721_152900", "20220721_152702"]   $ curl 'http://cluster-bringup:5000/api/performance/clusterkit/results/20220721_152951?raw_data=true&test=latency' { "Cluster": "Unknown", "User": "root", "Testname": "latency", "Date_and_Time": "2022/07/21 15:29:51", "JOBID": 0, "PPN": 28, "Bidirectional": "True", "Skip_Intra_Node": "True", "HCA_Tag": "Unknown", "Technology": "Unknown", "Units": "usec", "Nodes": {"ib-node-01": 0, "ib-node-02": 1}, "Links": [[0, 41.885]] }

This job collects fabric counters with and without traffic based on CollectX and ClusterKit tools.

Warning

By default, this job template is configured to run with the ib_host_manager group specified in the IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "Fabric Health Counters Collection".

The following are available for running Fabric Health Counters Collection:

Variable

Description

clusterkit_path

Path to the ClusterKit executable script

collection_interval

Interval of time between counter samples in minutes

cot_executable

Path to the installed cotclient tool

counters_output_dir

Directory path to save counters data

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

hpcx_dir

Path to the HPC-X directory

ib_device

Name of the RDMA device of the port used to connect to the fabric

idle_test_time

Time to run monitor counters without traffic in minutes

reset_counters

Specify to reset counters before starting the counters collection

stress_test_time

Time to run monitor counters with traffic in minutes

ufm_telemetry_path

Path for the UFM Telemetry directory located in the ib_host_manager_server

working_dir

Path to the working directory on the host

The following are variable definitions and default values for the fabric health counters collection:

Variable

Default

Type

clusterkit_path

'{hpcx_dir}/clusterkit/bin/clusterkit.sh'

String

collection_interval

5

Integer

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

counters_output_dir

'/tmp/collectx_counters_{date}_{time}/'

String

disable_report

false

Boolean

hpcx_dir

'/opt/nvidia/hpcx'

String

ib_device

'mlx5_0'

String

idle_test_time

30

Integer

reset_counters

true

Boolean

stress_test_time

30

Integer

ufm_telemetry_path

'{working_dir}/ufm_telemetry'

String

working_dir

'/tmp'

String

This job performs diagnostics on the fabric's state based on ibdiagnet checks, SM files, and switch commands.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB Fabric Health Checks".

The following variables are available for running IB Fabric Health Checks:

Variable

Description

check_max_failure_percentage

Max failure percentage for fabric health checks

cot_executable

Path to the installed cotclient tool

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

exclude_scope

List of node GUIDs and their ports to be excluded

ib_device

Name of the RDMA device of the port used to connect to the fabric

routing_check

Specify for routing check

sm_configuration_file

Path for SM configuration file; supported only when the SM is running on the ib_host_manager

sm_unhealthy_ports_check

Specify for SM unhealthy ports check; supported only when the SM is running on the ib_host_manager

topology_type

Type of topology to discover

mlnxos_switch_hostname

Hostname expression that represents switches running MLNX-OS

mlnxos_switch_username

Username to authenticate against the target switches

mlnxos_switch_password

Password to authenticate against the target switches

The following are variable definitions and default values for the health check:

Variable

Default

Type

check_max_failure_percentage

1

Float

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

disable_report

false

Boolean

exclude_scope

NULL

List(String)

ib_device

'mlx5_0'

String

routing_check

True

Boolean

sm_configuration_file

'/etc/opensm/opensm.conf'

String

sm_unhealthy_ports_check

false

Boolean

topology_type

'infiniband'

String

mlnxos_switch_hostname

NULL

String

mlnxos_switch_username

NULL

String

mlnxos_switch_password

NULL

String

The following example shows how to exclude ports using the exclude_scope variable:

Copy
Copied!
            

exclude_scope: ['0x1234@1/3', '0x1235']

In this example, IB Fabric Health Check runs over the fabric except on ports 1 and 3 of node GUID 0x1234 and all ports of node GUID 0x1235.

The following example shows how to configure switch variables:

Copy
Copied!
            

mlnxos_switch_hostname: 'ib-switch-t[1-2],ib-switch-s1' mlnxos_switch_username: 'admin' mlnxos_switch_password: 'my_admin_password'

In this example, IB Fabric Health Check performs a check that requires switch connection over ib-switch-t1, ib-switch-t2, and ib-switch-s1 using the username admin and password my_admin_password for the connection.

This job discovers network topology and updates the database.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB Network Discovery".

The following variables are required for network discovery:

Name

Description

Type

api_url

URL to your cluster bring-up REST API

String

Warning

For the network discovery to find the IPs of MLNX-OS switches, the ufm_telemetry_path variable is required. This feature is supported for UFM Telemetry version 1.11.0 and above.

The following variables are available for network discovery:

Variable

Description

clear_topology

Use to clear previous topology data.

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

ufm_telemetry_path

Path for the UFM Telemetry folder located on the ib_host_manager_server.

Specify for using UFM Telemetry's ibdiagnet tool for the network discovery (e.g., '/tmp/ufm_telemetry').

switch_username

Username to authenticate against MLNX-OS switches

switch_password

Password to authenticate against MLNX-OS switches

cot_executable

Path to installed cotclient tool

ib_device

Name of the in-band HCA device to use (e.g., 'mlx5_0')

subnet

Name of a subnet which the topology nodes of the are member of

The following are variables definitions and default values for network discovery:

Variable

Default

Type

clear_topology

false

Boolean

disable_report

false

Boolean

ufm_telemetry_path

NULL

String

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

ib_device

'mlx5_0'

String

subnet

'infiniband-default'

String

This job installs NVIDIA® UFM® Telemetry on one or more hosts.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "UFM Telemetry Upgrade".

The following variables are required for UFM Telemetry installation:

Variable

Description

ufm_telemetry_package_url

URL for UFM Telemetry to download

The following variables are available for UFM Telemetry installation:

Variable

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

ufm_telemetry_checksum

Checksum of the UFM Telemetry package to download

working_dir

Destination path for installing UFM Telemetry. The package will be placed in a subdirectory called ufm_telemetry.

The following are variable definitions and default values for UFM Telemetry installation:

Variable

Default

Type

disable_report

false

Boolean

ufm_telemetry_path

NULL

String

working_dir

'/tmp'

String

This job installs NVIDIA® MLNX_OFED driver on one or more hosts.

Refer to the official NVIDIA Linux Drivers documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX_OFED Upgrade".

Warning

By default, the MLNX_OFED package is downloaded from the MLNX_OFED download center. You must specify the ofed_version (or use its default value) and the ofed_package_url variables when the download center is not available.

The following variables are available for MLNX_OFED installation:

Variable

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Install MLNX_OFED package even if it is already up to date

ofed_checksum

Checksum of the MLNX_OFED package to download

ofed_dependencies

List of all package dependencies for the MLNX_OFED package

ofed_install_options

List of optional arguments for the installation command

ofed_package_url

URL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.

ofed_version

Version number of the MLNX_OFED package to install

working_dir

Path to the working directory on the host

The following are variable definitions and default values for MLNX_OFED installation:

Variable

Default

Type

disable_report

false

Boolean

force

false

Boolean

ofed_checksum

''

String

ofed_dependencies

[]

List

ofed_install_options

[]

List

ofed_package_url

''

String

ofed_version

23.10-0.5.5.0

String

working_dir

'/tmp'

String

The following example shows MLNX_OFED for RHEL/CentOS 8.0 on the MLNX_OFED Download Center:

mlnx-ofed-download-center-version-1-modificationdate-1709767603777-api-v2.png

Copy
Copied!
            

ofed_checksum: 'SHA256: 37b64787db9eabecc3cefd80151c0f49c852751d797e1ccdbb49d652f08916e3' ofed_version: '5.4-1.0.3.0'

This job installs updates system firmware/OS software on one or more MLNX-OS switches.

Warning

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX-OS Upgrade".

Warning

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update MLNX-OS system:

Variable

Description

Type

mlnxos_image_url

URL of the firmware/MLNX-OS image to download

String

switch_username

Username to authenticate against target switches

String

switch_password

Password to authenticate against target switches

String

switches

List of IP addresses/hostnames of the switches to upgrade

List[String]

The following variables are available to update MLNX-OS system:

Variable

Description

command_timeout

Time (in seconds) to wait for the command to complete

cot_executable

Path to the installed cotclient tool

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Update MLNX-OS system even if it is already up to date

image_url

Alias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set.

reload_command

Specify an alternative command to reload switch system

reload_timeout

Time (in seconds) to wait for the switch system to reload

remove_images

Determine whether to remove all images on disk before system upgrade starts

The following are variable definitions and default values for update MLNX-OS system:

Variable

Default

Type

command_timeout

240

Integer

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

disable_report

false

Boolean

force

false

Boolean

reload_command

'"reload noconfirm"'

String

reload_timeout

200

Integer

remove_images

false

Boolean

This job executes configuration commands on one or more MLNX-OS switches.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX-OS Configure".

The following variables are required to configure MLNX-OS system:

Variable

Description

Type

switch_config_commands

List of configuration commands to execute

List[String]

switch_username

Username to authenticate against target switches

String

switch_password

Password to authenticate against target switches

String

switches

List of IP addresses/hostnames of the switches to configure

List[String]

The following variables are available to configure MLNX-OS system:

Variable

Description

cot_executable

Path to the installed cotclient tool

save_config

Indicates to save the system configuration after the execution completed

The following are variable definitions and default values to configure MLNX-OS system:

Variable

Default

Type

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

save_config

true

Boolean

This job installs NVIDIA® MFT package on one or more hosts.

Refer to the official Mellanox Firmware Tools documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the Launch Template button on "MFT Upgrade".

Warning

By default, the MFT package is downloaded from the MFT download center. You must specify the mft_version (or use its default value) and the mft_package_url variables when the download center is not available.

The following variables are available for MFT installation:

Variable

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Install MFT package even if it is already up to date

mft_checksum

Checksum of MFT package to download

mft_dependencies

List of all package dependencies for the MFT package

mft_install_options

List of optional arguments for the installation command

mft_package_url

URL of the MFT package to download (default: auto-detection). In addition, you must specify the mft_version parameter or use its default value.

mft_version

Version number of the MFT package to install

working_dir

Path to the working directory on the host

The following are variable definitions and default values for MFT installation:

Variable

Default

Type

disable_report

false

Boolean

force

false

Boolean

mft_checksum

''

String

mft_dependencies

[]

List

mft_install_options

[]

List

mft_package_url

''

String

mft_version

'4.26.0-93'

String

working_dir

'/tmp'

String

The following example shows MFT for RedHat on the MFT Download Center:

mft-download-center-version-1-modificationdate-1709767604387-api-v2.png

Copy
Copied!
            

mft_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' mft_version: '4.17.0-106'

This job installs NVIDIA® HPC-X® package on one or more hosts.

Refer to the official NVIDIA HPC-X documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory. You must set the hpcx_install_once variable to true when installing the HPC-X package to a shared location.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "HPC-X Upgrade".

Warning

By default, the HPC-X package is downloaded from the HPC-X download center. You need to specify the hpcx_version (or use its default value) and the hpcx_package_url variables when the download center is not available.

The following variables are available for HPC-X installation:

Variable

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Install HPC-X package even if it is already up to date

hpcx_checksum

Checksum of the HPC-X package to download

hpcx_dir

Target path for HPC-X installation folder

hpcx_install_once

Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory.

hpcx_package_url

URL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.

hpcx_version

Version number of the HPC-X package to install

ofed_version

Version number of the OFED package compatible to the HPC-X package. This variable item is required when MLNX_OFED is not installed on the host.

working_dir

Path to the working directory on the host

The following are variable definitions and default values for HPC-X installation:

Variable

Default

Type

disable_report

false

Boolean

force

false

Boolean

hpcx_checksum

''

String

hpcx_dir

'/opt/nvidia/hpcx'

String

hpcx_install_once

false

Boolean

hpcx_package_url

''

String

hpcx_version

'2.17.0'

String

ofed_version

''

String

working_dir

'/tmp'

String

The following example shows HPC-X for RedHat 8.0 on the HPC-X Download Center:

hpc-x-download-center-version-1-modificationdate-1709767603333-api-v2.png

Copy
Copied!
            

hpcx_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' hpcx_version: '2.9.0' ofed_version: ''

This job configures multiple MLNX-OS switches as IB NDR routers. The job supports two modes:

  • Attached mode (AWX Job Template)

  • Detached mode (Ansible playbook/Docker image)

    Warning

    The PDF report and the pass-fail criteria features are not available when running via detached mode.

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button in "IB NDR Router Configuration".

The following variables are required for multiple router configuration:

Variable

Description

Type

settings_file

Path to YAML file describing the switches and the mapping of the ports. For more information see section "Input File".

String

server_ip

IP of the server running the playbook.

Used for saving the switch configuration backups to the server.

This variable is mandatory only when running via Docker mode.

String

server_username

Username of the server running the playbook.

Used for saving the switch configuration backups to the server.

When running in attached mode, this would be the ib_host_manager user.

String

server_password

Password of the server running the playbook.

Used for saving the switches configuration backups to the server.

When running in attached mode, this would be the ib_host_manager user's password.

String

The following variables are available for multiple router configuration:

Variable

Description

query_conf_job_delay

Time (in seconds) to wait between each query of the asynchronous configuration job

query_conf_job_retries

Number of retries to query the asynchronous configuration job

ssh_conn_retries

Number of retries to SSH to a switch after reboot

ssh_conn_delay

Time (in seconds) to wait between each SSH retry

ssh_conn_retry_timeout

Timeout (in seconds) for each SSH retry

http_cmd_timeout

Timeout (in seconds) for the HTTP request that contains commands for execution

boot_timeout

Timeout (in seconds) for a switch to boot, after profile change, and resolve configuration wizard

backup_only

Specify as True to only dump configuration backups of switches to the server

backup_only_job_retries

Number of retries to query the backup job until finished (in backup_only mode)

backup_only_job_delay

Time (in seconds) to wait between each query of the backup job (in backup_only mode)

bulk_size

Number of switches per bulk

working_dir

Path to the directory that the playbook will save its files to.

The playbook will save these files under {working_dir}/ router_conf_job_files/{unique_id}/.

no_cache

Specify as True to remove all files created by the job. This includes log files and switch configuration backups.

internal_python_interpreter

A path to the Python interpreter to utilize for executing the configuration procedure Python module.

This python interpreter must have Paramiko library installed.

The following are variable definitions and default values for multiple router configuration:

Variable

Type

Default

query_conf_job_delay

Integer

50

query_conf_job_retries

Integer

30

ssh_conn_retries

Integer

10

ssh_conn_delay

Integer

30

ssh_conn_retry_timeout

Integer

40

http_cmd_timeout

Integer

10

boot_timeout

Integer

450

backup_only

Boolean

false

backup_only_job_retries

Integer

5

backup_only_job_delay

Integer

10

bulk_size

Integer

100

working_dir

String

/tmp

no_cache

Boolean

false

internal_python_interpreter

String

  • Attached mode – If COTClient is installed, it will use its Python interpreter. If not, it will use ansible_python_interpreter if provided. By default, the Python interpreter is auto-discovered by Ansible.

  • Detached mode: ansible_python_interpreter will be selected if provided. By default, the Python interpreter is auto-discovered by Ansible.

Input File

Each group in the input file should consist of switches, username, password and ports_mapping keys:

  • The switches key describes a list of switch names/IPs by using a host list expression.

  • The username/password are used to authenticate against the switches.

  • The ports_mapping key describes how to map the physical ports to the SWIDs with a list of dictionaries. Each dictionary describes a mapping of a ports list, in label port format, to an IB subnet.

The following parameters are available for each group in the input file:

Variable

Description

Type

auth_protocol

Authentication protocol configured on switches. Supported protocol: tacacs.

Default: None.

String

default_username

Default username to authenticate against the switches after changing their system profile.

Default: admin.

String

default_password

Default password to authenticate against the switches after changing their system profile.

Default: admin.

String

adaptive_routing_groups

Number of adaptive routing groups to set for the new system profile on switches in a group.

Setting this parameter to zero turns off the adaptive routing feature. If not specified, the original value is retained.

Integer

additional_commands

List of additional configuration commands to execute before enabling the subnets at the end of the procedure.

List

rollback_additional_commands

List of additional configuration commands to execute in case of rollback. These commands are executed after the rollback procedure is completed.

List

Example:

Copy
Copied!
            

group1:   switches: grla-example-[1-10] username: admin password: admin auth_protocol: tacacs adaptive_routing_groups: 2048 additional_commands: - <some_command> rollback_additional_commands: - <some_command>   ports_mapping:    - subnet: infiniband-1       ports:         - 1/1         - 1/2    - subnet: infiniband-default       ports:         - 2/1          - 2/2


Docker Mode

The IB NDR Router Configuration job supports Docker mode, which consists of all dependencies and files required for this job, wrapped as a Docker image.

The Docker mode supports running the playbook as the localhost only.

To run the router configuration job Docker image, use the following command:

Copy
Copied!
            

docker run -v /tmp/router_conf_job_files/:/tmp/router_conf_job_files/ -itd <image_name>

Use docker exec command to start a shell running inside the container:

Copy
Copied!
            

docker exec -it <container_id> bash

To run the job, prepare the inventory that consists of the localhost (only) and the settings file. Use the following base command, after adding the relevant parameters:

Copy
Copied!
            

ansible-playbook /ndr_router_configuration.yml -i /inv_localhost.ini -c local

Example:

Copy
Copied!
            

ansible-playbook /ndr_router_configuration.yml -i /inv_localhost.ini -c local -e "server_username=root"


System Requirements

Attached Mode Requirements

In attached mode, the playbook tasks are executed on the ib_host_manager.

  • If COTClient is installed, it will use its Python interpreter for internal_python_interpreter

  • Otherwise, the ib_host_manager must meet the following requirements:

    • Python version 3 or higher (see "Target Python / PowerShell" in Ansible documentation).

    • Paramiko version 3.4.0 or higher

Detached Mode Requirements

In detached mode, the Ansible playbook can run on the localhost or send tasks to be performed on a remote host.

  • If running on the localhost, the server must meet both Control Node and Target Node requirements as mentioned below

  • Otherwise, the server running the Ansible playbook is the Control Node itself and the remote server is the Target Node

Control node requirements:

  • Ansible version 2.12 or higher

  • Python version 3 or higher (see "Control Node Python" section in Ansible documentation)

Target node requirements:

  • Python version 3 or higher (see "Target Python / PowerShell" in Ansible documentation)

  • Paramiko version 3.4.0 or higher

Docker Mode Requirements

The image can run on any server meeting the following requirements:

  • Ubuntu version 18.04 or higher

  • Docker version 19.03.14 or higher

To run the container, a volume must be set as the location to save the files made by this job. This volume will map /tmp/router_conf_job_files in the server to the same path in the container.

Example:

Copy
Copied!
            

docker run -v /tmp/router_conf_job_files/:/tmp/router_conf_job_files/ -itd cot/router_conf:1.0.0

This job generating report by run ID (default: last run ID).

To run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "Report Generation".

The following variables are required to generate report:

Variable

Description

Type

api_url

URL to your cluster bring-up REST API

String

The following variables are available to generate report:

Variable

Description

disable_report

Specify for skipping operations required for the summary report (e.g., data collection, report generation, etc.).

run_id

Run ID for generating report of this run.

The following are variable definitions and default values to generate report:

Variable

Default

Type

disable_report

false

Boolean

run_id

'last'

String

A file server is useful when you must access files (e.g., packages, images, etc.) that are not available on the WEB.

The files can be accessed over the following URL: http://<host>:<port>/downloads/ where host (IP address/hostname) and port are the address of your cluster bring-up host.

For example, if cluster-bringup is the hostname of your cluster bring-up host and the TCP port is 5000 as defined in the suggested configuration, then files can be accessed over the URL http://cluster-bringup:5000/downloads/.

To see all available files, open your browser and navigate to http://cluster-bringup:5000/downloads/.

index-of-downloads-version-1-modificationdate-1709767605513-api-v2.png

Create a directory for a specific cable firmware image and copy a binary image file into it. Run:

Copy
Copied!
            

[root@cluster-bringup ~]# mkdir -p \ /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu [root@cluster-bringup ~]# cp /tmp/hercules2.bin \ /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu

The file can be accessed over the URL http://cluster-bringup:5000/downloads/linkx/rel-38_100_121/iffu/hercules2.bin.

To see all available files, open a browser and navigate to http://cluster-bringup:5000/downloads/.

index-of-downloads-linkx-version-1-modificationdate-1709767605897-api-v2.png

To see the image file, navigate to http://cluster-bringup:5000/downloads/linkx/rel38_100_121/iffu/.

index-of-downloads-hercules2-version-1-modificationdate-1709767606187-api-v2.png


© Copyright 2023, NVIDIA. Last updated on Mar 18, 2024.