InfiniBand Bring-up Tool v1.0.1
IBUtils2 Utility Release Notes v2.15

Supported Job Templates

The following subsections describe the currently supported job templates.

Create, update, or destroy one or more hosts on a specific AWX inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "AWX Inventory Host Update".

Warning

Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update inventory:

Name

Default

Type

controller_host

URL to the AWX controller instance

String

controller_oauthtoken

OAuth token for the AWX controller instance

String

hostname

URL to the AWX controller instance

String

Alternatively, you can specify the following variables for update inventory:

Name

Default

Type

controller_host

URL to the AWX controller instance

String

controller_username

Username for the AWX controller instance

String

controller_password

Password for the AWX controller instance

String

hostname

Hostname or a hostname expression of the host(s) to update

String

The following variables are available to update inventory:

Name

Description

api_url

URL to your cluster bring-up REST API. This variable item is required when the hostname_regex_enabled is set to true.

description

Description to use for the host(s)

host_enabled

Determine whether the host(s) should be enabled

hostname_regex_enabled

Determine whether to use hostname expression to create the hostnames

host_state

State of the hosts resources. Options: present; or absent.

inventory

Name of the inventory the host(s) should be made a member of

The following are variable definitions and default values to update inventory:

Name

Default

Type

api_url

''

String

description

''

String

host_enabled

true

Boolean

hostname_regex_enabled

true

Boolean

host_state

'present'

String

inventory

'IB Cluster Inventory'

String

Ensure that Python environment for the COT client is installed on one or more hosts.

Warning

By default, this job template configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "COT Python Alignment".

The following variables are available for cluster orchestration Python environment installation:

Name

Description

cot_dir

Target path to installation root folder

force

Install the package even if it is already up to date

working_dir

Path to the working directory on the host

The following are variable definitions and default values for cluster bring-up client installation:

Name

Default

Type

cot_dir

'/opt/nvidia/cot'

String

force

false

Boolean

working_dir

'/tmp'

String

Run ClusterKit for high performance tests on the hosts of the inventory.

Warning

By default, this job template configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "ClusteKit".

Warning

ClusterKit relies on the HPC-X package. Make sure HPC-X package is installed.

The following variables are available for running ClusterKit:

Name

Description

clusterkit_hostname

Hostname expressions that represent the hostnames to run tests on

clusterkit_options

List of optional arguments for the tests

clusterkit_path

Path to the clusterkit executable script

inventory_group

Name of the inventory group for the hostnames to run tests on. This variable item is not available when either the use_hostfile is set to false or the clusterkit_hostname is set.

max_hosts

Limit the number of hostnames. This variable item is not available when the use_hostfile is set to false.

use_hostfile

Determine whether to use a file for hostnames to run tests on

working_dir

Path to the working directory on the host

The following are variable definitions and default values for running ClusterKit:

Name

Default

Type

clusterkit_hostname

null

String

clusterkit_options

[]

List[String]

clusterkit_path

'/opt/nvidia/hpcx/clusterkit/bin/clusterkit.sh'

String

inventory_group

all

String

max_hosts

-1

Integer

use_hostfile

true

Boolean

working_dir

'/tmp'

String

The ClusterKit results are uploaded to the database after each run and can be accessed via the API.

The following are REST requests to retrieve ClusterKit results:

URL

Response

Method Type

/api/performance/clusterkit/results

Get a list of all the ClusterKit run IDs stored in the database

GET

/api/performance/clusterkit/results/<run_id>

Get a ClusterKit run's results based on its run ID

GET

/api/performance/clusterkit/results/<run_id>?raw_data=true

Get a ClusterKit run's test results as they are stored in the ClusterKit JSON output file based on its run ID. Using the query param "raw_data".

GET

/api/performance/clusterkit/results/<run_id>?test=<test name>

Get a specific test result of the ClusterKit run based on its run ID. Using the query param "test".

GET

Query Param

Description

test

Returns a specific test result of the ClusterKit run

raw

Returns the data as it is stored in the ClusterKit output JSON files

Examples:

Copy
Copied!
            

$ curl ‘http://cluster-bringup:5000/api/performance/clusterkit/results’ ["20220721_152951", "20220721_151736", "20220721_152900", "20220721_152702"]   $ curl ‘http://cluster-bringup:5000/api/performance/clusterkit/results/20220721_152951?raw_data=true&test=latency’ { "Cluster": "Unknown", "User": "root", "Testname": "latency", "Date_and_Time": "2022/07/21 15:29:51", "JOBID": 0, "PPN": 28, "Bidirectional": "True", "Skip_Intra_Node": "True", "HCA_Tag": "Unknown", "Technology": "Unknown", "Units": "usec", "Nodes": {"ib-node-01": 0, "ib-node-02": 1}, "Links": [[0, 41.885]] }

Discover network topology and update the database with the discovered result.

Warning

By default, this job template configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB Network Discovery".

The following variables are required for network discovery:

Name

Default

Type

api_url

URL to your cluster bring-up REST API

String

The following variables are available for network discovery:

Name

Description

cot_python_interpreter

Path to cluster orchestration Python interpreter

ib_device

Name of the in-band HCA device to use (e.g., 'mlx5_0')

subnet

Name of a subnet which the topology nodes of the are member of

The following are variables definitions and default values for network discovery:

Name

Default

Type

cot_python_interpreter

'/opt/nvidia/cot/client/bin/python/'

String

ib_device

'mlx5_0'

String

subnet

'infiniband-default'

String

Install NVIDIA® UFM® Telemetry on one or more hosts.

Refer to UFM Telemetry documentation at Docker Hub for further information.

Warning

By default, this job template configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "UFM Telemetry Upgrade".

Warning

UFM Telemetry installation relies on Docker Hub. If you do not have access to this repository at running time, you must pull the image manually.

Warning

UFM Telemetry installation relies on both Docker API and Docker SDK for Python as a prerequisite.

The following variables are available for UFM Telemetry installation:

Name

Description

cable_info_schedule

Time of collecting cable info data

container

Name of the UFM Telemetry container to run

device

Name of the HCA device to use

ib_device

Alias name for device. This variable item is not available when the device is set.

image

Image of the UFM Telemetry to deploy

sample_rate

Frequency of collecting port counters

tag

Specify a tag for pulling a specific UFM Telemetry image

The following are variable definitions and default values for UFM Telemetry installation:

Name

Default

Type

cable_info_schedule

'1/00:00,3/00:00,5/00:00'

String

container

'ufm-telemetry'

String

device

'mlx5_0'

String

image

'mellanox/ufm-telemetry'

String

sample_rate

300

Integer

tag

'latest'

String

Install NVIDIA® MLNX_OFED driver on one or more hosts.

Refer to the official NVIDIA Linux Drivers documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX_OFED Upgrade".

Warning

By default, the MLNX_OFED package is downloaded from the MLNX_OFED download center. You must specify the ofed_package_url variable when the download center is not available.

The following variables are available for MLNX_OFED installation:

Name

Description

force

Install MLNX_OFED package even if it is already up to date

ofed_checksum

Checksum of the MLNX_OFED package to download

ofed_dependencies

List of all package dependencies for the MLNX_OFED package

ofed_install_options

List of optional arguments for the installation command

ofed_package_url

URL of the MLNX_OFED package to download

ofed_version

Version number of the MLNX_OFED package to install

working_dir

Path to the working directory on the host

The following are variable definitions and default values for MLNX_OFED installation:

Name

Default

Type

force

false

Boolean

ofed_checksum

''

String

ofed_dependencies

[]

List

ofed_install_options

[]

List

ofed_package_url

''

String

ofed_version

5.7-1.0.2.0'

String

working_dir

'/tmp'

String

The following example shows MLNX_OFED for RHEL/CentOS 8.0 on the MLNX_OFED Download Center:

mlnx-ofed-download-center.png

Copy
Copied!
            

ofed_checksum: 'SHA256: 37b64787db9eabecc3cefd80151c0f49c852751d797e1ccdbb49d652f08916e3' ofed_version: '5.4-1.0.3.0'

Update system firmware/OS software on one or more MLNX-OS switches.

Warning

By default, this job template configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX-OS Upgrade".

Warning

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update MLNX-OS system:

Name

Description

Type

mlnxos_image_url

URL of the MLNX-OS image to download

String

switch_username

Username to authenticate against target switches

String

switch_password

Password to authenticate against target switches

String

switches

List of IP addresses/hostnames of the switches to upgrade

List[String]

The following variables are available to update MLNX-OS system:

Name

Description

command_timeout

Time (in seconds) to wait for the command to complete

force

Update MLNX-OS system even if it is already up to date

image_url

Alias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set.

reload_command

Specify an alternative command to reload switch system

reload_timeout

Time (in seconds) to wait for the switch system to reload

remove_images

Determine whether to remove all images on disk before system upgrade starts

The following are variable definitions and default values for update MLNX-OS system:

Name

Default

Type

command_timeout

240

Integer

force

false

Boolean

reload_command

'"reload noconfirm"'

String

reload_timeout

200

Integer

remove_images

false

Boolean

Execute configuration commands on one or more MLNX-OS switches.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX-OS Configure".

The following variables are required to configure MLNX-OS system:

Name

Description

Type

switch_config_commands

List of configuration commands to execute

List[String]

switch_username

Username to authenticate against target switches

String

switch_password

Password to authenticate against target switches

String

switches

List of IP addresses/hostnames of the switches to configure

List[String]

The following variables are available to configure MLNX-OS system:

Name

Description

save_config

Indicates to save the system configuration after the execution completed

The following are variable definitions and default values to configure MLNX-OS system:

Name

Default

Type

save_config

true

Boolean

Install NVIDIA® MFT package on one or more hosts.

Refer to the official Mellanox Firmware Tools documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the Launch Template button on "MFT Upgrade".

Warning

By default, the MFT package is downloaded from the MFT download center. You need to specify the mft_package_url variable when the download center is not available.

The following variables are available for MFT installation:

Name

Description

force

Install MFT package even if it is already up to date

mft_checksum

Checksum of MFT package to download

mft_dependencies

List of all package dependencies for the MFT package

mft_install_options

List of optional arguments for the installation command

mft_package_url

URL of the MFT package to download

mft_version

Version number of the MFT package to install

working_dir

Path to the working directory on the host

The following are variable definitions and default values for MFT installation:

Name

Default

Type

force

false

Boolean

mft_checksum

''

String

mft_dependencies

[]

List

mft_install_options

[]

List

mft_package_url

''

String

mft_version

'4.21.0-99'

String

working_dir

'/tmp'

String

The following example shows MFT for RedHat on the MFT Download Center:

mft-download-center.png

Copy
Copied!
            

mft_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' mft_version: '4.17.0-106'

Install NVIDIA® HPC-X® package on one or more hosts.

Refer to the official NVIDIA HPC-X documentation for further information.

Warning

By default, this job template is configured to run against the hosts of IB Cluster Inventory. You must set the hpcx_install_once variable to true when installing the HPC-X package to a shared location.

The following instructions describe how to run this job template:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "HPC-X Upgrade".

Warning

By default, the HPC-X package is downloaded from the HPC-X download center. You need to specify the hpcx_package_url variable when the download center is not available.

The following variables are available for HPC-X installation:

Name

Description

force

Install HPC-X package even if it is already up to date

hpcx_checksum

Checksum of the HPC-X package to download

hpcx_dir

Target path for HPC-X installation folder

hpcx_install_once

Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory.

hpcx_package_url

URL of the HPC-X package to download

hpcx_version

Version number of the HPC-X package to install

ofed_version

Version number of the OFED package compatible to the HPC-X package. This variable item is required when MLNX_OFED is not installed on the host.

working_dir

Path to the working directory on the host

The following are variable definitions and default values for HPC-X installation:

Name

Default

Type

force

false

Boolean

hpcx_checksum

''

String

hpcx_dir

'/opt/nvidia/hpcx'

String

hpcx_install_once

false

Boolean

hpcx_package_url

''

String

hpcx_version

'2.12.0'

String

ofed_version

''

String

working_dir

'/tmp'

String

The following example shows HPC-X for RedHat 8.0 on the HPC-X Download Center:

hpc-x-download-center.png

Copy
Copied!
            

hpcx_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' hpcx_version: '2.9.0' ofed_version: ''

A file server is useful when you must access files (e.g., packages, images, etc.) that are not available on the WEB.

The files can be accessed over the following URL: http://<host>:<port>/downloads/ where host (IP address/hostname) and port are the address of your cluster bring-up host.

For example, if cluster-bringup is the hostname of your cluster bring-up host and the TCP port is 5000 as defined in the suggested configuration, then files can be accessed over the URL http://cluster-bringup:5000/downloads/.

To see all available files, open your browser and navigate to http://cluster-bringup:5000/downloads/.

index-of-downloads.png

  1. Create a directory for a specific cable firmware image and copy a binary image file into it. Run:

    Copy
    Copied!
                

    [root@cluster-bringup ~]# mkdir -p \ /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu [root@cluster-bringup ~]# cp /tmp/hercules2.bin \ /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu

    The file can be accessed over the URL http://cluster-bringup:5000/downloads/linkx/rel-38_100_121/iffu/hercules2.bin.

  2. To see all available files, open a browser and navigate to http://cluster-bringup:5000/downloads/.

    index-of-downloads-linkx.png

  3. To see the image file, navigate to http://cluster-bringup:5000/downloads/linkx/rel38_100_121/iffu/.

    index-of-downloads-hercules2.png

© Copyright 2023, NVIDIA. Last updated on Aug 28, 2023.