image image image image image

On This Page

The following subsections describe the currently supported job templates.

AWX Inventory Host Update

Create, update, or destroy one or more hosts on a specific AWX inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "AWX Inventory Host Update".

Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update inventory:

VariableDefaultType
controller_hostURL to the AWX controller instanceString
controller_oauthtokenOAuth token for the AWX controller instanceString
hostnameURL to the AWX controller instanceString

Alternatively, you can specify the following variables for update inventory:

VariableDefaultType
controller_hostURL to the AWX controller instanceString
controller_usernameUsername for the AWX controller instanceString
controller_passwordPassword for the AWX controller instanceString
hostnameHostname or a hostname expression of the host(s) to updateString

The following variables are available to update inventory:

VariableDescription
api_urlURL to your cluster bring-up REST API. This variable item is required when the hostname_regex_enabled is set to true.
descriptionDescription to use for the host(s)
host_enabledDetermine whether the host(s) should be enabled
hostname_regex_enabledDetermine whether to use hostname expression to create the hostnames
host_stateState of the hosts resources. Options: present; or absent.
inventoryName of the inventory the host(s) should be made a member of

The following are variable definitions and default values to update inventory:

VariableDefaultType
api_url''String
description''String
host_enabledtrueBoolean
hostname_regex_enabledtrueBoolean
host_state'present'String
inventory'IB Cluster Inventory'String

Cable Validation

Perform cable validation according to a given topology file.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "Cable Validation". 

    make sure that the filenames you provide in the ip_files and topo_files parameters, are names of files located at /opt/nvidia/cot/cable_validation_files.

The following variables are required to run cable validation:

VariableDescription

api_url

URL to your cluster bring-up REST API.

ip_files

List of IP filenames to use for cable validation.

topo_files

List of topology filenames to use for cable validation.

Alternatively, you can specify the following variables for cable validation:

VariableDescription

remove_agents

Specify to remove the agents from the switches once validation is complete.

delay_time

Time (in seconds) to wait between queries of async requests.

The following are variable definitions and default values to run cable validation:

VariableDefaultType

remove_agents

true

Boolean

delay_time

10

Integer

The following example shows how to provide the ip_files and topo_files parameters: 

ip_files: ['test-ip-file.ip']
topo_files: ['test-topo-file.topo']

In this example, the cable validation tool would expect to find the test-ip-file.ip and test-topo-file.topo files at /opt/nvidia/cot/cable_validation_files.

COT Python Alignment

Ensure that Python environment for the COT client is installed on one or more hosts.

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "COT Python Alignment".

The following variables are available for cluster orchestration Python environment installation:

VariableDescription
cot_dirTarget path to installation root folder
forceInstall the package even if it is already up to date
working_dirPath to the working directory on the host

The following are variable definitions and default values for cluster bring-up client installation:

VariableDefaultType
cot_dir'/opt/nvidia/cot'String
forcefalseBoolean
working_dir'/tmp'String

ClusterKit

This job runs high performance tests on the hosts of the inventory. 

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "ClusteKit".

ClusterKit relies on the HPC-X package. Make sure HPC-X package is installed.

The following variables are available for running ClusterKit:

VariableDescription
clusterkit_hostnameHostname expressions that represent the hostnames to run tests on
clusterkit_optionsList of optional arguments for the tests
clusterkit_pathPath to the clusterkit executable script
ib_deviceName of the RDMA device of the port used to connect to the fabric
inventory_groupName of the inventory group for the hostnames to run tests on. This variable item is not available when either the use_hostfile is set to false or the clusterkit_hostname is set.
max_hosts

Limit the number of hostnames. This variable item is not available when the use_hostfile is set to false.

use_hostfileDetermine whether to use a file for hostnames to run tests on
working_dirPath to the working directory on the host

The following are variable definitions and default values for running ClusterKit:

VariableDefaultType
clusterkit_hostnamenullString
clusterkit_options[]List[String]
clusterkit_path'/opt/nvidia/hpcx/clusterkit/bin/clusterkit.sh'String
ib_device'mlx5_0'String
inventory_groupallString
max_hosts-1Integer
use_hostfiletrueBoolean
working_dir'/tmp'String

The ClusterKit results are uploaded to the database after each run and can be accessed via the API.

The following are REST requests to retrieve ClusterKit results:

URLResponseMethod Type
/api/performance/clusterkit/resultsGet a list of all the ClusterKit run IDs stored in the databaseGET
/api/performance/clusterkit/results/<run_id>Get a ClusterKit run's results based on its run IDGET
/api/performance/clusterkit/results/<run_id>?raw_data=trueGet a ClusterKit run's test results as they are stored in the ClusterKit JSON output file based on its run ID. Using the query param "raw_data".GET
/api/performance/clusterkit/results/<run_id>?test=<test name>Get a specific test result of the ClusterKit run based on its run ID. Using the query param "test".GET
Query ParamDescription 
testReturns a specific test result of the ClusterKit run
rawReturns the data as it is stored in the ClusterKit output JSON files

Examples:

$ curl 'http://cluster-bringup:5000/api/performance/clusterkit/results' ["20220721_152951", "20220721_151736", "20220721_152900", "20220721_152702"]

$ curl 'http://cluster-bringup:5000/api/performance/clusterkit/results/20220721_152951?raw_data=true&test=latency' {
  "Cluster": "Unknown",
  "User": "root",
  "Testname": "latency",
  "Date_and_Time": "2022/07/21 15:29:51",
  "JOBID": 0,
  "PPN": 28,
  "Bidirectional": "True",
  "Skip_Intra_Node": "True",
  "HCA_Tag": "Unknown",
  "Technology": "Unknown",
  "Units": "usec",
  "Nodes": {"ib-node-01": 0, "ib-node-02": 1},
  "Links": [[0, 41.885]]
}

Fabric Health Counters Collection

This job collects fabric counters with and without traffic based on CollectX and ClusterKit tools.

By default, this job template is configured to run with the ib_host_manager group specified in the IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "Fabric Health Counters Collection".

The following are available for running Fabric Health Counters Collection:

VariableDescription

clusterkit_path

Path to the ClusterKit executable script

collection_interval

Interval of time between counter samples in minutes

cot_executable

Path to the installed cotclient tool

counters_output_dir

Directory path to save counters data

ib_device

Name of the RDMA device of the port used to connect to the fabric

idle_test_time

Time to run monitor counters without traffic in minutes

format_generate

Formats the collection counters data with the specified type

hpcx_dir

Path to the HPC-X directory

reset_counters

Specify to reset counters before starting the counters collection

stress_test_time

Time to run monitor counters with traffic in minutes

ufm_telemetry_path

Path for the UFM Telemetry directory located in the ib_host_manager_server

working_dir

Path to the working directory on the host

The following are variable definitions and default values for the fabric health counters collection:

VariableDefaultType

clusterkit_path

'{hpcx_dir}/clusterkit/bin/clusterkit.sh'

String

collection_interval

5

Integer

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

counters_output_dir

'/tmp/collectx_counters_{date}_{time}/'

String

ib_device

'mlx5_0'

String

idle_test_time

30

Integer

format_generate

'basic'

String

hpcx_dir

'/opt/nvidia/hpcx'

String

reset_counters

true

Boolean

stress_test_time

30

Integer

ufm_telemetry_path

'{working_dir}/ufm_telemetry'

String

working_dir

'/tmp'

String

IB Fabric Health Checks

This job performs diagnostics on the fabric's state based on ibdiagnet checks, SM files, and switch commands.

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "IB Fabric Health Checks".

The following variables are available for running IB Fabric Health Checks:

VariableDescription
check_max_failure_percentageMax failure percentage for fabric health checks
cot_executablePath to the installed cotclient tool
exclude_scopeList of node GUIDs and their ports to be excluded
ib_deviceName of the RDMA device of the port used to connect to the fabric
routing_checkSpecify for routing check
sm_configuration_filePath for SM configuration file; supported only when the SM is running on the ib_host_manager
sm_unhealthy_ports_checkSpecify for SM unhealthy ports check; supported only when the SM is running on the ib_host_manager
topology_typeType of topology to discover
mlnxos_switch_hostnameHostname expression that represents switches running MLNX-OS
mlnxos_switch_usernameUsername to authenticate against the target switches
mlnxos_switch_passwordPassword to authenticate against the target switches

The following are variable definitions and default values for the health check:

VariableDefaultType
check_max_failure_percentage1Float
cot_executable'/opt/nvidia/cot/client/bin/cotclient'String
exclude_scopeNULLList(String)
ib_device'mlx5_0'String
routing_checkTrueBoolean
sm_configuration_file'/etc/opensm/opensm.conf'String
sm_unhealthy_ports_checkfalseBoolean
topology_type'infiniband'String
mlnxos_switch_hostnameNULLString
mlnxos_switch_usernameNULLString
mlnxos_switch_passwordNULLString

The following example shows how to exclude ports using the exclude_scope variable:

exclude_scope: ['0x1234@1/3', '0x1235']

In this example, IB Fabric Health Check runs over the fabric except on ports 1 and 3 of node GUID 0x1234 and all ports of node GUID 0x1235.

The following example shows how to configure switch variables:

mlnxos_switch_hostname: 'ib-switch-t[1-2],ib-switch-s1'
mlnxos_switch_username: 'admin'
mlnxos_switch_password: 'my_admin_password'

In this example, IB Fabric Health Check performs a check that requires switch connection over ib-switch-t1, ib-switch-t2, and ib-switch-s1 using the username admin and password my_admin_password for the connection.

IB Network Discovery

This job discovers network topology and updates the database.

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "IB Network Discovery".

The following variables are required for network discovery:

NameDefaultType
api_urlURL to your cluster bring-up REST APIString

For the network discovery to find the IPs of MLNX-OS switches, the ufm_telemetry_path variable is required. This feature is supported for UFM Telemetry version 1.11.0 and above.

The following variables are available for network discovery:

VariableDescription
clear_topologyUse to clear previous topology data.
ufm_telemetry_path

Path for the UFM Telemetry folder located on the ib_host_manager_server.

Specify for using UFM Telemetry's ibdiagnet tool for the network discovery (e.g., '/tmp/ufm_telemetry').

switch_usernameUsername to authenticate against MLNX-OS switches
switch_passwordPassword to authenticate against MLNX-OS switches
cot_python_interpreterPath to cluster orchestration Python interpreter
ib_deviceName of the in-band HCA device to use (e.g., 'mlx5_0')
subnetName of a subnet which the topology nodes of the are member of

The following are variables definitions and default values for network discovery:

VariableDefaultType
clear_topologyfalseBoolean
ufm_telemetry_pathNULLString
cot_python_interpreter'/opt/nvidia/cot/client/bin/python/'String
ib_device'mlx5_0'String
subnet'infiniband-default'String

UFM Telemetry Upgrade

This job installs NVIDIA® UFM® Telemetry on one or more hosts.

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "UFM Telemetry Upgrade".

The following variables are required for UFM Telemetry installation:

VariableDescription
ufm_telemetry_package_urlURL for UFM Telemetry to download

The following variables are available for UFM Telemetry installation:

VariableDescription
working_dirDestination path for installing UFM Telemetry. The package will be placed in a subdirectory called ufm_telemetry. Default: /tmp.
ufm_telemetry_checksumChecksum of the UFM Telemetry package to download

MLNX_OFED Upgrade

This job installs NVIDIA® MLNX_OFED driver on one or more hosts.

Refer to the official NVIDIA Linux Drivers documentation for further information.

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "MLNX_OFED Upgrade".

By default, the MLNX_OFED package is downloaded from the MLNX_OFED download center. You must specify the ofed_version (or use its default value) and the ofed_package_url variables when the download center is not available.

The following variables are available for MLNX_OFED installation:

VariableDescription
forceInstall MLNX_OFED package even if it is already up to date
ofed_checksumChecksum of the MLNX_OFED package to download
ofed_dependenciesList of all package dependencies for the MLNX_OFED package
ofed_install_optionsList of optional arguments for the installation command
ofed_package_urlURL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.
ofed_versionVersion number of the MLNX_OFED package to install
working_dirPath to the working directory on the host

The following are variable definitions and default values for MLNX_OFED installation:

VariableDefaultType
forcefalseBoolean
ofed_checksum''String
ofed_dependencies[]List
ofed_install_options[]List
ofed_package_url''String
ofed_version23.04-0.5.3.3String
working_dir'/tmp'String

The following example shows MLNX_OFED for RHEL/CentOS 8.0 on the MLNX_OFED Download Center:

ofed_checksum: 'SHA256: 37b64787db9eabecc3cefd80151c0f49c852751d797e1ccdbb49d652f08916e3' ofed_version: '5.4-1.0.3.0'

MLNX-OS Upgrade

This job installs updates system firmware/OS software on one or more MLNX-OS switches. 

By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "MLNX-OS Upgrade".

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables. 

The following variables are required to update MLNX-OS system:

VariableDescriptionType
mlnxos_image_urlURL of the MLNX-OS image to downloadString
switch_usernameUsername to authenticate against target switchesString
switch_passwordPassword to authenticate against target switchesString
switchesList of IP addresses/hostnames of the switches to upgradeList[String]

The following variables are available to update MLNX-OS system:

VariableDescription
command_timeoutTime (in seconds) to wait for the command to complete
forceUpdate MLNX-OS system even if it is already up to date
image_urlAlias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set.
reload_commandSpecify an alternative command to reload switch system
reload_timeoutTime (in seconds) to wait for the switch system to reload
remove_imagesDetermine whether to remove all images on disk before system upgrade starts

The following are variable definitions and default values for update MLNX-OS system:

VariableDefaultType
command_timeout240Integer
forcefalseBoolean
reload_command'"reload noconfirm"'String
reload_timeout200Integer
remove_imagesfalseBoolean

MLNX-OS Configure

This job executes configuration commands on one or more MLNX-OS switches.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "MLNX-OS Configure".

The following variables are required to configure MLNX-OS system:

VariableDescriptionType
switch_config_commandsList of configuration commands to executeList[String]
switch_usernameUsername to authenticate against target switchesString
switch_passwordPassword to authenticate against target switchesString
switchesList of IP addresses/hostnames of the switches to configureList[String]

The following variables are available to configure MLNX-OS system:

VariableDescription
save_configIndicates to save the system configuration after the execution completed

The following are variable definitions and default values to configure MLNX-OS system:

VariableDefaultType
save_configtrueBoolean

MFT Upgrade

This job installs NVIDIA® MFT package on one or more hosts.

Refer to the official Mellanox Firmware Tools documentation for further information. 

By default, this job template is configured to run against the hosts of IB Cluster Inventory.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the Launch Template button on "MFT Upgrade".

By default, the MFT package is downloaded from the MFT download center. You must specify the mft_version (or use its default value) and the mft_package_url variables when the download center is not available.

The following variables are available for MFT installation:

VariableDescription
forceInstall MFT package even if it is already up to date
mft_checksumChecksum of MFT package to download
mft_dependenciesList of all package dependencies for the MFT package
mft_install_optionsList of optional arguments for the installation command
mft_package_urlURL of the MFT package to download (default: auto-detection). In addition, you must specify the mft_version parameter or use its default value.
mft_versionVersion number of the MFT package to install
working_dirPath to the working directory on the host

The following are variable definitions and default values for MFT installation:

VariableDefaultType
forcefalseBoolean
mft_checksum''String
mft_dependencies[]List
mft_install_options[]List
mft_package_url''String
mft_version'4.24.0-72'String
working_dir'/tmp'String

The following example shows MFT for RedHat on the MFT Download Center:

mft_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' mft_version: '4.17.0-106'

HPC-X Upgrade

This job installs NVIDIA® HPC-X® package on one or more hosts.

Refer to the official NVIDIA HPC-X documentation for further information.

By default, this job template is configured to run against the hosts of IB Cluster Inventory. You must set the hpcx_install_once variable to true when installing the HPC-X package to a shared location.

To run this job template:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "HPC-X Upgrade".

By default, the HPC-X package is downloaded from the HPC-X download center. You need to specify the hpcx_version (or use its default value) and the hpcx_package_url variables when the download center is not available.

The following variables are available for HPC-X installation:

VariableDescription
forceInstall HPC-X package even if it is already up to date
hpcx_checksumChecksum of the HPC-X package to download
hpcx_dirTarget path for HPC-X installation folder
hpcx_install_onceSpecify whether to install HPC-X package via single host. May be used to install the package on a shared directory.
hpcx_package_urlURL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.
hpcx_versionVersion number of the HPC-X package to install
ofed_versionVersion number of the OFED package compatible to the HPC-X package. This variable item is required when MLNX_OFED is not installed on the host.
working_dirPath to the working directory on the host

The following are variable definitions and default values for HPC-X installation:

VariableDefaultType
forcefalseBoolean
hpcx_checksum''String
hpcx_dir'/opt/nvidia/hpcx'String
hpcx_install_oncefalseBoolean
hpcx_package_url''String
hpcx_version'2.15.0'String
ofed_version''String
working_dir'/tmp'String

The following example shows HPC-X for RedHat 8.0 on the HPC-X Download Center:

hpcx_checksum: 'sha256: 57ba6a0e1aada907cb94759010b3d8a4b5b1e6db87ae638c9ac92e50beb1e29e' hpcx_version: '2.9.0' ofed_version: ''

File Server

A file server is useful when you must access files (e.g., packages, images, etc.) that are not available on the WEB.

The files can be accessed over the following URL: http://<host>:<port>/downloads/ where host (IP address/hostname) and port are the address of your cluster bring-up host.

For example, if cluster-bringup is the hostname of your cluster bring-up host and the TCP port is 5000 as defined in the suggested configuration, then files can be accessed over the URL http://cluster-bringup:5000/downloads/.

To see all available files, open your browser and navigate to http://cluster-bringup:5000/downloads/.

  1. Create a directory for a specific cable firmware image and copy a binary image file into it. Run: 

    [root@cluster-bringup ~]# mkdir -p \
    /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu
    [root@cluster-bringup ~]# cp /tmp/hercules2.bin \
    /opt/nvidia/cot/files/linkx/rel-38_100_121/iffu

    The file can be accessed over the URL http://cluster-bringup:5000/downloads/linkx/rel-38_100_121/iffu/hercules2.bin.

  2. To see all available files, open a browser and navigate to http://cluster-bringup:5000/downloads/.
  3. To see the image file, navigate to http://cluster-bringup:5000/downloads/linkx/rel38_100_121/iffu/.