image image image image image

On This Page

This inventory is a collection of hosts against which jobs may be launched to deploy the InfiniBand cluster.

The predefined group named ib_host_manager must contain a single host for in-band tasks.

Requirements

All hosts in this inventory must have Python 3.6 or greater.

All the hosts associated to ib_host_manager group must have the following:

  • Python ≥ 3.6
  • MLNX_OFED ≥ 5.6
  • MFT ≥ 4.20

MFT and MLNX_OFED packages are installed using the Host Package Deployment workflow.

Make sure both packages are installed when this workflow is not part of the bring-up flow.

Pass/Fail Criteria

If the user wants to define specific pass/fail criteria, the pass_fail_criteria variable should be utilized. This variable must consist of a dictionary as its value which will have a mapping of a job template (playbook name) to its user-defined criteria (dictionary). The criteria dictionary should contain two special keys, max_fail_percentage and action.

  • max_fail_percentage key expects an integer from 0-100 (percentage) as its value. The value represents a percentage (as integer) of failures which are acceptable during the execution of the supported job template. Its default value is 0, which means that in the case of any failures (one host or more) the job template will fail.
  • action defines the operation to perform if the actual failure percentage is greater than the max_fail_percentage value

Supported job template actions (operation types):

Action/OperationDescription
stopFails the execution of the job

Playbook name (key names supported for pass_fail_criteria) to job template name mapping:

Playbook NameJob Template Name

hca_fw_update

HCA Firmware Update

ib_hca_fw_update 

IB HCA Firmware Update

ib_cable_fw_update

IB Cable Firmware Update 

ib_switch_fw_update

IB Externally Managed Switch Firmware Update

mlnxos_configure

MLNX-OS Configure

mlnxos_upgrade

MLNX-OS Upgrade

Example for pass_fail_criteria variable example (placed in the inventory variables list): 

pass_fail_criteria: '{"hca_fw_update": {"max_fail_percentage": 40, "action": "stop"}, "ib_switch_fw_update": {"max_fail_percentage": 80, "action": "stop"}}'

In this example, the user provides criteria for two job templates: HCA Firmware Update (hca_fw_update) and IB Externally Managed Switch Firmware Update (ib_switch_fw_update).

  • For the HCA Firmware Update job template, the max_fail_percentage is set to 40. Supposing we have 3 total hosts. If only one host fails, then the job template will pass (33% actual failure which is smaller than 40%). If two hosts fail, the job template will fail (66% actual failure which is greater than 40%).
  • For the IB Externally Managed Switch Firmware Update job template, the max_fail_percentage is set to 80. For this job template to fail, over 80% of the hosts must fail.

IB Cluster Inventory Hosts Example

Add one or more hosts to IB Cluster Inventory.

  1. Go to Resources > Templates.
  2. Click the "Launch Template" icon for "AWX Inventory Host Update".
  3. Specify the following required variables: 
    • api_url – URL to your cluster bring-up REST API
    • controller_host – URL to your AWX controller instance
    • controller_username – username for your AWX controller instance
    • controller_password – password for your AWX controller instance
    • inventory – inventory the host(s) should be made a member of (default: IB Cluster Inventory)
    • hostname – a hostname or a hostname expression of the end-host(s) to add/remove
  4. Click the Next button.
  5. Click the Launch button. 

    You may specify the controller_oauthtoken variable with OAuth token for your AWX controller instance instead of using controller_username and controller_password variables.

  6. Select the Groups tab and click on a group named ib_host_manager.
  7. Select the Hosts tab and click the "Add" button to add a new host to the group.
  8. Select the "Add existing host" option and choose one host to be member of the group.

IB Cluster Inventory Variables Example

You can specify variable definitions and values to be applied to all hosts in this inventory.

To define variables for the IB Cluster Inventory:

  1. Go to Inventories > IB Cluster Inventory and select the Details tab.
  2. Click the Edit button, which opens the Edit details dialog.
  3. Enter variables using either JSON or YAML syntax.
  4. Click Save when finished.

This section describes all available variables for this inventory.

Cluster Bring-Up Web Framework Variables

NameDescription
api_urlURL to your cluster bring-up REST API
pypi_urlURL to your cluster bring-up PyPI repository

MLNX_OFED Upgrade Variables

NameDescription
ofed_package_urlURL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.
ofed_dependenciesList of all package dependencies for the MLNX_OFED package
ofed_install_optionsList of optional arguments for the installation command
ofed_versionVersion number of the MLNX_OFED package to install

MFT Upgrade Variables

NameDescription
mft_package_urlURL of the MFT package to download (default: auto-detection). In addition, you must specify the mft_version parameter or use its default value.
mft_dependenciesList of all package dependencies for the MFT package
mft_install_optionsList of optional arguments for the installation command
mft_versionVersion number of the MFT package to install

UFM Telemetry Upgrade Variables

NameDescription
ufm_telemetry_package_urlURL for NVIDIA® UFM® Telemetry to download

HPC-X Upgrade Variables

NameDescription
hpcx_dirTarget path for HPC-X installation folder (default: /opt/nvidia/hpcx)
hpcx_package_urlURL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.
hpcx_versionVersion number of the HPC-X package to install
hpcx_install_onceSpecify whether to install HPC-X package via single host. May be used to install the package on a shared directory (default: false).
ofed_versionVersion number of the OFED package that compatible to the HPC-X package (default: auto-detection). This variable is mandatory when MLNX_OFED is not installed on the host.

ClusterKit Variables

NameDescription
clusterkit_hostnameHostname expression that represents the hostname to run tests on (default: all hosts in the inventory)
clusterkit_optionsList of optional arguments for the tests

MLNX-OS Upgrade Variables

NameDescription
mlnxos_switch_hostname

Hostname expression that represents the names of the switches to upgrade.

To skip this parameter using auto-detection of the MLNX-OS switches, UFM Telemetry is required.

Make sure to run IB Network Discovery with ufm_telemetry_path parameter.

mlnxos_image_urlURL of the NVIDIA® MLNX-OS® image to download
mlnxos_switch_usernameUsername to authenticate against target switches
mlnxos_switch_passwordPassword to authenticate against target switches

Externally Managed Switch Firmware Upgrade Variables

NameDescription
switch_fw_image_urlURL of the firmware image to download
switch_psidPSID of the externally managed switch device to upgrade

HCA Firmware Upgrade Variables

NameDescription
hca_fw_image_urlURL of the firmware image to download
hca_psidPSID of the HCA device to upgrade

Cable Firmware Upgrade (IFFU) Variables

NameDescription
iffu_auto_updateSpecify whether to update all supported cables/transceivers connected to the host/switch (default: true)
iffu_fw_versionFirmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable.
iffu_image_urlURL of the firmware image to download
cable_identifierIdentifier of the cable/transceiver to update (e.g., OSFP, QSFP56)
cable_part_numberPart number of the cable/transceiver to upgrade