IB Cluster Inventory

This inventory is a collection of hosts against which jobs may be launched to deploy the InfiniBand cluster.

The predefined group named ib_host_manager must contain a single host for in-band tasks.

All hosts in this inventory must have Python 3.6 or greater.

All the hosts associated to ib_host_manager group must have the following:

  • Python ≥ 3.6

  • MLNX_OFED ≥ 5.6

  • MFT ≥ 4.20

Warning

MFT and MLNX_OFED packages are installed using the Host Package Deployment workflow.

Make sure both packages are installed when this workflow is not part of the bring-up flow.

Add one or more hosts to IB Cluster Inventory.

  1. Go to Resources > Templates.

  2. Click the "Launch Template" icon for "AWX Inventory Host Update".

    launch-awx-inventory-host-update-template-version-1-modificationdate-1709767586717-api-v2.png

  3. Specify the following required variables:

    • api_url – URL to your cluster bring-up REST API

    • controller_host – URL to your AWX controller instance

    • controller_username – username for your AWX controller instance

    • controller_password – password for your AWX controller instance

    • inventory – inventory the host(s) should be made a member of (default: IB Cluster Inventory)

    • hostname – a hostname or a hostname expression of the end-host(s) to add/remove

  4. Click the Next button.

  5. Click the Launch button.

    Warning

    You may specify the controller_oauthtoken variable with OAuth token for your AWX controller instance instead of using controller_username and controller_password variables.

  6. Select the Groups tab and click on a group named ib_host_manager.

  7. Select the Hosts tab and click the "Add" button to add a new host to the group.

  8. Select the "Add existing host" option and choose one host to be member of the group.

You can specify variable definitions and values to be applied to all hosts in this inventory.

To define variables for the IB Cluster Inventory:

  1. Go to Inventories > IB Cluster Inventory and select the Details tab.

  2. Click the Edit button, which opens the Edit details dialog.

  3. Enter variables using either JSON or YAML syntax.

  4. Click Save when finished.

This section describes all available variables for this inventory.

Cluster Bring-Up Web Framework Variables

Name

Description

api_url

URL to your cluster bring-up REST API

pypi_url

URL to your cluster bring-up PyPI repository


MLNX_OFED Upgrade Variables

Name

Description

ofed_package_url

URL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.

ofed_dependencies

List of all package dependencies for the MLNX_OFED package

ofed_install_options

List of optional arguments for the installation command

ofed_version

Version number of the MLNX_OFED package to install


MFT Upgrade Variables

Name

Description

mft_package_url

URL of the MFT package to download (default: auto-detection). In addition, you must specify the mft_version parameter or use its default value.

mft_dependencies

List of all package dependencies for the MFT package

mft_install_options

List of optional arguments for the installation command

mft_version

Version number of the MFT package to install


UFM Telemetry Upgrade Variables

Name

Description

ufm_telemetry_package_url

URL for NVIDIA® UFM® Telemetry to download


HPC-X Upgrade Variables

Name

Description

hpcx_dir

Target path for HPC-X installation folder (default: /opt/nvidia/hpcx)

hpcx_package_url

URL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.

hpcx_version

Version number of the HPC-X package to install

hpcx_install_once

Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory (default: false).

ofed_version

Version number of the OFED package that compatible to the HPC-X package (default: auto-detection). This variable is mandatory when MLNX_OFED is not installed on the host.


ClusterKit Variables

Name

Description

clusterkit_hostname

Hostname expression that represents the hostname to run tests on (default: all hosts in the inventory)

clusterkit_options

List of optional arguments for the tests


MLNX-OS Upgrade Variables

Name

Description

mlnxos_switch_hostname

Hostname expression that represents the names of the switches to upgrade.

To skip this parameter using auto-detection of the MLNX-OS switches, UFM Telemetry is required.

Make sure to run IB Network Discovery with ufm_telemetry_path parameter.

mlnxos_image_url

URL of the NVIDIA® MLNX-OS® image to download

mlnxos_switch_username

Username to authenticate against target switches

mlnxos_switch_password

Password to authenticate against target switches


Externally Managed Switch Firmware Upgrade Variables

Name

Description

switch_fw_image_url

URL of the firmware image to download

switch_psid

PSID of the externally managed switch device to upgrade


HCA Firmware Upgrade Variables

Name

Description

hca_fw_image_url

URL of the firmware image to download

hca_psid

PSID of the HCA device to upgrade


Cable Firmware Upgrade (IFFU) Variables

Name

Description

iffu_fw_version

Firmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable.

iffu_image_url

URL of the firmware image to download

cable_identifier

Identifier of the cable/transceiver to update (e.g., OSFP, QSFP56)

cable_part_number

Part number of the cable/transceiver to update

Can be provided as a regular expression (e.g., 'MFS1S00-H0(03|05|10)E_QP'). If not provided, the job runs in auto-update mode, so all supported cables/transceivers are updated.


© Copyright 2023, NVIDIA. Last updated on Mar 18, 2024.