IB Cluster Inventory
This inventory is a collection of hosts against which jobs may be launched to deploy the InfiniBand cluster.
The predefined group named ib_host_manager must contain a single host for in-band tasks.
All hosts in this inventory must have Python 3.6 or greater.
All the hosts associated to ib_host_manager group must have the following:
Python ≥ 3.6
MLNX_OFED ≥ 5.6
MFT ≥ 4.20
MFT and MLNX_OFED packages are installed using the Host Package Deployment workflow.
Make sure both packages are installed when this workflow is not part of the bring-up flow.
Add one or more hosts to IB Cluster Inventory.
Go to Resources > Templates.
Click the "Launch Template" icon for "AWX Inventory Host Update".
Specify the following required variables:
api_url – URL to your cluster bring-up REST API
controller_host – URL to your AWX controller instance
controller_username – username for your AWX controller instance
controller_password – password for your AWX controller instance
inventory – inventory the host(s) should be made a member of (default: IB Cluster Inventory)
hostname – a hostname or a hostname expression of the host(s) to add/remove
Click the Next button.
Click the Launch button.
WarningYou may specify the controller_oauthtoken variable with OAuth token for your AWX controller instance instead of using controller_username and controller_password variables.
Select the Groups tab and click on a group named ib_host_manager.
Select the Hosts tab and click the "Add" button to add a new host to the group.
Select the "Add existing host" option and choose one host to be member of the group.
You can specify variable definitions and values to be applied to all hosts in this inventory.
To define variables for the IB Cluster Inventory:
Go to Inventories > IB Cluster Inventory and select the Details tab.
Click the Edit button, which opens the Edit details dialog.
Enter variables using either JSON or YAML syntax.
Click Save when finished.
This section describes all available variables for this inventory.
Cluster Bring-Up Web Framework Variables
Name |
Description |
api_url |
URL to your cluster bring-up REST API |
pypi_url |
URL to your cluster bring-up PyPI repository |
MLNX_OFED Upgrade Variables
Name |
Description |
ofed_package_url |
URL of the MLNX_OFED package to download (default: auto-detection) |
ofed_dependencies |
List of all package dependencies for the MLNX_OFED package |
ofed_install_options |
List of optional arguments for the installation command |
ofed_version |
Version number of the MLNX_OFED package to install |
By default, the MLNX_OFED package is downloaded from the MLNX_OFED download center. Make sure to specify the ofed_package_url and ofed_version variables when the download center is unavailable.
MFT Upgrade Variables
Name |
Description |
mft_package_url |
URL of the MFT package to download (default: auto-detection) |
mft_dependencies |
List of all package dependencies for the MFT package |
mft_install_options |
List of optional arguments for the installation command |
mft_version |
Version number of the MFT package to install |
By default, the MFT package is downloaded from the MFT download center. Make sure to specify the mft_package_url and mft_version variables when the download center is unavailable.
HPC-X Upgrade Variables
Name |
Description |
hpcx_dir |
Target path for HPC-X installation folder (default: /opt/nvidia/hpcx) |
hpcx_package_url |
URL of the HPC-X package to download (default: auto-detection) |
hpcx_version |
Version number of the HPC-X package to install |
hpcx_install_once |
Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory (default: false). |
ofed_version |
Version number of the OFED package that compatible to the HPC-X package (default: auto-detection). This variable is mandatory when MLNX_OFED is not installed on the host. |
By default, the HPC-X package is downloaded from the HPC-X download center. Make sure to specify the hpcx_package_url and hpcx_version variables when the download center is unavailable.
ClusterKit Variables
Name |
Description |
clusterkit_hostname |
Hostname expression that represents the hostname to run tests on (default: all hosts in the inventory) |
clusterkit_options |
List of optional arguments for the tests |
MLNX-OS Upgrade Variables
Name |
Description |
mlnxos_image_url |
URL of the NVIDIA® MLNX-OS® image to download |
mlnxos_switch_hostname |
Hostname expression that represents the names of switches to upgrade |
mlnxos_switch_username |
Username to authenticate against target switches |
mlnxos_switch_password |
Password to authenticate against target switches |
Externally Managed Switch Firmware Upgrade Variables
Name |
Description |
switch_fw_image_url |
URL of the firmware image to download |
switch_psid |
PSID of the externally managed switch device to upgrade |
HCA Firmware Upgrade Variables
Name |
Description |
hca_fw_image_url |
URL of the firmware image to download |
hca_psid |
PSID of the HCA device to upgrade |
Cable Firmware Upgrade (IFFU) Variables
Name |
Description |
iffu_auto_update |
Specify whether to update all supported cables/transceivers connected to the host/switch (default: true) |
iffu_fw_version |
Firmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable. |
iffu_image_url |
URL of the firmware image to download |
cable_identifier |
Identifier of the cable/transceiver to update (e.g., OSFP, QSFP56) |
cable_part_number |
Part number of the cable/transceiver to upgrade |