IB Cluster Deployment Example
This example:
Configures 3 hosts to IB Cluster Inventory.
Configures a single host to be member of ib_host_manager.
Deploys an InfiniBand cluster.
The following example shows how to deploy an InfiniBand cluster that performs the following:
Updates MLNX_OFED to version number 5.6-1.0.3.3 on all hosts of this inventory.
Updates MFT to version number 4.20.0-34 on all hosts of this inventory.
Updates HPC-X to version number 2.11.0 on all hosts of this inventory.
Updates NVIDIA® MLNX-OS® to version number 3.9.3124 on 5 switches.
Updates firmware for NVIDIA® Quantum InfiniBand to version number 27.2008.3328.
Updates firmware for NVIDIA® ConnectX®-6 InfiniBand to version number 20.31.1014.
Updates firmware for NVIDIA® AOC InfiniBand HDR cables to version number 38.100.121.
Runs ClusterKit tests for 1 minute on 2 hosts of this inventory.
Using YAML syntax, the following variables are used in this example:
# Ansible parameters
ansible_python_interpreter: '/usr/bin/python3'
# Cluster bring-up WEB framework parameters
api_url: 'http://cluster-bringup:5000/api'
# UFM Telemetry Upgrade parameters
ufm_telemetry_package_url: 'http://cluster-bringup:5000/downloads/collectx-1.10.5-5968674.x86_64_el8.2-bringup.tar.gz'
# MLNX-OS switches parameters
mlnxos_switch_hostname: 'ib-switch-t[1-2],ib-switch-l[1-2],ib-switch-s1'
mlnxos_switch_username: 'admin'
mlnxos_switch_password: 'my_admin_password'
mlnxos_image_url: 'http://cluster-bringup:5000/downloads/sx_mlnx_os/sx_mlnx_os-3.9.3124/sx_mlnx_os-3.9.3124-X86_64/image-X86_64-3.9.3124.img'
# IB cables firmware update (IFFU) parameters
iffu_image_url: 'http://cluster-bringup:5000/downloads/linkx/rel-38_100_121/iffu/hercules2.bin'
iffu_auto_update: true
cable_part_number: 'MFS1S00-H0(03|05|10)E_QP'
iffu_fw_version: '38.100.121'
# IB externally managed switch firmware update parameters
switch_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-Quantum-rel-27_2008_3328-MQM8790-HS2X_Ax.bin.zip'
switch_psid: 'MT_0000000063'
# HCA firmware update parameters
hca_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-ConnectX6-rel-20_31_1014-MCX654106A-HCA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin.zip'
hca_psid: 'MT_0000000228'
# Software packages parameters
hpcx_version: '2.11.0'
ofed_version: '5.6-1.0.3.3'
ofed_dependencies: ['python3-devel', 'createrepo', 'kernel-rpm-macros', 'kernel-modules-extra']
mft_version: '4.20.0-34'
mft_dependencies: ['rpm-build']
# ClusterKit parameters
clusterkit_hostname: 'ib-node-0[1-2]'
clusterkit_options: ['--traffic', '1']
To configure the hosts for this inventory:
Add ib-node-01, ib-node-02, and ib-node-05 hosts to IB Cluster Inventory.
Verify the job output.
See the added hosts by going to Inventories > IB Cluster Inventory and selecting the Hosts tab.
Select the Groups tab and click on a group named ib_host_manager.
Select the Hosts tab and click the Add button to add a new host to the group.
Select the "Add existing host" option and mark one the hosts to be a member of the group.
Click Save when finished.
Once the host is successfully added, it will be member of the ib_host_manager group.
When the host is not member of this inventory, you need to select the "Add new host" option instead of "Add existing host" option.
To configure the variables for this inventory:
Go to Inventories > IB Cluster Inventory and select the Details tab.
Click the Edit icon which opens the "Edit details" dialog.
Enter variables using either JSON or YAML syntax.
Click Save when finished.
To deploy the InfiniBand cluster:
Go to Resources > Templates > IB Cluster Bring-Up.
Click the Launch button.
Once the job is completed successfully, the output of the job should look like this: