IB Cluster Deployment Example

This example:

  1. Configures 3 hosts to IB Cluster Inventory.

  2. Configures a single host to be member of ib_host_manager.

  3. Deploys an InfiniBand cluster.

The following example shows how to deploy an InfiniBand cluster that performs the following:

  1. Updates MLNX_OFED to version number 5.6-1.0.3.3 on all hosts of this inventory.

  2. Updates MFT to version number 4.20.0-34 on all hosts of this inventory.

  3. Updates HPC-X to version number 2.11.0 on all hosts of this inventory.

  4. Updates NVIDIA® MLNX-OS® to version number 3.9.3124 on 5 switches.

  5. Updates firmware for NVIDIA® Quantum InfiniBand to version number 27.2008.3328.

  6. Updates firmware for NVIDIA® ConnectX®-6 InfiniBand to version number 20.31.1014.

  7. Updates firmware for NVIDIA® AOC InfiniBand HDR cables to version number 38.100.121.

  8. Runs ClusterKit tests for 1 minute on 2 hosts of this inventory.

Using YAML syntax, the following variables are used in this example:

Copy
Copied!
            

# Ansible parameters ansible_python_interpreter: '/usr/bin/python3'   # Cluster bring-up WEB framework parameters api_url: 'http://cluster-bringup:5000/api'   # UFM Telemetry Upgrade parameters ufm_telemetry_package_url: 'http://cluster-bringup:5000/downloads/collectx-1.10.5-5968674.x86_64_el8.2-bringup.tar.gz'   # MLNX-OS switches parameters mlnxos_switch_hostname: 'ib-switch-t[1-2],ib-switch-l[1-2],ib-switch-s1' mlnxos_switch_username: 'admin' mlnxos_switch_password: 'my_admin_password' mlnxos_image_url: 'http://cluster-bringup:5000/downloads/sx_mlnx_os/sx_mlnx_os-3.9.3124/sx_mlnx_os-3.9.3124-X86_64/image-X86_64-3.9.3124.img'   # IB cables firmware update (IFFU) parameters iffu_image_url: 'http://cluster-bringup:5000/downloads/linkx/rel-38_100_121/iffu/hercules2.bin' iffu_auto_update: true cable_part_number: 'MFS1S00-H0(03|05|10)E_QP'   # IB externally managed switch firmware update parameters switch_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-Quantum-rel-27_2008_3328-MQM8790-HS2X_Ax.bin.zip' switch_psid: 'MT_0000000063'   # IB HCA firmware update parameters hca_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-ConnectX6-rel-20_31_1014-MCX654106A-HCA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin.zip' hca_psid: 'MT_0000000228'   # Software packages parameters hpcx_version: '2.11.0' ofed_version: '5.6-1.0.3.3' ofed_dependencies: ['python3-devel', 'createrepo', 'kernel-rpm-macros', 'kernel-modules-extra'] mft_version: '4.20.0-34' mft_dependencies: ['rpm-build']   # ClusterKit parameters clusterkit_hostname: 'ib-node-0[1-2]' clusterkit_options: ['--traffic', '1']

To configure the hosts for this inventory:

  1. Add ib-node-01, ib-node-02, and ib-node-05 hosts to IB Cluster Inventory.

    add-hosts-to-ib-cluster-inventory.png

    add-hosts-to-ib-cluster-inventory-2.png

  2. Verify the job output.

    verify-output.png

  3. See the added hosts by going to Inventories > IB Cluster Inventory and selecting the Hosts tab.

    see-added-hosts.png

  4. Select the Groups tab and click on a group named ib_host_manager.

    nav-to-ib_host_manager.png

  5. Select the Hosts tab and click the Add button to add a new host to the group.

    add-new-host-to-group.png

  6. Select the "Add existing host" option and mark one the hosts to be a member of the group.

    add-existing-host.png

  7. Click Save when finished.

Once the host is successfully added, it will be member of the ib_host_manager group.

host-part-of-group.png

Warning

When the host is not member of this inventory, you need to select the "Add new host" option instead of "Add existing host" option.

To configure the variables for this inventory:

  1. Go to Inventories > IB Cluster Inventory and select the Details tab.

    details-tab-example.png

  2. Click the Edit icon which opens the "Edit details" dialog.

    edit-details-example.png

  3. Enter variables using either JSON or YAML syntax.

  4. Click Save when finished.

To deploy the InfiniBand cluster:

  1. Go to Resources > Templates > IB Cluster Bring-Up.

  2. Click the Launch button.

    ib-cluster-deployment-example.png

Once the job is completed successfully, the output of the job should look like this:

example-output.png

© Copyright 2023, NVIDIA. Last updated on Aug 28, 2023.