InfiniBand Bring-up Tool v4.0.0
InfiniBand Bring-up Tool v4.0.0

Supported Workflow Templates

The following subsections describe the currently supported workflow templates.

This section describes how to deploy the InfiniBand cluster.

This procedure is a sequence of the following workflow/job templates:

  1. IB Network Deployment

  2. IB Network Verification

  3. Report Generation

These workflow/job templates are linked together to deploy the InfiniBand cluster:

  1. Deploy InfiniBand network.

  2. Verify the InfiniBand network.

  3. Generate summary report.

The following diagram shows the nodes of this workflow:

nodes-of-workflow-version-1-modificationdate-1709767599880-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on the "IB Cluster Bring-Up".

This section describes how to deploy InfiniBand network.

This procedure is a sequence of the following workflow and job templates:

  1. Host Package Deployment

  2. IB Network Discovery

  3. IB Switch System Alignment

  4. IB HCA Firmware Alignment

  5. IB Cable Firmware Alignment

  6. IB Network Discovery

These workflow templates and job templates are linked together to deploy the InfiniBand cluster:

  1. Ensure software packages are installed on the hosts.

  2. Discover InfiniBand topology and update the database with the discovered topology.

  3. Update system firmware/MLNX-OS software on InfiniBand switches.

  4. Update firmware on InfiniBand HCAs.

  5. Update cables' transceivers' firmware on InfiniBand cable devices.

  6. Discover InfiniBand topology and update the database with the discovered topology.

The following diagram shows the nodes of this workflow:

ib-network-deployment-workflow-example-version-1-modificationdate-1709767599530-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on the "IB Network Deployment".

Warning

Make sure that all variables for this workflow are defined.

This section describes how to verify the InfiniBand network.

This procedure is a sequence of the following job templates:

  1. IB Topology Comparison

  2. ClusterKit

  3. IB Topology Comparison

  4. IB Fabric Health Checks

  5. Fabric Health Counters Collection

These workflow templates are linked together to deploy the InfiniBand cluster:

  1. Discover InfiniBand topology and create a file with the discovered topology.

  2. Run ClusterKit tests.

  3. Discover InfiniBand topology and compare against the discovered topology.

  4. Performs diagnostic fabric health check of the fabric's state.

  5. Performs the collection of fabric counters with and without traffic.

The following diagram shows the nodes of this workflow:

ib-network-verification-workflow-example-version-1-modificationdate-1709767596407-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on the "IB Network Verification".

This section describes how to deploy NVIDIA software packages on one or more hosts.

Refer to the official NVIDIA Software Products documentation for further information.

This procedure is a sequence of the following job templates:

  1. COT Python Alignment

  2. MLNX_OFED Upgrade

  3. MFT Upgrade

  4. HPC-X Upgrade

  5. UFM Telemetry Upgrade

These job templates are linked together to deploy NVIDIA software packages:

  1. Ensure the Python environment for the cluster orchestration tool (COT) is installed.

  2. Ensure the MLNX_OFED Linux driver is installed.

  3. Ensure the HPC-X Software Toolkit is installed.

  4. Ensure the MFT is installed.

  5. Install UFM Telemetry if package URL is provided.

The following diagram shows the nodes of this workflow:

host-package-deployment-workflow-example-version-1-modificationdate-1709767600383-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on the "Host Package Deployment".

The following variables are available for deploying software packages:

Name

Description

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Install the packages even if the packages are already up to date

hpcx_checksum

Checksum of the HPC-X package to download

hpcx_dir

Target path for HPC-X installation folder

hpcx_install_once

Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory.

hpcx_package_url

URL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.

hpcx_version

Version number of the HPC-X package to install

mft_checksum

Checksum of the MFT package to download

mft_dependencies

List of all package dependencies for the MFT package

mft_install_options

List of optional arguments for the installation command

mft_package_url

URL of the MFT package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.

mft_version

Version number of the MFT package to install

ofed_checksum

Checksum of the MLNX_OFED package to download

ofed_dependencies

List of all package dependencies for the MLNX_OFED package

ofed_install_options

List of optional arguments for the installation command

ofed_package_url

URL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.

ofed_version

Version number of the MLNX_OFED package to install

working_dir

Path to the working directory on the host

The following are variable definitions and default values for deploying software packages:

Name

Default

Type

disable_report

false

Boolean

force

false

Boolean

hpcx_checksum

''

String

hpcx_dir

'/opt/nvidia/hpcx'

String

hpcx_install_once

false

Boolean

hpcx_package_url

''

String

hpcx_version

'2.17.0'

String

mft_checksum

''

String

mft_dependencies

[]

List[String]

mft_install_options

[]

List[String]

mft_package_url

''

String

mft_version

'4.26.0-93'

String

ofed_checksum

''

String

ofed_dependencies

[]

List[String]

ofed_install_options

[]

List[String]

ofed_package_url

''

String

ofed_version

'23.10-0.5.5.0'

String

working_dir

'/tmp'

String

This section describes how to update the firmware of the transceivers on one or more cable devices.

Refer to the official NVIDIA Cable Firmware Update documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand Cables

  2. IB Cable Firmware Update

These job templates are linked together to update cable transceiver firmware:

  1. Lookup for InfiniBand cables by a specific part number. If no part number is provided, the lookup returns all devices.

  2. Update cable transceiver firmware on the specified cable devices. If no part number is provided, all supported devices are updated.

Warning

This workflow relies on updated topology, so make sure the topology is up to date by running network discovery.

The following diagram shows the nodes of this workflow:

ib-cable-firmware-upgrade-workflow-example-version-1-modificationdate-1709767597563-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on the "IB Cable Firmware Alignment".

Warning

Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required for updating cable transceiver firmware:

Name

Description

Type

api_url

URL to your Cluster Bring-up REST API

String

iffu_image_url

URL of the firmware image to download

String

cable_part_number

Part number of the cable/transceiver to update.

Can be provided as a regular expression (e.g., 'MFS1S00-H0(03|05|10)E_QP').

String

The following is an additional variable required for updating cable transceiver firmware on hybrid products (e.g., NVIDIA AOC splitter, IB twin port HDR, OSFP-to-2xQSFP56):

Name

Description

Type

cable_identifier

Identifier of the cable/transceiver to update (e.g., OSFP, QSFP56)

String

Warning

Cable firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available to update cable transceiver firmware:

Name

Description

activate_delay

Time (in seconds) to wait before activating all updated cables

activate_delay_factor

Multiplying factor used to adjust delay after loading new firmware. Its value must be greater than or equal to 1.

activate_image_retries

Maximum number of retries available for activate task to complete

activate_image_wait

Time (in seconds) to wait for activate task to complete

burn_image_retries

Maximum number of retries available for burn task to complete

burn_image_wait

Time (in seconds) to wait for burn task to complete

cable_part_number

Part number of the cable/transceiver to update.

Can be provided as a regular expression (e.g., 'MFS1S00-H0(03|05|10)E_QP'). If not provided, the job runs in auto-update mode, so all supported cables/transceivers are updated.

clear_semaphore

Specify to clear the flash semaphore before update started

cot_executable

Path to the installed cotclient tool

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

cot_python_interpreter

Path to cluster orchestration Python interpreter

exclude_devices

List of GUIDS/LIDs representing the InfiniBand devices to ignore

exclude_ports

Ports labels that represent the cable devices to ignore

ib_device

Specify the name of the In-Band device to use (e.g., mlx5_0)

iffu_activate_auto_update

Specify whether to activate all updated cables/transceivers connected to the host/switch. This variable is not available when iffu_auto_update is set to true.

iffu_auto_update

Specify whether to update all supported cables/transceivers connected to the host/switch

iffu_fw_version

Firmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable.

iffu_image_checksum

Checksum of firmware image to download

max_device_ports

Limit the number of cables/transceivers to burn on each host/switch device. This variable is not available when iffu_auto_update is set to true.

query_image_retries

Maximum number of retries available for query task to complete

query_image_wait

Time (in seconds) to wait for query to complete

stop_on_failure

Specifies to stop the update firmware execution on the first failure

working_dir

Path to the working directory on the host

The following are variables definitions and default values for update cables transceivers' firmware:

Name

Default

Type

activate_delay

60

Integer

activate_delay_factor

2

Decimal

activate_image_retries

10

Integer

activate_image_wait

120

Integer

burn_image_retries

20

Integer

burn_image_wait

120

Integer

clear_semaphore

false

Boolean

cot_executable

'/opt/nvidia/cot/client/bin/python'

String

disable_report

false

Boolean

exclude_devices

[]

List[String]

exclude_ports

[]

List[String]

ib_device

''

String

iffu_activate_auto_update

false

Boolean

iffu_auto_update

true

Boolean

iffu_fw_version

''

String

iffu_image_checksum

''

String

max_device_ports

-1

Integer

query_image_retries

120

Integer

query_image_wait

10

Integer

stop_on_failure

false

Boolean

working_dir

'/tmp'

String

Warning

The following are the formats of port labels for each product:

  • NVIDIA Quantum-2 – <Node GUID>/P<ASIC>/<cage>/<port> (e.g., 0x900a84030040aab0/P1/3/1)

  • NVIDIA Quantum – <Node GUID>/P<port> (e.g., 0x900a84030040bbb0/P3)

  • NVIDIA® ConnectX®-6 – <Node GUID>/P1 (e.g., 0xb8cef60300ff8727/P1)

  • NVIDIA® ConnectX®-7 – <Node GUID>/P1 (e.g., 0x08c0eb0300e877c4/P1)

This section describes how to update the firmware on one or more InfiniBand HCAs.

Refer to the official NVIDIA Firmware Downloads documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand HCAs

  2. HCA Firmware Update

These job templates are linked together to update firmware on InfiniBand HCAs:

  1. Lookup for InfiniBand HCAs by a specific PSID.

  2. Update firmware on the specified InfiniBand HCAs.

The following shows a diagram of this workflow:

ib-hca-firmware-upgrade-workflow-example-version-1-modificationdate-1709767598717-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB HCA Firmware Alignment".

    Warning

    HCA firmware update on the SM (subnet manager) host requires stopping the SM service before running the job.

The following variables are required to update HCAs firmware:

Name

Description

Type

api_url

URL to your cluster bring-up REST API

String

hca_fw_image_url

URL of the firmware image to download

String

hca_psid

PSID of the HCA device to update

String

Warning

HCA firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available for update HCAs firmware:

Name

Description

burn_image_retries

Maximum number of retries available for burn task to complete

burn_image_wait

Time (in seconds) to wait for burn task to complete

clear_semaphore

Specify to clear the flash semaphore before update started

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

ib_device

Specify the name of the in-band device to use (e.g., 'mlx5_0')

exclude_devices

List of GUID/LIDs representing the HCAs to ignore

hca_fw_image_checksum

Checksum of firmware image to download

psid

Alias name for hca_psid. This variable is not available when the hca_psid variable is set.

query_image_retries

Maximum number of retries available for query task to complete

query_image_wait

Time (in seconds) to wait for query task to complete

subnet

Subnet name which the HCAs are member of

working_dir

Path to the working directory on the host

The following variables are available for update HCAs firmware:

Name

Default

Type

burn_image_retries

10

Integer

burn_image_wait

120

Integer

clear_semaphore

false

Boolean

disable_report

false

Boolean

ib_device

''

String

exclude_devices

[]

List[String]

hca_fw_image_checksum

''

String

query_image_retries

5

Integer

query_image_wait

30

Integer

subnet

'infiniband-default'

String

working_dir

'/tmp'

String

The following example shows the firmware image for NVIDIA® ConnectX®-6 VPI adapter cards on the ConnectX VPI/InfiniBand Firmware Download Center:

connectx-6-example-version-1-modificationdate-1709767597257-api-v2.png

Copy
Copied!
            

hca_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-ConnectX6-rel-20_31_1014-MCX654106A-HCA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin.zip' hca_fw_image_checksum: 'md5:8055b27dd7a3ac7ae60300a37455a7a4' hca_psid: 'MT_0000000228'

This section describes how to update system firmware/software on one or more InfiniBand switches.

This procedure is a sequence of the following job templates:

  1. IB Externally Managed Switch Firmware Alignment

  2. MLNX-OS System Alignment

These job templates are linked together to update firmware on InfiniBand switches:

  1. Update firmware on externally managed InfiniBand switches.

  2. Upgrade ASIC firmware/MLNX-OS software on InfiniBand switches.

Warning

This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.

The following diagram shows the nodes of this workflow:

ib-switch-system-alignment-example-version-1-modificationdate-1709767599087-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB Switch System Alignment".

This section describes how to update firmware on one or more externally managed InfiniBand switches.

Refer to the official NVIDIA Firmware Downloads documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand Switches

  2. IB Externally Managed Switch Firmware Update

These job templates are linked together to update firmware on InfiniBand switches:

  1. Lookup for externally managed InfiniBand switches by a specific PSID.

  2. Update firmware on the specified externally managed InfiniBand switches.

Warning

This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.

The following diagram shows the nodes of this workflow:

externally-managed-switch-firmware-upgrade-example-version-1-modificationdate-1709767598407-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "IB Externally Managed Switch Firmware Alignment".

Warning

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update externally managed InfiniBand switch firmware:

Name

Description

Type

api_url

URL to your cluster bring-up REST API

String

switch_fw_image_url

URL of the firmware image to download

String

switch_psid

PSID of the externally managed switch device to update

String

Warning

Switch firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available to update externally managed switch firmware:

Name

Description

burn_image_retries

Maximum number of retries available for burn task to complete

burn_image_wait

Time (in seconds) to wait for burn task to complete

clear_semaphore

Specify to clear the flash semaphore before update started

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

exclude_devices

List of GUIDS/LIDs representing the switches to ignore

ib_device

Specifies the name of the in-band device to use (e.g., 'mlx5_0')

psid

Alias name for switch_psid. This variable item is not available when the switch_psid variable is set.

query_image_retries

Maximum number of retries available for query task to complete

query_image_wait

Time (in seconds) to wait for query task to complete

subnet

Subnet name which the externally managed switches are member of

switch_fw_image_checksum

Checksum of firmware image to download

working_dir

Path to the working directory on the host

The following are variables definitions and default values for update externally managed switches firmware:

Name

Default

Type

burn_image_retries

10

Integer

burn_image_wait

120

Integer

clear_semaphore

false

Boolean

disable_report

false

Boolean

exclude_devices

[]

List[String]

ib_device

''

String

query_image_retries

5

String

query_image_wait

30

Integer

subnet

'infiniband-default'

String

switch_fw_image_checksum

''

String

working_dir

'/tmp'

String

The following example shows firmware image for NVIDIA Quantum-based InfiniBand switch platforms on the Quantum InfiniBand Firmware Download Center:

quantum-example-version-1-modificationdate-1709767596903-api-v2.png

Copy
Copied!
            

switch_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-Quantum-rel-27_2008_3328-MQM8790-HS2X_Ax.bin.zip' switch_fw_image_checksum: 'md5:953dca31ed40e0a90e991b4291f0fa2d' switch_psid: 'MT_0000000063'

This section describes how to update system firmware/MLNX-OS software on one or more switches.

Refer to the official NVIDIA® MLNX-OS® documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup MLNX-OS Switches

  2. MLNX-OS Upgrade

These job templates are linked together to update software on InfiniBand switches:

  1. Lookup for MLNX-OS switches hostnames.

  2. Update system firmware/OS software on the specified switches.

The following diagram shows the nodes of this workflow:

mlnx-os-upgrade-workflow-example-version-1-modificationdate-1709767597870-api-v2.png

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.

  2. Click the "Launch Template" button on "MLNX-OS System Alignment".

Warning

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update a MLNX-OS system:

Name

Description

Type

api_url

URL to your cluster bring-up REST API

String

mlnxos_image_url

URL of the MLNX-OS image to download

String

switch_username

Username to authenticate against target switches

String

switch_password

Password to authenticate against target switches

String

mlnxos_switch_hostname

Hostname expression that represents the names of the switches to upgrade.

To skip this parameter using auto-detection of the MLNX-OS switches, NVIDIA® UFM® Telemetry is required.

Make sure to run IB Network Discovery with ufm_telemetry_path parameter.

String

The following variables are available to update a MLNX-OS system:

Name

Description

command_timeout

Time (in seconds) to wait for the command to be completed

cot_executable

Path to the installed cotclient tool

disable_report

Specify to skip operations required for the summary report (e.g., data collection, report generation, etc.)

force

Specify to update MLNX-OS system even if it is already up to date

image_url

Alias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set.

mlnxos_switch_username

Alias name for switch_username. This variable item is not available when the switch_username is set.

mlnxos_switch_username

Alias name for switch_password. This variable item is not available when the switch_password is set.

reload_command

Specify an alternative command for reload switch system

reload_timeout

Time (in seconds) to wait for the switch system to be reloaded

remove_images

Determine whether to remove all images on disk before system upgrade started

The following are variable definitions and default values to update internally managed switch software:

Name

Default

Type

command_timeout

240

Integer

cot_executable

'/opt/nvidia/cot/client/bin/cotclient'

String

disable_report

false

Boolean

force

false

Boolean

reload_command

'"reload noconfirm"'

String

reload_timeout

200

Integer

remove_images

false

Boolean

© Copyright 2023, NVIDIA. Last updated on Mar 18, 2024.