Supported Workflow Templates
The following subsections describe the currently supported workflow templates.
This section describes how to deploy the InfiniBand cluster.
This procedure is a sequence of the following workflow templates:
IB Network Deployment
IB Network Verification
These workflow templates are linked together to deploy the InfiniBand cluster:
Deploy InfiniBand network.
Verify the InfiniBand network.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on the "IB Cluster Bring-Up".
This section describes how to deploy InfiniBand network.
This procedure is a sequence of the following workflow and job templates:
Host Package Deployment
IB Network Discovery
IB Switch System Alignment
IB HCA Firmware Alignment
IB Cable Firmware Alignment
IB Network Discovery
These workflow templates and job templates are linked together to deploy the InfiniBand cluster:
Ensure software packages are installed on the hosts.
Discover InfiniBand topology and update the database with the discovered topology.
Update system firmware/MLNX-OS software on InfiniBand switches.
Update firmware on InfiniBand HCAs.
Update cables' transceivers' firmware on InfiniBand cable devices.
Discover InfiniBand topology and update the database with the discovered topology.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on the "IB Network Deployment".
Make sure that all variables for this workflow are defined.
This section describes how to verify the InfiniBand network.
This procedure is a sequence of the following job templates:
IB Topology Comparison
ClusterKit
IB Topology Comparison
IB Fabric Health Checks
Fabric Health Counters Collection
These workflow templates are linked together to deploy the InfiniBand cluster:
Discover InfiniBand topology and create a file with the discovered topology.
Run ClusterKit tests.
Discover InfiniBand topology and compare against the discovered topology.
Performs diagnostic fabric health check of the fabric's state.
Performs the collection of fabric counters with and without traffic.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on the "IB Network Verification".
This section describes how to deploy NVIDIA software packages on one or more hosts.
Refer to the official NVIDIA Software Products documentation for further information.
This procedure is a sequence of the following job templates:
COT Python Alignment
MLNX_OFED Upgrade
MFT Upgrade
HPC-X Upgrade
UFM Telemetry Upgrade
These job templates are linked together to deploy NVIDIA software packages:
Ensure the Python environment for the cluster orchestration tool (COT) is installed.
Ensure the MLNX_OFED Linux driver is installed.
Ensure the HPC-X Software Toolkit is installed.
Ensure the MFT is installed.
Install UFM Telemetry if package URL is provided.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on the "Host Package Deployment".
The following variables are available for deploying software packages:
Name |
Description |
force |
Install the packages even if the packages are already up to date |
hpcx_checksum |
Checksum of the HPC-X package to download |
hpcx_dir |
Target path for HPC-X installation folder |
hpcx_install_once |
Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory. |
hpcx_package_url |
URL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value. |
hpcx_version |
Version number of the HPC-X package to install |
mft_checksum |
Checksum of the MFT package to download |
mft_dependencies |
List of all package dependencies for the MFT package |
mft_install_options |
List of optional arguments for the installation command |
mft_package_url |
URL of the MFT package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value. |
mft_version |
Version number of the MFT package to install |
ofed_checksum |
Checksum of the MLNX_OFED package to download |
ofed_dependencies |
List of all package dependencies for the MLNX_OFED package |
ofed_install_options |
List of optional arguments for the installation command |
ofed_package_url |
URL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value. |
ofed_version |
Version number of the MLNX_OFED package to install |
working_dir |
Path to the working directory on the host |
The following are variable definitions and default values for deploying software packages:
Name |
Default |
Type |
force |
false |
Boolean |
hpcx_checksum |
'' |
String |
hpcx_dir |
'/opt/nvidia/hpcx' |
String |
hpcx_install_once |
false |
Boolean |
hpcx_package_url |
'' |
String |
hpcx_version |
'2.15.0' |
String |
mft_checksum |
'' |
String |
mft_dependencies |
[] |
List[String] |
mft_install_options |
[] |
List[String] |
mft_package_url |
'' |
String |
mft_version |
'4.24.0-72' |
String |
ofed_checksum |
'' |
String |
ofed_dependencies |
[] |
List[String] |
ofed_install_options |
[] |
List[String] |
ofed_package_url |
'' |
String |
ofed_version |
'23.04-0.5.3.3' |
String |
working_dir |
'/tmp' |
String |
This section describes how to update the firmware of the transceivers on one or more cable devices.
Refer to the official NVIDIA Cable Firmware Update documentation for further information.
This procedure is a sequence of the following job templates:
Lookup InfiniBand Cables
IB Cable Firmware Update
These job templates are linked together to update cable transceiver firmware:
Lookup for InfiniBand cables by a specific part number.
Update cable transceiver firmware on the specified cable devices.
This workflow relies on updated topology, so make sure the topology is up to date by running network discovery.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on the "IB Cable Firmware Alignment".
Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.
The following variables are required for updating cable transceiver firmware:
Name |
Default |
Type |
api_url |
URL to your Cluster Bring-up REST API |
String |
iffu_image_url |
URL of the firmware image to download |
String |
cable_part_number |
Part number of the cable/transceiver to update. Can be provided as a regular expression (e.g., 'MFS1S00-H0(03|05|10)E_QP'). |
String |
The following is an additional variable required for updating cable transceiver firmware on hybrid products (e.g., NVIDIA AOC splitter, IB twin port HDR, OSFP-to-2xQSFP56):
Name |
Default |
Type |
cable_identifier |
Identifier of the cable/transceiver to update (e.g., OSFP, QSFP56) |
String |
Cable firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.
The following variables are available to update cable transceiver firmware:
Name |
Description |
activate_delay |
Time (in seconds) to wait before activating all updated cables |
activate_delay_factor |
Multiplying factor used to adjust delay after loading new firmware. Its value must be greater than or equal to 1. |
activate_image_retries |
Maximum number of retries available for activate task to complete |
activate_image_wait |
Time (in seconds) to wait for activate task to complete |
burn_image_retries |
Maximum number of retries available for burn task to complete |
burn_image_wait |
Time (in seconds) to wait for burn task to complete |
clear_semaphore |
Specify to clear the flash semaphore before update started |
cot_python_interpreter |
Path to cluster orchestration Python interpreter |
exclude_devices |
List of GUIDS/LIDs representing the InfiniBand devices to ignore |
exclude_ports |
Ports labels that represent the cable devices to ignore |
ib_device |
Specify the name of the In-Band device to use (e.g., mlx5_0) |
iffu_activate_auto_update |
Specify whether to activate all updated cables/transceivers connected to the host/switch. This variable is not available when iffu_auto_update is set to true. |
iffu_auto_update |
Specify whether to update all supported cables/transceivers connected to the host/switch |
iffu_fw_version |
Firmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable. |
iffu_image_checksum |
Checksum of firmware image to download |
max_device_ports |
Limit the number of cables/transceivers to burn on each host/switch device. This variable is not available when iffu_auto_update is set to true. |
query_image_retries |
Maximum number of retries available for query task to complete |
query_image_wait |
Time (in seconds) to wait for query to complete |
stop_on_failure |
Specifies to stop the update firmware execution on the first failure |
working_dir |
Path to the working directory on the host |
The following are variables definitions and default values for update cables transceivers' firmware:
Name |
Default |
Type |
activate_delay |
60 |
Integer |
activate_delay_factor |
2 |
Decimal |
activate_image_retries |
10 |
Integer |
activate_image_wait |
120 |
Integer |
burn_image_retries |
20 |
Integer |
burn_image_wait |
120 |
Integer |
clear_semaphore |
false |
Boolean |
cot_python_interpreter |
'/opt/nvidia/cot/client/bin/python' |
String |
exclude_devices |
[] |
List[String] |
exclude_ports |
[] |
List[String] |
ib_device |
'' |
String |
iffu_activate_auto_update |
false |
Boolean |
iffu_auto_update |
true |
Boolean |
iffu_fw_version |
'' |
String |
iffu_image_checksum |
'' |
String |
max_device_ports |
-1 |
Integer |
query_image_retries |
120 |
Integer |
query_image_wait |
10 |
Integer |
stop_on_failure |
false |
Boolean |
working_dir |
'/tmp' |
String |
The following are the formats of port labels for each product:
NVIDIA Quantum-2 – <Node GUID>/P<ASIC>/<cage>/<port> (e.g., 0x900a84030040aab0/P1/3/1)
NVIDIA Quantum – <Node GUID>/P<port> (e.g., 0x900a84030040bbb0/P3)
NVIDIA® ConnectX®-6 – <Node GUID>/P1 (e.g., 0xb8cef60300ff8727/P1)
NVIDIA® ConnectX®-7 – <Node GUID>/P1 (e.g., 0x08c0eb0300e877c4/P1)
This section describes how to update the firmware on one or more InfiniBand HCAs.
Refer to the official NVIDIA Firmware Downloads documentation for further information.
This procedure is a sequence of the following job templates:
Lookup InfiniBand HCAs
HCA Firmware Update
These job templates are linked together to update firmware on InfiniBand HCAs:
Lookup for InfiniBand HCAs by a specific PSID.
Update firmware on the specified InfiniBand HCAs.
The following shows a diagram of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on "IB HCA Firmware Alignment".
WarningHCA firmware update on the SM (subnet manager) host requires stopping the SM service before running the job.
The following variables are required to update HCAs firmware:
Name |
Default |
Type |
api_url |
URL to your cluster bring-up REST API |
String |
hca_fw_image_url |
URL of the firmware image to download |
String |
hca_psid |
PSID of the HCA device to update |
String |
HCA firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.
The following variables are available for update HCAs firmware:
Name |
Description |
burn_image_retries |
Maximum number of retries available for burn task to complete |
burn_image_wait |
Time (in seconds) to wait for burn task to complete |
clear_semaphore |
Specify to clear the flash semaphore before update started |
ib_device |
Specify the name of the in-band device to use (e.g., 'mlx5_0') |
exclude_devices |
List of GUID/LIDs representing the HCAs to ignore |
hca_fw_image_checksum |
Checksum of firmware image to download |
psid |
Alias name for hca_psid. This variable is not available when the hca_psid variable is set. |
query_image_retries |
Maximum number of retries available for query task to complete |
query_image_wait |
Time (in seconds) to wait for query task to complete |
subnet |
Subnet name which the HCAs are member of |
working_dir |
Path to the working directory on the host |
The following variables are available for update HCAs firmware:
Name |
Default |
Type |
burn_image_retries |
10 |
Integer |
burn_image_wait |
120 |
Integer |
clear_semaphore |
false |
Boolean |
ib_device |
'' |
String |
exclude_devices |
[] |
List[String] |
hca_fw_image_checksum |
'' |
String |
query_image_retries |
5 |
Integer |
query_image_wait |
30 |
Integer |
subnet |
'infiniband-default' |
String |
working_dir |
'/tmp' |
String |
The following example shows the firmware image for NVIDIA® ConnectX®-6 VPI adapter cards on the ConnectX VPI/InfiniBand Firmware Download Center:
hca_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-ConnectX6-rel-20_31_1014-MCX654106A-HCA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin.zip'
hca_fw_image_checksum: 'md5:8055b27dd7a3ac7ae60300a37455a7a4'
hca_psid: 'MT_0000000228'
This section describes how to update system firmware/software on one or more InfiniBand switches.
This procedure is a sequence of the following job templates:
IB Externally Managed Switch Firmware Alignment
MLNX-OS System Alignment
These job templates are linked together to update firmware on InfiniBand switches:
Update firmware on externally managed InfiniBand switches.
Upgrade ASIC firmware/MLNX-OS software on InfiniBand switches.
This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on "IB Switch System Alignment".
This section describes how to update firmware on one or more externally managed InfiniBand switches.
Refer to the official NVIDIA Firmware Downloads documentation for further information.
This procedure is a sequence of the following job templates:
Lookup InfiniBand Switches
IB Externally Managed Switch Firmware Update
These job templates are linked together to update firmware on InfiniBand switches:
Lookup for externally managed InfiniBand switches by a specific PSID.
Update firmware on the specified externally managed InfiniBand switches.
This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on "IB Externally Managed Switch Firmware Alignment".
Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.
The following variables are required to update externally managed InfiniBand switch firmware:
Name |
Default |
Type |
api_url |
URL to your cluster bring-up REST API |
String |
switch_fw_image_url |
URL of the firmware image to download |
String |
switch_psid |
PSID of the externally managed switch device to update |
String |
Switch firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.
The following variables are available to update externally managed switch firmware:
Name |
Description |
burn_image_retries |
Maximum number of retries available for burn task to complete |
burn_image_wait |
Time (in seconds) to wait for burn task to complete |
clear_semaphore |
Specify to clear the flash semaphore before update started |
exclude_devices |
List of GUIDS/LIDs representing the switches to ignore |
ib_device |
Specifies the name of the in-band device to use (e.g., 'mlx5_0') |
psid |
Alias name for switch_psid. This variable item is not available when the switch_psid variable is set. |
query_image_retries |
Maximum number of retries available for query task to complete |
query_image_wait |
Time (in seconds) to wait for query task to complete |
subnet |
Subnet name which the externally managed switches are member of |
switch_fw_image_checksum |
Checksum of firmware image to download |
working_dir |
Path to the working directory on the host |
The following are variables definitions and default values for update externally managed switches firmware:
Name |
Default |
Type |
burn_image_retries |
10 |
Integer |
burn_image_wait |
120 |
Integer |
clear_semaphore |
false |
Boolean |
exclude_devices |
[] |
List[String] |
ib_device |
'' |
String |
query_image_retries |
5 |
String |
query_image_wait |
30 |
Integer |
subnet |
'infiniband-default' |
String |
switch_fw_image_checksum |
'' |
String |
working_dir |
'/tmp' |
String |
The following example shows firmware image for NVIDIA Quantum-based InfiniBand switch platforms on the Quantum InfiniBand Firmware Download Center:
switch_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-Quantum-rel-27_2008_3328-MQM8790-HS2X_Ax.bin.zip'
switch_fw_image_checksum: 'md5:953dca31ed40e0a90e991b4291f0fa2d'
switch_psid: 'MT_0000000063'
This section describes how to update system firmware/MLNX-OS software on one or more switches.
Refer to the official NVIDIA® MLNX-OS® documentation for further information.
This procedure is a sequence of the following job templates:
Lookup MLNX-OS Switches
MLNX-OS Upgrade
These job templates are linked together to update software on InfiniBand switches:
Lookup for MLNX-OS switches hostnames.
Update system firmware/OS software on the specified switches.
The following diagram shows the nodes of this workflow:
The following instructions describe how to run this workflow:
Go to Resources > Templates.
Click the "Launch Template" button on "MLNX-OS System Alignment".
Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.
The following variables are required to update a MLNX-OS system:
Name |
Default |
Type |
api_url |
URL to your cluster bring-up REST API |
String |
mlnxos_image_url |
URL of the MLNX-OS image to download |
String |
switch_username |
Username to authenticate against target switches |
String |
switch_password |
Password to authenticate against target switches |
String |
mlnxos_switch_hostname |
Hostname expression that represents the names of the switches to upgrade. To skip this parameter using auto-detection of the MLNX-OS switches, NVIDIA® UFM® Telemetry is required. Make sure to run IB Network Discovery with ufm_telemetry_path parameter. |
String |
The following variables are available to update a MLNX-OS system:
Name |
Description |
command_timeout |
Time (in seconds) to wait for the command to be completed |
force |
Specify to update MLNX-OS system even if it is already up to date |
image_url |
Alias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set. |
mlnxos_switch_username |
Alias name for switch_username. This variable item is not available when the switch_username is set. |
mlnxos_switch_username |
Alias name for switch_password. This variable item is not available when the switch_password is set. |
reload_command |
Specify an alternative command for reload switch system |
reload_timeout |
Time (in seconds) to wait for the switch system to be reloaded |
remove_images |
Determine whether to remove all images on disk before system upgrade started |
The following are variable definitions and default values to update internally managed switch software:
Name |
Default |
Type |
command_timeout |
240 |
Integer |
force |
false |
Boolean |
reload_command |
'"reload noconfirm"' |
String |
reload_timeout |
200 |
Integer |
remove_images |
false |
Boolean |