image image image image image

On This Page

The following subsections describe the currently supported workflow templates.

IB Cluster Bring-Up

This section describes how to deploy the InfiniBand cluster.

This procedure is a sequence of the following workflow templates:

  1. IB Network Deployment
  2. IB Network Verification

These workflow templates are linked together to deploy the InfiniBand cluster:

  1. Deploy InfiniBand network.
  2. Verify the InfiniBand network.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on the "IB Cluster Bring-Up".

IB Network Deployment

This section describes how to deploy InfiniBand network.

This procedure is a sequence of the following workflow and job templates:

  1. Host Package Deployment
  2. IB Network Discovery
  3. IB Switch System Alignment
  4. IB HCA Firmware Alignment
  5. IB Cable Firmware Alignment
  6. IB Network Discovery

These workflow templates and job templates are linked together to deploy the InfiniBand cluster:

  1. Ensure software packages are installed on the hosts.
  2. Discover InfiniBand topology and update the database with the discovered topology.
  3. Update system firmware/MLNX-OS software on InfiniBand switches.
  4. Update firmware on InfiniBand HCAs.
  5. Update cables' transceivers' firmware on InfiniBand cable devices.
  6. Discover InfiniBand topology and update the database with the discovered topology.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on the "IB Network Deployment".

Make sure that all variables for this workflow are defined.

IB Network Verification

This section describes how to verify the InfiniBand network.

This procedure is a sequence of the following job templates:

  1. IB Topology Comparison
  2. ClusterKit
  3. IB Topology Comparison
  4. IB Fabric Health Checks
  5. Fabric Health Counters Collection

These workflow templates are linked together to deploy the InfiniBand cluster:

  1. Discover InfiniBand topology and create a file with the discovered topology.
  2. Run ClusterKit tests.
  3. Discover InfiniBand topology and compare against the discovered topology.
  4. Performs diagnostic fabric health check of the fabric's state.
  5. Performs the collection of fabric counters with and without traffic.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on the "IB Network Verification".

Host Package Deployment

This section describes how to deploy NVIDIA software packages on one or more hosts.

Refer to the official NVIDIA Software Products documentation for further information.

This procedure is a sequence of the following job templates:

  1. COT Python Alignment
  2. MLNX_OFED Upgrade
  3. MFT Upgrade
  4. HPC-X Upgrade
  5. UFM Telemetry Upgrade

These job templates are linked together to deploy NVIDIA software packages:

  1. Ensure the Python environment for the cluster orchestration tool (COT) is installed.
  2. Ensure the MLNX_OFED Linux driver is installed.
  3. Ensure the HPC-X Software Toolkit is installed.
  4. Ensure the MFT is installed.
  5. Install UFM Telemetry if package URL is provided.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on the "Host Package Deployment".

The following variables are available for deploying software packages:

NameDescription
forceInstall the packages even if the packages are already up to date
hpcx_checksumChecksum of the HPC-X package to download
hpcx_dirTarget path for HPC-X installation folder
hpcx_install_once Specify whether to install HPC-X package via single host. May be used to install the package on a shared directory.
hpcx_package_urlURL of the HPC-X package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.
hpcx_versionVersion number of the HPC-X package to install
mft_checksumChecksum of the MFT package to download
mft_dependenciesList of all package dependencies for the MFT package
mft_install_optionsList of optional arguments for the installation command
mft_package_urlURL of the MFT package to download (default: auto-detection). In addition, you must specify the hpcx_version parameter or use its default value.
mft_versionVersion number of the MFT package to install
ofed_checksumChecksum of the MLNX_OFED package to download
ofed_dependenciesList of all package dependencies for the MLNX_OFED package
ofed_install_optionsList of optional arguments for the installation command
ofed_package_urlURL of the MLNX_OFED package to download (default: auto-detection). In addition, you must specify the ofed_version parameter or use its default value.
ofed_versionVersion number of the MLNX_OFED package to install
working_dirPath to the working directory on the host

The following are variable definitions and default values for deploying software packages:

NameDefaultType
forcefalseBoolean
hpcx_checksum''String
hpcx_dir'/opt/nvidia/hpcx'String
hpcx_install_oncefalseBoolean
hpcx_package_url''String
hpcx_version'2.15.0'String
mft_checksum''String
mft_dependencies[]List[String]
mft_install_options[]List[String]
mft_package_url''String
mft_version'4.24.0-72'String
ofed_checksum''String
ofed_dependencies[]List[String]
ofed_install_options[]List[String]
ofed_package_url''String
ofed_version'23.04-0.5.3.3'String
working_dir'/tmp'String

IB Cable Firmware Alignment

This section describes how to update the firmware of the transceivers on one or more cable devices.

Refer to the official NVIDIA Cable Firmware Update documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand Cables
  2. IB Cable Firmware Update

These job templates are linked together to update cable transceiver firmware:

  1. Lookup for InfiniBand cables by a specific part number.
  2. Update cable transceiver firmware on the specified cable devices.

This workflow relies on updated topology, so make sure the topology is up to date by running network discovery.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on the "IB Cable Firmware Alignment".

Make sure that all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables. 

The following variables are required for updating cable transceiver firmware:

NameDefaultType
api_urlURL to your Cluster Bring-up REST APIString
iffu_image_urlURL of the firmware image to downloadString
cable_part_number

Part number of the cable/transceiver to update.

Can be provided as a regular expression (e.g., 'MFS1S00-H0(03|05|10)E_QP').

String

The following is an additional variable required for updating cable transceiver firmware on hybrid products (e.g., NVIDIA AOC splitter, IB twin port HDR, OSFP-to-2xQSFP56):

NameDefaultType
cable_identifierIdentifier of the cable/transceiver to update (e.g., OSFP, QSFP56)String

Cable firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available to update cable transceiver firmware:

NameDescription
activate_delay

Time (in seconds) to wait before activating all updated cables

activate_delay_factorMultiplying factor used to adjust delay after loading new firmware. Its value must be greater than or equal to 1.
activate_image_retriesMaximum number of retries available for activate task to complete
activate_image_waitTime (in seconds) to wait for activate task to complete
burn_image_retriesMaximum number of retries available for burn task to complete
burn_image_waitTime (in seconds) to wait for burn task to complete
clear_semaphoreSpecify to clear the flash semaphore before update started
cot_python_interpreterPath to cluster orchestration Python interpreter
exclude_devicesList of GUIDS/LIDs representing the InfiniBand devices to ignore
exclude_portsPorts labels that represent the cable devices to ignore
ib_deviceSpecify the name of the In-Band device to use (e.g., mlx5_0)
iffu_activate_auto_updateSpecify whether to activate all updated cables/transceivers connected to the host/switch. This variable is not available when iffu_auto_update is set to true.
iffu_auto_updateSpecify whether to update all supported cables/transceivers connected to the host/switch
iffu_fw_versionFirmware version number of the cable image to update. This variable is mandatory when the cable image is not queryable.
iffu_image_checksumChecksum of firmware image to download
max_device_portsLimit the number of cables/transceivers to burn on each host/switch device. This variable is not available when iffu_auto_update is set to true.
query_image_retriesMaximum number of retries available for query task to complete
query_image_wait

Time (in seconds) to wait for query to complete

stop_on_failureSpecifies to stop the update firmware execution on the first failure
working_dirPath to the working directory on the host

The following are variables definitions and default values for update cables transceivers' firmware:

NameDefaultType
activate_delay60Integer
activate_delay_factor2Decimal
activate_image_retries10Integer
activate_image_wait120Integer
burn_image_retries20Integer
burn_image_wait120Integer
clear_semaphorefalseBoolean
cot_python_interpreter'/opt/nvidia/cot/client/bin/python'String
exclude_devices

[]

List[String]
exclude_ports

[]

List[String]
ib_device''String
iffu_activate_auto_updatefalseBoolean
iffu_auto_updatetrueBoolean
iffu_fw_version''String
iffu_image_checksum''String
max_device_ports-1Integer
query_image_retries120Integer
query_image_wait10Integer
stop_on_failurefalseBoolean
working_dir'/tmp'String

The following are the formats of port labels for each product:

  • NVIDIA Quantum-2 – <Node GUID>/P<ASIC>/<cage>/<port> (e.g., 0x900a84030040aab0/P1/3/1)
  • NVIDIA Quantum – <Node GUID>/P<port> (e.g., 0x900a84030040bbb0/P3)
  • NVIDIA® ConnectX®-6 – <Node GUID>/P1 (e.g., 0xb8cef60300ff8727/P1)
  • NVIDIA® ConnectX®-7 – <Node GUID>/P1 (e.g., 0x08c0eb0300e877c4/P1)

IB HCA Firmware Alignment

This section describes how to update the firmware on one or more InfiniBand HCAs.

Refer to the official NVIDIA Firmware Downloads documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand HCAs
  2. HCA Firmware Update

These job templates are linked together to update firmware on InfiniBand HCAs:

  1. Lookup for InfiniBand HCAs by a specific PSID.
  2. Update firmware on the specified InfiniBand HCAs.

The following shows a diagram of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "IB HCA Firmware Alignment". 

    HCA firmware update on the SM (subnet manager) host requires stopping the SM service before running the job.

The following variables are required to update HCAs firmware:

NameDefaultType
api_urlURL to your cluster bring-up REST APIString
hca_fw_image_urlURL of the firmware image to downloadString
hca_psidPSID of the HCA device to updateString

HCA firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available for update HCAs firmware:

NameDescription
burn_image_retriesMaximum number of retries available for burn task to complete
burn_image_wait

Time (in seconds) to wait for burn task to complete

clear_semaphoreSpecify to clear the flash semaphore before update started
ib_deviceSpecify the name of the in-band device to use (e.g., 'mlx5_0')
exclude_devicesList of GUID/LIDs representing the HCAs to ignore
hca_fw_image_checksumChecksum of firmware image to download
psidAlias name for hca_psid. This variable is not available when the hca_psid variable is set.
query_image_retriesMaximum number of retries available for query task to complete
query_image_wait

Time (in seconds) to wait for query task to complete

subnetSubnet name which the HCAs are member of
working_dirPath to the working directory on the host

The following variables are available for update HCAs firmware:

NameDefaultType
burn_image_retries10Integer
burn_image_wait120Integer
clear_semaphorefalseBoolean
ib_device''String
exclude_devices[]List[String]
hca_fw_image_checksum''String
query_image_retries5Integer
query_image_wait30Integer
subnet'infiniband-default'String
working_dir'/tmp'String

The following example shows the firmware image for NVIDIA® ConnectX®-6 VPI adapter cards on the ConnectX VPI/InfiniBand Firmware Download Center:

hca_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-ConnectX6-rel-20_31_1014-MCX654106A-HCA_Ax-UEFI-14.24.13-FlexBoot-3.6.403.bin.zip'
hca_fw_image_checksum: 'md5:8055b27dd7a3ac7ae60300a37455a7a4'
hca_psid: 'MT_0000000228'

IB Switch System Alignment

This section describes how to update system firmware/software on one or more InfiniBand switches.

This procedure is a sequence of the following job templates:

  1. IB Externally Managed Switch Firmware Alignment
  2. MLNX-OS System Alignment

These job templates are linked together to update firmware on InfiniBand switches:

  1. Update firmware on externally managed InfiniBand switches.
  2. Upgrade ASIC firmware/MLNX-OS software on InfiniBand switches. 

This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "IB Switch System Alignment".

IB Externally Managed Switch Firmware Alignment

This section describes how to update firmware on one or more externally managed InfiniBand switches.

Refer to the official NVIDIA Firmware Downloads documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup InfiniBand Switches
  2. IB Externally Managed Switch Firmware Update

These job templates are linked together to update firmware on InfiniBand switches:

  1. Lookup for externally managed InfiniBand switches by a specific PSID.
  2. Update firmware on the specified externally managed InfiniBand switches.

This workflow relies on the updated topology. Therefore, make sure the topology is up-to-date by running network discovery.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "IB Externally Managed Switch Firmware Alignment".

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update externally managed InfiniBand switch firmware:

NameDefaultType
api_urlURL to your cluster bring-up REST APIString
switch_fw_image_urlURL of the firmware image to downloadString
switch_psidPSID of the externally managed switch device to updateString

Switch firmware may be provided as a zip file. For this purpose, either unzip or zipinfo must be installed when using Ansible. For more information, refer to Ansible's documentation.

The following variables are available to update externally managed switch firmware:

NameDescription
burn_image_retriesMaximum number of retries available for burn task to complete
burn_image_waitTime (in seconds) to wait for burn task to complete
clear_semaphoreSpecify to clear the flash semaphore before update started
exclude_devicesList of GUIDS/LIDs representing the switches to ignore
ib_deviceSpecifies the name of the in-band device to use (e.g., 'mlx5_0')
psidAlias name for switch_psid. This variable item is not available when the switch_psid variable is set.
query_image_retriesMaximum number of retries available for query task to complete
query_image_wait

Time (in seconds) to wait for query task to complete

subnetSubnet name which the externally managed switches are member of
switch_fw_image_checksumChecksum of firmware image to download
working_dirPath to the working directory on the host

The following are variables definitions and default values for update externally managed switches firmware:

NameDefaultType
burn_image_retries10Integer
burn_image_wait120Integer
clear_semaphorefalseBoolean
exclude_devices[]List[String]
ib_device''String
query_image_retries5String
query_image_wait30Integer
subnet'infiniband-default'String
switch_fw_image_checksum''String
working_dir'/tmp'String

The following example shows firmware image for NVIDIA Quantum-based InfiniBand switch platforms on the Quantum InfiniBand Firmware Download Center:

switch_fw_image_url: 'https://www.mellanox.com/downloads/firmware/fw-Quantum-rel-27_2008_3328-MQM8790-HS2X_Ax.bin.zip'
switch_fw_image_checksum: 'md5:953dca31ed40e0a90e991b4291f0fa2d'
switch_psid: 'MT_0000000063'

MLNX-OS System Alignment

This section describes how to update system firmware/MLNX-OS software on one or more switches.

Refer to the official NVIDIA® MLNX-OS® documentation for further information.

This procedure is a sequence of the following job templates:

  1. Lookup MLNX-OS Switches
  2. MLNX-OS Upgrade

These job templates are linked together to update software on InfiniBand switches:

  1. Lookup for MLNX-OS switches hostnames.
  2. Update system firmware/OS software on the specified switches.

The following diagram shows the nodes of this workflow:

The following instructions describe how to run this workflow:

  1. Go to Resources > Templates.
  2. Click the "Launch Template" button on "MLNX-OS System Alignment".

Make sure all required variables described below are defined before running this job. You can define these variables either as inventory variables or as job template variables.

The following variables are required to update a MLNX-OS system:

NameDefaultType
api_urlURL to your cluster bring-up REST APIString
mlnxos_image_urlURL of the MLNX-OS image to downloadString
switch_usernameUsername to authenticate against target switchesString
switch_passwordPassword to authenticate against target switchesString
mlnxos_switch_hostname

Hostname expression that represents the names of the switches to upgrade. 

To skip this parameter using auto-detection of the MLNX-OS switches, NVIDIA® UFM® Telemetry is required. 

Make sure to run IB Network Discovery with ufm_telemetry_path parameter.

String

The following variables are available to update a MLNX-OS system:

NameDescription
command_timeoutTime (in seconds) to wait for the command to be completed
forceSpecify to update MLNX-OS system even if it is already up to date
image_urlAlias name for mlnxos_image_url. This variable item is not available when the mlnxos_image_url is set.
mlnxos_switch_usernameAlias name for switch_username. This variable item is not available when the switch_username is set.
mlnxos_switch_usernameAlias name for switch_password. This variable item is not available when the switch_password is set.
reload_commandSpecify an alternative command for reload switch system
reload_timeoutTime (in seconds) to wait for the switch system to be reloaded
remove_imagesDetermine whether to remove all images on disk before system upgrade started

The following are variable definitions and default values to update internally managed switch software:

NameDefaultType
command_timeout240Integer
forcefalseBoolean
reload_command'"reload noconfirm"'String
reload_timeout200Integer
remove_imagesfalseBoolean