InfiniBand Cluster Bring-up Procedure
InfiniBand Cluster Bring-up Procedure

Confirm Components' Firmware and Software Versions

This chapter will cover how to read firmware and software version for the following:

  • Switch ASICs

  • Transceivers

  • HCA cards

The recommended guideline is to confirm that the versions among the cluster are aligned, or differ with up to 2 versions.

Information of the recommended NDR cluster bundle can be found here.

The process can be done using UFM GUI ( which is recommended), or through MOFED commands.

ASICs and HCAs FW version

From the left side main menu, click on Managed Elements, and then on Devices.

image-2024-5-8_16-24-46-version-1-modificationdate-1716821911330-api-v2.png

The Devices page opens and displays a table with all the managed switches/hosts in the cluster.

image-2024-5-8_16-32-15-version-1-modificationdate-1716821910710-api-v2.png

For switch ASIC, the FW version is listed in the main table.

For node HCA, select its row, Device Information section should pop up from the right side of the window, containing information about the selected device. If this section does not pop up, you should be able to open it by clicking on the left arrow on the top-right side of the table.

image-2024-5-8_16-37-38-version-1-modificationdate-1716821910430-api-v2.png

image-2024-5-8_16-39-53-version-1-modificationdate-1716821910050-api-v2.png

Click on the HCAs tab to see the device HCAs and the FW versions.

Note

For HCAs only, click on HCAs from the left side main menu. All connected HCAs are listed there with the FW versions.

Managed switch SW (NOS) version

Click on Network Map from the left side main menu. The visualization of the cluster should display.

Select a switch. The switch information and the SW Version (NOS) should appear in the table on the left side.

image-2024-5-8_18-46-17-version-1-modificationdate-1716821909643-api-v2.png


Transceivers

From the Devices page, select a switch, and from the Device Information table on the right, click on Cables tab.

The page displays a table with the connected cables and the FW versions.

image-2024-5-9_8-50-46-version-1-modificationdate-1716821909203-api-v2.png

Note

Alternatively, go to Cables page from the left side main menu, which displays information on all the connected cables at once.

Prerequisite

  • Make sure you have the latest MFT installed. If not, install it either as part of MLNX_OFED installation process or according to the instructions found here.

  • Before using it, start the MST driver, run mst start
    This command will create files that represent NVIDIA devices in directory /dev/mst
    For the relevant devices, run "mst status"
    For further information, see the mst Service section in the MFT User Manual.

Identify the Switch Firmware Version

Note

This section is applicable only to externally managed (unmanaged) switches (the ASIC firmware is bundled in NOS in managed systems).

  1. Access the unmanaged switches via its LID.

  2. Identify the switch LID, run ibswitches.

    Copy
    Copied!
                

    root@ufmx-qnt-02: # ibswitches Switch 0x900a8403006 f f780 ports 65 "MF0 ;grla -quanta -01:MQM9700/U l" enhanced port 0 lid 1 lmc 0 Switch 0x900a8403006 f e0c0 ports 65 "MF0 ;grla -quanta -s2:MQM9700/U l" enhanced port 0 lid 5 lmc 0 Switch 0x900a8403006 f f8c0 ports 65 "MF0 ;grla -quanta -s1:MQM9700/U l" enhanced port 0 lid 14 lmc 0 Switch 0x900a8403006 f e040 ports 65 "MF0 ;grla -quanta -02:MQM9700/U l" enhanced port 0 ltd 15 lmc 0

  3. Check the firmware version, run flint -d lid-X -qq q.

    Copy
    Copied!
                

    root@ufmx -qnt-02: # flint -d lid-1 -qq q Image type: FS4 FW Version: 31.2012.3008 FW Release Date: 3.1.2024 Product Version: 31.2012.3008 Rom Info: type=UEFI version=skipped cpu=skipped type=PXE version=skipped devid=skipped type=NVMe version=skipped devid=skipped Description: UID GuidsNumber Base GUID: 900a8403006ff780 64 Base MAC: 900a846ff780 64 Image VSD: N/A Device VSD: N/A PSID: MT 0000000577 Security Attributes: secure-fw

Identify the Switch Version

  1. Connect to your switch remotely with SSH: #ssh admin@my-switch-name(e.g. ssh admin@172.28.3.216)

  2. Enter config mode.

    Copy
    Copied!
                

    switch> enable switch# configure terminal switch (config)#

  3. Check the NOS' version.

    Copy
    Copied!
                

    switch (config)# show version Product name: MLNX-OS Product release: 3.4.2002 Build ID: #1-dev Build date: 2015-07-30 20:13:19 Target arch: x86_64 Target hw: x86_64 Built by: jenkins@fit74 _Version summary: X86_64 3.4.2002 2015-07-30 20:13:19 x86_64

Identify the HCA Firmware​ Version

  1. Identify the HCA device, run mst status.

    Copy
    Copied!
                

    [root@fit229 ~]# mst status MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded   MST devices: ------------ /dev/mst/mt4129_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:04:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00

  2. Check the firmware version.

    Copy
    Copied!
                

    [root@fit229 ~]# flint -d /dev/mst/mt4129_pciconf0 -qq q Image type: FS4 FW Version: 28.98.2400 FW Release Date: 14.2.2022 Product Version: 28.98.2400 Rom Info: type=UEFI version=14.25.21 cpu=AMD64,AARCH64 type=PXE version=3.6.502 cpu=AMD64 Description: UID GuidsNumber Base GUID: 1070fd0300d84644 4 Base MAC: 1070fdd84644 4 Image VSD: N/A Device VSD: N/A PSID: MT_0000000798 Security Attributes: N/A

  3. For further details, see https://docs.nvidia.com/networking/display/mftv4270/Querying+the+Firmware+Image.

Identify the Transceiver Firmware​ Version

To check what is the transceiver firmware version, run flint -d lid-1 --linkx --downstream_device_ids 1 q.

Copy
Copied!
            

[admin@gorilla-169 ~]# flint -d lid-1 --linkx --downstream_device_ids 1 q Host : lid-1  Device index 1  Component Index 3  Component Status NOT_PRESENT  Component Update State IDLE  Running state is :  Image A is running  Information block is :  FW image A is present  FW A Version : 46.130.0023 FW B Version : 00.00.0000 FW Factory Version : 00.00.0000 SupportedProtocol: CMIS 4.0 is implemented Activation type: Self-activation with HW reset contained in the Run FW Image command. No additional actions required from the host. Serial number is 0

Identify the Driver Version

Make sure all the servers are using the latest driver version, run - ofed_info -s.

Copy
Copied!
            

~ $ofed_info -s MLNX_OFED_LINUX-23.04-0.5.3.3


© Copyright 2024, NVIDIA. Last updated on May 28, 2024.