RMA Checklists and Support Upgrade Procedures

Checklist Before Submitting an RMA Request

Based on prior experience, a short troubleshooting session may reveal the root cause of a failure and prevent the redundant shipment of parts.

Please perform the steps below before submitting an RMA request:

  1. Carry out the troubleshooting steps for an early fault determination.

  2. Please mention in the "Problem Description" section the troubleshooting steps you performed.

We recommend attaching any relevant troubleshooting-related data (e.g. logs, screenshots, etc.) to your RMA request, should you wish to proceed. This will enable NVIDIA Support to identify the issue quicker and save you precious time.

Checklists are provided below for these NVIDIA Networking product types (assets):

Network Adapter Card Checklist

  1. If the card isn’t recognized by the OS, reseat the card in the server PCI slot and verify that it is recognized by the OS.

    1. In Linux-based OS: lspci | grep –i Mellanox

    2. In Windows OS: Under “Device Manager > Other devices”, select “InfiniBand Controller” or “Unknown Devices”. If you can't find the device, click “Action > Scan for hardware changes”.

  2. Swap the card with a known working card of a similar P/N.
    Note: If the issue recurs with the known working card, this would most likely indicate the card is not faulty.

  3. Replace the cable connected to the card with a known working cable.

  4. Connect the cable to another known working port destination.

  5. Read the port(s) LEDs indication. Are the LEDs indicating a fault state?

  6. Verify that the card firmware version is up-to-date.
    For the network adapter card firmware query and upgrade procedure, please refer to Network Adapter Firmware Query and Upgrade Procedure.

  7. For the InfiniBand protocol to work, verify that the SM is running in the fabric.

  8. Verify that the driver version installed on the server is up to date.
    For the driver query and upgrade procedure, please refer to Driver Query and Upgrade Procedure.

BlueField DPU Checklist

Note: Please remember to collect the logs for each step.

  1. Remove the card from the server.

    1. Make sure the PCIe slot is free from foreign objects.

    2. Take a visual inspection to make sure there are no lifted/broken/missing components, the connector pins are straight and not pull backed, and the golden fingers are clean.

  2. Reseat the card back in the server and make sure the DPU is installed straight and fully inserted.

  3. Ensure your server has sufficient cooling.

  4. Ensure ATX power is connected for relevant OPNs.

  5. Start from the Network Adapter Card Checklist section above.

  6. Install the DPU card again (including FW) with the latest version.

    1. Install the latest *.bfb file.

    2. Check the 1 GB interface by running - ssh to Arm.

    3. If not connecting, try to log to the BMC through the UART interface, run Obmc-console-client # connect to Arm.

  7. Once connected, run:

    1. cat /etc/mlnx-release # and make sure the correct OS version is installed.

    2. bfrshlog # to make sure there are no DDR error messages.

    3. lsblk # to make sure you see emmc and ssd devices and they have the correct capacity according to the datasheet.

    4. To check ATX power for sub150W DPUs, run #

      1. modprobe mlxbf-ptm

      2. cat /sys/kernel/debug/mlxbf-ptm/monitors/status/vr1_power #
        If it is 0, the ATX appears not connected.

    5. To check BMC FW version, run #

      1. ipmi tool mc info # to check the correct firmware version is installed.

      2. Or run, from any host in the network: ipmitool -C 17 -I lanplus -H bu-fae4-bf3-bmc -U root -P <password> mc info

    6. Run bfcfg -d # to check that it is not blank.

    7. Read system temperature.

    8. Run from the host:

      1. mlxlink -d mlx5_0 --port type pcie -e -c # to check the PCIe link width and speed is as expected; repeat for all PCIe ports.

      2. mlxlink -d mlx5_<x> -e -c -m # to check the link width and speed is as expected; repeat for all network ports.

  8. If still not connected, check the rshim connection from the host.

    1. Set verbose mode:

      1. Echo "DISPLAY_LEVEL 2" > /dev/rshim1/misc

      2. cat /dev/rshim1/misc # to make sure there are no DDR error messages.

BlueField SuperNIC Checklist

Note: Please remember to collect the logs for each step.

  1. Remove the card from the server.

    1. Make sure the PCIe slot is free from foreign objects.

    2. Take a visual inspection to make sure there are no lifted/broken/missing components, the connector pins are straight and not pull backed, and the golden fingers are clean.

  2. Reseat the card back in the server and make sure the DPU is installed straight and fully inserted.

  3. Ensure your server has sufficient cooling.

  4. Ensure ATX power is connected for relevant OPNs.

  5. Start from the Network Adapter Card Checklist section above.

  6. Install the DPU card again (including FW) with the latest version.

    1. Install the latest *.bfb file.

  7. Once connected

    1. Read system temperature.

    2. Run from the host:

      1. mlxlink -d mlx5_0 --port type pcie -e -c # to check the PCIe link width and speed is as expected; repeat for all PCIe ports.

      2. mlxlink -d mlx5_<x> -e -c -m # to check the link width and speed is as expected; repeat for all network ports.

  8. If still not connected, check the rshim connection from the host.

    1. Set verbose mode:

      1. Echo "DISPLAY_LEVEL 2" > /dev/rshim1/misc

      2. cat /dev/rshim1/misc # to make sure there are no DDR error messages.

Switch Power Supply and Fan FRU (Field Replacement Unit) Checklist

  1. Reseat the FRU module.

  2. Swap the FRU module with a known working FRU.
    Note: If the issue recurs with the known working FRU, this would most likely indicate that the FRU slot is faulty rather than the FRU.

  3. Read the Power supply/Fan LED indication. Does the LED indicate a fault state?

  4. Verify that the FRU module is recognized by the switch’s OS--Mellanox Onyx or MLNX-OS--by invoking the following commands:

    show inventory
    show module

Note: Before submitting an RMA Request, we recommend adding the top serial numbers of the chassis/switch. This can save time with identifying the asset.

Remotely-managed (Unmanaged) Switch Checklist

  1. Verify that the firmware version of the remotely-managed switch is up to date.

    For the remotely-managed switch firmware query and upgrade procedure, please refer to Remotely-managed Switch Firmware Query and Upgrade Procedure.

  2. If you encounter a setback with bringing up ports, please perform the following:

    a. Replace the connected cable(s) with another known working cable(s).
    b. Connect the cable(s) to another known working port(s).
    c. Perform a loopback test by connecting the faulty port(s) to another known working port in the same leaf.
    d. Read port(s) LEDs indication. Are the LEDs indicating a faulty state?

  3. Refer to the switch’s status LEDs indications. Is it in a faulty state?

  4. For the InfiniBand protocol to work, verify that the SM is running in the fabric.

Managed Switch Checklist

  1. For managed switches, please verify that the managed switch’s software and firmware versions are up to date.

    The firmware version is automatically upgraded during the upgrade of the software.
    For the managed switch’s firmware query and upgrade procedure, please refer to Managed Switch Software Query and Upgrade Procedure.

  2. If you encounter an issue with bringing-up ports, please perform the following:

    a. Replace the connected cable(s) with another known working cable(s).
    b. Connect the cable(s) to another known working port(s).
    c. Perform a loopback test by connecting the faulty port(s) to another known working port in the same leaf.
    d. Refer to the port(s) LEDs indication. Is it in a faulty state?

  3. For the InfiniBand protocol to work, verify that the SM is running in the fabric.

  4. Refer to the switch’s status LEDs indications. is it in a faulty state?

  5. Create the switch system dump file.

  6. To create the switch system dump file, please refer to Creating the Switch System Dump File.

Note: We recommend attaching the Syssump file to your RMA request. This can help us identify the issue much faster and save you precious time.

Leaf/Spine Module Checklist

  1. The leaf/spine is installed in a modular managed switch. Please verify that the managed switch software and firmware versions are up-to-date.

    The firmware version is automatically upgraded during the upgrade of the software.
    For the managed switch’s firmware query and upgrade procedure, please refer to Managed Switch Software Query and Upgrade Procedure.

  2. To query the state of the modules via the OS (MLNX_OS, Mellanox Onxy), run the “show inventory” and “show module” commands.

  3. If you encounter an issue with bringing-up internal ports, please perform the following:

    a. Reseat the leaf/spine module and the corresponding spine/leaf module, respectively.
    b. Swap the leaf/spine module and the corresponding spine/leaf module, respectively.
    On each internal link, we should eliminate the failing part (leaf, spine or backplane). Swapping the relevant leaf and spine will pinpoint the part which is causing the issue. Normally, the issue mitigates with the faulty part.

  4. If you encounter an issue with bringing-up external ports, please perform the following:

    a. Replace the connected cable(s) with another known working cable(s).
    b. Connect the cable(s) to another known working port(s).
    c. Perform a loopback test by connecting the faulty port(s) to another known working port in the same leaf.
    d. Refer to the port(s) LEDs indication. Is it in a faulty state?

  5. For the InfiniBand protocol to work, verify that the SM is running in the fabric.

  6. Refer to the switch’s status LEDs indications. Is it in a faulty state?

  7. Create the switch system dump file.
    To create the switch system dump file, please refer to Managed Switch Software Query and Upgrade Procedure.

Note: We recommend attaching the Syssump file to your RMA request. This can help us identify the issue much faster and save you precious time.

Cable Checklist

  1. In case you encounter an issue associated with link’s bring-up or link’s errors, verify that the connected devices’ (HCAs and/or switches) firmware versions are up to date.

    For the HCA firmware query and upgrade procedure, please refer to Network Adapter Firmware Query and Upgrade Procedure.
    For the remotely-managed switch firmware query and upgrade procedure, please refer to Remotely-managed Switch Firmware Query and Upgrade Procedure.
    For the managed switch firmware query and upgrade procedure, please refer to Managed Switch Software Query and Upgrade Procedure.

  2. Reseat the cable on both ends.

  3. Connect the cable to another known working port. Repeat the test on both ends of the cable.

  4. Replace the cable with a known working cable of a similar P/N.

    Note: If the issue reoccurs with the replaced cable, this would most likely indicate that the cable is not faulty.

© Copyright 2023, NVIDIA. Last updated on Jan 8, 2024.