Cluster Verification

InfiniBand Cluster Bring-up Procedure

This chapter describes the required procedure to be executed toward the end of cluster bringup phase, just before the cluster operation. That includes files and logs to be reviewed and kept as reference when the cluster is signed off from the build phase to the operation phase and after performing UFM/OpenSM/Firmware upgrade procedure.

This chapter outlines the necessary steps that need to be taken as part of the final stages of InfiniBand cluster bring up and initialization, just before the cluster becomes operational.

Please adhere to the following steps:

  1. Monitor SM Logs by:

    Looking for errors.

    Verifying the subnet is up.

  2. Run UFM Fabric Health.

    Check the output summary.

    Check the port counters.

    Check the nodes information.

    Check the errors on the links.

  3. ​ UFM Telemetry​ - collects unique counters for each port in the InfiniBand fabric to ensure efficiency.

  4.  UFM Events and Alarms - allows to identify any problems including ports and device connectivity.

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.