From Build to Operation


This chapter describes the required procedure to be executed toward the end of cluster bringup phase, just before the cluster operation. It is also recommended to execute this procedure after every maintenance window. That includes files and logs to be reviewed and kept as reference when the cluster is signed off from the build phase to the operation phase and after performing UFM/OpenSM/Firmware upgrade procedure.
This section outlines the necessary steps that need to be taken as part of the final stages of InfiniBand cluster bring up and initialization, just before the cluster becomes operational. It is recommended to follow these steps after each maintenance window as well. These steps encompass the review and retention of files and logs that serve as references when transitioning the cluster from the construction phase to the operational phase. This practice is particularly relevant after carrying out procedures such as UFM/OpenSM/Firmware upgrades.
Please adhere to the following steps:



Direct Link


Verify the operational status of the UFM service

UFM Service Verification


Generate the fabric health procedure

Fabric Health Report Generation and Validation


Confirm the integrity of the cluster's topology

Cluster Topology Validation


Validate the proper functioning of UFM telemetry collection

Telemetry Metrics Collection


Evaluate the cluster's performance

Cluster Performance Verification


Optional: Check for the presence and functionality of necessary UFM plugins

Review All Unhealthy Nodes


Run the quarterly maintenance procedure

Quarterly Maintenance


Obtain and securely store UFM system dump using the following command:


Retain this data for future reference


Contact NVIDIA Networking Support or your designated NVIDIA account team to facilitate a review of your UFM and SM configurations with NVIDIA. Provide the generated ufm_sysdump data created in the previous step for reference.

© Copyright 2023, NVIDIA. Last updated on Mar 20, 2024.