From Build to Operation
This chapter describes the required procedure to be executed toward the end of cluster bringup phase, just before the cluster operation. It is also recommended to execute this procedure after every maintenance window. That includes files and logs to be reviewed and kept as reference when the cluster is signed off from the build phase to the operation phase and after performing UFM/OpenSM/Firmware upgrade procedure.
This section outlines the necessary steps that need to be taken as part of the final stages of InfiniBand cluster bring up and initialization, just before the cluster becomes operational. It is recommended to follow these steps after each maintenance window as well. These steps encompass the review and retention of files and logs that serve as references when transitioning the cluster from the construction phase to the operational phase. This practice is particularly relevant after carrying out procedures such as UFM/OpenSM/Firmware upgrades.
Please adhere to the following steps:
Step |
Item |
Direct Link |
1 |
Verify the operational status of the UFM service |
|
2 |
Generate the fabric health procedure |
|
3 |
Confirm the integrity of the cluster's topology |
|
4 |
Validate the proper functioning of UFM telemetry collection |
|
5 |
Evaluate the cluster's performance |
|
6 |
Optional: Check for the presence and functionality of necessary UFM plugins |
|
7 |
Run the quarterly maintenance procedure |
|
8 |
Obtain and securely store UFM system dump using the following command: /usr/bin/ufm_sysdump.sh Retain this data for future reference |
|
9 |
Contact NVIDIA Networking Support or your designated NVIDIA account team to facilitate a review of your UFM and SM configurations with NVIDIA. Provide the generated ufm_sysdump data created in the previous step for reference. |