NVIDIA InfiniBand Cluster Operation and Maintenance Guide
1.0

From Build to Operation

This chapter describes the required procedure to be executed toward the end of cluster bringup phase, just before the cluster operation. It is also recommended to execute this procedure after every maintenance window. That includes files and logs to be reviewed and kept as reference when the cluster is signed off from the build phase to the operation phase and after performing UFM/OpenSM/Firmware upgrade procedure.
This section outlines the necessary steps that need to be taken as part of the final stages of InfiniBand cluster bring up and initialization, just before the cluster becomes operational. It is recommended to follow these steps after each maintenance window as well. These steps encompass the review and retention of files and logs that serve as references when transitioning the cluster from the construction phase to the operational phase. This practice is particularly relevant after carrying out procedures such as UFM/OpenSM/Firmware upgrades.
Please adhere to the following steps:

Step

Item

Direct Link

1

Verify the operational status of the UFM service

UFM Service Verification

2

Generate the fabric health procedure

Fabric Health Report Generation and Validation

3

Confirm the integrity of the cluster's topology

Cluster Topology Validation

4

Validate the proper functioning of UFM telemetry collection

Telemetry Metrics Collection

5

Evaluate the cluster's performance

Cluster Performance Verification

6

Optional: Check for the presence and functionality of necessary UFM plugins

Review All Unhealthy Nodes

7

Run the quarterly maintenance procedure

Quarterly Maintenance

8

Obtain and securely store UFM system dump using the following command:

/usr/bin/ufm_sysdump.sh

Retain this data for future reference

9

Contact NVIDIA Networking Support or your designated NVIDIA account team to facilitate a review of your UFM and SM configurations with NVIDIA. Provide the generated ufm_sysdump data created in the previous step for reference.

© Copyright 2023, NVIDIA. Last updated on Mar 20, 2024.