NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software NVIDIA InfiniBand Cluster Operation and Maintenance Guide Cluster Maintenance Procedure

Cluster Maintenance Procedure

This section describes the minimal set of operations required to monitor the network service and keep it in good health. We provide a sub-section for activities that should be run every couple of minutes, day, and full. We also describe how each activity is done and observed. Automation to query the UFM API may be required as described in List of Scenarios which also includes details on how each type of alert/issue-found should be handled.
This section outlines the minimal set of operations required to monitor and oversee the network service and uphold its optimal functionality. Also provided are tasks intended for regular intervals in minutes, daily, and full cycles. Additionally, this section describes the execution and observation process for each task. It might be necessary to automate UFM API queries.
It is assumed that UFM Enterprise operates continuously, gathering relevant data and generating alerts that necessitate examination. It is important to diligently monitor and address these alerts.
Cluster maintenance can be performed in the following intervals:

Maintenance at Regular Intervals of Minutes / Ongoing
Weekly Maintenance
Quarterly Maintenance

Maintenance at Regular Intervals of Minutes / Ongoing

Upon verifying that the UFM service is up and running, the following monitoring measures are automatically activated:

Validate that the event types outlined in List of Scenarios have not occurred.
In the event of an occurrence, check the debugging and resolution procedures.

For more information, refer to UFM Events Fluent Streaming (EFS) Plugin.

Weekly Maintenance

Follow the steps outlined in Maintenance at Regular Intervals of Minutes / Ongoing .
Monitor trends in link monitoring key indicators. Refer to Link Monitoring Key Indicators.
Validate the integrity of the Cluster topology as instructed in Cluster Topology Validation.
Execute Fabric Health Validation tests is instructed in Fabric Health Report Generation and Validation.
Verify network performance key indicators in accordance with Cluster Performance Verification.
Perform maintenance for the cooling system: review temperature differentials as detailed in Cooling System Maintenance and address any identified issues as instructed in Inadequate Control of Cluster Temperature .

Quarterly Maintenance

Follow the steps outlined in Weekly Maintenance .
Examine the most recent NVIDIA firmware and software release notes as detailed in . It is recommended to perform regular updates of the cluster software, at a minimum of once per year, aligning with the LTS release schedule.
Even if a software upgrade cannot be carried out, it is strongly recommended to familiarize yourself with documented known issues that have been resolved through releases Refer to UFM SW Release Notes and User Manual.
Conduct an annual review of NVIDIA network health – Contact NVIDIA Networking Support or your designated NVIDIA contact.

On This Page

Cluster Maintenance Procedure

Maintenance at Regular Intervals of Minutes / Ongoing

Weekly Maintenance

Quarterly Maintenance