NVIDIA InfiniBand Cluster Operation and Maintenance Guide


You can download a PDF here .

About This Document

This document is intended for network operators responsible for maintaining InfiniBand clusters.The purpose of this document is to outline the necessary automation tools, required tests, and essential information needed when accepting a new cluster. Additionally, the document provides recommendations for monitoring and maintenance routines, along with guidance on how to obtain the necessary inputs for these procedures and how to execute the maintenance operations effectively. The document's content is structured logically to facilitate easy reference and understanding.

In addition, this document provides links to documentation describing how to establish connections between network events and how they are reported by NVIDIA UFM (Unified Fabric Manager). The various scenarios have been categorized based on the anticipated likelihood of their occurrence. For each specific issue, a comprehensive set of UFM alerts that signal its presence are listed, along with the UFM settings that need configuring to receive these alerts. Detailed instances of these alerts are presented, accompanied by thorough explanations of their significance.It is important to note that this document aligns with the software capabilities as of July 2023. It aims to provide network operators with a comprehensive resource to effectively manage and maintain InfiniBand clusters, utilizing the most up-to-date information and practices available.

Related Documentation

Document Revision History

For the list of changes made to this document, refer to Document Revision History.

© Copyright 2023, NVIDIA. Last updated on Apr 1, 2024.