You can download a PDF here .
About This Document
This document is intended for network operators responsible for maintaining InfiniBand clusters.The purpose of this document is to outline the necessary automation tools, required tests, and essential information needed when accepting a new cluster. Additionally, the document provides recommendations for monitoring and maintenance routines, along with guidance on how to obtain the necessary inputs for these procedures and how to execute the maintenance operations effectively. The document's content is structured logically to facilitate easy reference and understanding.
In addition, this document provides links to documentation describing how to establish connections between network events and how they are reported by NVIDIA UFM (Unified Fabric Manager). The various scenarios have been categorized based on the anticipated likelihood of their occurrence. For each specific issue, a comprehensive set of UFM alerts that signal its presence are listed, along with the UFM settings that need configuring to receive these alerts. Detailed instances of these alerts are presented, accompanied by thorough explanations of their significance.It is important to note that this document aligns with the software capabilities as of July 2023. It aims to provide network operators with a comprehensive resource to effectively manage and maintain InfiniBand clusters, utilizing the most up-to-date information and practices available.
Related Documentation
NVIDIA UFM Enterprise | NVIDIA UFM Enterprise User Manual NVIDIA UFM Enterprise Quick Start Guide NVIDIA UFM Enterprise REST API Guide |
NVIDIA UFM Enterprise Appliance | NVIDIA UFM Enterprise Appliance Software User Manual |
NVIDIA UFM Telemetry | NVIDIA UFM Telemetry Documentation |
NVIDIA UFM High-Availability | NVIDIA UFM High-Availability User Guide |
Document Revision History
For the list of changes made to this document, refer to Document Revision History.