NVIDIA Docs Hub NVIDIA Networking Networking Software Management Software NVIDIA InfiniBand Cluster Operation and Maintenance Guide

About This Document

This document is intended for network operators responsible for maintaining InfiniBand clusters.The purpose of this document is to outline the necessary automation tools, required tests, and essential information needed when accepting a new cluster. Additionally, the document provides recommendations for monitoring and maintenance routines, along with guidance on how to obtain the necessary inputs for these procedures and how to execute the maintenance operations effectively. The document's content is structured logically to facilitate easy reference and understanding.

In addition, this document provides links to documentation describing how to establish connections between network events and how they are reported by NVIDIA UFM (Unified Fabric Manager). The various scenarios have been categorized based on the anticipated likelihood of their occurrence. For each specific issue, a comprehensive set of UFM alerts that signal its presence are listed, along with the UFM settings that need configuring to receive these alerts. Detailed instances of these alerts are presented, accompanied by thorough explanations of their significance.It is important to note that this document aligns with the software capabilities as of July 2023. It aims to provide network operators with a comprehensive resource to effectively manage and maintain InfiniBand clusters, utilizing the most up-to-date information and practices available.

Related Documentation

NVIDIA UFM Enterprise	NVIDIA UFM Enterprise User Manual NVIDIA UFM Enterprise Quick Start Guide NVIDIA UFM Enterprise REST API Guide
NVIDIA UFM Enterprise Appliance	NVIDIA UFM Enterprise Appliance Software User Manual
NVIDIA UFM Telemetry	NVIDIA UFM Telemetry Documentation
NVIDIA UFM High-Availability	NVIDIA UFM High-Availability User Guide

Document Revision History

For the list of changes made to this document, refer to Document Revision History.

NVIDIA InfiniBand Cluster Operation and Maintenance Guide

On This Page

About This Document

Related Documentation