NVIDIA InfiniBand Cluster Operation and Maintenance Guide
1.0

Prerequisite

This section describes the required tools for executing the InfiniBand cluster maintenance and operational procedures.

  1. UFM - July 2023 SW Version: This entails UFM Enterprise and at least one instance of UFM Telemetry. UFM incorporates an embedded UFM Telemetry instance featuring 120 fundamental debug counters for each port. These counters are collected periodically and are, by default, accessible through an HTTP endpoint. UFM offers multiple mechanisms for pushing (streaming) UFM Telemetry and event streams. Additional information can be found in Retrieving UFM Issues for comprehensive insights.
  2. UFM Installation: Refer to the installation instructions according to the desired UFM software.

    UFM

    Link to Installation Instructions

    UFM Enterprise

    UFM Enterprise Installation

    UFM Enterprise Appliance

    UFM Enterprise Appliance Software Upgrade

    UFM Telemetry

    UFM Telemetry Installation

    For those opting to use their own server

    UFM Installation Steps

© Copyright 2023, NVIDIA. Last updated on Feb 1, 2024.