InfiniBand Cluster Bring-up Procedure
InfiniBand Cluster Bring-up Procedure

Bring-up Process Checklist




Entry Criteria

Done Criteria


Cluster Planning - Choose Topology

 Setting the InfiniBand Cluster Topology


  • Desired topology was defined


Cluster Planning - Create PTP File

 Creating a Point-to-Point Excel File

  • Desired topology was defined

  • PTP file was created as described


Cluster Planning - Create & Save Topo File

Creating a Topology File

Saving the Topology File

  • Valid PTP file

  • Run topo file successfully

  • Generated topo file renamed with meaningful name

  • Topo file is saved for future usage


Topology Confirmation - Install UFM

 UFM Enterprise Installation

  • Previous steps completed

  • Cluster components physically Installed/deployed

  • UFM successfully installed and configured (including HA)

  • UFM status is running

  • UFM GUI works


Topology Confirmation using UFM

 Topology Confirmation using UFM

  • Generated topo file

  • UFM installation done criteria

  • Custom Topology Compare Report is cleared from errors/warnings


Network Deployment - SW & FW versions Alignment

Confirm Components' Firmware and Software Versions

On-site Upgrade – Low scale

  • UFM working with GUI

  • All versions are aligned and confirmed



Configuration and Basic Features Activation

  • All previous sections are successfully completed

  • Switch configuration done

  • UFM configuration done

  • Other optional configurations done


Cluster Verification

Cluster Verification

  • All previous sections are successfully completed

  • SM logs are verified and all issues were treated

  • UFM Fabric Health report is cleared of errors/alarms

  • The specified UFM Telemetry indicators comply the evaluation criteria as specified here

  • UFM events and alarms are monitored and treated

  • All links are up and with valid quality as described in UFM Telemetry​



Performance Testing

  • Cluster verification stage successfully completed

  • HPC-X package successfully installed and ClusterKit is working

  • ClusterKit run results are as expected according to 'Results Verification' section

  • Optional - results were visualized and analyzed as described in 'Visualize results' section

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.