InfiniBand Cluster Bring-up Procedure
InfiniBand Cluster Bring-up Procedure

Creating a Point-to-Point Excel File

The Point-to-Point Excel file centralizes all the physical information of the project and explicitly describes how to connect each cable. For the list of supported cables, see LinkX Cables and Transceivers | NVIDIA.

To create the Excel file:

  1. Open an Excel file (use this template file).

  2. Create 2 sheets as explained below:

    • Legend – describes basic properties for each element of the cluster. Each element should include the following properties:

      • Name – describes the naming convention for each element, best practice is to include the element basic name and * before and after the name

      • Model – element model

        Note

        The Model is the device format as described inside the “/usr/share/ibdm2.1.1/ibnl”. If the model used is not part of the supported list, please create a new one as follow:

        https://linux.die.net/man/1/ibdm-topo-file

        https://linux.die.net/man/1/ibdm-ibnl-file

      • Switch/HCA - whether it is a switch or HCA

      • Speed – element speed

      • Comments – general comments
        Example:

        Name

        Model

        Switch/HCA

        Speed

        Comments

        *dgx*

        HCA_12

        hca

        4x-100G

        NDR

        *clf*

        MQM9700

        switch

        4x-100G

        NDR

        *csp*

        MQM9700

        switch

        4x-100G

        NDR

    • PTP - explicitly describes how to connect each cable. The table has two main parts, Source and Destination, each one contains mostly the same columns. Each Line should include the following for each end of the cable:

      • Rack - device rack

      • U - device location in the rack

      • Name – name of the device (must comply with the naming convention as specified for the device type in the Label sheet)

      • HCA/port - HCA name and port (in Destination part only port)
        For example:

        Source

        Destination

        Rack

        U

        Name

        HCA/port

        Rack

        U

        Name

        Port

        SU1-1 A22

        3

        cl02s01dgx01

        1

        Leaves SU1 A38

        25

        cl02s01clf01

        1

        SU1-1 A22

        3

        cl02s01dgx01

        2

        Leaves SU1 A38

        27

        cl02s01clf02

        1

        SU1-1 A22

        3

        cl02s01dgx01

        3

        Leaves SU1 A38

        29

        cl02s01clf03

        1

        SU1-1 A22

        3

        cl02s01dgx01

        4

        Leaves SU1 A38

        31

        cl02s01clf04

        1

        SU1-1 A22

        3

        cl02s01dgx01

        5

        Leaves SU1 A38

        33

        cl02s01clf05

        1

        SU1-1 A22

        3

        cl02s01dgx01

        6

        Leaves SU1 A38

        35

        cl02s01clf06

        1

        SU1-1 A22

        3

        cl02s01dgx01

        7

        Leaves SU1 A38

        37

        cl02s01clf07

        1

        SU1-1 A22

        3

        cl02s01dgx01

        8

        Leaves SU1 A38

        39

        cl02s01clf08

        1

Please note that:

  • Destination device should always be a switch (HCAs should always be specified in source)

  • For switches, use real/physical port numbers

  • HCA ports can be named/enumerated as you wish, and you have to verify that there is a proper mapping from HCA port enumeration to real HCA interface name (will be referred in next step page)

Note

In the provided examples , the element name *dgx* denotes the device with the identifier cl02s01dgx01.

Note

Make sure to have clear and meaningful names, a well described element, its role, and its location in both the topology and in the cluster.

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.