Rail Optimized Topology Validation

This section specifies options for fast HCA to Top-of-Rack cabling validation in the rail optimized topologies like "DGX SuperPOD". The feature checks that all HCAs nodes connected to the same Top-of-Rack switch have the same PCIe address (BDF) in the corresponding server. The "Rail Optimized Topology" validation is applicable to the compute nodes. An output ibdiagnet2.rails file includes PCIE BDF details of the HCAs with regard to the corresponding Top of the Rack switch.

Parameter

Description

--rail_validation

Checks that a topology is rail optimized.

Data will be dumped to the ibdiagnet2.rails file.

Warnings and errors will be dumped to the ibdiagnet2.log file.

--rail_validation_opt

Comma separated Rail Optimized Validation options.

  • regex='regular expression' - only nodes matching the regular expression will be included in the report

Example:

Copy
Copied!
            

ibdiagnet --rail_validation

  • ibdiagnet's Output:

    Copy
    Copied!
                

    Rail Optimized Topology Validation -W- Node rail connectivity mismatch on the switch: "SwitchIB Mellanox Technologies" GUID=0xe41d2d030003e470 -W- rail A (PCIe 0000:04:00.0): 1 ports ==> r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35 -W- rail B (PCIe 0000:09:00.0): 1 ports ==> r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30 -W- rail C (PCIe 0000:10:00.0): 2 ports ==> r-ufm218/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P33, ... -W- Rail Optimized Topology Validation ended with 1 warnings -I- Rail Optimized Topology validation is usually applicable to the compute nodes. -I- If detected mis-cabled nodes are not compute ones, please apply Rail Optimized Topology check for specific set of nodes by invoking: --rail_validation_opt regex='reg expression'

  • ibdiagnet.rails file content:

    Copy
    Copied!
                

    Node rail connectivity mismatch on the switch: SwitchIB Mellanox Technologies GUID=0xe41d2d030003e470 rail A(0000:04.00.00): 1 ports: r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35 rail B(0000:09.00.00): 1 ports: r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30 rail C(0000:10.00.00): 2 ports: r-ufm218/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P33 r-ufm216/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P34

Example:

Copy
Copied!
            

ibdiagnet --rail_validation --rail_validation_opt regex='[a-zA-Z]-ufm11[0-9]*'

Output:

  • HCAs installed in r-ufm118 and r-ufm112 servers will be included in the report, as their node descriptions match provided the regular expression

  • HCAs installed in r-ufm216 and r-ufm218 servers will be excluded from the report, as their node descriptions do not match provided regular expression

Copy
Copied!
            

Rail Optimized Topology Validation -W- Node rail connectivity mismatch on the switch: "SwitchIB Mellanox Technologies" GUID=0xe41d2d030003e470 -W- rail A (PCIe 0000:04:00.0): 1 ports ==> r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35 -W- rail B (PCIe 0000:09:00.0): 1 ports ==> r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30 -W- Rail Optimized Topology Validation ended with 1 warnings -I- Rail Optimized Topology validation is usually applicable to the compute nodes. -I- If detected mis-cabled nodes are not compute ones, please apply Rail Optimized Topology check for specific set of nodes by invoking: --rail_validation_opt regex='reg expression'

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.