Rail Optimized Topology Validation
This section specifies options for fast HCA to Top-of-Rack cabling validation in the rail optimized topologies like "DGX SuperPOD". The feature checks that all HCAs nodes connected to the same Top-of-Rack switch have the same PCIe address (BDF) in the corresponding server. The "Rail Optimized Topology" validation is applicable to the compute nodes. An output ibdiagnet2.rails file includes PCIE BDF details of the HCAs with regard to the corresponding Top of the Rack switch.
Parameter | Description |
--rail_validation | Checks that a topology is rail optimized. Data will be dumped to the ibdiagnet2.rails file. Warnings and errors will be dumped to the ibdiagnet2.log file. |
--rail_validation_opt | Comma separated Rail Optimized Validation options.
|
Example:
ibdiagnet --rail_validation
ibdiagnet's Output:
Rail Optimized Topology Validation -W- Node rail connectivity mismatch on the switch: "SwitchIB Mellanox Technologies" GUID=0xe41d2d030003e470 -W- rail A (PCIe 0000:04:00.0): 1 ports ==> r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35 -W- rail B (PCIe 0000:09:00.0): 1 ports ==> r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30 -W- rail C (PCIe 0000:10:00.0): 2 ports ==> r-ufm218/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P33, ... -W- Rail Optimized Topology Validation ended with 1 warnings -I- Rail Optimized Topology validation is usually applicable to the compute nodes. -I- If detected mis-cabled nodes are not compute ones, please apply Rail Optimized Topology check for specific set of nodes by invoking: --rail_validation_opt regex='reg expression'
ibdiagnet.rails file content:
Node rail connectivity mismatch on the switch: SwitchIB Mellanox Technologies GUID=0xe41d2d030003e470 rail A(0000:04.00.00): 1 ports: r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35 rail B(0000:09.00.00): 1 ports: r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30 rail C(0000:10.00.00): 2 ports: r-ufm218/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P33 r-ufm216/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P34
Example:
ibdiagnet --rail_validation --rail_validation_opt regex='[a-zA-Z]-ufm11[0-9]*'
Output:
HCAs installed in r-ufm118 and r-ufm112 servers will be included in the report, as their node descriptions match provided the regular expression
HCAs installed in r-ufm216 and r-ufm218 servers will be excluded from the report, as their node descriptions do not match provided regular expression
Rail Optimized Topology Validation
-W- Node rail connectivity mismatch on the switch: "SwitchIB Mellanox Technologies" GUID=0xe41d2d030003e470
-W- rail A (PCIe 0000:04:00.0): 1 ports ==> r-ufm118/U1/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P35
-W- rail B (PCIe 0000:09:00.0): 1 ports ==> r-ufm112/U2/P1 <--> Se41d2d030003e470/Ne41d2d030003e470/P30
-W- Rail Optimized Topology Validation ended with 1 warnings
-I- Rail Optimized Topology validation is usually applicable to the compute nodes.
-I- If detected mis-cabled nodes are not compute ones, please apply Rail Optimized Topology check for specific set of nodes by invoking: --rail_validation_opt regex='reg expression'