InfiniBand Cluster Bring-up Procedure
InfiniBand Cluster Bring-up Procedure

SM Logs

SM logs include details of reported errors, all errors reported in opensm.log should be treated as indicators of IB fabric health.

SM logs path:

  • When only OpenSM is running without UFM: /var/log/opensm.log

  • When OpenSM is running with UFM on a Docker, enter the container:

    Copy
    Copied!
                

    docker exec -it ufm bash

    the path is: /opt/ufm/files/log/opensm.log

The SM log file should include the message "SUBNET UP" if OpenSM was able to set up the subnet correctly.

The SM log file size can be changed.​ You can choose how often a new SM log file will be created: daily, weekly (default), monthly.​

The SM log file will reach its maximum log size, or it will obey the rotational periodically order.​

  1. Modify the OpenSM log maximum file size:​

    Copy
    Copied!
                

    vi /opt/ufm/files/conf/opensm/opensm.conf ​log_max_size

  2. Modify the OpenSM log frequency rotation:

    Copy
    Copied!
                

    vi /etc/logrotate.d/opensm

Locate the subnet manager:

Copy
Copied!
            

[root@fit229 ~]# sminfo sminfo: sm lid 8 sm guid 0xa088c203007cdd36, activity count 47086 priority 15 state 3 SMINFO_MASTER

Query node description:

Copy
Copied!
            

[root@fit229 ~]# smpquery nd 8 Node Description:...................fit232 mlx5_0

Error

Description

TIMEOUT

Timeout in the network, look for a bad cable

trap128

The link state is changed. If this occurs too often on the same cable, make sure the cable is not corrupted

trap131

A bad cable connected

trap 144

Change in either link width/speed or node description

traps 257-259

Bad partitions

Example (Error trap 128):

Check the error by running the next command, if a port LinkDownedCounter is too big, it means the cable is corrupted.

for i in {1..<ports amount>};do echo Port:$i;perfquery <LID>$i | grep LinkDownedCounter;done

Apr 16 22:11:41 477567 [DA9C8640] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:4 GID:fe80::900a:8403:b3:c540

Copy
Copied!
            

[root@l-qa-203 ~]# for i in {1..64};do echo Port:$i;perfquery 4 $i | grep LinkDownedCounter;done Port:1 LinkDownedCounter:...............2 Port:2 LinkDownedCounter:...............0 Port:3 LinkDownedCounter:...............154222 Port:4 ..

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.