image image image image image

On This Page

Aggregation Trees Diagnostic

Run ibdiagnet tool and check /var/tmp/ibdiagnet2/ibdiagnet2.sharp 

# ibdiagnet --sharp

Example: 

# This database file was automatically generated by IBDIAG


TreeID:0, Max Radix:2
(0), AN:Mellanox Technologies Aggregation Node, lid:26, port guid:0x248a070300ea8028, Child index:0, parent QPn:0, remote parent QPn:0, radix:2
        (1), AN:Mellanox Technologies Aggregation Node, lid:22, port guid:0x248a070300ea8068, Child index:0, parent QPn:5246977, remote parent QPn:5244929, radix:0
        (1), AN:Mellanox Technologies Aggregation Node, lid:21, port guid:0x248a070300ea8048, Child index:1, parent QPn:5242881, remote parent QPn:5246978, radix:0

TreeID:1, Max Radix:2
(0), AN:Mellanox Technologies Aggregation Node, lid:30, port guid:0x7cfe900300bf85d8, Child index:0, parent QPn:0, remote parent QPn:0, radix:2
        (1), AN:Mellanox Technologies Aggregation Node, lid:22, port guid:0x248a070300ea8068, Child index:0, parent QPn:5242882, remote parent QPn:15771649, radix:0
        (1), AN:Mellanox Technologies Aggregation Node, lid:21, port guid:0x248a070300ea8048, Child index:1, parent QPn:5249026, remote parent QPn:15773698, radix:0


AN:Mellanox Technologies Aggregation Node, lid:26, node guid:0x248a070300ea8028
QPN:5244929, State:1, TS:0x00000000, G:0, SL:0, RLID:22, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5246977, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31
QPN:5246978, State:1, TS:0x00000000, G:0, SL:0, RLID:21, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5242881, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31

AN:Mellanox Technologies Aggregation Node, lid:21, node guid:0x248a070300ea8048
QPN:5242881, State:1, TS:0x00000000, G:0, SL:0, RLID:26, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5246978, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31
QPN:5249026, State:1, TS:0x00000000, G:0, SL:0, RLID:30, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:15773698, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31

AN:Mellanox Technologies Aggregation Node, lid:22, node guid:0x248a070300ea8068
QPN:5242882, State:1, TS:0x00000000, G:0, SL:0, RLID:30, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:15771649, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31
QPN:5246977, State:1, TS:0x00000000, G:0, SL:0, RLID:26, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5244929, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31

AN:Mellanox Technologies Aggregation Node, lid:30, node guid:0x7cfe900300bf85d8
QPN:15771649, State:1, TS:0x00000000, G:0, SL:0, RLID:22, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5242882, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31
QPN:15773698, State:1, TS:0x00000000, G:0, SL:0, RLID:21, Traffic Class:0, Hop Limit:0, RGID:::, RQ PSN:0, SQ PSN:0, PKey:0x0000ffff, RQPN:5249026, RNR Mode:0, RNR Retry Limit:0x00000007, Timeout Retry Limit:7, Local Ack Timeout:31

Mellanox SHARP Benchmark Script

Mellanox SHARP distribution provides a test script which executes OSU (allreduce, barrier) benchmark running with and without Mellanox SHARP. To run the Mellanox SHARP benchmark script the following prerequisites are required:

  • ssh
  • pdsh
  • environment-modules.x86_64

You can find this script at $HPCX_SHARP_DIR/sbin/sharp_benchmark.sh. This script should be launched from a host running SM and Aggregation Manager. It receives a list of compute host from SLURM allocation or from “hostlist” environment variable. “hostlist” is comma separated list. Also it requires hca environment variables to be supplied. It runs OSU all reduce and OSU barrier benchmarks with and without Mellanox SHARP.

Help

This script includes OSU benchmarks for MPI_Allreduce and MPI_Barrier blocking collective operations.
Both benchmarks run with and without using SHARP technology.
 
Usage: sharp_benchmark.sh [-t] [-d] [-h] [-f]
        -t - tests list (e.g. sharp:barrier)
        -d - dry run
        -h - display this help and exit
        -f - supress error in prerequsites checking
 
Configuration:
 Runtime:
  sharp_ppn - number of processes per compute node (default 1)
  sharp_ib_dev - Infiniband device used for communication. Format <device_name>:<port_number>.
                 For example: sharp_ib_dev="mlx5_0:1"
                 This is a mandatory parameter. If it's absent, sharp_benchmark.sh tries to use the first active device on local machine
  sharp_groups_num - number of groups per communicator. (default is the number of devices in sharp_ib_dev)
  sharp_num_trees - number of trees to request. (default num tress based on the #rails and #channels)
  sharp_job_members_type - type of sharp job members list. (default is SHARP_MEMBER_LIST_PROCESSES_DATA)
  sharp_hostlist - hostnames of compute nodes used in the benchmark. The list may include normal host names,
                   a range of hosts in hostlist format. Under SLURM allocation, SLURM_NODELIST is used as a default
  sharp_test_iters - number of test iterations (default 10000)
  sharp_test_skip_iters - number of test iterations (default 1000)
  sharp_test_max_data - max data size used for testing (default and maximum 4096)
 Environment:
  SHARP_INI_FILE - takes configuration from given file instead of /labhome/danielk/.sharp_benchmark.ini
  SHARP_TMP_DIR - store temporary files here instead of /tmp
  HCOLL_INSTALL - use specified hcoll install instead from hpcx
 
Examples:
  sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh  # run using "mlx5_0:1" IB port. Rest parameters are loaded from /labhome/danielk/.sharp_benchmark.ini or default
  SHARP_INI_FILE=~/benchmark.ini  sharp_benchmark.sh # Override default configuration file
  SHARP_INI_FILE=~/benchmark.ini  sharp_hostlist=ajna0[2-3]  sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh # Use specific host list
  sharp_ppn=1 sharp_hostlist=ajna0[1-8] sharp_ib_dev="mlx5_0:1" sharp_benchmark.sh  -d # Print commands without actual run
 
Dependencies:
  This script uses "python-hostlist" package. Visit https://www.nsc.liu.se/~kent/python-hostlist/ for details