image image image image image

On This Page

Overview

To meet the needs of scientific research and engineering simulations, supercomputers are growing at an unrelenting rate. As supercomputers increase in size from mere thousands to hundreds-of-thousands of processor cores, new performance and scalability challenges have emerged. In the past, performance tuning of parallel applications could be accomplished fairly easily by separately optimizing their algorithms, communication, and computational aspects. However, as systems continue to scale to larger machines, these issues become co-mingled and must be addressed comprehensively.

Collective communications execute global communication operations to couple all processes/nodes in the system and therefore must be executed as quickly and as efficiently as possible. Indeed, the scalability of most scientific and engineering applications is bound by the scalability and performance of the collective routines employed. Most current implementations of collective operations will suffer from the effects of systems noise at extreme-scale (system noise increases the latency of collective operations by amplifying the effect of small, randomly occurring OS interrupts during collective progression.) Furthermore, collective operations will consume a significant fraction of CPU cycles, cycles that could be better spent doing the meaningful computation.

Mellanox Technologies has addressed these two issues, lost CPU cycles and performance lost to the effects of system noise, by offloading the communications to the host channel adapters (HCAs) and switches. The technology, named CORE-Direct® (Collectives Offload Resource Engine), provides the most advanced solution available for handling collective operations thereby ensuring maximal scalability, minimal CPU overhead, and providing the capability to overlap communication operations with computation allowing applications to maximize asynchronous communication.

Additionally, FCA v4.2 also contains support for building runtime configurable hierarchical collectives. As with FCA 2.X, FCA v4.2 leverages hardware multicast capabilities to accelerate collective operations. In FCA v4.2, we take full advantage of the performance and scalability of the UCX point-to-point library in the form of the "ucx_p2p" BCOL. This enables users to leverage Mellanox hardware offloads transparently and with minimal effort.

FCA v4.2 and above is a standalone library that can be integrated into any MPI or PGAS runtime. Support for FCA is currently integrated into Open MPI versions 1.7.4 and higher. FCA v4.2 release currently supports blocking and non-blocking variants of "Allgather", "Allgatherv", "Allreduce", "AlltoAll", "AlltoAllv", "Barrier", and "Bcast".

As of HPC-X v2.2, FCA (v4.1), which is part of the HPC-X package, will not be compiled with MXM. FCA will be compiled with UCX and will use it by default.

The following diagram summarizes the FCA architecture:

The following diagram shows the FCA components and the role that each plays in the acceleration process:

FCA Installation Package Content

HCOLL is part of the HPC-X software toolkit and does not require special installation.


The FCA installation package includes the following items:

  • FCA- Mellanox Fabric Collector Accelerator Installation files
  • hcoll-<version>.x86_64.<OS>.rpm
  • hcoll-<version>.x86_64.<OS>.tar.gz
    where:
    <version>: The version of this release
    <OS>: One of the supported Linux distributions.
  • Mellanox Fabric Collective Accelerator (FCA) Software: End-User License Agreement
  • FCA MPI runtime libraries
  • Mellanox Fabric Collective Accelerator (FCA) Release Notes

Differences Between FCA v3.x and FCA v4.2

FCA v4.2 is new software which continues to expose the power of CORE-Direct® to offload collective operations to the HCA. It adds additional scalable algorithms for collectives and supports both blocking and non-blocking APIs (MPI-3 SPEC compliant). Additionally, FCA v4.2 (hcoll) does not require FCA manager daemon.

Configuring FCA

Compiling Open MPI with FCA v4.2

To compile Open MPI with FCA v4.2

  1. Install FCA v4.2 from: 
     •  an RPM

    # rpm -ihv hcoll-x.y.z-1.x86_64.rpm

    •  a tarball.

    % tar jxf hcoll-x.y.z.tbz

     FCA v4.2will be installed automatically in the /opt/mellanox/hcoll folder.

  2. Enter the Open MPI source directory and run the following command:

    % cd $OMPI_HOME
    % ./configure --with-hcoll=/opt/mellanox/hcoll --with-mxm=/opt/mellanox/mxm < ... other configure parameters>
    % make -j 9 && make install -j 9

    libhcoll requires UCX v1.3 or higher.

To check the version of the FCA installed on your host:

% rpm -qi hcoll


To upgrade to a newer version of FCA:

  1. Remove the existing FCA version.

    % rpm -e hcoll
  2. Remove the precompiled Open MPI.

    % rpm -e mlnx-openmpi_gcc
  3. Install the new FCA version and compile the Open MPI with it.

Enabling FCA in Open MPI

To enable FCA v4.2 HCOLL collectives in Open MPI, explicitly ask for them by setting the following MCA parameter:

%mpirun -np 32 -mca coll_hcoll_enable 1 -x coll_hcoll_np=0 -x HCOLL_MAIN_IB=<device_name>:<port_num> ./a.out

Tuning FCA v4.2 Setting

The default FCA v4.2 settings should be optimal for most systems. To check the available FCA parameters and their default values, run the following command:

% /opt/mellanox/hcoll/bin/hcoll_info --all


FCA v4.2 parameters are simply environment variables and can be modified in one of the following ways:

  • Modify the default FCA v4.2 parameters as part of the mpirun command:

    % mpirun ... -x HCOLL_ML_BUFFER_SIZE=65536
  • Modify the default FCA v4.2 parameter values from SHELL:

    % export -x HCOLL_ML_BUFFER_SIZE=65536
    % mpirun ...

Selecting Ports and Devices

To select the HCA device and port you would like FCA v4.2 to run over:

-x HCOLL_MAIN_IB=<device_name>:<port_num>

Enabling Offloaded MPI Non-blocking Collectives

In order to use hardware offloaded collectives in non-blocking MPI calls (e.g. MPI_Ibcast()), set the following parameter

x HCOLL_ENABLE_NBC=1

The supported non-blocking MPI collectives are:Note that enabling non-blocking MPI collectives will disable multicast acceleration in blocking MPI collectives.

  • MPI_Ibarrier
  • MPI_Ibcast
  • MPI_Iallgather
  • MPI_Iallreduce (4b, 8b, SUM, MIN, PROD, AND, OR, LAND, LOR)

Enabling Multicast Accelerated Collectives

FCA v4.2, like its 2.x predecessor, uses hardware multicast to accelerate certain collective operations. In order to take full advantage of this unique capability, you must first have IPoIB configured on every adapter card/port pair that collective message traffic flows through.

Configuring IPoIB

To configure IPoIB, you need to define an IP address on the IB interface.

  1. Use /usr/bin/ibdev2netdev to show all IB interfaces.

    hpchead ~ >ibdev2netdev
    mlx4_0 port 1 ==> ib0 (Down)
    mlx4_0 port 2 ==> ib1 (Down)
    mlx5_0 port 1 ==> ib2 (Down)
    mlx5_0 port 2 ==> ib3 (Down)
  2. Use /sbin/ifconfig to get the address informations for a specific interface (e.g. ib0).

    hpchead ~ >ifconfig ib0
    ifconfig uses the ioctl access method to get the full address information, which limits
    hardware addresses to 8 bytes. Since InfiniBand address has 20 bytes, only the first 8
    bytes are displayed correctly.
    Ifconfig is obsolete! For replacement check ip.
    ib0      Link encap:InfiniBand HWaddr
             A0:04:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
             inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
             BROADCAST MULTICAST MTU:2044 Metric:1
             RX packets:58 errors:0 dropped:0 overruns:0 frame:0
             TX packets:1332 errors:0 dropped:0 overruns:0 carrier:0
             collisions:0 txqueuelen:1024
             RX bytes:3248 (3.1 KiB) TX bytes:80016 (78.1 KiB)

    Or you can use /sbin/ip for the same purpose

    hpchead ~ >ip addr show ib0
    4: ib0: <BROADCAST,MULTICAST> mtu 2044 qdisc mq state DOWN qlen 1024
       link/infiniband a0:04:02:20:fe:80:00:00:00:00:00:00:00:02:c9:03:00:21:f9:31 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
       inet 192.168.1.1/24 brd 192.168.1.255 scope global ib0-

In the example above, the IP is defined (192.168.1.1). If it is not defined, then you can define an IP address now.

Enabling Mellanox SHARP Software Accelerated Collectives

As of v1.7, HPC-X supports Mellanox SHARP Software Accelerated Collectives. These collectives are enabled by default if FCA (HCOLL) v3.5 and above detects that it is running in a supported environment.

To enable Mellanox SHARP acceleration:

-x HCOLL_ENABLE_SHARP=1

To disable Mellanox SHARP acceleration:

-x HCOLL_ENABLE_SHARP=0


To change the Mellanox SHARP message threshold:

-x HCOLL_BCOL_P2P_ALLREDUCE_SHARP_MAX=<threshold> ( default:256)

The maximum allreduce size runs through SHARP. Messages with a size greater than the above will fallback to non-SHARP based algorithms (multicast based or non-multicast based).

To use Mellanox SHARP non-blocking interface:

-x HCOLL_ENABLE_SHARP_NONBLOCKING=1

For instructions on how to deploy Mellanox SHARP software in InfiniBand fabric, see Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Deployment Guide.

Once Mellanox SHARP software is deployed, you need to only specify the HCA device (device_name) and port number (port_num) that is connected to the Mellanox SHARP software tree in the following way:

-x HCOLL_MAIN_IB=<device_name>:<port_num>


To select an HCOLL CUDA topology:

Configuring NVIDIA® CUDA® GPUs Support - HCOLL

Collective operations with CUDA memory is enabled in HCOLL using NVIDIA's NCCL collective communication library. HCOLL CUDA support is enabled in HPC-X through HCOLL CUDA build. LD_PRELOAD CUDA-enabled libhcoll.so library to enable collective operation with CUDA buffers.

To select an HCOLL CUDA topology:

-x HCOLL_CUDA_SBGP=p2p -x HCOLL_CUDA_BCOL=nccl


To tune the maximum message size threshold to HCOLL staging scheme with CUDA buffers:

-x HCOLL_CUDA_STAGING_MAX_THRESHOLD=262144

For further information on CUDA support in HPC-X, please refer to CUDA GPU.

Limitations

  • HCOLL, as of v4.1 release, does not fully support mixed MPI datatypes.

In this context, mixed datatypes refers to collective operations where the datatype layout of input and output buffers may be different on different ranks. For example:

For an arbitrary MPI collective operation:

MPI_Collective_op( input, count1, datatype-in_i, output, count2,datatype-out_i, communicator)

Where i = 0,...,(number_of_mpi_processes - 1)

Mixed mode means when i is not equal to j, (datatype-in_i, datatype-out_i) is not necessarily equal to (datatype-in_j, datatype-out_j).Where i = 0,...,(number_of_mpi_processes - 1)

Mixed MPI datatypes, in general, can prevent protocol consensus inside HCOLL, resulting in hangs. However, because HCOLL contains a datatype engine with packing and unpacking flows built into the collective algorithms, mixed MPI datatypes will work under the following scenarios:

    • If the packed length of the data (a value all ranks must agree upon regardless of datatype) can fit inside a single HCOLL buffer (the default is (64Kbytes - header_space)), then mixed datatypes will work.
    • If the packed length of count*datatype is bigger than an internal HCOLL buffer, then HCOLL will need to fragment the message. If the datatypes and counts are defined on each rank so that all ranks agree on the number of fragments needed to complete the operation, then mixed datatypes will work. Our datatype engine cannot split across primitive types and padding, this may result in non-agreement on the number of fragments required to process the operation. When this happens, HCOLL will hang with some ranks expecting incoming fragments and other believing the operation is complete.
  • The environment variable HCOLL_ALLREDUCE_ZCOPY_TUNE=<static/dynamic> (default - dynamic) selects the level of automatic runtime tuning of HCOLL’s large data allreduce algorithm. “Static” means no tuning is applied at runtime. “Dynamic” - allows HCOLL to dynamically adjust the algorithms radix and zero-copy threshold selection based on runtime sampling of performance.

Note: The “dynamic” mode should not be used in cases where numerical reproducibility is required, as this mode may result in a variation of the floating point reduction result from one run to another due to non-fixed reduction order.