NVIDIA HPC-X Software Toolkit Rev 2.11
5T Technology (PTP, SyncE, and more) User Manual

Release Notes Change Log History

Category

Change

Rev 2.10

UCX

Added support for atomics on GPU memory target

OpenSHMEM

Added support for reducing memory overhead on scale

Rev 2.9

UCX Configuration File

The UCX configuration file enables the user to apply configuration variables set by the user in the /etc/ucx/ucx.conf file.

For further information see UCX Configuration File.

Instrumentation and Monitoring FUSE-based Tool

This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in /tmp/ucx.

For further information see Instrumentation and Monitoring FUSE-based Tool.

OS Architecture

HPC-X v1.9 onwards will no longer support PPC architecture in its releases.

Bug Fixes

Bug Fixes

Rev 2.8

HPC-X Content

Updated the following communication libraries and acceleration packages versions:

  • Open MPI version 4.1.x

  • Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.4.x

  • HCOLL version 4.7

  • UCX version 1.10

  • ClusterKit version 1.3

  • nccl-rdma-sharp-plugin version 2.1

UCX

Added support for Multi-interface for cloud (client-server) applications.

Added support for using Adaptive-Routing (out-of-order) on an SL that supports it.

Added support for UCP Active-Messages API with Rendezvous.

Added support for Keepalive functionality on the UCT layer.

Performed several error handling enhancements.

Added support for GPU-NIC locality discovery.

NCCL-RDMA-SHARP-PLUGIN

Added support for NCCL Plugin API v4.

Added support for PCIe Relaxed Ordering.

Added support for Adaptive Routing.

Rev 2.7

UCX

Added a new request API. For further information on this request API, please refer to UCX API documentation.

Added support for PCIe Relaxed Ordering.

Added out-of-box support for RoCE LAG.

Added Flow Control support for RDMA Read operations.

AMD Rome optimizations: Optimized IB connection establishment procedures to reduce system noise.

Rev 2.6

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

  • Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.1.0

  • HCOLL version 4.5

  • UCX version 1.9

UCX

Added support in UCX for communication between containers configured to share the memory namespaces.

Added strided Receive queue support for hardware tag matching.

Made the following performance improvements on AMD EPYC servers.

  • 8-16 KB message: Improved latency by up to 6.4%, bandwidth by up to 20%, and bidirectional bandwidth by up to 96%

  • IMB/multiPingPong and osu_mbw_mr for messages up to 32B on full ppn on MLNX_OFED 5.0.
    Note: To enjoy this performance optimization, make sure to enable hardware tag-matching by setting UCX_RC_TM_ENABLE=y

Added support for multithreaded memory region in Open SHMEM (OSHMEM) applications to improve performance in job startup and teardown latencies.

The multithreaded MR enables a more efficient use of the CPU resource during registration of memory regions larger than 4GB.

Cuda

Removed Cuda support in SLES 11 and RHEL 6 OSs.

Rev 2.5

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

  • Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.0

  • HCOLL version 4.4

  • UCX version 1.7

Removed CUDA init script (hpcx-init-cuda.sh), and environmental module (modules/hpcx-cuda) from HPC-X.

Up until HPC-X v2.4, these files used to point to the default files hpcx-init.sh and modules/hpcx. Now, these CUDA files no longer exist, and users can only use the default init script and environmental module for enabling CUDA support.

CUDA

Unified Vanilla and CUDA environments. CUDA v10.0 is supported out of the box with standard init script or environmental module.

Note: HPC-X is compiled against CUDA version 10.0, which does not support GCC versions newer than v8. Therefore, HPC-X built on systems with GCC versions above v8 will not have CUDA support.

UCX

Made performance optimizations.

Added full support for rdma-core.

Added support for CUDA v10.1.

Rev 2.4

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

  • Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.8

  • HCOLL version 4.3

  • UCX version 1.6

Removed rc.local_mellanox script. HPC-X became more stable and this script is no longer required.

CUDA

Unified Vanilla and CUDA environments. CUDA v9 is supported out of the box with standard init script or environmental module.

Note: HPC-X is compiled against CUDA version 9, which does not support GCC versions newer than v7. Therefore, HPC-X built on systems with GCC versions above v7 will not have CUDA support.

UCX

Enabled HDR, SocketDirect and MultiRail features out-of-box.

UCX Random DCI is now at GA level.

Implemented a number of job startup optimizations.

Added support from PCIe atomic operations feature.

HCOLL

Added support for performing floating point 16 bit operations for machine learning scenarios.

OpenMPI

Added multi threading support to OpenMPI OSC UCX.

General

HPC-X is now available through the EasyBuild framework: https://easybuild.readthedocs.io/en/latest/

Rev 2.3

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

  • Open MPI version 4.0.x

  • Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7.2

  • HCOLL version 4.2

  • UCX version 1.5

  • OpenSHMEM version 1.4

UCX

UCX is now compiled without JAVA bindings.

Added support for running UCX over rdma-core, for DC transport and direct verbs.

Emulation layer: Added the ability to run UCX over software emulation of remote memory access and atomic operations. This provides full support of SHMEM and MPI-RMA over shared memory, TCP, and older RDMA hardware, such as ConnectX-3 HCA.

HCOLL

HCOLL and Mellanox SHARP are now compiled with CUDA support.

Added support for CUDA buffers over SRA allreduce algorithm.

MXM

Removed support for MXM library.

OpenMPI

Added the following configuration options to OMPI:

  • --with-libevent=internal

  • --enable-mpi1-compatibility

Updated the configuration file platform/mellanox/optimized config in OMPI upstream by removing BTL OpenIB and UCT support and removing links to MXM/FCA usage.

Removed PMI2 support.

Rev 2.2

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7

• HCOLL version 4.1

• UCX version 1.4

Added support for Singularity containerization.
For further information, please refer to HPC-X User Manual.

“osc ucx” is no longer the default one-sided-component in OpenMPI.

Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED.

MXM Support

Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default.

UCX

Added support for the following UCX features:

• New API for establishing client-server connection.

• Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs.

HPC-X Setup

Added support for HPC-X to work on Huawei ARM architecture.

HCOLL

Improved performance by utilizing zero-copy messaging for MPI Bcast.

Rev 2.1

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• Open MPI version 3.1.x

• Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.5

• HCOLL version 4.0

• MXM version 3.7

• UCX version 1.3

• OpenSHMEM v1.3 specification compliant

UCX

• UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA.

• Added the following UCX features:

• Added support for GPU memory in UCX communication libraries

• Added support for Multi-Rail protocol

MXM

The UD_RNDV_ZCOPY parameter is set to ‘no’ by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol.

HCOLL

• UCX is now the default p2p transport in HCOLL

• Improved multi-threaded performance

• Improved shared memory performance

• Added support for Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.5

• Added support for Mellanox SHARP software multi-channel/multi-rail capable algorithms

• Improved Allreduce large message algorithm

• Improved AlltoAll algorithm

Profiling IB verbs API (ibprof)

Removed ibprof tool from HPC-X toolkit.

UPC

Removed UPC from HPC-X toolkit.

Rev 2.0

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• OpenMPI version 3.0.0

• Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.4

• HCOLL version 3.9

• UCX version 1.3

UCX

• UCX is now at GA level.

• Added the following UCX features:

[ConnectX-5 only] Added support for hardware Tag Matching with DC transport.

[ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA.

• Added UCX datatypes - community approved datatype support.

• Added UCX support to Inbox RHEL.

• Added GPU Direct RDMA support.

• Hardware Tag Matching (See section Hardware Tag Matching in the User Manual)

• SR-IOV Support (See section SR-IOV Support in the User Manual)

• Adaptive Routing (AR) (See section Adaptive Routing in the User Manual)

• Error Handling (See section Error Handling in the User Manual)

HCOLL

• Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4

• Added support for NCCL on-host GPU based collectives.

• Added support for Hierarchical GPU based allreduce using NCCL for scale-in and MXM/UCX for scale-out.

• Improved shared memory performance for allreduce, barrier, and broadcast. Targeting high thread count systems, e.g. Power9.

• Improved large message allreduce (multi-radix, zero-copy fragmentation, CPU vectorization.)

• Added new and improved AlltoAllv algorithm - hybrid logarithmic pair-wise exchange.

• Added support for on-demand HCOLL memory. Improves HCOLL's memory footprint on high thread count system e.g. Power9.

• Added a high performance multithreaded implementation to support MPI_THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9.

• HCOLL startup improvements.

Open MPI / OpenSHMEM

• Added support for Open MPI 3.0.0.

• Added support for xpmem kernel module.

• Added a high performance implementation of shmem_ptr() with UCX SPML.

• Added a UCX allocator. The UCX allocator optimizes intra-node communication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML.

• Added a UCX one-sided component to support MPI RMA operations.

Rev 1.9.7

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Bug Fixes, see Section 4, “Bug Fixes History”, on page 11

Rev 1.9

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• OpenMPI version 2.1.2a1

• Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.3.1

• HCOLL version 3.8.1652

• MXM version 3.6.3103

• UCX version 1.2.2947

UCX

Point-to-point communication API, with tag matching, remote memory access, and atomic operations.

This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport

A cleaner API with lower software overhead which provides better performance especially for small messages.

Support for multitude of InifiniBand transports and Mellanox offloads to optimize data transfer performance:

• RDMA

• DC

• Out-of-order

• HW tag matching offload

• Registration cache

• ODP

Shared memory communications for optimal intra-node data transfer:

• SysV

• posix

• knem

• CMA

• xpmem

MXM

Enabled Adaptive Routing for all the transport layers (UD/RC/DC).

Memory registration optimization.

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP).

Shared memory

Improved the intranode performance of allreduce and barrier.

Configuration

Changed many default parameter setting in order to achieve best out-of-the-box experience for several applications including - CP2K, miniDFT, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA.

FCA

As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package.

Improved AlltoAllv algorithm.

Improved large data allreduce.

Improved UCX BCOL.

OS architecture

Added support for ARM architecture.

Rev 1.8.2

MXM

Updated MXM version to 3.6.2098 which includes memory registration optimization.

Rev 1.8

Cross Channel (CC)

Added Cross Channel (CC) AlltoAllv

Added CC zcpy Ring Bcas

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives

Shared memory POWER

Added shared memory POWER optimizations for allreduce

Added shared memory POWER optimizations for Barrier

Mixed data types

Added support for mixed data types

Non-contiguous Bcast

Added support for non-contiguous Bcast with UMR or SGE in CC

UMR

Added UMR support in CC bcol

Unified Communication - X Framework (UCX)

A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications.

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• HCOLL updated to v3.7.
Open MPI updated to v2.10

FCA

FCA 2.x is no longer the default FCA used in HPC-X.

As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x.

Bug Fixes

See Section 4, “Bug Fixes History”, on page 11

Rev 1.7

MXM

Updated MXM version to 3.6

FCA Collective

Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce.

FCA

Added MPI datatype support.

Added optimizations for PPC platforms.

Added support for multiple Mellanox SHARP technology leaders on a single host.

Added support for collecting Mellanox SHARP technology usage statistics.

Exposed cross-channel non-blocking collectives to the MPI level.

Rev 1.6

MXM v3.5

See Section 5.3, “MXM Change Log History”, on page 23

IB-Router

Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI.

Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box.

FCA v3.5

See Section 5.2, “FCA Change Log History”, on page 21

Rev 1.5

HPC-X Content

Updated the following communications libraries and acceleration packages versions:

• Open MPI updated to v1.10

• UPC update to 2.22.0

• MXM updated to v3.4.369

• FCA updated to v3.4.799

MXM v3.4.369

See Section 5.3, “MXM Change Log History”, on page 23

FCA v3.4.799

See Section 5.2, “FCA Change Log History”, on page 21

Rev 1.4

FCA v3.3

See Section 5.2, “FCA Change Log History”, on page 21

MXM v3.4

See Section 5.3, “MXM Change Log History”, on page 23

Rev 1.3

MLNX_OFED

Added support for OFED Inbox drivers

CPU Architecture

Added support for PPC architecture

LID Mask Control (LMC)

Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization.

Performance

Performance improvements for all transport layers.

Adaptive Routing

Enhanced support for Adaptive Routing for the UD transport layer.

For further information, please refer to the HPC-X User Manual section “Adaptive Routing for UD Transport”.

UD zero copy

UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage.

Category

Change

Rev 3.5

FCA Collective

Added MPI Allgatherv and MPI reduce

FCA

Added support for Mellanox SHARP library (including SHARP allreduce, reduce and barrier)

Enhanced scalability for CORE-Direct based collectives

Added support for complex data types

Rev 3.4

General

UCX support

Communicator caching scheme with eviction: improves jobstart and communicator creation time

Collectives

Collectives: Added Alltoallv and Alltoall small message algorithms.

Rev 3.3

General

Ported to PowerPC

Thread safety added

Collectives

Improved large message allreduce algorithm (Enabled by default)

Beta version of network topology awareness (Enabled by default)

Rev 3.0

Collectives

Offload collectives communication from MPI process onto Mellanox interconnect hardware

Efficient collectives communication flow optimized to job and topology

MPI collectives

Significantly reduce MPI collectives runtime

MPI-3

Native support for MPI-3

Blocking and Non-blocking collectives

Support for blocking and nonblocking collectives

HCOLL

Supports hierarchical communication algorithms (HCOLL)

Collective algorithm

Supports multiple optimizations within a single collective algorithm

Performance

Increase CPU availability and efficiency for increased application performance

MPI libraries

Seamless integration with MPI libraries and job schedulers

Rev 2.5

Multicast Group

Added MCG (Multicast Group) cleanup tool

Performance

Performance improvements

Rev 2.2

Performance

Performance improvements

Dynamic offloading rules

Enabled dynamic offloading rules configuration based on the data type and reduce operations

Mixed MTU

Added support for mixed MTU

Rev 2.1.1

AMD/Interlagos CPUs

Added support for AMD/Interlagos CPUs

Rev 2.1

Core-Direct®

Added support for Mellanox Core-Direct® technology (enables offloading collective operations to the HCA.)

Non-contiguous data layouts

Added support for non-contiguous data layouts

PGI compilers

Added support for PGI compilers

Category

Change

Rev 2.2

Performance

Added Sandy Bridge performance optimizations.

memheap

Allocated memheap using contiguous memory provided by the HCA.

ptmalloc allocator

Replaced the buddy memheap by the ptmalloc allocator.

multiple pSync arrays

Added the option of using multiple pSync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines)

spml yoda

Optimized small size puts

Performance

Performance optimization

Memory footprint optimizations

Added memory footprint optimizations

Rev 1.8.2

Acceleration Packages

Added support for new MXM, FCA, HCOLL versions

Job start optimization

Added job start optimization

Performance

Performance improvements

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.