Category	Change
Rev 2.16
HPC-X Content	Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release. NVIDIA SHARP v3.4 UCC v1.3 ClusterKit v1.10 nccl-rdma-sharp-plugin v2.4 NCCL v2.18 XPMEM v2.7
Supported Cards	Added support for BlueField-3 cards.
Bug Fixes	See Bug Fixes.
Rev 2.15
HPC-X Content	Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release. NVIDIA SHARP v3.3 UCX v1.15 ClusterKit v1.9 nccl-rdma-sharp-plugin v2.3 GDRCopy v2.3 NCCL v2.17.1-1 CUDA v12.1
Bug Fixes	See Bug Fixes.
Rev 2.14
TL/UCP Special Service Worker	Added support for having a separate UCX UCP worker use UCC service collectives. For further information, please see TL/UCP Special Service Worker section.
Data Type Support in CUDA Executor Component (EC)	Added out-of-box support for all datatypes and reduction operations for UCC collectives for GPUs. For further information, please see Data Type Support in CUDA Executor Component section.
EC/CUDA One-shot Kernel with Cooperative Launch	Added support for using a single CUDA kernel for CUDA operations in UCC GPU collectives. For further information, please see EC/CUDA One-shot Kernel with Cooperative Launch section.
Out-Of-Box Native GPU Allreduce	Added support for the UCC library to detect the NVIDIA NVLink topology and select the best GPU-based algorithms for supported collectives (Allgather/v, Reducescatter/v). For further information, please see Out-Of-Box Native GPU Allreduce section.
Bug Fixes	See Bug Fixes.
Rev 2.13.1 LTS
Operating System	Added support for Ubuntu v20.04 and v20.10.
Rev 2.13
HPC-X Content	Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release. NVIDIA SHARP v3.1 HCOLL v4.8 UCC v1.2 ClusterKit v1.8 nccl-rdma-sharp-plugin v2.2
NCCL-RDMA-SHARP-PLUGIN	Added support for NCCL plugin API v5.
SHARP	Added support for SHARP on NDR.
Bug Fixes	See Bug Fixes section.
Rev 2.12
UCX	Added a method to set RoCE ECE value from UCX configuration. For example: UCX_IB_ECE=auto will use maximal ECE value, and UCX_IB_ECE= will use a specific numeric ECE value.
HPC-X Content	Updated the version of the UCX communication library to v1.14.
Rev 2.11
Adapter Cards	Added support NVIDIA ConnectX-7 adapter card with with 400 Gb/s speed.
SHARPD	sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process
HPC-X Content	Updated the versions of the following communication libraries. UCX version 1.13 ClusterKit 1.6
HPC-X Content	Added support for UCC, a collective communication operations API and library in HPC-X. UCC is now part of the HPC-X package. For further information on UCC, pleased see Unified Collective Communication (UCC) section.
Rev 2.10
UCX	Added support for atomics on GPU memory target
OpenSHMEM	Added support for reducing memory overhead on scale
Rev 2.9
UCX Configuration File	The UCX configuration file enables the user to apply configuration variables set by the user in the /etc/ucx/ucx.conf file. For further information see UCX Configuration File.
Instrumentation and Monitoring FUSE-based Tool	This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in /tmp/ucx. For further information see Instrumentation and Monitoring FUSE-based Tool.
OS Architecture	HPC-X v1.9 onwards will no longer support PPC architecture in its releases.
Bug Fixes	Bug Fixes in this Version
Rev 2.8
HPC-X Content	Updated the following communication libraries and acceleration packages versions: Open MPI version 4.1.x NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.4.x HCOLL version 4.7 UCX version 1.10 ClusterKit version 1.3 nccl-rdma-sharp-plugin version 2.1
UCX	Added support for Multi-interface for cloud (client-server) applications.
	Added support for using Adaptive-Routing (out-of-order) on an SL that supports it.
	Added support for UCP Active-Messages API with Rendezvous.
	Added support for Keepalive functionality on the UCT layer.
	Performed several error handling enhancements.
	Added support for GPU-NIC locality discovery.
NCCL-RDMA-SHARP-PLUGIN	Added support for NCCL Plugin API v4.
	Added support for PCIe Relaxed Ordering.
	Added support for Adaptive Routing.
Rev 2.7
UCX	Added a new request API. For further information on this request API, please refer to UCX API documentation.
	Added support for PCIe Relaxed Ordering.
	Added out-of-box support for RoCE LAG.
	Added Flow Control support for RDMA Read operations.
	AMD Rome optimizations: Optimized IB connection establishment procedures to reduce system noise.
Rev 2.6
HPC-X Content	Updated the following communications libraries and acceleration packages versions: NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.1.0 HCOLL version 4.5 UCX version 1.9
UCX	Added support in UCX for communication between containers configured to share the memory namespaces.
	Added strided Receive queue support for hardware tag matching.
	Made the following performance improvements on AMD EPYC servers. 8-16 KB message: Improved latency by up to 6.4%, bandwidth by up to 20%, and bidirectional bandwidth by up to 96% IMB/multiPingPong and osu_mbw_mr for messages up to 32B on full ppn on MLNX_OFED 5.0. Note: To enjoy this performance optimization, make sure to enable hardware tag-matching by setting `UCX_RC_TM_ENABLE=y`
	Added support for multithreaded memory region in Open SHMEM (OSHMEM) applications to improve performance in job startup and teardown latencies. The multithreaded MR enables a more efficient use of the CPU resource during registration of memory regions larger than 4GB.
Cuda	Removed Cuda support in SLES 11 and RHEL 6 OSs.
Rev 2.5
HPC-X Content	Updated the following communications libraries and acceleration packages versions: NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 2.0 HCOLL version 4.4 UCX version 1.7
HPC-X Content	Removed CUDA init script (hpcx-init-cuda.sh), and environmental module (modules/hpcx-cuda) from HPC-X. Up until HPC-X v2.4, these files used to point to the default files hpcx-init.sh and modules/hpcx. Now, these CUDA files no longer exist, and users can only use the default init script and environmental module for enabling CUDA support.
CUDA	Unified Vanilla and CUDA environments. CUDA v10.0 is supported out of the box with standard init script or environmental module. Note: HPC-X is compiled against CUDA version 10.0, which does not support GCC versions newer than v8. Therefore, HPC-X built on systems with GCC versions above v8 will not have CUDA support.
UCX	Made performance optimizations.
	Added full support for rdma-core.
	Added support for CUDA v10.1.
Rev 2.4
HPC-X Content	Updated the following communications libraries and acceleration packages versions: NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.8 HCOLL version 4.3 UCX version 1.6
HPC-X Content	Removed `rc.local_mellanox` script. HPC-X became more stable and this script is no longer required.
CUDA	Unified Vanilla and CUDA environments. CUDA v9 is supported out of the box with standard init script or environmental module. Note: HPC-X is compiled against CUDA version 9, which does not support GCC versions newer than v7. Therefore, HPC-X built on systems with GCC versions above v7 will not have CUDA support.
UCX	Enabled HDR, SocketDirect and MultiRail features out-of-box.
	UCX Random DCI is now at GA level.
	Implemented a number of job startup optimizations.
	Added support from PCIe atomic operations feature.
HCOLL	Added support for performing floating point 16 bit operations for machine learning scenarios.
OpenMPI	Added multi threading support to OpenMPI OSC UCX.
General	HPC-X is now available through the EasyBuild framework: https://easybuild.readthedocs.io/en/latest/
Rev 2.3
HPC-X Content	Updated the following communications libraries and acceleration packages versions: Open MPI version 4.0.x NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7.2 HCOLL version 4.2 UCX version 1.5 OpenSHMEM version 1.4
UCX	UCX is now compiled without JAVA bindings.
	Added support for running UCX over rdma-core, for DC transport and direct verbs.
	Emulation layer: Added the ability to run UCX over software emulation of remote memory access and atomic operations. This provides full support of SHMEM and MPI-RMA over shared memory, TCP, and older RDMA hardware, such as ConnectX-3 HCA.
HCOLL	HCOLL and NVIDIA SHARP are now compiled with CUDA support.
HCOLL	Added support for CUDA buffers over SRA allreduce algorithm.
MXM	Removed support for MXM library.
OpenMPI	Added the following configuration options to OMPI: `--with-libevent=internal` `--enable-mpi1-compatibility`
	Updated the configuration file platform/mellanox/optimized config in OMPI upstream by removing BTL OpenIB and UCT support and removing links to MXM/FCA usage.
	Removed PMI2 support.
Rev 2.2
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7 • HCOLL version 4.1 • UCX version 1.4
	Added support for Singularity containerization. For further information, please refer to HPC-X User Manual.
	“osc ucx” is no longer the default one-sided-component in OpenMPI.
	Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED.
MXM Support	Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default.
UCX	Added support for the following UCX features: • New API for establishing client-server connection. • Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs.
HPC-X Setup	Added support for HPC-X to work on Huawei ARM architecture.
HCOLL	Improved performance by utilizing zero-copy messaging for MPI Bcast.
Rev 2.1
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • Open MPI version 3.1.x • NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.5 • HCOLL version 4.0 • MXM version 3.7 • UCX version 1.3 • OpenSHMEM v1.3 specification compliant
UCX	• UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA. • Added the following UCX features: • Added support for GPU memory in UCX communication libraries • Added support for Multi-Rail protocol
MXM	The UD_RNDV_ZCOPY parameter is set to ‘no’ by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol.
HCOLL	• UCX is now the default p2p transport in HCOLL • Improved multi-threaded performance • Improved shared memory performance • Added support for NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.5 • Added support for NVIDIA SHARP software multi-channel/multi-rail capable algorithms • Improved Allreduce large message algorithm • Improved AlltoAll algorithm
Profiling IB verbs API (ibprof)	Removed ibprof tool from HPC-X toolkit.
UPC	Removed UPC from HPC-X toolkit.
Rev 2.0
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • OpenMPI version 3.0.0 • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.4 • HCOLL version 3.9 • UCX version 1.3
UCX	• UCX is now at GA level. • Added the following UCX features: • [ConnectX-5 only] Added support for hardware Tag Matching with DC transport. • [ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA. • Added UCX datatypes - community approved datatype support. • Added UCX support to Inbox RHEL. • Added GPU Direct RDMA support. • Hardware Tag Matching (See section Hardware Tag Matching in the User Manual) • SR-IOV Support (See section SR-IOV Support in the User Manual) • Adaptive Routing (AR) (See section Adaptive Routing in the User Manual) • Error Handling (See section Error Handling in the User Manual)
HCOLL	• Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4 • Added support for NCCL on-host GPU based collectives. • Added support for Hierarchical GPU based allreduce using NCCL for scale-in and MXM/UCX for scale-out. • Improved shared memory performance for allreduce, barrier, and broadcast. Targeting high thread count systems, e.g. Power9. • Improved large message allreduce (multi-radix, zero-copy fragmentation, CPU vectorization.) • Added new and improved AlltoAllv algorithm - hybrid logarithmic pair-wise exchange. • Added support for on-demand HCOLL memory. Improves HCOLL's memory footprint on high thread count system e.g. Power9. • Added a high performance multithreaded implementation to support MPI_THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9. • HCOLL startup improvements.
Open MPI / OpenSHMEM	• Added support for Open MPI 3.0.0. • Added support for xpmem kernel module. • Added a high performance implementation of shmem_ptr() with UCX SPML. • Added a UCX allocator. The UCX allocator optimizes intra-node communication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML. • Added a UCX one-sided component to support MPI RMA operations.
Rev 1.9.7
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)	Bug Fixes, see Section 4, “Bug Fixes History”, on page 11
Rev 1.9
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • OpenMPI version 2.1.2a1 • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.3.1 • HCOLL version 3.8.1652 • MXM version 3.6.3103 • UCX version 1.2.2947
UCX	Point-to-point communication API, with tag matching, remote memory access, and atomic operations. This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport
	A cleaner API with lower software overhead which provides better performance especially for small messages.
	Support for multitude of InifiniBand transports and NVIDIA offloads to optimize data transfer performance: • RDMA • DC • Out-of-order • HW tag matching offload • Registration cache • ODP
	Shared memory communications for optimal intra-node data transfer: • SysV • posix • knem • CMA • xpmem
MXM	Enabled Adaptive Routing for all the transport layers (UD/RC/DC).
MXM	Memory registration optimization.
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)	Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP).
Shared memory	Improved the intranode performance of allreduce and barrier.
Configuration	Changed many default parameter setting in order to achieve best out-of-the-box experience for several applications including - CP2K, miniDFT, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA.
FCA	As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package.
	Improved AlltoAllv algorithm.
	Improved large data allreduce.
	Improved UCX BCOL.
OS architecture	Added support for ARM architecture.
Rev 1.8.2
MXM	Updated MXM version to 3.6.2098 which includes memory registration optimization.
Rev 1.8
Cross Channel (CC)	Added Cross Channel (CC) AlltoAllv
Cross Channel (CC)	Added CC zcpy Ring Bcas
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)	Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives
Shared memory POWER	Added shared memory POWER optimizations for allreduce
Shared memory POWER	Added shared memory POWER optimizations for Barrier
Mixed data types	Added support for mixed data types
Non-contiguous Bcast	Added support for non-contiguous Bcast with UMR or SGE in CC
UMR	Added UMR support in CC bcol
Unified Communication - X Framework (UCX)	A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications.
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • HCOLL updated to v3.7. Open MPI updated to v2.10
FCA	FCA 2.x is no longer the default FCA used in HPC-X. As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x.
Bug Fixes	See Section 4, “Bug Fixes History”, on page 11
Rev 1.7
MXM	Updated MXM version to 3.6
FCA Collective	Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce.
FCA	Added MPI datatype support.
	Added optimizations for PPC platforms.
	Added support for multiple NVIDIA SHARP technology leaders on a single host.
	Added support for collecting NVIDIA SHARP technology usage statistics.
	Exposed cross-channel non-blocking collectives to the MPI level.
Rev 1.6
MXM v3.5	See Section 5.3, “MXM Change Log History”, on page 23
IB-Router	Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI. Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box.
FCA v3.5	See Section 5.2, “FCA Change Log History”, on page 21
Rev 1.5
HPC-X Content	Updated the following communications libraries and acceleration packages versions: • Open MPI updated to v1.10 • UPC update to 2.22.0 • MXM updated to v3.4.369 • FCA updated to v3.4.799
MXM v3.4.369	See Section 5.3, “MXM Change Log History”, on page 23
FCA v3.4.799	See Section 5.2, “FCA Change Log History”, on page 21
Rev 1.4
FCA v3.3	See Section 5.2, “FCA Change Log History”, on page 21
MXM v3.4	See Section 5.3, “MXM Change Log History”, on page 23
Rev 1.3
MLNX_OFED	Added support for OFED Inbox drivers
CPU Architecture	Added support for PPC architecture
LID Mask Control (LMC)	Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization.
Performance	Performance improvements for all transport layers.
Adaptive Routing	Enhanced support for Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section “Adaptive Routing for UD Transport”.
UD zero copy	UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage.

FCA Change Log History

Category	Change
Rev 3.5
FCA Collective	Added MPI Allgatherv and MPI reduce
FCA	Added support for NVIDIA SHARP library (including SHARP allreduce, reduce and barrier)
	Enhanced scalability for CORE-Direct based collectives
	Added support for complex data types
Rev 3.4
General	UCX support
General	Communicator caching scheme with eviction: improves jobstart and communicator creation time
Collectives	Collectives: Added Alltoallv and Alltoall small message algorithms.
Rev 3.3
General	Ported to PowerPC
General	Thread safety added
Collectives	Improved large message allreduce algorithm (Enabled by default)
Collectives	Beta version of network topology awareness (Enabled by default)
Rev 3.0
Collectives	Offload collectives communication from MPI process onto NVIDIA interconnect hardware.
Collectives	Efficient collectives communication flow optimized to job and topology
MPI collectives	Significantly reduce MPI collectives runtime
MPI-3	Native support for MPI-3
Blocking and Non-blocking collectives	Support for blocking and nonblocking collectives
HCOLL	Supports hierarchical communication algorithms (HCOLL)
Collective algorithm	Supports multiple optimizations within a single collective algorithm
Performance	Increase CPU availability and efficiency for increased application performance
MPI libraries	Seamless integration with MPI libraries and job schedulers
Rev 2.5
Multicast Group	Added MCG (Multicast Group) cleanup tool
Performance	Performance improvements
Rev 2.2
Performance	Performance improvements
Dynamic offloading rules	Enabled dynamic offloading rules configuration based on the data type and reduce operations
Mixed MTU	Added support for mixed MTU
Rev 2.1.1
AMD/Interlagos CPUs	Added support for AMD/Interlagos CPUs
Rev 2.1
Core-Direct®	Added support for Core-Direct® technology (enables offloading collective operations to the HCA.)
Non-contiguous data layouts	Added support for non-contiguous data layouts
PGI compilers	Added support for PGI compilers

HPC-X™ Open MPI/OpenSHMEM Change Log History

Category	Change
Rev 2.2
Performance	Added Sandy Bridge performance optimizations.
memheap	Allocated memheap using contiguous memory provided by the HCA.
ptmalloc allocator	Replaced the buddy memheap by the ptmalloc allocator.
multiple pSync arrays	Added the option of using multiple pSync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines)
spml yoda	Optimized small size puts
Performance	Performance optimization
Memory footprint optimizations	Added memory footprint optimizations
Rev 1.8.2
Acceleration Packages	Added support for new MXM, FCA, HCOLL versions
Job start optimization	Added job start optimization
Performance	Performance improvements

On This Page

Release Notes Change Log History

HPC-X Toolkit Change Log History

FCA Change Log History

HPC-X™ Open MPI/OpenSHMEM Change Log History