Release Notes Change Log History
Category |
Change |
Rev 2.17.0 |
|
HPC-X Content |
Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release.
Added the following Supported Platforms and OSs:
|
Supported Cards |
Added support for GH100. |
Known Issues |
See Known Issues. |
Rev 2.16.2 |
|
HPC-X Content |
Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release.
|
Supported Cards |
All cards up to BlueField-3 and ConnectX-7. |
Bug Fixes |
|
Rev 2.16 |
|
HPC-X Content |
Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release.
|
Supported Cards |
Added support for BlueField-3 cards. |
Bug Fixes |
See Bug Fixes. |
Rev 2.15 |
|
HPC-X Content |
Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release.
|
Bug Fixes |
See Bug Fixes. |
Rev 2.14 |
|
TL/UCP Special Service Worker |
Added support for having a separate UCX UCP worker use UCC service collectives. For further information, please see TL/UCP Special Service Worker section. |
Data Type Support in CUDA Executor Component (EC) |
Added out-of-box support for all datatypes and reduction operations for UCC collectives for GPUs. For further information, please see Data Type Support in CUDA Executor Component section. |
EC/CUDA One-shot Kernel with Cooperative Launch |
Added support for using a single CUDA kernel for CUDA operations in UCC GPU collectives. For further information, please see EC/CUDA One-shot Kernel with Cooperative Launch section. |
Out-Of-Box Native GPU Allreduce |
Added support for the UCC library to detect the NVIDIA NVLink topology and select the best GPU-based algorithms for supported collectives (Allgather/v, Reducescatter/v). For further information, please see Out-Of-Box Native GPU Allreduce section. |
Bug Fixes |
See Bug Fixes. |
Rev 2.13.1 LTS |
|
Operating System |
Added support for Ubuntu v20.04 and v20.10. |
Rev 2.13 |
|
HPC-X Content |
Updated HPC-X Content section to reflect the communication libraries versions embedded in this HPC-X release.
|
NCCL-RDMA-SHARP-PLUGIN |
Added support for NCCL plugin API v5. |
SHARP |
Added support for SHARP on NDR. |
Bug Fixes |
See Bug Fixes section. |
Rev 2.12 |
|
UCX |
Added a method to set RoCE ECE value from UCX configuration. For example: UCX_IB_ECE=auto will use maximal ECE value, and UCX_IB_ECE= will use a specific numeric ECE value. |
HPC-X Content |
Updated the version of the UCX communication library to v1.14. |
Rev 2.11 |
|
Adapter Cards |
Added support NVIDIA ConnectX-7 adapter card with with 400 Gb/s speed. |
SHARPD |
sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
HPC-X Content |
Updated the versions of the following communication libraries.
|
Added support for UCC, a collective communication operations API and library in HPC-X. UCC is now part of the HPC-X package. For further information on UCC, pleased see Unified Collective Communication (UCC) section. |
|
Rev 2.10 |
|
UCX |
Added support for atomics on GPU memory target |
OpenSHMEM |
Added support for reducing memory overhead on scale |
Rev 2.9 |
|
UCX Configuration File |
The UCX configuration file enables the user to apply configuration variables set by the user in the /etc/ucx/ucx.conf file. For further information see UCX Configuration File. |
Instrumentation and Monitoring FUSE-based Tool |
This new functionality enables the user to analyze UCX-based applications in runtime. The tool is based on Filesystem in Userspace (FUSE) interface. If the feature is enabled, a directory for each process using UCX will be created in /tmp/ucx. For further information see Instrumentation and Monitoring FUSE-based Tool. |
OS Architecture |
HPC-X v1.9 onwards will no longer support PPC architecture in its releases. |
Bug Fixes |
|
Rev 2.8 |
|
HPC-X Content |
Updated the following communication libraries and acceleration packages versions:
|
UCX |
Added support for Multi-interface for cloud (client-server) applications. |
Added support for using Adaptive-Routing (out-of-order) on an SL that supports it. |
|
Added support for UCP Active-Messages API with Rendezvous. |
|
Added support for Keepalive functionality on the UCT layer. |
|
Performed several error handling enhancements. |
|
Added support for GPU-NIC locality discovery. |
|
NCCL-RDMA-SHARP-PLUGIN |
Added support for NCCL Plugin API v4. |
Added support for PCIe Relaxed Ordering. |
|
Added support for Adaptive Routing. |
|
Rev 2.7 |
|
UCX |
Added a new request API. For further information on this request API, please refer to UCX API documentation. |
Added support for PCIe Relaxed Ordering. |
|
Added out-of-box support for RoCE LAG. |
|
Added Flow Control support for RDMA Read operations. |
|
AMD Rome optimizations: Optimized IB connection establishment procedures to reduce system noise. |
|
Rev 2.6 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions:
|
UCX |
Added support in UCX for communication between containers configured to share the memory namespaces. |
Added strided Receive queue support for hardware tag matching. |
|
Made the following performance improvements on AMD EPYC servers.
|
|
Added support for multithreaded memory region in Open SHMEM (OSHMEM) applications to improve performance in job startup and teardown latencies. The multithreaded MR enables a more efficient use of the CPU resource during registration of memory regions larger than 4GB. |
|
Cuda |
Removed Cuda support in SLES 11 and RHEL 6 OSs. |
Rev 2.5 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions:
|
Removed CUDA init script (hpcx-init-cuda.sh), and environmental module (modules/hpcx-cuda) from HPC-X. Up until HPC-X v2.4, these files used to point to the default files hpcx-init.sh and modules/hpcx. Now, these CUDA files no longer exist, and users can only use the default init script and environmental module for enabling CUDA support. |
|
CUDA |
Unified Vanilla and CUDA environments. CUDA v10.0 is supported out of the box with standard init script or environmental module. Note: HPC-X is compiled against CUDA version 10.0, which does not support GCC versions newer than v8. Therefore, HPC-X built on systems with GCC versions above v8 will not have CUDA support. |
UCX |
Made performance optimizations. |
Added full support for rdma-core. |
|
Added support for CUDA v10.1. |
|
Rev 2.4 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions:
|
Removed rc.local_mellanox script. HPC-X became more stable and this script is no longer required. |
|
CUDA |
Unified Vanilla and CUDA environments. CUDA v9 is supported out of the box with standard init script or environmental module. Note: HPC-X is compiled against CUDA version 9, which does not support GCC versions newer than v7. Therefore, HPC-X built on systems with GCC versions above v7 will not have CUDA support. |
UCX |
Enabled HDR, SocketDirect and MultiRail features out-of-box. |
UCX Random DCI is now at GA level. |
|
Implemented a number of job startup optimizations. |
|
Added support from PCIe atomic operations feature. |
|
HCOLL |
Added support for performing floating point 16 bit operations for machine learning scenarios. |
OpenMPI |
Added multi threading support to OpenMPI OSC UCX. |
General |
HPC-X is now available through the EasyBuild framework: https://easybuild.readthedocs.io/en/latest/ |
Rev 2.3 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions:
|
UCX |
UCX is now compiled without JAVA bindings. |
Added support for running UCX over rdma-core, for DC transport and direct verbs. |
|
Emulation layer: Added the ability to run UCX over software emulation of remote memory access and atomic operations. This provides full support of SHMEM and MPI-RMA over shared memory, TCP, and older RDMA hardware, such as ConnectX-3 HCA. |
|
HCOLL |
HCOLL and NVIDIA SHARP are now compiled with CUDA support. |
Added support for CUDA buffers over SRA allreduce algorithm. |
|
MXM |
Removed support for MXM library. |
OpenMPI |
Added the following configuration options to OMPI:
|
Updated the configuration file platform/mellanox/optimized config in OMPI upstream by removing BTL OpenIB and UCT support and removing links to MXM/FCA usage. |
|
Removed PMI2 support. |
|
Rev 2.2 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.7 • HCOLL version 4.1 • UCX version 1.4 |
Added support for Singularity containerization. For further information, please refer to HPC-X User Manual. |
|
“osc ucx” is no longer the default one-sided-component in OpenMPI. |
|
Removed KNEM library from HPC-X package. UCX will use the KNEM available in MLNX_OFED. |
|
MXM Support |
Open MPI and HCOLL are not compiled with MXM anymore. Both are compiled with UCX only and use it by default. |
UCX |
Added support for the following UCX features: • New API for establishing client-server connection. • Out-of-box support for Memory In Chip (MEMIC) on ConnectX-5 HCAs. |
HPC-X Setup |
Added support for HPC-X to work on Huawei ARM architecture. |
HCOLL |
Improved performance by utilizing zero-copy messaging for MPI Bcast. |
Rev 2.1 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • Open MPI version 3.1.x • NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.5 • HCOLL version 4.0 • MXM version 3.7 • UCX version 1.3 • OpenSHMEM v1.3 specification compliant |
UCX |
• UCX is now the default pml layer for Open MPI, default spml layer for OpenSHMEM, and default OSC component for MPI RMA. • Added the following UCX features: • Added support for GPU memory in UCX communication libraries • Added support for Multi-Rail protocol |
MXM |
The UD_RNDV_ZCOPY parameter is set to ‘no’ by default. This means that the zcopy mechanism for the UD transport is disabled when using the Rendezvous protocol. |
HCOLL |
• UCX is now the default p2p transport in HCOLL • Improved multi-threaded performance • Improved shared memory performance • Added support for NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.5 • Added support for NVIDIA SHARP software multi-channel/multi-rail capable algorithms • Improved Allreduce large message algorithm • Improved AlltoAll algorithm |
Profiling IB verbs API (ibprof) |
Removed ibprof tool from HPC-X toolkit. |
UPC |
Removed UPC from HPC-X toolkit. |
Rev 2.0 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • OpenMPI version 3.0.0 • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.4 • HCOLL version 3.9 • UCX version 1.3 |
UCX |
• UCX is now at GA level. • Added the following UCX features: • [ConnectX-5 only] Added support for hardware Tag Matching with DC transport. • [ConnectX-5 only] Added support for Out-of-order RDMA RC and DC to support adaptive routing with true RDMA. • Added UCX datatypes - community approved datatype support. • Added UCX support to Inbox RHEL. • Added GPU Direct RDMA support. • Hardware Tag Matching (See section Hardware Tag Matching in the User Manual) • SR-IOV Support (See section SR-IOV Support in the User Manual) • Adaptive Routing (AR) (See section Adaptive Routing in the User Manual) • Error Handling (See section Error Handling in the User Manual) |
HCOLL |
• Added support for Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) v1.4 • Added support for NCCL on-host GPU based collectives. • Added support for Hierarchical GPU based allreduce using NCCL for scale-in and MXM/UCX for scale-out. • Improved shared memory performance for allreduce, barrier, and broadcast. Targeting high thread count systems, e.g. Power9. • Improved large message allreduce (multi-radix, zero-copy fragmentation, CPU vectorization.) • Added new and improved AlltoAllv algorithm - hybrid logarithmic pair-wise exchange. • Added support for on-demand HCOLL memory. Improves HCOLL's memory footprint on high thread count system e.g. Power9. • Added a high performance multithreaded implementation to support MPI_THREAD_MULTIPLE applications. Designed specifically for high thread count systems, e.g. Power9. • HCOLL startup improvements. |
Open MPI / OpenSHMEM |
• Added support for Open MPI 3.0.0. • Added support for xpmem kernel module. • Added a high performance implementation of shmem_ptr() with UCX SPML. • Added a UCX allocator. The UCX allocator optimizes intra-node communication by allowing direct access to memories of processes on the same node. The UCX allocator can only be used with the UCX SPML. • Added a UCX one-sided component to support MPI RMA operations. |
Rev 1.9.7 |
|
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) |
Bug Fixes, see Section 4, “Bug Fixes History”, on page 11 |
Rev 1.9 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • OpenMPI version 2.1.2a1 • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) version 1.3.1 • HCOLL version 3.8.1652 • MXM version 3.6.3103 • UCX version 1.2.2947 |
UCX |
Point-to-point communication API, with tag matching, remote memory access, and atomic operations. This can be used to implement MPI, PGAS, and Big Data libraries and applications- IB transport |
A cleaner API with lower software overhead which provides better performance especially for small messages. |
|
Support for multitude of InifiniBand transports and NVIDIA offloads to optimize data transfer performance: • RDMA • DC • Out-of-order • HW tag matching offload • Registration cache • ODP |
|
Shared memory communications for optimal intra-node data transfer: • SysV • posix • knem • CMA • xpmem |
|
MXM |
Enabled Adaptive Routing for all the transport layers (UD/RC/DC). |
Memory registration optimization. |
|
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) |
Improved the Out-of-the-box performance of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). |
Shared memory |
Improved the intranode performance of allreduce and barrier. |
Configuration |
Changed many default parameter setting in order to achieve best out-of-the-box experience for several applications including - CP2K, miniDFT, VASP, DL-POLY, Amber, Fluent, GAMES-UK, and LS-DYNA. |
FCA |
As of HPC-X v1.9, FCA v2.5 is no longer included in the HPC-X package. |
Improved AlltoAllv algorithm. |
|
Improved large data allreduce. |
|
Improved UCX BCOL. |
|
OS architecture |
Added support for ARM architecture. |
Rev 1.8.2 |
|
MXM |
Updated MXM version to 3.6.2098 which includes memory registration optimization. |
Rev 1.8 |
|
Cross Channel (CC) |
Added Cross Channel (CC) AlltoAllv |
Added CC zcpy Ring Bcas |
|
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) |
Added Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) non-blocking collectives |
Shared memory POWER |
Added shared memory POWER optimizations for allreduce |
Added shared memory POWER optimizations for Barrier |
|
Mixed data types |
Added support for mixed data types |
Non-contiguous Bcast |
Added support for non-contiguous Bcast with UMR or SGE in CC |
UMR |
Added UMR support in CC bcol |
Unified Communication - X Framework (UCX) |
A new acceleration library, integrated into the Open MPI (as a pml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. |
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • HCOLL updated to v3.7. Open MPI updated to v2.10 |
FCA |
FCA 2.x is no longer the default FCA used in HPC-X. As of HPC-X v1.8, FCA 3.x (HCOLL) is the default FCA used and it replaces FCA v2.x. |
Bug Fixes |
See Section 4, “Bug Fixes History”, on page 11 |
Rev 1.7 |
|
MXM |
Updated MXM version to 3.6 |
FCA Collective |
Added Cross-Channel based Allgather, Bcast, 8-byte Allreduce. |
FCA |
Added MPI datatype support. |
Added optimizations for PPC platforms. |
|
Added support for multiple NVIDIA SHARP technology leaders on a single host. |
|
Added support for collecting NVIDIA SHARP technology usage statistics. |
|
Exposed cross-channel non-blocking collectives to the MPI level. |
|
Rev 1.6 |
|
MXM v3.5 |
See Section 5.3, “MXM Change Log History”, on page 23 |
IB-Router |
Allows hosts that are located on different IB subnets to communicate with each other. This support is currently available when using the 'openib btl' in Open MPI. Note: When using 'openib btl', RoCE and IB router are mutually exclusive. The Open MPI inside HPC-X 1.6 is not compiled with ib-router support, therefore it supports RoCE out-of-the-box. |
FCA v3.5 |
See Section 5.2, “FCA Change Log History”, on page 21 |
Rev 1.5 |
|
HPC-X Content |
Updated the following communications libraries and acceleration packages versions: • Open MPI updated to v1.10 • UPC update to 2.22.0 • MXM updated to v3.4.369 • FCA updated to v3.4.799 |
MXM v3.4.369 |
See Section 5.3, “MXM Change Log History”, on page 23 |
FCA v3.4.799 |
See Section 5.2, “FCA Change Log History”, on page 21 |
Rev 1.4 |
|
FCA v3.3 |
See Section 5.2, “FCA Change Log History”, on page 21 |
MXM v3.4 |
See Section 5.3, “MXM Change Log History”, on page 23 |
Rev 1.3 |
|
MLNX_OFED |
Added support for OFED Inbox drivers |
CPU Architecture |
Added support for PPC architecture |
LID Mask Control (LMC) |
Added support for multiple LIDs usage when the LMC in the fabric is higher than zero. MXM will use multiple LIDs to distribute traffic across multiple links and achieve better resource utilization. |
Performance |
Performance improvements for all transport layers. |
Adaptive Routing |
Enhanced support for Adaptive Routing for the UD transport layer. For further information, please refer to the HPC-X User Manual section “Adaptive Routing for UD Transport”. |
UD zero copy |
UD zero copy support on receiver side to achieve better bandwidth utilization and reduce CPU usage. |
Category |
Change |
Rev 3.5 |
|
FCA Collective |
Added MPI Allgatherv and MPI reduce |
FCA |
Added support for NVIDIA SHARP library (including SHARP allreduce, reduce and barrier) |
Enhanced scalability for CORE-Direct based collectives |
|
Added support for complex data types |
|
Rev 3.4 |
|
General |
UCX support |
Communicator caching scheme with eviction: improves jobstart and communicator creation time |
|
Collectives |
Collectives: Added Alltoallv and Alltoall small message algorithms. |
Rev 3.3 |
|
General |
Ported to PowerPC |
Thread safety added |
|
Collectives |
Improved large message allreduce algorithm (Enabled by default) |
Beta version of network topology awareness (Enabled by default) |
|
Rev 3.0 |
|
Collectives |
Offload collectives communication from MPI process onto NVIDIA interconnect hardware. |
Efficient collectives communication flow optimized to job and topology |
|
MPI collectives |
Significantly reduce MPI collectives runtime |
MPI-3 |
Native support for MPI-3 |
Blocking and Non-blocking collectives |
Support for blocking and nonblocking collectives |
HCOLL |
Supports hierarchical communication algorithms (HCOLL) |
Collective algorithm |
Supports multiple optimizations within a single collective algorithm |
Performance |
Increase CPU availability and efficiency for increased application performance |
MPI libraries |
Seamless integration with MPI libraries and job schedulers |
Rev 2.5 |
|
Multicast Group |
Added MCG (Multicast Group) cleanup tool |
Performance |
Performance improvements |
Rev 2.2 |
|
Performance |
Performance improvements |
Dynamic offloading rules |
Enabled dynamic offloading rules configuration based on the data type and reduce operations |
Mixed MTU |
Added support for mixed MTU |
Rev 2.1.1 |
|
AMD/Interlagos CPUs |
Added support for AMD/Interlagos CPUs |
Rev 2.1 |
|
Core-Direct® |
Added support for Core-Direct® technology (enables offloading collective operations to the HCA.) |
Non-contiguous data layouts |
Added support for non-contiguous data layouts |
PGI compilers |
Added support for PGI compilers |
Category |
Change |
Rev 2.2 |
|
Performance |
Added Sandy Bridge performance optimizations. |
memheap |
Allocated memheap using contiguous memory provided by the HCA. |
ptmalloc allocator |
Replaced the buddy memheap by the ptmalloc allocator. |
multiple pSync arrays |
Added the option of using multiple pSync arrays instead of barrier synchronization between collective routines (fcollect, reduction routines) |
spml yoda |
Optimized small size puts |
Performance |
Performance optimization |
Memory footprint optimizations |
Added memory footprint optimizations |
Rev 1.8.2 |
|
Acceleration Packages |
Added support for new MXM, FCA, HCOLL versions |
Job start optimization |
Added job start optimization |
Performance |
Performance improvements |