Prior Releases#
NVIDIA® NVSHMEM 2.8.0 Release Notes#
Abstract#
NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.8.0 and earlier releases.
Key Features and Enhancements#
This NVSHMEM release includes the following key features and enhancements:
The transport formerly called GPU Initiated Communication (GIC) has
been renamed to InfiniBand GPUDirect Async (IBGDA) to reflect the underlying technology used by that transport.
Improvements to the all-to-all algorithm were made for both the
IBGDA and IBRC transports. These changes specifically focused on latency bound all-to-all operations.
Support for RC connections was added to IBGDA to optimize workloads
on small PE sets.
Compatibility#
NVSHMEM 2.8.0 has been tested with the following:
CUDA Toolkit:
11.8
On x86 and Power 9 processors
Limitations#
NVSHMEM is not yet compatible with the PMI client library on Cray
systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be
disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the
following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a
performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the
following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can
only be statically linked.
This is because the linking of CUDA device symbols does not work > across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure
ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband on older versions
of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
IBGDA does not work with DMABUF.
Fixed Issues#
An issue in the IBGDA Transport which caused all GPUs on the same
host to use the same NIC.
The DMA-BUF registration issue is fixed in this release. Users no
longer need to limit their allocation granularity to work around that issue.
Breaking Changes#
Due to the name change of the IBGDA transport, all IBGDA related
environment variables have changed. Please see the API docs and installation guide for more information.
Deprecated Features#
n/a
Known Issues#
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.
NVSHMEM buffers that span multiple physical memory regions are not
well-handled in IBGDA. To work-around this issue, either
Set NVSHMEM_DISABLE_CUDA_VMM=1 and NVSHMEM_SYMMETRIC_SIZE=<size> where size is large enough to cover your NVSHMEM memory usage, or
Set NVSHMEM_CUMEM_GRANULARITY=<size> such that it covers your application’s NVSHMEM memory consumption.
When using IBGDA, nvshmem_put, nvsmem_put_signal, and nvshmem_get do not support transferring data more than 2 GiB in one call.
Abstract#
NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.7.0 and earlier releases.
NVSHMEM 2.7.0#
Key Features and Enhancements#
This NVSHMEM release includes the following key features and enhancements:
Default Hopper Support (i.e. sm_90 and compute_90)
A new (Experimental) CMake build system
Performance improvements to the GPU Initiated Communication (GIC)
transport. Specifically, improvements were made to the synchronization and concurrency paths in GIC to improve the overall message rate of the transport.
Support for [[CUDA minor version
compatibility]{.underline}](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility) in the NVSHMEM library and headers.
Compatibility checks for the inbuilt bootstrap plugins.
Limited DMA-BUF memory registration support. This enables using
NVSHMEM core functionality without the nv_peer_mem or nvidia_peermem modules. DMA-BUF registrations are only supported up to 4 GiB in NVSHMEM 2.7.
SO Versioning for both the nvshmem_host shared library and the
precompiled bootstrap modules.
NVSHMEM now links statically to libcudart_static.a instead of
libcudart.so. This increases the NVSHMEM library size, but removes the requirement for applications to provide the dependency for NVSHMEM.
Compatibility#
NVSHMEM 2.7.0 has been tested with the following:
CUDA Toolkit:
On x86 and Power 9 processors
Limitations#
NVSHMEM is not yet compatible with the PMI client library on Cray
systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be
disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the
following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a
performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the
following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can
only be statically linked.
This is because the linking of CUDA device symbols does not work > across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure
ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot
allocate the complete device memory because of the inability to reuse the BAR1 space.
This has been fixed with CUDA driver releases 470 and later and in the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
Fixed Issues#
An issue in the local buffer registration path
`nvshmemx_buffer_register` where collisions between overlapping memory regions were not properly handled.
An issue causing validation errors in collective operations when all
GPUs in a job are connected via PCIe without a remote transport using the proxy thread.
Breaking Changes#
Support for Pascal devices was removed.
Users are welcome to attempt compiling from source and running
nvshmem against Pascal GPUs using the NVCC_GENCODE options, but no further bug fixes or support for Pascal devices will be added to NVSHMEM.
Deprecated Features#
n/a
Known Issues#
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.
DMA-BUF registrations are only supported with buffers up to 4 GiB.
For heaps or registrations larger than 4 GiB, nvidia_peermem or nv_peer_mem must be used.
NVSHMEM 2.6.0#
This is the NVIDIA® NVSHMEM 2.6.0 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added new GPU initiated communication transport that allows kernel
initiated communication to be issued directly to the NIC and bypass the CPU proxy thread. The transport is currently provided in experimental mode. It is disabled by default. Please refer to the installation guide for how to enable it.
Updated the libfabric transport with initial support for
Slingshot-11 networks. Performance tuning for the libfabric transport is ongoing.
Added collective algorithms for bcast/fcollect/reduce that use low
latency (LL) optimization by sending data and synchronization together, resulting in significant performance improvements.
Added warp- and block-scope implementation of recursive exchange
algorithm for reduce collectives.
Fixed bug in host/on-stream RMA API for very large data transfers.
Fixed bug in implementation of nvshmem_fence and
nvshmemx_quiet_on_stream API.
Compatibility
NVSHMEM 2.6.0 has been tested with the following:
CUDA:
On x86 and Power 9 processors
Limitations
NVSHMEM is not yet compatible with the PMI client library on Cray
systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be
disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the
following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a
performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the
following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
Fixed Issues
There are no fixed issues in this release.
Breaking Changes
There are no breaking changes in this release.
Deprecated Features
TBD
Known Issues
NVSHMEM device APIs can only be statically linked.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure
ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot
allocate the complete device memory because of the inability to reuse the BAR1 space.
This has been fixed with CUDA driver releases 470 and later and in the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
NVSHMEM 2.5.0#
This is the NVIDIA® NVSHMEM 2.5.0 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added multi-instance support in NVSHMEM.
NVSHMEM now builds as two libraries, libnvshmem_host.so and libnvshmem_device.a, which allows an application to have multiple components (for example, shared libraries and the application) that use NVSHMEM.
Note: Support for the libnvshmem.a library still exists for legacy purposes but will be eventually removed.
Added the nvshmemx_init_status API to query the initialized state of
NVSHMEM.
Added support for CUDA_VISIBLE_DEVICES.
Support for CUDA_VISIBLE_DEVICES is not yet available with CUDA VMM and requires you to set NVSHMEM_DISABLE_CUDA_VMM=1.
Updated PMI and PMI-2 bootstraps to plug-ins.
Added the nvshmem-info utility to display information about the
NVSHMEM library.
Fixed warnings when using NVSHMEM in applications that compile
without the Relocatable Device Code (RDC) option.
Renamed internal variables to avoid potential conflicts with
variables in application.
Implemented the nvshmem_alltoallmem API.
Improved the GPU-to-NIC assignment logic for the Summit/Sierra
supercomputer.
Added quiet for kernel initiated operations to host barrier
operations.
Updated the descriptions for NVSHMEM environment variables that are
displayed by using nvshmem-info or by setting NVSHMEM_INFO=1.
Added a symmetric size check to the NVSHMEM memory allocation
functions.
Compatibility
NVSHMEM 2.5.0 has been tested with the following:
CUDA:
On x86 and Power 9 processors
Limitations
VMM support is disabled by default on Power 9 systems because of a
performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the
following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
Fixed Issues
There are no fixed issues in this release.
Breaking Changes
There are no breaking changes in this release.
Deprecated Features
TBD
Known Issues
NVSHMEM device APIs can only be statically linked.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure
ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT
at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot
allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
NVSHMEM 2.4.1#
This is the NVIDIA NVSHMEM 2.4.1 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added limited support for Multiple Processes per GPU (MPG) on x86
platforms.
The amount of support depends on the availability of CUDA MPS.
MPG support is currently not available on Power 9 platforms.
Added a local buffer registration API that allows non-symmetric
buffers to be used as local buffers in the NVSHMEM API.
Added support for dynamic symmetric heap allocation, which
eliminates the need to specify NVSHMEM_SYMMETRIC_SIZE.
On x86 platforms, this feature is enabled by default, and is > available with CUDA version 11.3 or later.
On P9 platforms, this feature is disabled by default, and > can be enabled by using the NVSHMEM_CUDA_DISABLE_VMM > environment variable.
Support for large RMA messages.
To build NVSHMEM without ibrc support, set NVSHMEM_IBRC_SUPPORT=0 in
the environment you build.
This allows you to build and run NVSHMEM without the GDRCopy and OFED dependencies.Support for calling nvshmem_init/finalize multiple times with an MPI
bootstrap.
Improved testing coverage (large messages, exercising full GPU
memory, and so on).
Improved the default PE to NIC assignment for NVIDIA DGX-2™ systems.
Optimized channel request processing by using the CPU proxy thread.
Added support for the shmem_global_exit API.
Removed redundant barriers to improve the collectives’ performance.
Significant code refactoring to use templates instead of macros for
internal functions.
Improved performance for device-side blocking RMA and strided RMA
APIs.
Bug fix for buffers with large offsets into the NVSHMEM symmetric
heap.
Compatibility
NVSHMEM 2.4.1 has been tested with the following:
CUDA:
On x86 and Power 9 processors
Limitations
VMM support is disabled by default on Power 9 systems because of a
performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the
following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
Fixed Issues
There are no fixed issues in this release.
Breaking Changes
There are no breaking changes in this release.
Deprecated Features
TBD
Known Issues
NVSHMEM can only be linked statically.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
NVSHMEM 2.2.1#
This is the NVIDIA NVSHMEM 2.2.1 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Implemented dynamic heap memory allocation for runs with P2P GPUs.
This feature, which requires CUDA version 11.3 or later, can be enabled using NVSHMEM_DISABLE_CUDA_VMM=0. Support for IB runs will be added in the next release.
Improved UCX transport performance for AMO and RMA operations.
Improved performance for warp and block put/get operations.
Added atomic support for PCIe-connected GPUs over the UCX transport.
The UCX transport now supports non-symmetric buffers for use as local buffers in RMA and AMO operations.
Added support to initialize NVSHMEM in CUmodule.
Enabled MPI and PMIx bootstrap modules to be compiled externally from the NVSHMEM build.
This allows multiple builds of these plugins to support various MPI and PMIx libraries. To select the plugins, set NVSHMEM_BOOTSTRAP=”plugin” and NVSHMEM_BOOTSTRAP_PLUGIN=”plugin_name.so”.
Note: The plugin sources are installed with the compiled NVSHMEM library.
Enabled MPI bootstrap to be used with nvshmem_init.
You can set NVSHMEM_BOOTSTRAP=MPI or use the bootstrap plugin method.
Fixed bugs in nvshmem_<typename>_g and the fetch atomics implementation.
Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match OpenSHMEM specification
Fixed a type of nreduce argument in the reduction API to size_t to match the OpenSHMEM specification.
Improved NVSHMEM build times with a multi-threaded option in the CUDA compiler (requires CUDA version 11.2 and later).
Several fixes to address Coverity reports.
Compatibility
NVSHMEM 2.2.1 has been tested with the following:
CUDA:
On x86 and Power 9 processors
Limitations
Systems with PCIe peer-to-peer communication require one of the following:
InfiniBand to support NVSHMEM atomics APIs.
The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.
Fixed Issues
Breaking Changes
Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match OpenSHMEM specification.
Fixed a type of nreduce argument in the reduction API to size_t to match OpenSHMEM specification.
Removed support for host-side NVSHMEM wait APIs.
Deprecated Features
TBD
Known Issues
NVSHMEM can only be linked statically.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
NVSHMEM 2.1.2#
This is the NVIDIA NVSHMEM 2.1.2 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added a new UCX internode communication transport layer.
Note: UCX is experimental for this release.
Added support for the automatic warp-level coalescing of nvshmem_g
operations.
Added support for put-with-signal operations on CUDA streams.
Added support to map the symmetric heap by using the cuMem APIs.
Improved the performance of the single-threaded NVSHMEM put/get
device API.
Added the NVSHMEM_MAX_TEAMS environment variable to specify the
maximum number of teams that can be created.
Improved the host and on-stream Alltoall performance by using NCCL.
Fixed a bug in the compare-and-swap operation that caused several
bytes of the compare operand to be lost.
Added CPU core affinity to debugging output.
Added support for the CUDA 11.3 cudaDeviceFlushGPUDirectRDMAWrites
API for consistency.
Improved support for the NVIDIA Tools Extension (NVTX) to enable
performance analysis through NVIDIA NSight.
Removed the NVSHMEM_IS_P2P_RUN environment variable, because runtime
automatically determines it.
Made improvements to NVSHMEM example codes.
Added the NVSHMEM_REMOTE_TRANSPORT environment variable to select
the networking layer that is used for communication between nodes.
Set the maxrregcount to 32 for non-inlined device functions to
ensure that calling these NVSHMEM functions does not negatively affect kernel occupancy.
Compatibility
NVSHMEM 2.1.2 has been tested with the following:
CUDA:
On x86 and Power 9 processors
Limitations
Systems with PCIe peer-to-peer communication require InfiniBand to support NVSHMEM atomics APIs.
Fixed Issues
Breaking Changes
Removed the following deprecated constants:
_NVSHMEM_MAJOR_VERSION
_NVSHMEM_MINOR_VERSION
_NVSHMEM_VENDOR_STRING
Removed support for the deprecated nvshmem_wait API.
Deprecated Features
TBD
Known Issues
NVSHMEM can only be linked statically..
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.
We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.
This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.
With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.
NVSHMEM 2.0.3#
This is the NVIDIA NVSHMEM 2.0.3 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
Added support to use the NVIDIA® Collective Communication Library
(NCCL) for optimized NVSHMEM host and on-stream collectives.
Added support for RDMA over Converged Ethernet (RoCE) networks.
Added support for PMI-2 to enable an NVSHMEM job launch with
srun/SLURM.
Added support for PMIx to enable an NVSHMEM job launch with
PMIx-compatible launchers, such as Slurm and Open MPI.
Uniformly reformatted the perftest benchmark output.
Added support for the putmem_signal and signal_wait_until APIs.
Improved support for single-node environments without InfiniBand.
Fixed a bug that occurred when large numbers of fetch atomic
operations were performed on InfiniBand.
Improved topology awareness in NIC-to-GPU assignments for NVIDIA®
DGX™ A100 systems.
Added the CUDA_LIMIT_STACK_SIZE environment variable to set the GPU
thread stack size on Power systems.
Updated the threading level support that was reported for host and
stream-based APIs to NVSHMEM_THREAD_SERIALIZED.
Device-side APIs support NVSHMEM_THREAD_MULTIPLE.
Compatibility
NVSHMEM 2.0.3 has been tested with the following:
The following versions of CUDA:
x86 and Power 9
Limitations
There are no limitations in this release.
Fixed Issues
Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.
The OpenSHMEM specification has clarified that memory management routines operate on NVSHMEM_TEAM_WORLD and no other collectives on that team are permitted concurrently.
Breaking Changes
Removed support for active set-based collectives interface in OpenSHMEM.
Deprecated Features
TBD
Known Issues
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries and not as shared libraries.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.
We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.
NVSHMEM 2.0.2 EA#
Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.0.3 and earlier releases.
This is the NVIDIA NVSHMEM 2.0.2 EA release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
Added support to use the NVIDIA Collective Communication Library (NCCL) for optimized NVSHMEM host and on-stream collectives
Note: This feature is not yet supported on Power 9 systems.
Added support for RDMA over Converged Ethernet (RoCE) networks.
Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
Added support for PMIx to enable an NVSHMEM job launch with PMIx- compatible launchers, such as Slurm and Open MPI.
Uniformly reformatted the perftest benchmark output.
Added support for the putmem_signal and signal_wait_until APIs.
Improved support for single-node environments without InfiniBand.
Fixed a bug that occurred when large numbers of fetch atomic operations were performed on InfiniBand.
Improved topology awareness in NIC-to-GPU assignments for DGX A100 systems.
Compatibility
NVSHMEM 2.0.2 EAhas been tested with the following:
The following versions of CUDA:
x86 and Power 9
Limitations
NVSHMEM with NCCL is not yet supported on Power 9 systems.
Fixed Issues
Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.
The OpenSHMEM specification has clarified that memory management routines operate on NVSHMEM_TEAM_WORLD and no other collectives on that team are permitted concurrently.
Breaking Changes
Removed support for active set-based collectives interface in OpenSHMEM.
Deprecated Features
TBD
Known Issues
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries and not as shared libraries.
This is because the linking of CUDA device symbols does not work across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.
We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.
Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 1.1.3 and earlier releases.
NVSHMEM 1.1.3#
This is the NVIDIA NVSHMEM 1.1.3 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements:
Implemented the nvshmem_<type>_put_signal API from OpenSHMEM 1.5.
Added the nvshmemx_signal_op API.
Optimized the implementation of a signal set operation over P2P connected GPUs.
Optimized the performance of the nvshmem_fence() function.
Optimized the latency of the NVSHMEM atomics API.
Fixed a bug in the nvshmem_ptr API.
Fixed a bug in the implementation of the host-side strided transfer (iput, iget,and so on) API.
Fixed a bug in the on-stream reduction for the long long datatype.
Fixed a hang during nvshmem barrier collective operation.
Fixed __device__ nvshmem_quiet() to also do quiet on IB ops to self.
Compatibility
NVSHMEM 1.1.3 has been tested with the following:
CUDA [10.1]{.underline}, [10.2]{.underline}, and [11.0]{.underline}
x86 and PowerPC
Limitations
TBD
Fixed Issues
TBD
Breaking Changes
TBD
Deprecated Features
TBD
Known Issues
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries.
This is because the linking of CUDA device symbols does not work across shared libraries.
Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.
nvshmem_barrier*, nvshmem_quiet and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.
They do not ensure global ordering and visibility.
NVSHMEM 1.0.1#
Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 1.0.1 and earlier releases.
This is the NVIDIA NVSHMEM 1.0.1 release notes. This is the first official release of NVSHMEM.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs.
Includes a low-overhead, in-kernel communication API for use by GPU threads.
Includes stream-based and CPU-initiated communication APIs.
Supports peer-to-peer communication using NVIDIA^®^ NVLink^®^ and PCI Express and for GPU clusters using NVIDIA Mellanox^®^ InfiniBand.
Supports x86 and POWER9 processors.
Is interoperable with MPI and other OpenSHMEM implementations.
Compatibility
NVSHMEM 1.0.0 has been tested with the following:
CUDA [10.1]{.underline}, [10.2]{.underline}, and [11.0 RC]{.underline}
x86 and PowerPC
Limitations
TBD
Fixed Issues
TBD
Breaking Changes
TBD
Deprecated Features
TBD
Known Issues
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries. This is because linking of CUDA device symbols does not work across shared libraries.
NVSHMEM collective operations with overlapping active sets are known not to work in some scenarios.
nvshmem_quiet only ensures PE-PE visibility and not global visibility of data.
NVSHMEM 0.4.1 EA#
This is the NVIDIA NVSHMEM 0.4.1 Early Access (EA) release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Implemented several bug fixes.
Improved NVSHMEM memory consumption and bootstrapping time at large scales.
Fixed maximum number of outstanding RDMA reads/atomics in an InfiniBand endpoint.
Added MPI launcher support to programs in the examples/ directory.
Fixed and improved performance tests.
Implemented to automatically detect libmpi_ibm.so when libmpi.so is not found.
Added support for nvshmem_<type>_wait(test)_any/all/some API.
Added missing OpenSHMEM 1.3 constants.
Added support for shmem_<type>_g API over InfiniBand.
Improved reduce collective operation performance. Added recursive exchange based algorithm.
Optimized shmem_quiet implementation.
Code improvements - separation of external and internal header files, improved internal function/macro naming, code reuse, warnings cleanup, fixes to some build issues, etc.
Compatibility
NVSHMEM 0.4.1 EA has been tested with the following:
CUDA [9.1]{.underline}, [10.1]{.underline} and [10.2]{.underline}
PowerPC and x86
Limitations
TBD
Fixed Issues
Fixed a bug in nvshmem_malloc/free implementation.
Fixed CUDA thread ID calculation for multi-dimensional grids for blocking and non-blocking put/get calls.
Fixed the implementation of nvshmex_collective_launch_query_gridsize API.
Fixed the memory consumption and bootstrapping scalability issues for very large scale runs.
Breaking Changes
Removed support for shmem_<type>_p/g/put/get as synchronization operations. Instead, use nvshmemx_<type>_signal, atomics API for synchronization.
Removed volatile qualifier from the ivars argument in wait/test API for compliance with OpenSHMEM 1.4 API.
Renamed bitwise AMOs to correct OpenSHMEM 1.4 name.
Renamed hydra launch binaries to nvshmrun and nvshmrun.hydra to avoid confusion with MPI binaries.
Renamed USE_MPI(SHMEM)_IN_TEST environment variables in performance tests to NVSHMEMTEST_USE_MPI(SHMEM)_LAUNCHER.
Deprecated Features
TBD
Known Issues
Certain combinations/sequence of calls to nvshmem_malloc and nvshmem_free can lead to segmentation faults.
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries. This is because linking of CUDA device symbols does not work across shared libraries.
NVSHMEM 0.3#
This is the NVSHMEM 0.3 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Bug fixes
Fixes and clarification to reporting in performance tests
Availability through the apply-for-access program
Compatibility
NVSHMEM 0.3 has been tested with the following:
This NVSHMEM release supports [CUDA 10.1]{.underline}.
Limitations
NA
Deprecated Features
NA
Known Issues
NA
NVSHMEM 0.2.5#
This is the NVSHMEM 0.2.5 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Released sources under EULA.
Enhanced OpenSHMEM and MPI interoperability to work with any implementation.
Added Clang support.
Added timeout mechanism in debug mode that forces application to exit when it hangs in NVSHMEM API calls.
Improved NIC-PE affinity detection on CORAL systems.
Removed OpenSHMEM APIs to avoid symbol conflicts when using MPI/OpenSHMEM interoperability, Now all NVSHMEM APIs have an nv* prefix.
Removed Hydra sources from the package. Provided script to download and install.
Bug fixes
Compatibility
NVSHMEM 0.2.5 has been tested with the following:
This NVSHMEM release supports [CUDA 10.1]{.underline}.
Limitations
This release of NVSHMEM only supports groups of GPUs that are all-to-all connected via PCIe, NVLink or InfiniBand. The NVSHMEM Developer Guide provides more information on system requirements.
The NVSHMEM memory model is defined briefly in the NVSHMEM Developer Guide, with explanations of aspects that are left undefined in the OpenSHMEM specification and relaxations to allow for a more efficient implementation on hardware with relaxed memory models.
Exercise with a set of litmus tests (see [POWER and ARM Litmus Tests]{.underline}) has been done to show progress towards building confidence in the correctness and completeness of the NVSHMEM memory model. Further extensions to the memory model that may allow for efficient implementations on hardware with relaxed memory models are being considered.
The NVSHMEM memory model definition is intended to facilitate discussion in the OpenSHMEM community towards a concrete and complete definition of the OpenSHMEM memory model in the specification. This process will influence changes in the NVSHMEM memory model as it is defined today.
This version of NVSHMEM is based on a prototype of OpenSHMEM for NVIDIA GPUs developed as part of DoE Design Forward project. The prototype is available [here]{.underline}. The prototype was an experimental effort and is not officially supported. It implements a very limited subset (put, get, wait_until) of OpenSHMEM APIs compared to this version.
Deprecated Features
The following features are deprecated in NVSHMEM 0.2.5:
TBD
Fixed Issues
The following issues have been resolved in NVSHMEM 0.2.5:
TBD (GitHub issues <number and link>)
Known Issues
shmem_symmetric_size environment variable
As defined in OpenSHMEM standard, SHMEM_SYMMETRIC_SIZE is an environment variable for controlling the size of the symmetric heap in NVSHMEM. In NVSHMEM version 0.2, all allocations from the symmetric heap are aligned to a 512-byte boundary. As a consequence, allocations of a granularity less or equal to 512 bytes consume up to 512 bytes.
NVSHMEM 0.2.4#
This is the NVSHMEM 0.2.4 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Scalable barrier implementation
Contiguous RMA APIs in header (enables compiler to inline)
New environment variable for MPI and OpenSHMEM library selection
Bug fixes
Compatibility
NVSHMEM 0.2.4 has been tested with the following:
This NVSHMEM release supports [CUDA 10.1]{.underline}.
NVSHMEM 0.2.3#
This is the NVSHMEM 0.2.3 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Added Power9 support
Added InfiniBand support
Added interop support with OpenMPI-4.0.0/OSHMEM (depends on UCX-1.4.0)
NVSHMEM 0.1#
This is the NVSHMEM 0.1 release notes.
Key Features and Enhancements
This NVSHMEM release includes the following key features and enhancements.
Support for P2P platforms only (PCI-E and NVLink)
Added _warp and _block variants of all __device__ RMA and collective APIs
Added _on_stream variants of all __host__ RMA and collective APIs