Prior Releases#

NVIDIA® NVSHMEM 2.8.0 Release Notes#

Abstract#

NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.8.0 and earlier releases.

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

The transport formerly called GPU Initiated Communication (GIC) has

been renamed to InfiniBand GPUDirect Async (IBGDA) to reflect the underlying technology used by that transport.
Improvements to the all-to-all algorithm were made for both the

IBGDA and IBRC transports. These changes specifically focused on latency bound all-to-all operations.
Support for RC connections was added to IBGDA to optimize workloads

on small PE sets.

Compatibility#

NVSHMEM 2.8.0 has been tested with the following:

CUDA Toolkit:
- [11.0]{.underline}
- 11.8
- 12.0
On x86 and Power 9 processors

Limitations#

NVSHMEM is not yet compatible with the PMI client library on Cray

systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be

disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the

following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can

only be statically linked.
- This is because the linking of CUDA device symbols does not work > across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure

ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
- They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband on older versions

of the 460 driver and previous branches, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space. This has been fixed with CUDA driver releases 470 and later and in the latest 460 driver.
When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

IBGDA does not work with DMABUF.

Fixed Issues#

An issue in the IBGDA Transport which caused all GPUs on the same

host to use the same NIC.
The DMA-BUF registration issue is fixed in this release. Users no

longer need to limit their allocation granularity to work around that issue.

Breaking Changes#

Due to the name change of the IBGDA transport, all IBGDA related

environment variables have changed. Please see the API docs and installation guide for more information.

Deprecated Features#

n/a

Known Issues#

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT

at compile time, are not currently supported.
NVSHMEM buffers that span multiple physical memory regions are not

well-handled in IBGDA. To work-around this issue, either
- Set NVSHMEM_DISABLE_CUDA_VMM=1 and NVSHMEM_SYMMETRIC_SIZE=<size> where size is large enough to cover your NVSHMEM memory usage, or
- Set NVSHMEM_CUMEM_GRANULARITY=<size> such that it covers your application’s NVSHMEM memory consumption.
When using IBGDA, nvshmem_put, nvsmem_put_signal, and nvshmem_get do not support transferring data more than 2 GiB in one call.

Abstract#

NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.7.0 and earlier releases.

NVSHMEM 2.7.0#

Key Features and Enhancements#

This NVSHMEM release includes the following key features and enhancements:

Default Hopper Support (i.e. sm_90 and compute_90)
A new (Experimental) CMake build system
Performance improvements to the GPU Initiated Communication (GIC)

transport. Specifically, improvements were made to the synchronization and concurrency paths in GIC to improve the overall message rate of the transport.
Support for [[CUDA minor version

compatibility]{.underline}](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility) in the NVSHMEM library and headers.
Compatibility checks for the inbuilt bootstrap plugins.
Limited DMA-BUF memory registration support. This enables using

NVSHMEM core functionality without the nv_peer_mem or nvidia_peermem modules. DMA-BUF registrations are only supported up to 4 GiB in NVSHMEM 2.7.
SO Versioning for both the nvshmem_host shared library and the

precompiled bootstrap modules.
NVSHMEM now links statically to libcudart_static.a instead of

libcudart.so. This increases the NVSHMEM library size, but removes the requirement for applications to provide the dependency for NVSHMEM.

Compatibility#

NVSHMEM 2.7.0 has been tested with the following:

CUDA Toolkit:
On x86 and Power 9 processors

Limitations#

NVSHMEM is not yet compatible with the PMI client library on Cray

systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be

disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the

following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.
NVSHMEM host APIs can be dynamically linked, but device APIs can

only be statically linked.
- This is because the linking of CUDA device symbols does not work > across shared libraries.
nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure

ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.
- They do not ensure global ordering and visibility.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot

allocate the complete device memory because of the inability to reuse the BAR1 space.

This has been fixed with CUDA driver releases 470 and later and in the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

Fixed Issues#

An issue in the local buffer registration path

`nvshmemx_buffer_register` where collisions between overlapping memory regions were not properly handled.
An issue causing validation errors in collective operations when all

GPUs in a job are connected via PCIe without a remote transport using the proxy thread.

Breaking Changes#

Support for Pascal devices was removed.
Users are welcome to attempt compiling from source and running

nvshmem against Pascal GPUs using the NVCC_GENCODE options, but no further bug fixes or support for Pascal devices will be added to NVSHMEM.

Deprecated Features#

n/a

Known Issues#

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT

at compile time, are not currently supported.
DMA-BUF registrations are only supported with buffers up to 4 GiB.

For heaps or registrations larger than 4 GiB, nvidia_peermem or nv_peer_mem must be used.

NVSHMEM 2.6.0#

This is the NVIDIA® NVSHMEM 2.6.0 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added new GPU initiated communication transport that allows kernel

initiated communication to be issued directly to the NIC and bypass the CPU proxy thread. The transport is currently provided in experimental mode. It is disabled by default. Please refer to the installation guide for how to enable it.
Updated the libfabric transport with initial support for

Slingshot-11 networks. Performance tuning for the libfabric transport is ongoing.
Added collective algorithms for bcast/fcollect/reduce that use low

latency (LL) optimization by sending data and synchronization together, resulting in significant performance improvements.
Added warp- and block-scope implementation of recursive exchange

algorithm for reduce collectives.
Fixed bug in host/on-stream RMA API for very large data transfers.
Fixed bug in implementation of nvshmem_fence and

nvshmemx_quiet_on_stream API.

Compatibility

NVSHMEM 2.6.0 has been tested with the following:

CUDA:
On x86 and Power 9 processors

Limitations

NVSHMEM is not yet compatible with the PMI client library on Cray

systems and must use the NVSHMEM internal PMI-2 client library. Jobs can be launched with the PMI bootstrap by specifying –mpi=pmi2 to Slurm and NVSHMEM_BOOTSTRAP_PMI=PMI-2. PMI-2 can also be set as the default PMI by setting NVSHMEM_DEFAULT_PMI2=1 when building NVSHMEM. Jobs can also be launched directly by using the MPI or SHMEM bootstraps.
The libfabric transport does not yet support VMM and VMM must be

disabled by setting NVSHMEM_DISABLE_CUDA_VMM.
Libfabric support on Slingshot-11 networks requires setting the

following environment variable, FI_CXI_OPTIMIZED_MRS=false.
VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

There are no breaking changes in this release.

Deprecated Features

TBD

Known Issues

NVSHMEM device APIs can only be statically linked.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure

ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT

at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot

allocate the complete device memory because of the inability to reuse the BAR1 space.

This has been fixed with CUDA driver releases 470 and later and in the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the

CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

NVSHMEM 2.5.0#

This is the NVIDIA® NVSHMEM 2.5.0 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added multi-instance support in NVSHMEM.

NVSHMEM now builds as two libraries, libnvshmem_host.so and libnvshmem_device.a, which allows an application to have multiple components (for example, shared libraries and the application) that use NVSHMEM.

Note: Support for the libnvshmem.a library still exists for legacy purposes but will be eventually removed.

Added the nvshmemx_init_status API to query the initialized state of

NVSHMEM.
Added support for CUDA_VISIBLE_DEVICES.

Support for CUDA_VISIBLE_DEVICES is not yet available with CUDA VMM and requires you to set NVSHMEM_DISABLE_CUDA_VMM=1.

Updated PMI and PMI-2 bootstraps to plug-ins.
Added the nvshmem-info utility to display information about the

NVSHMEM library.
Fixed warnings when using NVSHMEM in applications that compile

without the Relocatable Device Code (RDC) option.
Renamed internal variables to avoid potential conflicts with

variables in application.
Implemented the nvshmem_alltoallmem API.
Improved the GPU-to-NIC assignment logic for the Summit/Sierra

supercomputer.
Added quiet for kernel initiated operations to host barrier

operations.
Updated the descriptions for NVSHMEM environment variables that are

displayed by using nvshmem-info or by setting NVSHMEM_INFO=1.
Added a symmetric size check to the NVSHMEM memory allocation

functions.

Compatibility

NVSHMEM 2.5.0 has been tested with the following:

CUDA:
On x86 and Power 9 processors

Limitations

VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

There are no breaking changes in this release.

Deprecated Features

TBD

Known Issues

NVSHMEM device APIs can only be statically linked.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure

ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT

at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot

allocate the complete device memory because of the inability to reuse the BAR1 space.

This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the

CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

NVSHMEM 2.4.1#

This is the NVIDIA NVSHMEM 2.4.1 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added limited support for Multiple Processes per GPU (MPG) on x86

platforms.
- The amount of support depends on the availability of CUDA MPS.
- MPG support is currently not available on Power 9 platforms.
Added a local buffer registration API that allows non-symmetric

buffers to be used as local buffers in the NVSHMEM API.
Added support for dynamic symmetric heap allocation, which

eliminates the need to specify NVSHMEM_SYMMETRIC_SIZE.
- On x86 platforms, this feature is enabled by default, and is > available with CUDA version 11.3 or later.
- On P9 platforms, this feature is disabled by default, and > can be enabled by using the NVSHMEM_CUDA_DISABLE_VMM > environment variable.
Support for large RMA messages.
To build NVSHMEM without ibrc support, set NVSHMEM_IBRC_SUPPORT=0 in

the environment you build.
This allows you to build and run NVSHMEM without the GDRCopy and OFED dependencies.
Support for calling nvshmem_init/finalize multiple times with an MPI

bootstrap.
Improved testing coverage (large messages, exercising full GPU

memory, and so on).
Improved the default PE to NIC assignment for NVIDIA DGX-2™ systems.
Optimized channel request processing by using the CPU proxy thread.
Added support for the shmem_global_exit API.
Removed redundant barriers to improve the collectives’ performance.
Significant code refactoring to use templates instead of macros for

internal functions.
Improved performance for device-side blocking RMA and strided RMA

APIs.
Bug fix for buffers with large offsets into the NVSHMEM symmetric

heap.

Compatibility

NVSHMEM 2.4.1 has been tested with the following:

CUDA:
On x86 and Power 9 processors

Limitations

VMM support is disabled by default on Power 9 systems because of a

performance regression.
MPG support is not yet available on Power 9 systems.
Systems with PCIe peer-to-peer communication require one of the

following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will > use sockets for atomics.

Fixed Issues

There are no fixed issues in this release.

Breaking Changes

There are no breaking changes in this release.

Deprecated Features

TBD

Known Issues

NVSHMEM can only be linked statically.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

NVSHMEM 2.2.1#

This is the NVIDIA NVSHMEM 2.2.1 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Implemented dynamic heap memory allocation for runs with P2P GPUs.

This feature, which requires CUDA version 11.3 or later, can be enabled using NVSHMEM_DISABLE_CUDA_VMM=0. Support for IB runs will be added in the next release.

Improved UCX transport performance for AMO and RMA operations.
Improved performance for warp and block put/get operations.
Added atomic support for PCIe-connected GPUs over the UCX transport.
The UCX transport now supports non-symmetric buffers for use as local buffers in RMA and AMO operations.
Added support to initialize NVSHMEM in CUmodule.
Enabled MPI and PMIx bootstrap modules to be compiled externally from the NVSHMEM build.

This allows multiple builds of these plugins to support various MPI and PMIx libraries. To select the plugins, set NVSHMEM_BOOTSTRAP=”plugin” and NVSHMEM_BOOTSTRAP_PLUGIN=”plugin_name.so”.

Note: The plugin sources are installed with the compiled NVSHMEM library.

Enabled MPI bootstrap to be used with nvshmem_init.

You can set NVSHMEM_BOOTSTRAP=MPI or use the bootstrap plugin method.

Fixed bugs in nvshmem_<typename>_g and the fetch atomics implementation.
Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match OpenSHMEM specification
Fixed a type of nreduce argument in the reduction API to size_t to match the OpenSHMEM specification.
Improved NVSHMEM build times with a multi-threaded option in the CUDA compiler (requires CUDA version 11.2 and later).
Several fixes to address Coverity reports.

Compatibility

NVSHMEM 2.2.1 has been tested with the following:

CUDA:
On x86 and Power 9 processors

Limitations

Systems with PCIe peer-to-peer communication require one of the following:
- InfiniBand to support NVSHMEM atomics APIs.
- The use of NVSHMEM’s UCX transport that, if IB is absent, will use sockets for atomics.

Fixed Issues

Breaking Changes

Changed nvshmem_<typename>_collect to nvshmem_<typename>_fcollect to match OpenSHMEM specification.
Fixed a type of nreduce argument in the reduction API to size_t to match OpenSHMEM specification.
Removed support for host-side NVSHMEM wait APIs.

Deprecated Features

TBD

Known Issues

NVSHMEM can only be linked statically.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

NVSHMEM 2.1.2#

This is the NVIDIA NVSHMEM 2.1.2 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added a new UCX internode communication transport layer.

Note: UCX is experimental for this release.

Added support for the automatic warp-level coalescing of nvshmem_g

operations.
Added support for put-with-signal operations on CUDA streams.
Added support to map the symmetric heap by using the cuMem APIs.
Improved the performance of the single-threaded NVSHMEM put/get

device API.
Added the NVSHMEM_MAX_TEAMS environment variable to specify the

maximum number of teams that can be created.
Improved the host and on-stream Alltoall performance by using NCCL.
Fixed a bug in the compare-and-swap operation that caused several

bytes of the compare operand to be lost.
Added CPU core affinity to debugging output.
Added support for the CUDA 11.3 cudaDeviceFlushGPUDirectRDMAWrites

API for consistency.
Improved support for the NVIDIA Tools Extension (NVTX) to enable

performance analysis through NVIDIA NSight.
Removed the NVSHMEM_IS_P2P_RUN environment variable, because runtime

automatically determines it.
Made improvements to NVSHMEM example codes.
Added the NVSHMEM_REMOTE_TRANSPORT environment variable to select

the networking layer that is used for communication between nodes.
Set the maxrregcount to 32 for non-inlined device functions to

ensure that calling these NVSHMEM functions does not negatively affect kernel occupancy.

Compatibility

NVSHMEM 2.1.2 has been tested with the following:

CUDA:
On x86 and Power 9 processors

Limitations

Systems with PCIe peer-to-peer communication require InfiniBand to support NVSHMEM atomics APIs.

Fixed Issues

Breaking Changes

Removed the following deprecated constants:
- _NVSHMEM_MAJOR_VERSION
- _NVSHMEM_MINOR_VERSION
- _NVSHMEM_VENDOR_STRING
Removed support for the deprecated nvshmem_wait API.

Deprecated Features

TBD

Known Issues

NVSHMEM can only be linked statically..

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure ordering and visibility between source and destination PEs on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.

We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.

When built with GDRcopy and when using Infiniband, NVSHMEM cannot allocate the complete device memory because of the inability to reuse the BAR1 space.

This will be fixed with future CUDA driver releases in the 470 or later and the 460 branch.

When NVSHMEM maps the symmetric heap using cudaMalloc, it sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes synchronous CUDA memory operations on the symmetric heap.

With CUDA 11.3 and later, NVSHMEM supports the mapping of the symmetric heap by using the CUDA VMM APIs. However, when you map the symmetric heap by using the VMM APIs, CUDA does not support this attribute, and users are responsible for the synchronization. For additional information about synchronous CUDA memory operations, see [API synchronization behavior]{.underline}.

NVSHMEM 2.0.3#

This is the NVIDIA NVSHMEM 2.0.3 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
Added support to use the NVIDIA® Collective Communication Library

(NCCL) for optimized NVSHMEM host and on-stream collectives.
Added support for RDMA over Converged Ethernet (RoCE) networks.
Added support for PMI-2 to enable an NVSHMEM job launch with

srun/SLURM.
Added support for PMIx to enable an NVSHMEM job launch with

PMIx-compatible launchers, such as Slurm and Open MPI.
Uniformly reformatted the perftest benchmark output.
Added support for the putmem_signal and signal_wait_until APIs.
Improved support for single-node environments without InfiniBand.
Fixed a bug that occurred when large numbers of fetch atomic

operations were performed on InfiniBand.
Improved topology awareness in NIC-to-GPU assignments for NVIDIA®

DGX™ A100 systems.
Added the CUDA_LIMIT_STACK_SIZE environment variable to set the GPU

thread stack size on Power systems.
Updated the threading level support that was reported for host and

stream-based APIs to NVSHMEM_THREAD_SERIALIZED.

Device-side APIs support NVSHMEM_THREAD_MULTIPLE.

Compatibility

NVSHMEM 2.0.3 has been tested with the following:

The following versions of CUDA:
x86 and Power 9

Limitations

There are no limitations in this release.

Fixed Issues

Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.

The OpenSHMEM specification has clarified that memory management routines operate on NVSHMEM_TEAM_WORLD and no other collectives on that team are permitted concurrently.

Breaking Changes

Removed support for active set-based collectives interface in OpenSHMEM.

Deprecated Features

TBD

Known Issues

NVSHMEM and libraries that use NVSHMEM can only be built as static libraries and not as shared libraries.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.

We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.

NVSHMEM 2.0.2 EA#

Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 2.0.3 and earlier releases.

This is the NVIDIA NVSHMEM 2.0.2 EA release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
Added support to use the NVIDIA Collective Communication Library (NCCL) for optimized NVSHMEM host and on-stream collectives

Note: This feature is not yet supported on Power 9 systems.

Added support for RDMA over Converged Ethernet (RoCE) networks.
Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
Added support for PMIx to enable an NVSHMEM job launch with PMIx- compatible launchers, such as Slurm and Open MPI.
Uniformly reformatted the perftest benchmark output.
Added support for the putmem_signal and signal_wait_until APIs.
Improved support for single-node environments without InfiniBand.
Fixed a bug that occurred when large numbers of fetch atomic operations were performed on InfiniBand.
Improved topology awareness in NIC-to-GPU assignments for DGX A100 systems.

Compatibility

NVSHMEM 2.0.2 EAhas been tested with the following:

The following versions of CUDA:
x86 and Power 9

Limitations

NVSHMEM with NCCL is not yet supported on Power 9 systems.

Fixed Issues

Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.

The OpenSHMEM specification has clarified that memory management routines operate on NVSHMEM_TEAM_WORLD and no other collectives on that team are permitted concurrently.

Breaking Changes

Removed support for active set-based collectives interface in OpenSHMEM.

Deprecated Features

TBD

Known Issues

NVSHMEM and libraries that use NVSHMEM can only be built as static libraries and not as shared libraries.

This is because the linking of CUDA device symbols does not work across shared libraries.

nvshmem_barrier*, nvshmem_quiet, and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

Complex types, which are enabled by setting NVSHMEM_COMPLEX_SUPPORT at compile time, are not currently supported.
In some cases, nvshmem_<typename>_g over InfiniBand and RoCE has been reported to return stale data.

We are continuing to investigate this issue. In the meantime, you can use nvshmem_<typename>_atomic_fetch as a workaround for nvshmem_<typename>_g, but the performance of these options is different.

Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 1.1.3 and earlier releases.

NVSHMEM 1.1.3#

This is the NVIDIA NVSHMEM 1.1.3 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements:

Implemented the nvshmem_<type>_put_signal API from OpenSHMEM 1.5.
Added the nvshmemx_signal_op API.
Optimized the implementation of a signal set operation over P2P connected GPUs.
Optimized the performance of the nvshmem_fence() function.
Optimized the latency of the NVSHMEM atomics API.
Fixed a bug in the nvshmem_ptr API.
Fixed a bug in the implementation of the host-side strided transfer (iput, iget,and so on) API.
Fixed a bug in the on-stream reduction for the long long datatype.
Fixed a hang during nvshmem barrier collective operation.
Fixed __device__ nvshmem_quiet() to also do quiet on IB ops to self.

Compatibility

NVSHMEM 1.1.3 has been tested with the following:

CUDA [10.1]{.underline}, [10.2]{.underline}, and [11.0]{.underline}
x86 and PowerPC

Limitations

TBD

Fixed Issues

TBD

Breaking Changes

TBD

Deprecated Features

TBD

Known Issues

NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries.

This is because the linking of CUDA device symbols does not work across shared libraries.

Concurrent NVSHMEM collective operations with active sets are not supported.
Concurrent NVSHMEM memory allocation operations and collective operations are not supported.
nvshmem_barrier*, nvshmem_quiet and nvshmem_wait_until only ensure PE-PE ordering and visibility on systems with NVLink and InfiniBand.

They do not ensure global ordering and visibility.

NVSHMEM 1.0.1#

Abstract: NVSHMEM is an NVIDIA based “shared memory” library that provides an easy-to-use CPU-side interface to allocate pinned memory that is symmetrically distributed across a cluster of NVIDIA GPUs These release notes describe the key features, software enhancements and improvements, and known issues for NVSHMEM 1.0.1 and earlier releases.

This is the NVIDIA NVSHMEM 1.0.1 release notes. This is the first official release of NVSHMEM.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs.
Includes a low-overhead, in-kernel communication API for use by GPU threads.
Includes stream-based and CPU-initiated communication APIs.
Supports peer-to-peer communication using NVIDIA^®^ NVLink^®^ and PCI Express and for GPU clusters using NVIDIA Mellanox^®^ InfiniBand.
Supports x86 and POWER9 processors.
Is interoperable with MPI and other OpenSHMEM implementations.

Compatibility

NVSHMEM 1.0.0 has been tested with the following:

CUDA [10.1]{.underline}, [10.2]{.underline}, and [11.0 RC]{.underline}
x86 and PowerPC

Limitations

TBD

Fixed Issues

TBD

Breaking Changes

TBD

Deprecated Features

TBD

Known Issues

NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries. This is because linking of CUDA device symbols does not work across shared libraries.
NVSHMEM collective operations with overlapping active sets are known not to work in some scenarios.
nvshmem_quiet only ensures PE-PE visibility and not global visibility of data.

NVSHMEM 0.4.1 EA#

This is the NVIDIA NVSHMEM 0.4.1 Early Access (EA) release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Implemented several bug fixes.
Improved NVSHMEM memory consumption and bootstrapping time at large scales.
Fixed maximum number of outstanding RDMA reads/atomics in an InfiniBand endpoint.
Added MPI launcher support to programs in the examples/ directory.
Fixed and improved performance tests.
Implemented to automatically detect libmpi_ibm.so when libmpi.so is not found.
Added support for nvshmem_<type>_wait(test)_any/all/some API.
Added missing OpenSHMEM 1.3 constants.
Added support for shmem_<type>_g API over InfiniBand.
Improved reduce collective operation performance. Added recursive exchange based algorithm.
Optimized shmem_quiet implementation.
Code improvements - separation of external and internal header files, improved internal function/macro naming, code reuse, warnings cleanup, fixes to some build issues, etc.

Compatibility

NVSHMEM 0.4.1 EA has been tested with the following:

CUDA [9.1]{.underline}, [10.1]{.underline} and [10.2]{.underline}
PowerPC and x86

Limitations

TBD

Fixed Issues

Fixed a bug in nvshmem_malloc/free implementation.
Fixed CUDA thread ID calculation for multi-dimensional grids for blocking and non-blocking put/get calls.
Fixed the implementation of nvshmex_collective_launch_query_gridsize API.
Fixed the memory consumption and bootstrapping scalability issues for very large scale runs.

Breaking Changes

Removed support for shmem_<type>_p/g/put/get as synchronization operations. Instead, use nvshmemx_<type>_signal, atomics API for synchronization.
Removed volatile qualifier from the ivars argument in wait/test API for compliance with OpenSHMEM 1.4 API.
Renamed bitwise AMOs to correct OpenSHMEM 1.4 name.
Renamed hydra launch binaries to nvshmrun and nvshmrun.hydra to avoid confusion with MPI binaries.
Renamed USE_MPI(SHMEM)_IN_TEST environment variables in performance tests to NVSHMEMTEST_USE_MPI(SHMEM)_LAUNCHER.

Deprecated Features

TBD

Known Issues

Certain combinations/sequence of calls to nvshmem_malloc and nvshmem_free can lead to segmentation faults.
NVSHMEM and libraries that use NVSHMEM can only be built as static libraries, not as shared libraries. This is because linking of CUDA device symbols does not work across shared libraries.

NVSHMEM 0.3#

This is the NVSHMEM 0.3 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Bug fixes
Fixes and clarification to reporting in performance tests
Availability through the apply-for-access program

Compatibility

NVSHMEM 0.3 has been tested with the following:

This NVSHMEM release supports [CUDA 10.1]{.underline}.

Limitations

NA

Deprecated Features

NA

Known Issues

NA

NVSHMEM 0.2.5#

This is the NVSHMEM 0.2.5 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Released sources under EULA.
Enhanced OpenSHMEM and MPI interoperability to work with any implementation.
Added Clang support.
Added timeout mechanism in debug mode that forces application to exit when it hangs in NVSHMEM API calls.
Improved NIC-PE affinity detection on CORAL systems.
Removed OpenSHMEM APIs to avoid symbol conflicts when using MPI/OpenSHMEM interoperability, Now all NVSHMEM APIs have an nv* prefix.
Removed Hydra sources from the package. Provided script to download and install.
Bug fixes

Compatibility

NVSHMEM 0.2.5 has been tested with the following:

This NVSHMEM release supports [CUDA 10.1]{.underline}.

Limitations

This release of NVSHMEM only supports groups of GPUs that are all-to-all connected via PCIe, NVLink or InfiniBand. The NVSHMEM Developer Guide provides more information on system requirements.
The NVSHMEM memory model is defined briefly in the NVSHMEM Developer Guide, with explanations of aspects that are left undefined in the OpenSHMEM specification and relaxations to allow for a more efficient implementation on hardware with relaxed memory models.
Exercise with a set of litmus tests (see [POWER and ARM Litmus Tests]{.underline}) has been done to show progress towards building confidence in the correctness and completeness of the NVSHMEM memory model. Further extensions to the memory model that may allow for efficient implementations on hardware with relaxed memory models are being considered.
The NVSHMEM memory model definition is intended to facilitate discussion in the OpenSHMEM community towards a concrete and complete definition of the OpenSHMEM memory model in the specification. This process will influence changes in the NVSHMEM memory model as it is defined today.
This version of NVSHMEM is based on a prototype of OpenSHMEM for NVIDIA GPUs developed as part of DoE Design Forward project. The prototype is available [here]{.underline}. The prototype was an experimental effort and is not officially supported. It implements a very limited subset (put, get, wait_until) of OpenSHMEM APIs compared to this version.

Deprecated Features

The following features are deprecated in NVSHMEM 0.2.5:

TBD

Fixed Issues

The following issues have been resolved in NVSHMEM 0.2.5:

TBD (GitHub issues <number and link>)

Known Issues

shmem_symmetric_size environment variable

As defined in OpenSHMEM standard, SHMEM_SYMMETRIC_SIZE is an environment variable for controlling the size of the symmetric heap in NVSHMEM. In NVSHMEM version 0.2, all allocations from the symmetric heap are aligned to a 512-byte boundary. As a consequence, allocations of a granularity less or equal to 512 bytes consume up to 512 bytes.

NVSHMEM 0.2.4#

This is the NVSHMEM 0.2.4 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Scalable barrier implementation
Contiguous RMA APIs in header (enables compiler to inline)
New environment variable for MPI and OpenSHMEM library selection
Bug fixes

Compatibility

NVSHMEM 0.2.4 has been tested with the following:

This NVSHMEM release supports [CUDA 10.1]{.underline}.

NVSHMEM 0.2.3#

This is the NVSHMEM 0.2.3 release notes. This release includes fixes from the previous NVSHMEM 0.x releases as well as the following additional changes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Added Power9 support
Added InfiniBand support
Added interop support with OpenMPI-4.0.0/OSHMEM (depends on UCX-1.4.0)

NVSHMEM 0.1#

This is the NVSHMEM 0.1 release notes.

Key Features and Enhancements

This NVSHMEM release includes the following key features and enhancements.

Support for P2P platforms only (PCI-E and NVLink)
Added _warp and _block variants of all __device__ RMA and collective APIs
Added _on_stream variants of all __host__ RMA and collective APIs