NVSHMEM
3.3.0
Introduction
Key Features
Communication Transports
Advantages Of NVSHMEM
GPU-Initiated Communication And Strong Scaling
Using NVSHMEM
Example NVSHMEM Program
Using the NVSHMEM InfiniBand GPUDirect Async Transport
Using NVSHMEM With MPI or OpenSHMEM
Compiling NVSHMEM Programs
Running NVSHMEM Programs
Communication Model
Data Consistency
Multiprocess GPU Support
Building NVSHMEM Applications/Libraries
NVSHMEM and the CUDA Model
The CUDA Execution Model
Work Submission in CUDA
The CUDA Abstract Machine
Nonlocal Operations and the CUDA Execution Model
CUDA Streams and Circular Dependencies
CUDA Stream Order and Execution Resources
CUDA Streams and False Circular Dependencies
Intra-Kernel Synchronization
Ensuring Safe Nonlocal Operations Using NVSHMEM Cooperative Kernel Launch
Implicitly Asynchronous cudaMemcpy
Memory Model
Pointers to Symmetric Objects
Ordering of Operations
Atomicity Guarantees
Differences Between NVSHMEM and OpenSHMEM
Ordering of Blocking Fetching Operations
Visibility Guarantees
Execution Model
Progress of NVSHMEM Operations
Invoking NVSHMEM Operations
Library Constants
Library Handles
Environment Variables
Standard options
Bootstrap options
Additional options
Collectives options
Transport options
NVTX options
NVSHMEM APIs
Overview of the APIs
Unsupported OpenSHMEM 1.3 APIs
OpenSHMEM 1.3 APIs Not Supported Over Remote Network Transports
Supported OpenSHMEM APIs (OpenSHMEM 1.4 and 1.5)
NVSHMEM API Extensions For CPU Threads
NVSHMEM API Extensions For GPU Threads
Tile-Granular Collective APIs
Library Setup, Exit, and Query
NVSHMEM_INIT
NVSHMEMX_INIT_ATTR
NVSHMEMX_HOSTLIB_INIT_ATTR
NVSHMEMX_HOSTLIB_FINALIZE
NVSHMEMX_GET_UNIQUE_ID
NVSHMEMX_SET_ATTR_UNIQUEID_ARGS
NVSHMEMX_CUMODULE_INIT
NVSHMEMX_INIT_STATUS
NVSHMEM_MY_PE
NVSHMEM_N_PES
NVSHMEM_FINALIZE
NVSHMEM_GLOBAL_EXIT
NVSHMEM_PTR
NVSHMEMX_MC_PTR
NVSHMEM_INFO_GET_VERSION
NVSHMEM_INFO_GET_NAME
NVSHMEMX_VENDOR_GET_VERSION_INFO
Thread Support
NVSHMEM_INIT_THREAD
NVSHMEM_QUERY_THREAD
Kernel Launch Routines
NVSHMEMX_COLLECTIVE_LAUNCH
NVSHMEMX_COLLECTIVE_LAUNCH_QUERY_GRIDSIZE
Memory Management
NVSHMEM_MALLOC, NVSHMEM_FREE, NVSHMEM_ALIGN
NVSHMEM_CALLOC
Memory Registration
NVSHMEMX_BUFFER_REGISTER
NVSHMEMX_BUFFER_UNREGISTER
NVSHMEMX_BUFFER_UNREGISTER_ALL
NVSHMEMX_BUFFER_REGISTER_SYMMETRIC
NVSHMEMX_BUFFER_UNREGISTER_SYMMETRIC
Team Management
Predefined and Application-Defined Teams
Team Handles
Thread Safety
Collective Ordering
Team Creation
Team Splitting
Arbitrary Team Initialization
NVSHMEM_TEAM_MY_PE
NVSHMEM_TEAM_N_PES
NVSHMEM_TEAM_CONFIG_T
NVSHMEM_TEAM_GET_CONFIG
NVSHMEM_TEAM_TRANSLATE_PE
NVSHMEM_TEAM_SPLIT_STRIDED
NVSHMEM_TEAM_SPLIT_2D
NVSHMEM_TEAM_DESTROY
NVSHMEMX_TEAM_INIT
NVSHMEMX_TEAM_GET_UNIQUEID
Remote Memory Access
Blocking RMA
NVSHMEM_PUT
NVSHMEM_P
NVSHMEM_IPUT
NVSHMEM_GET
NVSHMEM_G
NVSHMEM_IGET
Nonblocking RMA
NVSHMEM_PUT_NBI
NVSHMEM_GET_NBI
Atomic Memory Operations
NVSHMEM_ATOMIC_FETCH
NVSHMEM_ATOMIC_SET
NVSHMEM_ATOMIC_COMPARE_SWAP
NVSHMEM_ATOMIC_SWAP
NVSHMEM_ATOMIC_FETCH_INC
NVSHMEM_ATOMIC_INC
NVSHMEM_ATOMIC_FETCH_ADD
NVSHMEM_ATOMIC_ADD
NVSHMEM_ATOMIC_FETCH_AND
NVSHMEM_ATOMIC_AND
NVSHMEM_ATOMIC_FETCH_OR
NVSHMEM_ATOMIC_OR
NVSHMEM_ATOMIC_FETCH_XOR
NVSHMEM_ATOMIC_XOR
Signaling Operations
Atomicity Guarantees for Signaling Operations
Available Signal Operators
NVSHMEM_PUT_SIGNAL
NVSHMEM_PUT_SIGNAL_NBI
NVSHMEM_SIGNAL_FETCH
NVSHMEMX_SIGNAL
NVSHMEMX_SIGNAL_OP
Collective Communication
Team-based collectives
Implicit team collectives
Tile-based Collectives
Tile helper functions
Tile collective algorithms
Error codes returned from team-based collectives
Collective operations scopes and active sets
NVSHMEM_BARRIER
NVSHMEM_SYNC
NVSHMEM_SYNC_ALL
NVSHMEM_ALLTOALL
NVSHMEM_BROADCAST
NVSHMEM_FCOLLECT
NVSHMEM_REDUCTIONS
AND
OR
XOR
MAX
MIN
SUM
PROD
TILE_REDUCTIONS
TILE_ALLGATHER
TILE_WAIT
Point-To-Point Synchronization
NVSHMEM_WAIT_UNTIL
NVSHMEM_WAIT_UNTIL_ALL
NVSHMEM_WAIT_UNTIL_ANY
NVSHMEM_WAIT_UNTIL_SOME
NVSHMEM_WAIT_UNTIL_ALL_VECTOR
NVSHMEM_WAIT_UNTIL_ANY_VECTOR
NVSHMEM_WAIT_UNTIL_SOME_VECTOR
NVSHMEM_TEST
NVSHMEM_TEST_ALL
NVSHMEM_TEST_ANY
NVSHMEM_TEST_SOME
NVSHMEM_TEST_ALL_VECTOR
NVSHMEM_TEST_ANY_VECTOR
NVSHMEM_TEST_SOME_VECTOR
NVSHMEM_SIGNAL_WAIT_UNTIL
Memory Ordering
NVSHMEM_FENCE
NVSHMEM_QUIET
Language Bindings
Python Bindings (NVSHMEM4Py)
NVSHMEM4Py Overview
Initialization and Finalization
Memory Management
Interoperability
Collective Operations
Remote Memory Access (RMA)
Utility Functions for NVSHMEM4Py
Quick Start
Key Features
Examples
Language Bindings Examples
NVSHMEM4Py Examples
UID-Based Initialization Example
MPI Comm-Based Initialization Example
Torch.distributed ProcessGroup Initialization Example
Simple P2P Kernel Example
On-stream Kernels Example
PyTorch and Triton Interoperability Example
Attribute-Based Initialization Example
Collective Launch Example
On-Stream Example
Threadgroup Example
Put on Block Example
Ring Broadcast Example
Ring Allreduce Example
User Buffer Registration Example
GEMM + AllReduce Fused Kernel Example
Troubleshooting And FAQs
General FAQs
Prerequisite FAQs
Running NVSHMEM Programs FAQs
Interoperability With MPI FAQs
Interoperability With OpenSHMEM FAQs
GPU-GPU Interconnection FAQs
NVSHMEM API Usage FAQs
Debugging FAQs
Miscellaneous FAQs
NVSHMEM SLA
LICENSE AGREEMENT FOR NVIDIA SOFTWARE DEVELOPMENT KITS
1. License.
2. Limitations.
3. Ownership.
4. No Warranties.
5. Limitations of Liability.
6. Termination.
7. General.
NVSHMEM SUPPLEMENT TO SOFTWARE LICENSE AGREEMENT FOR NVIDIA SOFTWARE DEVELOPMENT KITS
Acknowledgements
Notices
Trademarks
Copyright
NVSHMEM
Docs
»
Examples
»
Language Bindings Examples
Language Bindings Examples
ΒΆ
Contents:
NVSHMEM4Py Examples
UID-Based Initialization Example
MPI Comm-Based Initialization Example
Torch.distributed ProcessGroup Initialization Example
Simple P2P Kernel Example
On-stream Kernels Example
PyTorch and Triton Interoperability Example
This directory contains examples of using the NVSHMEM language bindings.