NVSHMEM4Py Overview¶

NVSHMEM4Py is the official Python language binding for NVSHMEM, providing a Pythonic interface to the NVSHMEM library. It enables Python applications to leverage the high-performance, PGAS (Partitioned Global Address Space) programming model offered by NVSHMEM for GPU-centric communication.

Current API Support¶

Currently, NVSHMEM4py supports the native pythonic extensions for host-side APIs. This implies that supported NVSHMEM point-to-point/collective operations are dispatched from the host (CPU) side, potentially launching GPU communications kernels/GPU offload operations on user-defined CUDA stream. Additionally, it allows for flexibility to interop with Python domain-specific language (DSL) to author custom kernels targeting peer-to-peer communication using local and remote symmetric memory buffers. Native pythonic extensions for device-side APIs in DSLs are not yet supported.

Key Features¶

Seamless Integration with Python: NVSHMEM4Py allows Python applications to utilize NVSHMEM’s capabilities with a Pythonic interface.
Interoperability: Native support for the broader CUDA Python ecosystem, with special support for PyTorch and CuPy. Additionally, NVSHMEM4Py interoperates with Python DSLs for authoring GPU Kernels, including Numba-Cuda and Triton. This enables Python developers to author GPU kernels involving communication in Python.
Symmetric Memory Management: Provides Python interfaces to allocate and manage symmetric memory across multiple GPUs.

Usage Model¶

NVSHMEM4Py follows the same programming model as the core NVSHMEM library, with adaptations to make it more Pythonic. Applications typically:

Initialize the NVSHMEM environment
Allocate symmetric memory
Perform communication operations (put/get, collectives, etc.)
Synchronize as needed
Finalize the NVSHMEM environment

The Python API maintains the same PE (Processing Element) concept as the core NVSHMEM library, where each PE represents a process with its associated GPU.

Limitations¶

Not all NVSHMEM C/C++ APIs are currently exposed in the Python binding
Performance may have some overhead compared to the native C/C++ implementation
Requires proper CUDA and NVSHMEM installation on the system

NVSHMEM4Py enables Python developers to write distributed GPU applications with a simple shared memory programming model, making it easier to scale Python applications across multiple GPUs and nodes.