Python Device APIs¶
NVSHMEM4Py brings device-side APIs to Python, letting you call NVSHMEM operations directly from GPU kernels. This works by integrating with Python device Domain-Specific Languages (DSLs) - specialized tools that let you write and run GPU kernels in Python. These DSLs give you a familiar and efficient way to handle device-level parallel programming and communication.
Right now, NVSHMEM4Py has built-in support for Numba-CUDA. Numba is a popular Python library that compiles Python functions on-the-fly to run on CUDA-enabled GPUs. When you use Numba-CUDA, you can write Python functions as CUDA kernels, and NVSHMEM4Py takes this further by letting these kernels handle NVSHMEM operations like remote memory access, collective operations, and synchronization right on the GPU.
This integration makes it easy for Python developers to build high-performance, distributed GPU applications using natural Python code. You get all the benefits of NVSHMEM’s one-sided communication and PGAS (Partitioned Global Address Space) model without leaving Python. As the Python GPU ecosystem grows, NVSHMEM4Py plans to support more device DSLs, making device-side NVSHMEM programming even more accessible and flexible in Python.