Python Device APIs for CuTe DSL¶
This section documents the Python device APIs for using NVSHMEM with the CuTe DSL. CuTe provides a convenient way to author GPU kernels in Python, enabling rapid prototyping and development of CUDA code. By leveraging NVSHMEM’s Python bindings, you can write CuTe kernels that perform efficient GPU-to-GPU communication, such as remote memory access (RMA) and collective operations, directly from Python.
These APIs allow you to launch kernels that use NVSHMEM primitives for communication and synchronization between GPUs, making it easier to develop scalable, high-performance applications in Python. The following pages describe the available APIs and provide usage examples for integrating NVSHMEM with your CuTe workflows.
Available APIs¶
The CuTe NVSHMEM device bindings include:
- Collectives:
barrier,barrier_all,sync,sync_all,reduce,reducescatter,fcollect,broadcast,alltoall(plusnvshmemx_*_{block,warp}variants). - RMA: vector
put/get(blocking and nonblocking_nbi),put_signal, and scalarp/g. - Atomics:
fetch/set/swap/compare_swap,inc/fetch_inc,add/fetch_add, bitwiseand/fetch_and,or/fetch_or,xor/fetch_xor. - Signalling:
signal_op,signal_wait_until. - Utilities:
n_pes,my_pe,team_n_pes,team_my_pe.
Execution scopes and semantics:
- Scopes:
device,block,warp. - Semantics: blocking and nonblocking (
_nbi) where supported.
Data types:
NVSHMEM4Py CuTe APIs support passing in CuTe Tensors as arguments. Refer to CuTe documentation for more details. You can also use Torch tensors by using DLPack to convert them to CuTe Tensors.