NVSHMEM Device Atomic Memory Operations with CuTe DSL¶
This section documents the NVSHMEM Device Atomic Memory Operations with CuTe DSL.
Example: Using atomic_add in a CuTe kernel¶
The following example demonstrates how to use the NVSHMEM atomic add operation (atomic_add) in a CuTe kernel. This allows threads to safely increment a value in shared memory on a remote PE (processing element).
import cutlass
from cutlass import cute
from cuda.core import Device, Stream
import nvshmem
import nvshmem.core.device.cute as nvshmem_cute
import nvshmem.core.interop.cute as nvshmem_cute_interop
from mpi4py import MPI
@cute.kernel
def atomic_add_kernel(dst: cute.Tensor, value: cutlass.Int32, pe: cutlass.Int32):
# Atomically add 'value' to dst on remote PE
nvshmem_cute.atomic_add(dst, value, pe)
@cute.jit
def atomic_add_launcher(dst, value, pe):
atomic_add_kernel[1, 1](dst, value, pe)
# Initialize NVSHMEM
dev = Device()
dev.set_current()
stream = dev.create_stream()
nvshmem.init(dev=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi", stream=stream)
# Get information about the current PE
me = nvshmem.my_pe()
n_pes = nvshmem.n_pes()
# Choose a remote PE (for example, next PE in a ring)
pe = (me + 1) % n_pes
# Allocate device tensor using NVSHMEM symmetric memory
dst = nvshmem_cute_interop.tensor((1,), dtype=cute.Int32)
# Compile and launch the kernel
compiled_fn, nvshmem_kernel = nvshmem_cute_interop.cute_compile_helper(
atomic_add_launcher, dst, 42, pe
)
compiled_fn(dst, 42, pe, stream=stream)
# Finalize NVSHMEM
nvshmem.core.library_finalize(nvshmem_kernel)
nvshmem_cute_interop.cleanup_cute()
nvshmem.finalize(dev=dev, stream=stream)
This example atomically adds 42 to the dst tensor on the next PE in a ring. The atomic operation is performed at the thread level, ensuring safe concurrent access.
-
nvshmem.core.device.cute.amo.atomic_inc(dst, pe)¶ Atomically increments the value at symmetric
dston PEpeby 1. Does not return the old value.This is a thread-level remote atomic increment operation (non-fetching variant). Use
atomic_fetch_incif you need the value before the increment.- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Supported dtypes are integral types as determined by the NVSHMEM atomic inc dispatch table.pe(int): Target PE.
- Note:
- Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch_inc(dst, pe)¶ Atomically increments the value at symmetric
dston PEpeby 1, and returns the value prior to the increment.This is a thread-level remote atomic fetch-and-increment operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Supported dtypes are integral types as determined by the NVSHMEM atomic fetch-inc dispatch table.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the increment. - Note:
- Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch(src, pe)¶ Atomically fetches (reads) the current value at symmetric
srcon PEpe.This is a thread-level remote atomic operation. The read is performed atomically with respect to other atomic operations on the same location.
- Args:
src: CuTe tensor view pointing to a single-element symmetric source on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched.pe(int): Source PE to fetch from.
- Returns:
- The current value stored at
srcon PEpe, with the same dtype assrc. - Note:
- Supported dtypes are determined by the NVSHMEM atomic fetch dispatch table.
Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_set(dst, value, pe)¶ Atomically sets the value at symmetric
dston PEpetovalue.This is a thread-level remote atomic store. The write is performed atomically with respect to other atomic operations on the same location.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched.value: The value to store. Cast to the element dtype ofdstbefore the operation.pe(int): Target PE.
- Note:
- This operation does not return the old value. Use
atomic_fetchbefore setting if you need the previous value. Supported dtypes are determined by the NVSHMEM atomic set dispatch table. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_add(dst, value, pe)¶ Atomically adds
valueto the value at symmetricdston PEpe. Does not return the old value.This is a thread-level remote atomic add operation (non-fetching variant). Use
atomic_fetch_addif you need the value before the addition.- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched (integral and floating-point types supported).value: Value to add. Cast to the element dtype ofdst.pe(int): Target PE.
- Note:
- Supported dtypes are determined by the NVSHMEM atomic add dispatch table.
Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch_add(dst, value, pe)¶ Atomically adds
valueto the value at symmetricdston PEpe, and returns the value prior to the addition.This is a thread-level remote atomic fetch-and-add operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched (integral and floating-point types supported).value: Value to add. Cast to the element dtype ofdst.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the addition. - Note:
- Supported dtypes are determined by the NVSHMEM atomic fetch-add dispatch table.
Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_and(dst, value, pe)¶ Atomically applies bitwise AND of
valuewith the value at symmetricdston PEpe. Does not return the old value.This is a thread-level remote atomic bitwise AND operation (non-fetching variant). Use
atomic_fetch_andif you need the value before the operation.- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to AND with. Cast to the element dtype ofdst.pe(int): Target PE.
- Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch_and(dst, value, pe)¶ Atomically applies bitwise AND of
valuewith the value at symmetricdston PEpe, and returns the value prior to the operation.This is a thread-level remote atomic fetch-and-AND operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to AND with. Cast to the element dtype ofdst.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the AND operation. - Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_or(dst, value, pe)¶ Atomically applies bitwise OR of
valuewith the value at symmetricdston PEpe. Does not return the old value.This is a thread-level remote atomic bitwise OR operation (non-fetching variant). Use
atomic_fetch_orif you need the value before the operation.- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to OR with. Cast to the element dtype ofdst.pe(int): Target PE.
- Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch_or(dst, value, pe)¶ Atomically applies bitwise OR of
valuewith the value at symmetricdston PEpe, and returns the value prior to the operation.This is a thread-level remote atomic fetch-and-OR operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to OR with. Cast to the element dtype ofdst.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the OR operation. - Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_xor(dst, value, pe)¶ Atomically applies bitwise XOR of
valuewith the value at symmetricdston PEpe. Does not return the old value.This is a thread-level remote atomic bitwise XOR operation (non-fetching variant). Use
atomic_fetch_xorif you need the value before the operation.- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to XOR with. Cast to the element dtype ofdst.pe(int): Target PE.
- Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_fetch_xor(dst, value, pe)¶ Atomically applies bitwise XOR of
valuewith the value at symmetricdston PEpe, and returns the value prior to the operation.This is a thread-level remote atomic fetch-and-XOR operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. Only integral (bitwise) dtypes are supported.value: Mask value to XOR with. Cast to the element dtype ofdst.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the XOR operation. - Note:
- Only integral dtypes (e.g.,
uint32,uint64, etc.) are supported. Passing an unsupported dtype raisesRuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_swap(dst, value, pe)¶ Atomically replaces the value at symmetric
dston PEpewithvalue, and returns the old value.This is a thread-level remote atomic swap operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched.value: New value to store. Cast to the element dtype ofdst.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the swap. - Note:
- Supported dtypes are determined by the NVSHMEM atomic swap dispatch table.
Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.
-
nvshmem.core.device.cute.amo.atomic_compare_swap(dst, cond, value, pe)¶ Atomically compares the value at symmetric
dston PEpewithcond, and if equal, replaces it withvalue. Returns the old value regardless.This is a thread-level remote atomic compare-and-swap (CAS) operation.
- Args:
dst: CuTe tensor view pointing to a single-element symmetric destination on PEpe. Must be a symmetric (NVSHMEM-allocated) tensor. The element dtype determines which underlying NVSHMEM atomic is dispatched.cond: Comparison value. Cast to the element dtype ofdst. The swap only occurs if the current value atdstequalscond.value: Replacement value. Cast to the element dtype ofdst. Written todstonly if the comparison succeeds.pe(int): Target PE.
- Returns:
- The value stored at
dston PEpeprior to the operation, regardless of whether the swap occurred. - Note:
- Supported dtypes are determined by the NVSHMEM atomic compare-swap dispatch table.
Passing an unsupported dtype raises
RuntimeErrorat JIT compile time.