NVSHMEM Device Atomic Memory Operations with Numba-CUDA DSL¶
This section documents the NVSHMEM Device Atomic Memory Operations with Numba-CUDA DSL.
Example: Using atomic_add in a Numba-CUDA kernel¶
The following example demonstrates how to use the NVSHMEM atomic add operation (atomic_add) in a Numba-CUDA kernel. This allows multiple threads to safely increment a value in shared memory on a remote PE (processing element).
import numpy as np
import cupy as cp
from numba import cuda
import nvshmem
import nvshmem.core.device.numba as nvshmem_numba
from mpi4py import MPI
@cuda.jit
def atomic_add_kernel(dst, value, pe):
tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
if tid == 0:
# Atomically add 'value' to dst[0] on remote PE
nvshmem_numba.atomic_add(dst, value, pe)
# Initialize NVSHMEM
dev = cudaDevice()
dev.set_current()
stream = dev.create_stream()
nvshmem.init(dev=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi", stream=stream)
# Get information about the current PE
me = nvshmem.my_pe()
n_pes = nvshmem.n_pes()
# Choose a remote PE (for example, next PE in a ring)
pe = (me + 1) % n_pes
# Allocate device buffer
dst = nvshmem.array((1,), dtype=np.int32)
# Launch kernel to perform atomic add on remote PE's dst[0]
# Note, Numba-cuda does not accept a cuda.core Stream, so we need to pass the stream handle.
atomic_add_kernel[1, 1](dst, 42, pe, stream=int(stream.handle))
# Finalize NVSHMEM
nvshmem.finalize(dev=dev, stream=stream)
This example atomically adds 42 to the dst[0] array on the next PE in a ring. Only thread 0 performs the atomic operation for demonstration purposes. In practice, you can have multiple threads performing atomic operations as needed.
For more details, see the Numba test suite and the NVSHMEM4Py documentation.
-
nvshmem.core.device.numba.amo.atomic_inc(dst, pe)¶ Atomically increments
dstby 1 (no return value).- Args:
dst(Array): Symmetric destination array on remote PE.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_fetch_inc(dst, pe)¶ Atomically increments
dstby 1 and returns the old value.- Args:
dst(Array): Symmetric destination array on remote PE.pe(int): Target PE.
- Returns:
- The value stored at
dstprior to increment.
-
nvshmem.core.device.numba.amo.atomic_fetch(src, pe)¶ Fetches the current value at symmetric
srcon PEpe.- Args:
src(Array): Symmetric source array on remote PE.pe(int): PE to fetch from.
- Returns:
- Current value stored at
srcon PEpe.
-
nvshmem.core.device.numba.amo.atomic_set(dst, value, pe)¶ Sets the value at symmetric
dston PEpetovalue.- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to set.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_add(dst, value, pe)¶ Atomically adds
valuetodst(no return value).- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to add.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_fetch_add(dst, value, pe)¶ Atomically adds
valuetodstand returns the old value.- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to add.pe(int): Target PE.
- Returns:
- The value stored at
dstprior to the addition.
-
nvshmem.core.device.numba.amo.atomic_and(dst, value, pe)¶ Atomically applies bitwise AND of
valuewithdst(no return value).- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to AND with.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_fetch_and(dst, value, pe)¶ Atomically applies bitwise AND and returns the old value.
- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to AND with.pe(int): Target PE.
- Returns:
- The value stored at
dstprior to the AND.
-
nvshmem.core.device.numba.amo.atomic_or(dst, value, pe)¶ Atomically applies bitwise OR of
valuewithdst(no return value).- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to OR with.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_fetch_or(dst, value, pe)¶ Atomically applies bitwise OR and returns the old value.
- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to OR with.pe(int): Target PE.
- Returns:
- The value stored at
dstprior to the OR.
-
nvshmem.core.device.numba.amo.atomic_xor(dst, value, pe)¶ Atomically applies bitwise XOR of
valuewithdst(no return value).- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to XOR with.pe(int): Target PE.
-
nvshmem.core.device.numba.amo.atomic_fetch_xor(dst, value, pe)¶ Atomically applies bitwise XOR and returns the old value.
- Args:
dst(Array): Symmetric destination array on remote PE.value: Value to XOR with.pe(int): Target PE.
- Returns:
- The value stored at
dstprior to the XOR.
-
nvshmem.core.device.numba.amo.atomic_swap(dst, value, pe)¶ Atomically swaps the current value at
dstwithvalueon PEpe.- Args:
dst(Array): Symmetric destination array on remote PE.value: Replacement value.pe(int): Target PE.
- Returns:
- The old value previously stored at
dst.
-
nvshmem.core.device.numba.amo.atomic_compare_swap(dst, cond, value, pe)¶ Atomically compares the current value at
dstwithcondand swaps withvalueif equal.- Args:
dst(Array): Symmetric destination location on remote PE.cond: Comparison value.value: Replacement value.pe(int): Target PE.
- Returns:
- The old value previously stored at
dst.