NVSHMEM Device Remote Memory Access (RMA) with Numba-CUDA DSL¶
This section documents the NVSHMEM Device Remote Memory Access (RMA) operations with Numba-CUDA DSL.
Example: Using put and get in a Numba-CUDA kernel¶
The following example demonstrates how to use the NVSHMEM put and get operations in a Numba-CUDA kernel. These allow threads to write to and read from memory on a remote PE (processing element) directly from device code.
import numpy as np
import cupy as cp
from numba import cuda
import nvshmem
import nvshmem.core.device.numba as nvshmem_numba
from mpi4py import MPI
@cuda.jit
def rma_kernel(src, dst, remote_buf, pe):
tid = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
if tid == 0:
# Put data from src to remote_buf on remote PE
nvshmem_numba.put(remote_buf, src, pe)
# Get data from remote_buf on remote PE to dst
nvshmem_numba.get(dst, remote_buf, pe)
# Initialize NVSHMEM
dev = cudaDevice()
dev.set_current()
stream = dev.create_stream()
nvshmem.init(dev=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi", stream=stream)
# Get information about the current PE
me = nvshmem.my_pe()
n_pes = nvshmem.n_pes()
# Choose a remote PE (for example, next PE in a ring)
pe = (me + 1) % n_pes
# Allocate device buffers
src = nvshmem.array((1,), dtype=np.int32)
dst = nvshmem.array((1,), dtype=np.int32)
remote_buf = nvshmem.array((1,), dtype=np.int32)
# Launch kernel to perform put and get on remote PE's buffer
# Note, Numba-cuda does not accept a cuda.core Stream, so we need to pass the stream handle.
rma_kernel[1, 1](src, dst, remote_buf, pe, stream=int(stream.handle))
# Finalize NVSHMEM
nvshmem.finalize(dev=dev, stream=stream)
This example puts the value from src[0] to the remote_buf[0] on the next PE in a ring, and then gets the value back into dst[0]. Only thread 0 performs the RMA operations for demonstration purposes. In practice, you can have multiple threads performing RMA as needed.
For more details, see the Numba test suite and the NVSHMEM4Py documentation.
-
nvshmem.core.device.numba.rma.p()¶ Put immediate data from src to dst on PE pe. Device initiated.
srcmust be a scalar value, passed as a symmetric Array of size 1.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Scalar source value (size-1 array) on this PE.pe(int): PE to put to.
-
nvshmem.core.device.numba.rma.g()¶ Get immediate data from src on PE pe to local dst. Device initiated.
srcmust be a scalar value, passed as a symmetric Array of size 1.- Args:
src(Array): Source symmetric array (size-1) on remote PE.pe(int): PE to get from.
- Returns:
- Scalar value retrieved from
srcon PEpe.
-
nvshmem.core.device.numba.rma.put(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies data from local
srcto symmetricdston PEpe. This is a thread-level operation.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.get(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies data from symmetric
srcon PEpeto localdst. This is a thread-level operation.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.put_nbi(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies data from local
srcto symmetricdston PEpe. This is a thread-level operation.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.get_nbi(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies data from symmetric
srcon PEpeto localdst. This is a thread-level operation.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.put_block(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies from local
srcto symmetricdston PEpe. This is a CTA-level operation. All threads in the CTA must call this function with the same arguments.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.get_block(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies from symmetric
srcon PEpeto localdst. This is a CTA-level operation. All threads in the CTA must call this function with the same arguments.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.put_nbi_block(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies from local
srcto symmetricdston PEpe. This is a CTA-level operation. All threads in the CTA must call this function with the same arguments.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
-
nvshmem.core.device.numba.rma.get_nbi_block(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies from symmetric
srcon PEpeto localdst. This is a CTA-level operation. All threads in the CTA must call this function with the same arguments.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
-
nvshmem.core.device.numba.rma.put_warp(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies from local
srcto symmetricdston PEpe. This is a warp-level operation. All threads in the warp must call this function with the same arguments.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.get_warp(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Copies from symmetric
srcon PEpeto localdst. This is a warp-level operation. All threads in the warp must call this function with the same arguments.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.put_nbi_warp(src: numba.core.types.npytypes.Array, dst: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies from local
srcto symmetricdston PEpe. This is a warp-level operation. All threads in the warp must call this function with the same arguments.- Args:
src(Array): Local source array on this PE to copy from.dst(Array): Symmetric destination array on PEpeto copy to.pe(int): PE to copy to.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.get_nbi_warp(dst: numba.core.types.npytypes.Array, src: numba.core.types.npytypes.Array, pe: int32) → None¶ Non-blockingly copies from symmetric
srcon PEpeto localdst. This is a warp-level operation. All threads in the warp must call this function with the same arguments.- Args:
dst(Array): Local destination array on this PE to copy to.src(Array): Symmetric source array from PEpeto copy from.pe(int): PE to copy from.
- Note:
- When
srcanddstare of different sizes, only data of the smaller size is copied.srcanddstmust have the same data type.
-
nvshmem.core.device.numba.rma.put_signal_block()¶ Put data with a signal operation from
srctodston PEpein a manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal()¶ Put data with a signal operation from
srctodston PEpein a manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal_nbi()¶ Put data with a signal operation from
srctodston PEpein a _nbi manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal_warp()¶ Put data with a signal operation from
srctodston PEpein a manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal_nbi_block()¶ Put data with a signal operation from
srctodston PEpein a _nbi manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal_nbi_warp()¶ Put data with a signal operation from
srctodston PEpein a _nbi manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
-
nvshmem.core.device.numba.rma.put_signal_nbi_warp() Put data with a signal operation from
srctodston PEpein a _nbi manner at a scope. Device initiated.Signal variables must be an
Arrayof dtypeint64(8 bytes) allocated by or registered with NVSHMEM4Py. Supported signal operations areSignalOp.SIGNAL_SETandSignalOp.SIGNAL_ADD.- Args:
dst(Array): Destination symmetric array on remote PE.src(Array): Source symmetric array on this PE.signal_var(Array): Symmetric signal variable (dtypeint64).signal_val(int): Signal value.signal_op(SignalOp): Signal operation type.pe(int): Target PE to put to.
NVSHMEM4Py Memory Management with Numba-CUDA DSL¶
This section documents the NVSHMEM4Py Memory Management with Numba-CUDA DSL.
NVSHMEM4Py provides functions to access remote symmetric and multicast buffers as CuPy arrays via the nvshmem.core.device.numba.mem module.
-
nvshmem.core.device.numba.mem.get_multicast_array(team: <MutableEnum TEAM_INVALID=-1, TEAM_WORLD=0, TEAM_SHARED=1, TEAM_NODE=2, TEAM_SAME_MYPE_NODE=3, TEAM_SAME_GPU=4, TEAM_GPU_LEADERS=5, TEAMS_MIN=6, TEAM_INDEX_MAX=32767>, array: numba.core.types.npytypes.Array)¶ Returns an array view on multicast-accessible memory corresponding to the input array. The Array passed into it must be allocated by NVSHMEM4Py.
This is the Python array equivalent of nvshmemx_mc_ptr which returns a pointer into a peer’s symmetric heap
- Args:
- array: Array - A symmetric array allocated by NVSHMEM team: Teams - A NVSHMEM Team to create the Multicast object across
- Returns:
- A Numba ArrayView which represents the Multicast object
- NOTE: This function is only supported with the Numba Runtime. To enable it,
- set the environment variable
NUMBA_CUDA_ENABLE_NRT=1
-
nvshmem.core.device.numba.mem.get_peer_array(arr: numba.core.types.npytypes.Array, pe: int)¶ Returns an array view of a peer buffer associated with an NVSHMEM-allocated object.
This is the Python array equivalent of nvshmem_ptr which returns a pointer into a peer’s symmetric heap
- Args:
- array: Array - A symmetric array allocated by NVSHMEM pe: int - The remote PE to retrieve an ArrayView into
- Returns:
- A Numba ArrayView which represents the peer object
- NOTE: This function is only supported with the Numba Runtime. To enable it,
- set the environment variable
NUMBA_CUDA_ENABLE_NRT=1