Interoperability¶
NVSHMEM4Py provides interoperability with other Python libraries through the use of DLPack-compatible CUDA buffers. This allows for seamless integration with libraries like CuPy and PyTorch. It has special functions which support allocating CuPy Arrays and Torch tensors which are backed by NVSHMEM symmetric memory.
Interoperability with PyTorch and CuPy¶
NVSHMEM4Py provides specialized functions to create PyTorch tensors and CuPy arrays that are backed by NVSHMEM symmetric memory. This allows these objects to be used directly in NVSHMEM operations while maintaining their native API functionality.
Creating NVSHMEM-backed Arrays¶
The interoperability modules provide factory functions that create arrays backed by NVSHMEM symmetric memory:
import nvshmem.core as nvshmem
import torch
import cupy as cp
# Create a PyTorch tensor backed by NVSHMEM memory
tensor = nvshmem.interop.torch.empty((1000, 1000), dtype=torch.float32)
# Create a CuPy array backed by NVSHMEM memory
array = nvshmem.interop.cupy.empty((1000, 1000), dtype=cp.float32)
These objects can be used with their respective libraries’ APIs just like regular tensors or arrays:
# Use PyTorch operations
tensor = tensor + 1.0
tensor = torch.nn.functional.relu(tensor)
# Use CuPy operations
array = array * 2.0
array = cp.sqrt(array)
NVSHMEM Operations with PyTorch and CuPy¶
The same objects can be used directly in NVSHMEM operations:
import nvshmem.core as nvshmem
# Get my PE ID and total number of PEs
my_pe = nvshmem.my_pe()
n_pes = nvshmem.n_pes()
# Put data to the next PE (PyTorch example)
next_pe = (my_pe + 1) % n_pes
nvshmem.put(tensor, tensor, count=tensor.numel(), pe=next_pe)
# Get data from previous PE (CuPy example)
prev_pe = (my_pe - 1) % n_pes
nvshmem.get(array, array, count=array.size, pe=prev_pe)
Memory Management¶
These objects are backed by NVSHMEM symmetric memory, so they should be freed when they are no longer needed. If they are not freed, an Exception will be raised when the garbage collector finds them. Unfreed objects are safely freed at program exit.
# Free the symmetric memory backing these objects
nvshmem.free_tensor(tensor)
nvshmem.free_array(array)
This ensures proper cleanup of distributed resources across all PEs.
Interoperability API reference¶
Interoperability with PyTorch¶
The following are interoperability helpers for NVSHMEM4Py memory used in Torch
-
nvshmem.core.interop.torch.bytetensor(shape: Tuple[int], dtype: None = None, release=False, morder='C', except_on_del=True) → None¶ Create a PyTorch tensor from NVSHMEM-allocated memory with the given shape and dtype.
This function allocates raw memory using NVSHMEM and wraps it in a PyTorch tensor without reshaping or changing the dtype view. Useful for low-level manipulation.
- Args:
- shape (tuple or list of int): Shape of the desired tensor.
- dtype (
str,np.dtype, ortorch.dtype, optional): Data type of the tensor. Defaults to"float32". - release (bool, optional): Do not track this buffer internally to NVSHMEM
- If True, it is the user’s responsibility to hold references to the buffer until free() is called otherwise, deadlocks may occur.
- morder (
str, optional): The memory format to use."C"for C-style, and"F"for “Fortran-style
- Returns:
- torch.Tensor: A raw PyTorch tensor referencing the NVSHMEM-allocated memory.
- Raises:
- RuntimeError: If NVSHMEM or PyTorch is not properly initialized or enabled.
-
nvshmem.core.interop.torch.tensor(shape: Tuple[int], dtype: None = None, release=False, morder='C', except_on_del=True) → None¶ Create a PyTorch tensor view on NVSHMEM-allocated memory with the given shape and dtype.
This function allocates memory using NVSHMEM, wraps it with a DLPack tensor, and then converts it into a PyTorch tensor with the desired dtype and shape.
- Args:
- shape (tuple or list of int): Shape of the desired tensor.
- dtype (torch.dtype, optional): Data type of the tensor. Defaults to
torch.float32. - release (bool, optional): Do not track this buffer internally to NVSHMEM
- If True, it is the user’s responsibility to hold references to the buffer until free() is called otherwise, deadlocks may occur.
- morder (
str, optional): The memory format to use."C"for C-style, and"F"for “Fortran-style
- Returns:
- torch.Tensor: A PyTorch tensor view on the NVSHMEM-allocated buffer.
- Raises:
- RuntimeError: If NVSHMEM or PyTorch is not properly initialized or enabled.
-
nvshmem.core.interop.torch.free_tensor(tensor: None) → None¶ Free an NVSHMEM-backed Torch Tensor
- Args:
- tensor (
torch.Tensor): A PyTorch tensor backed by NVSHMEM memory. - Returns:
None- Raises:
RuntimeError: If NVSHMEM or PyTorch is not properly initialized or enabled.
-
nvshmem.core.interop.torch.tensor_get_buffer(tensor: None) → Tuple[cuda.core._memory._buffer.Buffer, int, str]¶ Get a nvshmem Buffer object from a Torch tensor object which was allocated with
nvshmem.core.tensor()ornvshmem.core.bytetensor()Returns the buffer and the array’s size
-
nvshmem.core.interop.torch.get_peer_tensor(tensor: None, peer_pe: int = None) → None¶ Return a Buffer based on the
peer_buffer(wrapper of nvshmem_ptr) API
-
nvshmem.core.interop.torch.get_multicast_tensor(team: <MutableEnum TEAM_INVALID=-1, TEAM_WORLD=0, TEAM_SHARED=1, TEAM_NODE=2, TEAM_SAME_MYPE_NODE=3, TEAM_SAME_GPU=4, TEAM_GPU_LEADERS=5, TEAMS_MIN=6, TEAM_INDEX_MAX=32767>, tensor: None) → None¶ Returns a PyTorch Tensor view on multicast-accessible memory corresponding to the input tensor.
This function takes a PyTorch tensor that wraps NVSHMEM-allocated memory and returns a new tensor that uses a Multicast Memory alias for that buffer, obtained via
nvshmemx_mc_ptr. The resulting tensor is suitable for use in GPU kernels that leverage multicast features such as NVSwitch-based SHARP collectives.The Tensor passed into it must be allocated by NVSHMEM4Py.
- IMPORTANT:
- The returned tensor’s memory cannot be accessed from the host (CPU). It is only valid for use in GPU kernels. Host-side access or copying is undefined behavior and may result in errors.
NOTE: This function does not copy data. It provides a device-side view of the same underlying memory, but aliased for multicast access. Any modifications made through the returned tensor will affect the same memory region.
- Args:
- tensor (
torch.Tensor): A PyTorch tensor backed by NVSHMEM-allocated memory. team (Teams): The NVSHMEM team for which multicast access is requested. - Returns:
torch.Tensor: A PyTorch tensor view on the multicast alias of the NVSHMEM buffer.- Raises:
NvshmemInvalid: If the input tensor is not backed by NVSHMEM memory or multicast is not supported.NvshmemError: If the tensor’s backing buffer is not properly tracked or initialized.
-
nvshmem.core.interop.torch.register_external_tensor(tensor: None) → None¶ Register an external tensor with NVSHMEM.
-
nvshmem.core.interop.torch.unregister_external_tensor(tensor: None) → None¶ Unregister an external tensor with NVSHMEM.
Interoperability with CuPy¶
The following are interoperability helpers for NVSHMEM4Py memory used in CuPy
-
nvshmem.core.interop.cupy.bytearray(shape: Tuple[int], dtype: str = 'float32', release=False, device_id: int = None, morder='C', except_on_del=True) → None¶ Create a raw CuPy byte array from NVSHMEM-allocated memory.
This function allocates raw memory using NVSHMEM and wraps it with a CuPy array without reshaping or reinterpreting the dtype view.
This function uses the shape and dtype to choose how much memory to allocate, but does not cast or reshape Therefore, the type of the array will always be cupy.uint8.
Any future calls to
.view()on this object should setcopy=False, to avoid copying the object off of the sheap- Args:
- shape (tuple or list of int): Shape of the desired array.
- dtype (
str,np.dtype, orcupy.dtype, optional): Data type of the array. Defaults to"float32". - release (bool, optional): Do not track this buffer internally to NVSHMEM
- If True, it is the user’s responsibility to hold references to the buffer until free() is called otherwise, deadlocks may occur.
- morder (
str, optional): The memory format to use."C"for C-style, and"F"for “Fortran-style
- Returns:
cupy.ndarray: A CuPy array backed by NVSHMEM-allocated memory.- Raises:
ModuleNotFoundError: If CuPy is not available or enabled.
-
nvshmem.core.interop.cupy.array(shape: Tuple[int], dtype: str = 'float32', release=False, morder='C', except_on_del=True) → None¶ Create a CuPy array view on NVSHMEM-allocated memory with the given shape and dtype.
This function allocates memory using NVSHMEM, wraps it with a DLPack-compatible CuPy array, and returns a reshaped and retyped view of that memory.
- Args:
- shape (tuple or list of int): Shape of the desired array.
- dtype (
str,np.dtype, orcupy.dtype, optional): Data type of the array. Defaults to"float32". - release (bool, optional): Do not track this buffer internally to NVSHMEM
- If True, it is the user’s responsibility to hold references to the buffer until free() is called otherwise, deadlocks may occur.
- morder (
str, optional): The memory format to use."C"for C-style, and"F"for “Fortran-style
Any future calls to
.view()on this object should set copy=False, to avoid copying the object off of the sheap- Returns:
cupy.ndarray: A CuPy array view on NVSHMEM-allocated memory.- Raises:
ModuleNotFoundError: If CuPy is not available or enabled.
-
nvshmem.core.interop.cupy.free_array(array: None) → None¶ Free an NVSHMEM-backed CuPy Array
- Args:
- array (
cupy.ndarray): A CuPy array backed by NVSHMEM memory. - Returns:
- None
- Raises:
ModuleNotFoundError: If CuPy is not available or enabled.
-
nvshmem.core.interop.cupy.array_get_buffer(array: None) → Tuple[cuda.core._memory._buffer.Buffer, int, str]¶ Get a nvshmem Buffer object from a Cupy NDArray object which was allocated with
nvshmem.core.array()ornvshmem.core.bytearray()Returns a Tuple of the array and its size in bytes
-
nvshmem.core.interop.cupy.get_peer_array(array: None, peer_pe: int = None) → None¶ Return a Buffer based on the peer_buffer (wrapper of nvshmem_ptr) API
-
nvshmem.core.interop.cupy.get_multicast_array(team: <MutableEnum TEAM_INVALID=-1, TEAM_WORLD=0, TEAM_SHARED=1, TEAM_NODE=2, TEAM_SAME_MYPE_NODE=3, TEAM_SAME_GPU=4, TEAM_GPU_LEADERS=5, TEAMS_MIN=6, TEAM_INDEX_MAX=32767>, array: None) → None¶ Returns a CuPy array view on multicast-accessible memory corresponding to the input array.
This function takes a CuPy Array that wraps NVSHMEM-allocated memory and returns a new array that uses a Multicast Memory alias for that buffer, obtained via
nvshmemx_mc_ptr. The resulting array is suitable for use in GPU kernels that leverage multicast features such as NVSwitch-based SHARP collectives.The Array passed into it must be allocated by NVSHMEM4Py.
- IMPORTANT:
- The returned array’s memory cannot be accessed from the host (CPU). It is only valid for use in GPU kernels. Host-side access or copying is undefined behavior and may result in errors.
NOTE: This function does not copy data. It provides a device-side view of the same underlying memory, but aliased for multicast access. Any modifications made through the returned array will affect the same memory region.
- Args:
- array (
ndarray): A CuPy Array backed by NVSHMEM-allocated memory. team (Teams): The NVSHMEM team for which multicast access is requested. - Returns:
ndarray: A PyTorch array view on the multicast alias of the NVSHMEM buffer.- Raises:
NvshmemInvalid: If the input array is not backed by NVSHMEM memory or multicast is not supported.NvshmemError: If the array’s backing buffer is not properly tracked or initialized.
-
nvshmem.core.interop.cupy.register_external_array(array: None) → None¶ Register an external array with NVSHMEM.
-
nvshmem.core.interop.cupy.unregister_external_array(array: None) → None¶ Unregister an external array with NVSHMEM.