core.resharding.nvshmem_copy_service.core.kernel_launcher#

CUDA kernel management and launching for pack/unpack operations.

Handles kernel compilation, launching, and stream coordination.

Module Contents#

Classes#

KernelLauncher

Manages CUDA kernel loading and launching for data pack/unpack operations.

API#

class core.resharding.nvshmem_copy_service.core.kernel_launcher.KernelLauncher#

Manages CUDA kernel loading and launching for data pack/unpack operations.

Initialization

load_kernels() None#

Load and compile CUDA kernels from source.

set_streams(pack_stream, unpack_stream) None#

Cache CuPy stream wrappers for kernel launching.

This eliminates per-launch overhead of stream pointer extraction and CuPy ExternalStream creation.

Parameters:
  • pack_stream – CUDA stream for pack operations

  • unpack_stream – CUDA stream for unpack operations

launch_pack(
gpu_plan: Tuple[Any, Any, Any, int],
pack_stream,
torch_pack_stream: torch.cuda.ExternalStream,
pack_event: torch.cuda.Event,
) None#

Launch pack kernel to copy data from user tensors to send buffer.

Parameters:
  • gpu_plan – Tuple of (cp_src_addrs, cp_dst_addrs, cp_sizes, num_chunks) as CuPy arrays

  • pack_stream – CUDA stream (cuda.core.experimental.Stream) - unused, kept for compatibility

  • torch_pack_stream – PyTorch external stream wrapper

  • pack_event – CUDA event to record after kernel launch

launch_unpack(
gpu_plan: Tuple[Any, Any, Any, int],
unpack_stream,
torch_unpack_stream: torch.cuda.ExternalStream,
unpack_event: torch.cuda.Event,
) None#

Launch unpack kernel to copy data from receive buffer to user tensors.

Parameters:
  • gpu_plan – Tuple of (cp_src_addrs, cp_dst_addrs, cp_sizes, num_chunks) as CuPy arrays

  • unpack_stream – CUDA stream (cuda.core.experimental.Stream) - unused,

  • compatibility (kept for)

  • torch_unpack_stream – PyTorch external stream wrapper

  • unpack_event – CUDA event to record after kernel launch