`core.resharding.nvshmem_copy_service.service`#

Remote Copy Service - Main orchestrator for NVSHMEM-based GPU-to-GPU transfers.

This service coordinates task segmentation, workload packing, scheduling,

GPU resource management, and pipelined execution.

Module Contents#

Main service for managing remote GPU-to-GPU data transfers.

class core.resharding.nvshmem_copy_service.service.RemoteCopyService(group=None)#

Main service for managing remote GPU-to-GPU data transfers.

Provides high-level API for registering transfers, scheduling, and executing pipelined communication with NVSHMEM.

Initialization

init(log_level: str = 'INFO') → None#

Initialize the service.

Sets up NVSHMEM, CUDA device, streams, buffers, and kernels. Expects to be launched with torchrun.

Parameters:: log_level – Logging level (TRACE, DEBUG, INFO, WARN, ERROR)

register_send( task_id: int, src_tensor, src_pos: int, size: int, dest_pe: int, ) → None#

Parameters:

register_receive( task_id: int, dest_tensor, dest_pos: int, size: int, src_pe: int, ) → None#

Parameters:

schedule() → None#

Build execution schedule.

Can be called once and followed by multiple run() calls for repeated execution with the same communication pattern.

Steps:

run() → None#

Execute the scheduled communication.

Can be called multiple times after a single schedule() call to repeat the same communication pattern.

clear_requests() → None#

Clear registered requests and schedule.

Call this before registering a new set of transfers.

_prepare_iter_schedules( schedule_batches: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]], workloads: Dict[int, List], global_summaries: Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary], num_iterations: int, ) → List[Dict]#

Organize schedule into iteration-based structure.