core.resharding.nvshmem_copy_service.service#

Remote Copy Service - Main orchestrator for NVSHMEM-based GPU-to-GPU transfers.

This service coordinates task segmentation, workload packing, scheduling,

GPU resource management, and pipelined execution.

Module Contents#

Classes#

RemoteCopyService

Main service for managing remote GPU-to-GPU data transfers.

API#

class core.resharding.nvshmem_copy_service.service.RemoteCopyService#

Main service for managing remote GPU-to-GPU data transfers.

Provides high-level API for registering transfers, scheduling, and executing pipelined communication with NVSHMEM.

Initialization

property my_pe: int#

Get this PE’s rank.

property n_pes: int#

Get total number of PEs.

property device#

Get CUDA device.

property initialized: bool#

Check if service is initialized.

init(log_level: str = 'INFO') None#

Initialize the service.

Sets up NVSHMEM, CUDA device, streams, buffers, and kernels. Expects to be launched with torchrun.

Parameters:

log_level – Logging level (TRACE, DEBUG, INFO, WARN, ERROR)

register_send(
task_id: int,
src_tensor,
src_pos: int,
size: int,
dest_pe: int,
) None#

Register a send operation.

Parameters:
  • task_id – Unique task identifier

  • src_tensor – Source tensor (PyTorch/CuPy tensor or pointer)

  • src_pos – Starting position in source tensor

  • size – Number of bytes to send

  • dest_pe – Destination PE rank

register_receive(
task_id: int,
dest_tensor,
dest_pos: int,
size: int,
src_pe: int,
) None#

Register a receive operation.

Parameters:
  • task_id – Unique task identifier

  • dest_tensor – Destination tensor (PyTorch/CuPy tensor or pointer)

  • dest_pos – Starting position in destination tensor

  • size – Number of bytes to receive

  • src_pe – Source PE rank

schedule() None#

Build execution schedule.

Can be called once and followed by multiple run() calls for repeated execution with the same communication pattern.

Steps:

  1. Segment large tasks into manageable chunks

  2. Pack tasks into batches

  3. Schedule batches to iterations (conflict-free)

  4. Build GPU execution plans (pointer arrays, chunking)

  5. Create synchronization events

run() None#

Execute the scheduled communication.

Can be called multiple times after a single schedule() call to repeat the same communication pattern.

clear_requests() None#

Clear registered requests and schedule.

Call this before registering a new set of transfers.

finalize() None#

Cleanup resources.

_segment_tasks() None#

Segment tasks into manageable chunks.

_prepare_iter_schedules(
schedule_batches: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]],
workloads: Dict[int, List],
global_summaries: Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary],
num_iterations: int,
) List[Dict]#

Organize schedule into iteration-based structure.

Returns:

List of dicts with ‘send’ and ‘recv’ keys for each iteration