core.resharding.copy_services.base#

Module Contents#

Classes#

SendOp

Single send operation pending in a CopyService queue.

RecvOp

Single receive operation pending in a CopyService queue.

CopyService

Abstract interface for submitting and executing batched P2P copy operations.

Functions#

match_local_ops_by_task_id

Pair same-rank send/recv ops by task_id, raising on any mismatch.

API#

class core.resharding.copy_services.base.SendOp#

Single send operation pending in a CopyService queue.

task_id: int | None#

None

tensor: torch.Tensor#

None

dest_rank: int#

None

class core.resharding.copy_services.base.RecvOp#

Single receive operation pending in a CopyService queue.

task_id: int | None#

None

tensor: torch.Tensor#

None

src_rank: int#

None

class core.resharding.copy_services.base.CopyService(group=None)#

Bases: abc.ABC

Abstract interface for submitting and executing batched P2P copy operations.

All backends accept an optional task_id on submit calls. The task_id is a globally unique identifier shared between the matching send and recv for the same transfer. It is required for local (same-rank) copy matching and for the NVSHMEM backend’s scheduling. Backends that do not need it for remote transfers simply ignore it.

Initialization

abstractmethod submit_send(
src_tensor: torch.Tensor,
dest_rank: int,
task_id: Optional[int] = None,
)#

Register a tensor send from the current rank to dest_rank.

abstractmethod submit_recv(
dest_tensor: torch.Tensor,
src_rank: int,
task_id: Optional[int] = None,
)#

Register a tensor receive into dest_tensor from src_rank.

abstractmethod run()#

Execute all previously submitted send/recv operations as a single batch.

close() None#

Release backend-owned resources. Default no-op; NVSHMEM overrides.

core.resharding.copy_services.base.match_local_ops_by_task_id(
local_sends: list,
local_recvs: list,
backend_name: str,
rank: int,
) list[tuple]#

Pair same-rank send/recv ops by task_id, raising on any mismatch.

Returns a list of (send_op, recv_op) tuples for the caller to apply backend-specific local-copy logic. Either op type may be a backend-local wrapper as long as it exposes .task_id.