core.resharding.nvshmem_copy_service.planning.communication_scheduler#

Module Contents#

Classes#

CommunicationScheduler

Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded.

API#

class core.resharding.nvshmem_copy_service.planning.communication_scheduler.CommunicationScheduler#

Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded.

Initialization

build_schedule(
workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
my_pe: int,
n_pes: int,
) Tuple[Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]], Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary]]#

Main scheduling method.

  1. Exchanges workload info with other PEs.

  2. Assigns batches to iterations.

  3. Returns:

    • local schedule (iteration -> list of batches)

    • global workload summaries (key: (src, dest, batch_idx) -> summary)

_collect_all_batches(
workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
my_pe: int,
n_pes: int,
) List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]#

Exchanges batch counts and details with all PEs to build a global view. Uses torch.distributed for reliable communication.

_assign_iterations(
batches: List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch],
)#
_has_conflict(
batch: core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch,
iteration: int,
all_batches: List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch],
) bool#
_exchange_workload_summaries(
workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
my_pe: int,
n_pes: int,
) Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary]#

Exchange detailed workload content using torch.distributed. Simple and reliable - no NVSHMEM symmetric memory issues.