`core.resharding.nvshmem_copy_service.planning.communication_scheduler`#

Module Contents#

Classes#

CommunicationScheduler

Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded. Uses greedy first-fit scheduling algorithm.

API#

class core.resharding.nvshmem_copy_service.planning.communication_scheduler.CommunicationScheduler#

Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded. Uses greedy first-fit scheduling algorithm.

Initialization

build_schedule( workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]], my_pe: int, n_pes: int, group=None, ) → Tuple[Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]], Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary]]#

Main scheduling method.

Exchanges workload info with other PEs.
Assigns batches to iterations.
Returns:
- local schedule (iteration -> list of batches)
- global workload summaries (key: (src, dest, batch_idx) -> summary)

Parameters:

workloads – Dict mapping destination PE to list of workload groups.
my_pe – This PE’s rank.
n_pes – Total number of PEs.
group – Optional ProcessGroup for distributed operations.

_collect_all_batches( workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]], my_pe: int, n_pes: int, group=None, ) → List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch]#: Exchanges batch counts and details with all PEs to build a global view. Uses torch.distributed for reliable communication.

_assign_iterations( batches: List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch], )#

Greedy first-fit scheduling algorithm.

Assigns batches to iterations using simple greedy first-fit. Processes batches in sorted order and assigns each to the first available iteration with no conflicts.

_exchange_workload_summaries( workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]], my_pe: int, n_pes: int, group=None, ) → Dict[Tuple[int, int, int], core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadSummary]#: Exchange detailed workload content using torch.distributed. Simple and reliable - no NVSHMEM symmetric memory issues.

core.resharding.nvshmem_copy_service.planning.communication_scheduler#

Module Contents#

Classes#

API#

`core.resharding.nvshmem_copy_service.planning.communication_scheduler`#