core.resharding.nvshmem_copy_service.planning.communication_scheduler#
Module Contents#
Classes#
Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded. |
API#
- class core.resharding.nvshmem_copy_service.planning.communication_scheduler.CommunicationScheduler#
Builds a conflict-free, iteration-based schedule for communication. Ensures that in any given iteration, a PE is not overloaded.
Initialization
- build_schedule(
- workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
- my_pe: int,
- n_pes: int,
Main scheduling method.
Exchanges workload info with other PEs.
Assigns batches to iterations.
Returns:
local schedule (iteration -> list of batches)
global workload summaries (key: (src, dest, batch_idx) -> summary)
- _collect_all_batches(
- workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
- my_pe: int,
- n_pes: int,
Exchanges batch counts and details with all PEs to build a global view. Uses torch.distributed for reliable communication.
- _assign_iterations( )#
- _has_conflict(
- batch: core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch,
- iteration: int,
- all_batches: List[core.resharding.nvshmem_copy_service.nvshmem_types.ScheduledBatch],
- _exchange_workload_summaries(
- workloads: Dict[int, List[core.resharding.nvshmem_copy_service.nvshmem_types.WorkloadGroup]],
- my_pe: int,
- n_pes: int,
Exchange detailed workload content using torch.distributed. Simple and reliable - no NVSHMEM symmetric memory issues.