nemo_rl.weight_sync.collective_weight_synchronizer#
NCCL collective weight synchronizer for non-colocated deployments.
Handles weight transfer between policy and generation workers running on separate GPU clusters using NCCL collective communication. The policy broadcasts its weights, and generation workers receive them via the established NCCL process group.
Lifecycle per sync:
policy.broadcast_weights_for_collective() – send via NCCL generation.update_weights_from_collective() – receive via NCCL
Verify transfer success
No offload/restore steps are needed since policy and generation run on separate GPUs with dedicated memory.
Module Contents#
Classes#
Weight synchronizer using NCCL collectives for non-colocated deployments. |
API#
- class nemo_rl.weight_sync.collective_weight_synchronizer.CollectiveWeightSynchronizer(
- policy: Any,
- generation: Any,
- train_cluster: Any,
- inference_cluster: Any,
Bases:
nemo_rl.weight_sync.interfaces.WeightSynchronizerWeight synchronizer using NCCL collectives for non-colocated deployments.
Policy and generation workers run on separate GPU clusters. Weights are synchronized via NCCL broadcast over a pre-established process group.
- Parameters:
policy – Policy object implementing ColocatablePolicyInterface.
generation – Generation object implementing GenerationInterface.
train_cluster – RayVirtualCluster for the training workers, used to obtain the master address/port and world size for collective init.
inference_cluster – RayVirtualCluster for the inference workers.
Initialization
- sync_weights(
- *,
- timer: Optional[nemo_rl.utils.timer.Timer] = None,
- kv_scales: Optional[dict[str, float]] = None,
- property is_stale: bool#
- mark_stale() None#
- init_communicator() None#
- shutdown() None#