nemo_automodel.components.speculative.eagle.remote.transport

View as Markdown

Dedicated NCCL transport for GPU-to-GPU supervision-tensor transfer.

A 2-process NCCL group connects the target server (rank 0) to the training client (rank 1). HTTP stays the control plane (input_ids up, tensor metadata down); this group is the data plane for the large supervision tensors, working over NVLink intra-node and RDMA/RoCE inter-node.

The group is created from an explicit TCPStore so it is independent of the training job’s default process group. We delegate the actual group creation to SGLang’s init_custom_process_group (the proven path; it builds a non default group from a provided store). SGLang is an optional, non-bundled dependency — when it is absent :meth:NCCLTransport.initialize returns False and the caller falls back to the binary wire format.

Environment variables:

  • NEMO_EAGLE_ENABLE_NCCL"1" (default) to attempt NCCL, "0" to force the wire-format fallback.
  • NEMO_EAGLE_NCCL_PORT — TCP rendezvous port (default: HTTP port + 100).

Module Contents

Classes

NameDescription
NCCLTransportA dedicated 2-process NCCL group between server (rank 0) and client (rank 1).

Data

_ELEMENT_SIZE

_NCCL_UNSUPPORTED_DTYPES

logger

API

class nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport(
nccl_port: int,
host: str,
is_server: bool
)

A dedicated 2-process NCCL group between server (rank 0) and client (rank 1).

Parameters

nccl_port: TCP port for the rendezvous store. host: Hostname/IP of the server (rendezvous master). is_server: True on the server side (rank 0), False on the client side (rank 1).

_group_name
= f'nemo_eagle_target_transfer_{nccl_port}'
_init_lock
= threading.Lock()
_pg
Optional[ProcessGroup] = None
_rank
= 0 if is_server else 1
is_initialized
bool
nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.destroy() -> None

Abort and unregister the group.

The group is asymmetric: the client can finish before the long-lived server, so a blocking destroy_process_group (which expects both peers) would hang. Abort the local communicator and scrub it from PyTorch’s global registry so the later default-group teardown does not try to shut it down again.

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.initialize(
timeout_seconds: int = 120
) -> bool

Establish the NCCL group via TCP rendezvous; blocks until both peers connect.

Returns True on success, False on any failure (caller falls back to wire).

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.recv_tensors(
metadata: dict[str, typing.Optional[dict]],
keys_order: list[str]
) -> dict[str, typing.Optional[torch.Tensor]]

Receive tensors (client side) per metadata in keys_order.

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.send_tensors(
tensor_dict: dict[str, typing.Optional[torch.Tensor]],
keys_order: list[str]
) -> None

Send tensors (server side) in keys_order; skips None entries.

nemo_automodel.components.speculative.eagle.remote.transport._ELEMENT_SIZE = {torch.int16: 2, torch.int8: 1, torch.bool: 1}
nemo_automodel.components.speculative.eagle.remote.transport._NCCL_UNSUPPORTED_DTYPES = {torch.int16, torch.int8, torch.bool}
nemo_automodel.components.speculative.eagle.remote.transport.logger = logging.getLogger(__name__)