nemo_automodel.components.speculative.eagle.remote.transport

Dedicated NCCL transport for GPU-to-GPU supervision-tensor transfer.

A 2-process NCCL group connects the target server (rank 0) to the training client (rank 1). HTTP stays the control plane (input_ids up, tensor metadata down); this group is the data plane for the large supervision tensors, working over NVLink intra-node and RDMA/RoCE inter-node.

The group is created from an explicit TCPStore so it is independent of the training job’s default process group. We delegate the actual group creation to SGLang’s init_custom_process_group (the proven path; it builds a non default group from a provided store). SGLang is an optional, non-bundled dependency — when it is absent :meth:NCCLTransport.initialize returns False and the caller falls back to the binary wire format.

Environment variables:

NEMO_EAGLE_ENABLE_NCCL — "1" (default) to attempt NCCL, "0" to force the wire-format fallback.
NEMO_EAGLE_NCCL_PORT — TCP rendezvous port (default: HTTP port + 100).

Module Contents

Classes

Name	Description
`NCCLTransport`	A dedicated 2-process NCCL group between server (rank 0) and client (rank 1).

Data

_ELEMENT_SIZE

_NCCL_UNSUPPORTED_DTYPES

logger

API

class nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport(
    nccl_port: int,
    host: str,
    is_server: bool
)

A dedicated 2-process NCCL group between server (rank 0) and client (rank 1).

Parameters

nccl_port: TCP port for the rendezvous store. host: Hostname/IP of the server (rendezvous master). is_server: True on the server side (rank 0), False on the client side (rank 1).

_group_name

= f'nemo_eagle_target_transfer_{nccl_port}'

_init_lock

= threading.Lock()

_pg

Optional[ProcessGroup] = None

_rank

= 0 if is_server else 1

is_initialized

bool

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.destroy() -> None

Abort and unregister the group.

The group is asymmetric: the client can finish before the long-lived server, so a blocking destroy_process_group (which expects both peers) would hang. Abort the local communicator and scrub it from PyTorch’s global registry so the later default-group teardown does not try to shut it down again.

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.initialize(
    timeout_seconds: int = 120
) -> bool

Establish the NCCL group via TCP rendezvous; blocks until both peers connect.

Returns True on success, False on any failure (caller falls back to wire).

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.recv_tensors(
    metadata: dict[str, typing.Optional[dict]],
    keys_order: list[str]
) -> dict[str, typing.Optional[torch.Tensor]]

Receive tensors (client side) per metadata in keys_order.

nemo_automodel.components.speculative.eagle.remote.transport.NCCLTransport.send_tensors(
    tensor_dict: dict[str, typing.Optional[torch.Tensor]],
    keys_order: list[str]
) -> None

Send tensors (server side) in keys_order; skips None entries.

nemo_automodel.components.speculative.eagle.remote.transport._ELEMENT_SIZE = {torch.int16: 2, torch.int8: 1, torch.bool: 1}

nemo_automodel.components.speculative.eagle.remote.transport._NCCL_UNSUPPORTED_DTYPES = {torch.int16, torch.int8, torch.bool}

nemo_automodel.components.speculative.eagle.remote.transport.logger = logging.getLogger(__name__)