nemo_automodel.components.speculative.eagle.remote.client

Training-side client for the remote EAGLE-3 target server.

:class:RemoteEagle3TargetModel implements the Eagle3TargetBackend contract by delegating generate_batch to one or more remote target servers. It POSTs input_ids over HTTP and receives the supervision tensors either over NCCL (GPU-direct, body carries only metadata) or as a binary wire blob (fallback).

Multiple server URLs are dispatched round-robin so the prefetch pipeline in the training loop can keep several requests in flight (one per server) and overlap target inference with draft training.

Module Contents

Classes

Name	Description
`RemoteEagle3TargetModel`	EAGLE-3 target backend that delegates forward passes to remote servers.
`_AsyncHandle`	Future-like wrapper that converts a worker-thread result into a batch.
`_ServerClient`	HTTP + NCCL connection to a single remote target server.

Data

logger

API

class nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel(
    urls: list[str],
    device: torch.device,
    timeout: int = 120,
    max_retries: int = 3
)

Bases: Eagle3TargetBackend

EAGLE-3 target backend that delegates forward passes to remote servers.

_clients

_embeddings

Optional[SimpleNamespace] = None

_executor

Optional[ThreadPoolExecutor] = None

_next

= itertools.cycle(range(len(self._clients)))

num_remote_servers

int

supports_async

bool

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel._build_payload(
    input_ids,
    attention_mask,
    loss_mask
) -> bytes

staticmethod

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel._to_batch(
    result: dict,
    attention_mask: torch.Tensor
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.close() -> None

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.from_urls(
    urls: list[str],
    device,
    kwargs = {}
) -> 'RemoteEagle3TargetModel'

classmethod

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.generate_batch(
    input_ids,
    attention_mask,
    loss_mask
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.generate_batch_async(
    input_ids,
    attention_mask,
    loss_mask
) -> nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.get_input_embeddings()

Fetch the target input-embedding weight once and cache it.

Returns an object exposing .weight (the only attribute the draft’s copy_embeddings_from_target reads), matching the offline-cache path.

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.model_info() -> dict

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.set_vocab_mapping(
    selected_token_ids: torch.Tensor,
    selected_token_mask: torch.Tensor
) -> None

class nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle(
    future,
    convert
)

Future-like wrapper that converts a worker-thread result into a batch.

nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle.cancel() -> bool

nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle.result(
    timeout: typing.Optional[float] = None
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch

class nemo_automodel.components.speculative.eagle.remote.client._ServerClient(
    url: str,
    timeout: int,
    max_retries: int,
    nccl_rank_offset: int = 0
)

HTTP + NCCL connection to a single remote target server.

_generate_lock

= threading.Lock()

_nccl

Optional[NCCLTransport] = None

_nccl_enabled

_nccl_lock

= threading.Lock()

_session

= requests.Session()

url

= url.rstrip('/')

nemo_automodel.components.speculative.eagle.remote.client._ServerClient._host() -> str

nemo_automodel.components.speculative.eagle.remote.client._ServerClient._init_nccl() -> bool

nemo_automodel.components.speculative.eagle.remote.client._ServerClient._nccl_port() -> int

nemo_automodel.components.speculative.eagle.remote.client._ServerClient.close() -> None

nemo_automodel.components.speculative.eagle.remote.client._ServerClient.generate(
    payload: bytes
) -> dict[str, typing.Optional[torch.Tensor]]

POST /generate and return the supervision tensors (NCCL or wire).

/generate is the per-step hot path, so a transient timeout / connection reset here would otherwise abort a long remote-training run. The wire path is an idempotent HTTP round-trip, so it reuses :meth:request’s exponential-backoff retry. The NCCL path is deliberately a single attempt: the POST triggers a server-side NCCL send paired with the recv_tensors below, so a blind retry would issue a second send and desync the 2-process data-plane group (the client’s one recv vs the server’s two sends would hang). Recovering the NCCL path needs a transport resync (tear down + re-init, or fall back to wire) and is tracked separately.

Serialized on _generate_lock so this process never has two /generate requests in flight against the same server: the NCCL recv posted here must pair with this request’s send, and the server’s hook-based aux capture is not reentrant.

nemo_automodel.components.speculative.eagle.remote.client._ServerClient.request(
    endpoint: str,
    payload: bytes,
    content_type: str = 'application/octet-stream'
) -> bytes

POST payload to endpoint with exponential-backoff retry.

nemo_automodel.components.speculative.eagle.remote.client.logger = logging.getLogger(__name__)