nemo_automodel.components.speculative.eagle.remote.client

View as Markdown

Training-side client for the remote EAGLE-3 target server.

:class:RemoteEagle3TargetModel implements the Eagle3TargetBackend contract by delegating generate_batch to one or more remote target servers. It POSTs input_ids over HTTP and receives the supervision tensors either over NCCL (GPU-direct, body carries only metadata) or as a binary wire blob (fallback).

Multiple server URLs are dispatched round-robin so the prefetch pipeline in the training loop can keep several requests in flight (one per server) and overlap target inference with draft training.

Module Contents

Classes

NameDescription
RemoteEagle3TargetModelEAGLE-3 target backend that delegates forward passes to remote servers.
_AsyncHandleFuture-like wrapper that converts a worker-thread result into a batch.
_ServerClientHTTP + NCCL connection to a single remote target server.

Data

logger

API

class nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel(
urls: list[str],
device: torch.device,
timeout: int = 120,
max_retries: int = 3
)

Bases: Eagle3TargetBackend

EAGLE-3 target backend that delegates forward passes to remote servers.

_clients
_embeddings
Optional[SimpleNamespace] = None
_executor
Optional[ThreadPoolExecutor] = None
_next
= itertools.cycle(range(len(self._clients)))
num_remote_servers
int
supports_async
bool
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel._build_payload(
input_ids,
attention_mask,
loss_mask
) -> bytes
staticmethod
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel._to_batch(
result: dict,
attention_mask: torch.Tensor
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.close() -> None
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.from_urls(
urls: list[str],
device,
kwargs = {}
) -> 'RemoteEagle3TargetModel'
classmethod
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.generate_batch(
input_ids,
attention_mask,
loss_mask
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.generate_batch_async(
input_ids,
attention_mask,
loss_mask
) -> nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.get_input_embeddings()

Fetch the target input-embedding weight once and cache it.

Returns an object exposing .weight (the only attribute the draft’s copy_embeddings_from_target reads), matching the offline-cache path.

nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.model_info() -> dict
nemo_automodel.components.speculative.eagle.remote.client.RemoteEagle3TargetModel.set_vocab_mapping(
selected_token_ids: torch.Tensor,
selected_token_mask: torch.Tensor
) -> None
class nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle(
future,
convert
)

Future-like wrapper that converts a worker-thread result into a batch.

nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle.cancel() -> bool
nemo_automodel.components.speculative.eagle.remote.client._AsyncHandle.result(
timeout: typing.Optional[float] = None
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch
class nemo_automodel.components.speculative.eagle.remote.client._ServerClient(
url: str,
timeout: int,
max_retries: int,
nccl_rank_offset: int = 0
)

HTTP + NCCL connection to a single remote target server.

_generate_lock
= threading.Lock()
_nccl
Optional[NCCLTransport] = None
_nccl_enabled
_nccl_lock
= threading.Lock()
_session
= requests.Session()
url
= url.rstrip('/')
nemo_automodel.components.speculative.eagle.remote.client._ServerClient._host() -> str
nemo_automodel.components.speculative.eagle.remote.client._ServerClient._init_nccl() -> bool
nemo_automodel.components.speculative.eagle.remote.client._ServerClient._nccl_port() -> int
nemo_automodel.components.speculative.eagle.remote.client._ServerClient.close() -> None
nemo_automodel.components.speculative.eagle.remote.client._ServerClient.generate(
payload: bytes
) -> dict[str, typing.Optional[torch.Tensor]]

POST /generate and return the supervision tensors (NCCL or wire).

/generate is the per-step hot path, so a transient timeout / connection reset here would otherwise abort a long remote-training run. The wire path is an idempotent HTTP round-trip, so it reuses :meth:request’s exponential-backoff retry. The NCCL path is deliberately a single attempt: the POST triggers a server-side NCCL send paired with the recv_tensors below, so a blind retry would issue a second send and desync the 2-process data-plane group (the client’s one recv vs the server’s two sends would hang). Recovering the NCCL path needs a transport resync (tear down + re-init, or fall back to wire) and is tracked separately.

Serialized on _generate_lock so this process never has two /generate requests in flight against the same server: the NCCL recv posted here must pair with this request’s send, and the server’s hook-based aux capture is not reentrant.

nemo_automodel.components.speculative.eagle.remote.client._ServerClient.request(
endpoint: str,
payload: bytes,
content_type: str = 'application/octet-stream'
) -> bytes

POST payload to endpoint with exponential-backoff retry.

nemo_automodel.components.speculative.eagle.remote.client.logger = logging.getLogger(__name__)