nemo_automodel.components.speculative.eagle.remote.server

Remote EAGLE-3 target server.

Runs the frozen target model and, for each training request, produces the draft-vocab supervision (aux hidden states, target_probs, position_mask) and ships it back to the training client. The supervision computation reuses the co-located building blocks verbatim — HFEagle3TargetModel.generate_batch for the forward + aux capture and _compute_target_distribution for the draft-vocab projection — so a remote run is numerically identical to a co-located one.

The HTTP request handling is split from the http.server plumbing (:class:TargetModelServer holds the pure logic) so it can be unit-tested on CPU with the NCCL data plane disabled (wire-format path).

Module Contents

Classes

Name	Description
`TargetModelServer`	Request-handling logic for the remote target server (HTTP-transport agnostic).

Functions

Name	Description
`_make_request_handler`	-
`compute_supervision`	Produce the precomputed draft-vocab supervision for one batch.
`serve`	Run the blocking HTTP server until the client disconnects or Ctrl-C.

Data

logger

API

class nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer(
    target_wrapper,
    nccl_port: int,
    host: str = '0.0.0.0'
)

Request-handling logic for the remote target server (HTTP-transport agnostic).

Parameters

target_wrapper: A loaded HFEagle3TargetModel (or any object exposing the same generate_batch / get_input_embeddings surface). nccl_port: TCP rendezvous port for the NCCL data plane. host: Bind/advertise address (rendezvous master for NCCL).

_device

= self._infer_device()

_generate_lock

= threading.Lock()

_nccl

Optional[NCCLTransport] = None

_nccl_enabled

_pending_nccl_send

Optional[tuple[dict, list[str]]] = None

_selected_token_ids

Optional[Tensor] = None

_selected_token_mask

Optional[Tensor] = None

_shutdown_event

= threading.Event()

generate_lock

Lock

shutdown_event

Event

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer._infer_device() -> torch.device

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.close() -> None

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.flush_nccl_send() -> None

Send the pending supervision tensors over NCCL (after the HTTP flush).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_disconnect(
    _raw: bytes
) -> bytes

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_generate(
    raw: bytes,
    client_wants_nccl: bool
) -> tuple[bytes, bool]

Run the target and serialize the supervision.

Returns (body, used_nccl). When NCCL is used the body is the JSON metadata only and the tensors are queued for :meth:flush_nccl_send (sent after the HTTP response is flushed, to avoid a recv deadlock).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_init_nccl(
    raw: bytes
) -> bytes

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_input_embeddings(
    _raw: bytes
) -> bytes

Return the target input-embedding weight (used once to seed the draft).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_model_info(
    _raw: bytes
) -> bytes

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_set_vocab_mapping(
    raw: bytes
) -> bytes

nemo_automodel.components.speculative.eagle.remote.server._make_request_handler(
    server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer
)

nemo_automodel.components.speculative.eagle.remote.server.compute_supervision(
    target_wrapper,
    selected_token_ids: torch.Tensor,
    selected_token_mask: torch.Tensor,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    loss_mask: torch.Tensor
) -> dict[str, torch.Tensor]

Produce the precomputed draft-vocab supervision for one batch.

Mirrors the co-located path exactly: generate_batch runs the target and returns shifted logits / input_ids / loss_mask plus the aux hidden states; _compute_target_distribution then projects the shifted logits onto the draft vocab. Returns tensors keyed by :data:protocol.SUPERVISION_KEYS.

nemo_automodel.components.speculative.eagle.remote.server.serve(
    server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer,
    host: str,
    port: int
) -> None

Run the blocking HTTP server until the client disconnects or Ctrl-C.

nemo_automodel.components.speculative.eagle.remote.server.logger = logging.getLogger(__name__)