nemo_automodel.components.speculative.eagle.remote.server

View as Markdown

Remote EAGLE-3 target server.

Runs the frozen target model and, for each training request, produces the draft-vocab supervision (aux hidden states, target_probs, position_mask) and ships it back to the training client. The supervision computation reuses the co-located building blocks verbatim — HFEagle3TargetModel.generate_batch for the forward + aux capture and _compute_target_distribution for the draft-vocab projection — so a remote run is numerically identical to a co-located one.

The HTTP request handling is split from the http.server plumbing (:class:TargetModelServer holds the pure logic) so it can be unit-tested on CPU with the NCCL data plane disabled (wire-format path).

Module Contents

Classes

NameDescription
TargetModelServerRequest-handling logic for the remote target server (HTTP-transport agnostic).

Functions

NameDescription
_make_request_handler-
compute_supervisionProduce the precomputed draft-vocab supervision for one batch.
serveRun the blocking HTTP server until the client disconnects or Ctrl-C.

Data

logger

API

class nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer(
target_wrapper,
nccl_port: int,
host: str = '0.0.0.0'
)

Request-handling logic for the remote target server (HTTP-transport agnostic).

Parameters

target_wrapper: A loaded HFEagle3TargetModel (or any object exposing the same generate_batch / get_input_embeddings surface). nccl_port: TCP rendezvous port for the NCCL data plane. host: Bind/advertise address (rendezvous master for NCCL).

_device
= self._infer_device()
_generate_lock
= threading.Lock()
_nccl
Optional[NCCLTransport] = None
_nccl_enabled
_pending_nccl_send
Optional[tuple[dict, list[str]]] = None
_selected_token_ids
Optional[Tensor] = None
_selected_token_mask
Optional[Tensor] = None
_shutdown_event
= threading.Event()
generate_lock
Lock
shutdown_event
Event
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer._infer_device() -> torch.device
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.close() -> None
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.flush_nccl_send() -> None

Send the pending supervision tensors over NCCL (after the HTTP flush).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_disconnect(
_raw: bytes
) -> bytes
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_generate(
raw: bytes,
client_wants_nccl: bool
) -> tuple[bytes, bool]

Run the target and serialize the supervision.

Returns (body, used_nccl). When NCCL is used the body is the JSON metadata only and the tensors are queued for :meth:flush_nccl_send (sent after the HTTP response is flushed, to avoid a recv deadlock).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_init_nccl(
raw: bytes
) -> bytes
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_input_embeddings(
_raw: bytes
) -> bytes

Return the target input-embedding weight (used once to seed the draft).

nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_model_info(
_raw: bytes
) -> bytes
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_set_vocab_mapping(
raw: bytes
) -> bytes
nemo_automodel.components.speculative.eagle.remote.server._make_request_handler(
server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer
)
nemo_automodel.components.speculative.eagle.remote.server.compute_supervision(
target_wrapper,
selected_token_ids: torch.Tensor,
selected_token_mask: torch.Tensor,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
loss_mask: torch.Tensor
) -> dict[str, torch.Tensor]

Produce the precomputed draft-vocab supervision for one batch.

Mirrors the co-located path exactly: generate_batch runs the target and returns shifted logits / input_ids / loss_mask plus the aux hidden states; _compute_target_distribution then projects the shifted logits onto the draft vocab. Returns tensors keyed by :data:protocol.SUPERVISION_KEYS.

nemo_automodel.components.speculative.eagle.remote.server.serve(
server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer,
host: str,
port: int
) -> None

Run the blocking HTTP server until the client disconnects or Ctrl-C.

nemo_automodel.components.speculative.eagle.remote.server.logger = logging.getLogger(__name__)