nemo_automodel.components.speculative.eagle.target_runner

View as Markdown

Engine-agnostic EAGLE-3 target backend built on a pluggable runner.

The supervision contract (how a target’s raw logits / aux hidden states become an :class:Eagle3TargetBatch) is identical no matter which inference engine runs the frozen target. This module owns that contract:

  • :class:TargetRunner is the narrow surface a concrete engine must implement (forward_eagle3 / set_aux_layers / input_embedding_weight and a model exposing .config + .parameters()). It is the single seam between the engine-coupled code and the rest of the stack: SGLang implements it today (sglang_runner.SGLangTargetRunner), and a vLLM runner can implement the same protocol and drop in without touching this file, the trainer, or the remote server.
  • :class:RunnerEagle3TargetModel is the :class:Eagle3TargetBackend that assembles the supervision batch from any :class:TargetRunner. Its shift / aux-concatenation semantics are identical to :class:HFEagle3TargetModel, so a run through any runner is numerically equivalent to the co-located one.

Keeping the contract here (and importing engines lazily in their own adapter modules) means importing this module never pulls in SGLang or vLLM, so the contract stays unit-testable on CPU against a fake runner.

Module Contents

Classes

NameDescription
RunnerEagle3TargetModelEAGLE-3 target backend that runs the frozen target through a runner.
TargetRunnerMinimal engine surface :class:RunnerEagle3TargetModel depends on.

API

class nemo_automodel.components.speculative.eagle.target_runner.RunnerEagle3TargetModel(
runner: nemo_automodel.components.speculative.eagle.target_runner.TargetRunner,
aux_layer_ids: typing.Optional[typing.Sequence[int]] = None
)

Bases: Eagle3TargetBackend

EAGLE-3 target backend that runs the frozen target through a runner.

Engine-agnostic: it owns only the supervision contract and delegates the actual forward to a :class:TargetRunner. Concrete engines subclass this to add their own from_pretrained (which builds the runner); everything else is inherited.

Parameters

runner: A loaded runner implementing :class:TargetRunner. aux_layer_ids: The three decoder layers to capture (low / mid / high). When None the shared EAGLE-3 default recipe is used, matching every other backend.

aux_layer_ids
model
= runner.model
nemo_automodel.components.speculative.eagle.target_runner.RunnerEagle3TargetModel.close() -> None

Release the runner (frees GPU memory / engine handles).

nemo_automodel.components.speculative.eagle.target_runner.RunnerEagle3TargetModel.generate_batch(
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
loss_mask: torch.Tensor
) -> nemo_automodel.components.speculative.eagle.target.Eagle3TargetBatch

Run the runner’s target and capture aux hidden states plus logits.

Produces an :class:Eagle3TargetBatch byte-for-byte compatible with :meth:HFEagle3TargetModel.generate_batch: the logits, input_ids and loss_mask are shifted left by one (next-token alignment) while the aux hidden states are kept position-aligned. The draft-vocab projection happens trainer-side / server-side from logits, exactly as in the co-located path.

Note: sequences are assumed right-padded (loss_mask zeros the pad). With causal attention, trailing pad tokens do not affect earlier positions, so the captured supervision matches a masked HuggingFace forward.

nemo_automodel.components.speculative.eagle.target_runner.RunnerEagle3TargetModel.get_input_embeddings() -> types.SimpleNamespace

Return an object exposing .weight (the target input embeddings).

Matches the offline-cache / remote path: the draft’s copy_embeddings_from_target only reads .weight.

class nemo_automodel.components.speculative.eagle.target_runner.TargetRunner()
Protocol

Minimal engine surface :class:RunnerEagle3TargetModel depends on.

Implemented for real by an inference-engine runner (SGLangTargetRunner today, a vLLM runner later) and faked in unit tests, which is why the backend depends on this protocol rather than on any engine directly.

model
'nn.Module'
nemo_automodel.components.speculative.eagle.target_runner.TargetRunner.forward_eagle3(
input_ids: torch.Tensor,
attention_mask: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]

Run the target once and return (logits, aux_hidden_states).

logits is [batch, seq, vocab] (full vocab, unshifted) and aux_hidden_states is [batch, seq, 3 * hidden] (the three capture layers concatenated on the last dim, unshifted).

nemo_automodel.components.speculative.eagle.target_runner.TargetRunner.input_embedding_weight() -> torch.Tensor

Return the target input-embedding weight [vocab, hidden].

nemo_automodel.components.speculative.eagle.target_runner.TargetRunner.set_aux_layers(
aux_layer_ids: typing.Sequence[int]
) -> None

Tell the underlying model which 3 decoder layers to capture.