nemo_automodel.components.speculative.eagle.target_runner
nemo_automodel.components.speculative.eagle.target_runner
Engine-agnostic EAGLE-3 target backend built on a pluggable runner.
The supervision contract (how a target’s raw logits / aux hidden states become
an :class:Eagle3TargetBatch) is identical no matter which inference engine
runs the frozen target. This module owns that contract:
- :class:
TargetRunneris the narrow surface a concrete engine must implement (forward_eagle3/set_aux_layers/input_embedding_weightand amodelexposing.config+.parameters()). It is the single seam between the engine-coupled code and the rest of the stack: SGLang implements it today (sglang_runner.SGLangTargetRunner), and a vLLM runner can implement the same protocol and drop in without touching this file, the trainer, or the remote server. - :class:
RunnerEagle3TargetModelis the :class:Eagle3TargetBackendthat assembles the supervision batch from any :class:TargetRunner. Its shift / aux-concatenation semantics are identical to :class:HFEagle3TargetModel, so a run through any runner is numerically equivalent to the co-located one.
Keeping the contract here (and importing engines lazily in their own adapter modules) means importing this module never pulls in SGLang or vLLM, so the contract stays unit-testable on CPU against a fake runner.
Module Contents
Classes
API
Bases: Eagle3TargetBackend
EAGLE-3 target backend that runs the frozen target through a runner.
Engine-agnostic: it owns only the supervision contract and delegates the
actual forward to a :class:TargetRunner. Concrete engines subclass this to
add their own from_pretrained (which builds the runner); everything else
is inherited.
Parameters
runner:
A loaded runner implementing :class:TargetRunner.
aux_layer_ids:
The three decoder layers to capture (low / mid / high). When None
the shared EAGLE-3 default recipe is used, matching every other backend.
Release the runner (frees GPU memory / engine handles).
Run the runner’s target and capture aux hidden states plus logits.
Produces an :class:Eagle3TargetBatch byte-for-byte compatible with
:meth:HFEagle3TargetModel.generate_batch: the logits, input_ids
and loss_mask are shifted left by one (next-token alignment) while
the aux hidden states are kept position-aligned. The draft-vocab
projection happens trainer-side / server-side from logits, exactly
as in the co-located path.
Note: sequences are assumed right-padded (loss_mask zeros the pad). With causal attention, trailing pad tokens do not affect earlier positions, so the captured supervision matches a masked HuggingFace forward.
Return an object exposing .weight (the target input embeddings).
Matches the offline-cache / remote path: the draft’s
copy_embeddings_from_target only reads .weight.
Minimal engine surface :class:RunnerEagle3TargetModel depends on.
Implemented for real by an inference-engine runner (SGLangTargetRunner
today, a vLLM runner later) and faked in unit tests, which is why the
backend depends on this protocol rather than on any engine directly.
Run the target once and return (logits, aux_hidden_states).
logits is [batch, seq, vocab] (full vocab, unshifted) and
aux_hidden_states is [batch, seq, 3 * hidden] (the three capture
layers concatenated on the last dim, unshifted).
Return the target input-embedding weight [vocab, hidden].
Tell the underlying model which 3 decoder layers to capture.