nemo_automodel.components.speculative.serve_target

View as Markdown

Launch a remote EAGLE-3 target server (train-inference disaggregation).

Loads the frozen target model on this process’s GPU and serves draft-vocab supervision (aux hidden states, target_probs, position_mask) to a training client over HTTP (control plane) + NCCL (data plane).

Typical usage (single-GPU server)::

CUDA_VISIBLE_DEVICES=0 python -m nemo_automodel.components.speculative.serve_target
—target meta-llama/Llama-3.1-8B-Instruct
—host 0.0.0.0 —port 8001

Then point training at it::

recipe_args.target_model_backend: remote recipe_args.remote_urls: [“http://<server-host>:8001”] recipe_args.target_prefetch_depth: 1

Verify readiness with curl http://&lt;host&gt;:8001/health. NCCL GPU-direct transfer requires sglang installed in the server’s environment; without it the server transparently falls back to the binary wire format.

Module Contents

Functions

NameDescription
_parse_args-
mainLoad the target model and run the blocking HTTP + NCCL server.

Data

logger

API

nemo_automodel.components.speculative.serve_target._parse_args(
argv = None
) -> argparse.Namespace
nemo_automodel.components.speculative.serve_target.main(
argv = None
) -> None

Load the target model and run the blocking HTTP + NCCL server.

nemo_automodel.components.speculative.serve_target.logger = logging.getLogger(__name__)