nemo_automodel.components.speculative.serve_target
nemo_automodel.components.speculative.serve_target
Launch a remote EAGLE-3 target server (train-inference disaggregation).
Loads the frozen target model on this process’s GPU and serves draft-vocab
supervision (aux hidden states, target_probs, position_mask) to a
training client over HTTP (control plane) + NCCL (data plane).
Typical usage (single-GPU server)::
CUDA_VISIBLE_DEVICES=0 python -m nemo_automodel.components.speculative.serve_target
—target meta-llama/Llama-3.1-8B-Instruct
—host 0.0.0.0 —port 8001
Then point training at it::
recipe_args.target_model_backend: remote recipe_args.remote_urls: [“http://<server-host>:8001”] recipe_args.target_prefetch_depth: 1
Verify readiness with curl http://<host>:8001/health. NCCL GPU-direct
transfer requires sglang installed in the server’s environment; without it the
server transparently falls back to the binary wire format.
Module Contents
Functions
Data
API
Load the target model and run the blocking HTTP + NCCL server.