nemo_automodel.components.speculative.regenerate
nemo_automodel.components.speculative.regenerate
Regenerate dataset answers with the EAGLE target model.
EAGLE drafters learn best when the supervised assistant turn is produced by
the same model that will serve as the inference target. Many public chat
datasets were generated by other models, so the assistant tokens they contain
are off-distribution for the drafter. This script takes such a dataset,
strips the trailing assistant turn from each sample, replays the remaining
[system, user, ...] context against a target model running behind an
OpenAI-compatible SGLang server, and writes a new dataset whose
messages column ends with a freshly-generated assistant turn.
The output parquet files have the same messages column shape that
ChatDataset (used by build_eagle3_dataloader) consumes, so the
regenerated directory can be plugged directly into train_data_path in
the EAGLE-3 recipe.
Typical usage:
1. Spin up SGLang serving the target model in another shell:
python -m sglang.launch_server
—model-path meta-llama/Llama-3.1-8B-Instruct —port 30000
2. Regenerate answers:
python -m nemo_automodel.components.speculative.regenerate
—input-data Aeala/ShareGPT_Vicuna_unfiltered
—output-dir ./regenerated/sharegpt_llama31_8b
—target-server http://localhost:30000/v1
—model meta-llama/Llama-3.1-8B-Instruct
—concurrency 64 —shard-size 1000
The script is resumable: re-running with the same --output-dir --resume
skips any shards that are already on disk, and verifies via a manifest that
the input/model/sharding configuration matches the earlier run.
Module Contents
Classes
Functions
Data
API
Sampling parameters forwarded to the SGLang chat completion endpoint.
Return the regeneration settings that must stay stable across resume.
Fields that change the content of the output dataset are included. Fields
that only affect throughput / reliability (concurrency, timeout_s,
max_retries) are intentionally omitted so a user can re-resume with
different operational knobs. output_dir is also omitted: the manifest
lives inside output_dir, so encoding it here would only break resume
after a directory rename.
POST payload to url and return the assistant message dict, with bounded retries.
Guard --resume against silently mixing shards from different runs.
Also refuses to start a fresh run that would silently clobber existing
shards: if the output directory already contains shard files and the user
did not pass --resume, raise so they make an explicit choice (either
delete the directory or pass --resume).
Return the set of shard indices already present in output_dir.
Return messages truncated so its tail is not an assistant turn.
EAGLE-3 supervision needs an assistant turn produced by the target model. The strategy here mirrors SpecForge’s offline regeneration: keep every leading system / user / tool turn (including any intermediate user<->assistant rounds), but drop the trailing assistant turn so the target can produce a fresh one.
Returns None if the sample has no valid prompt context (e.g. it is
empty, or starts with an assistant turn that gets dropped, leaving
nothing). Callers should skip such samples.
Yield rows’ messages_column from an HF dataset or a list of dicts.
Return the manifest path inside output_dir.
Run a single shard’s prompts through the target server with bounded concurrency.
shard_samples items are (global_index, original_messages, prompt_messages);
only the prompt is sent to the server, but both are kept around so the
written rows can preserve the original for traceability.
Call the target server once and return prompt + [assistant].
Async driver: load dataset, regenerate, write shards. Returns a process exit code.
Reject invalid CLI values before any network or disk work starts.
Persist the current regeneration config for future --resume checks.
Write a shard atomically (.tmp then os.replace) so partial writes never linger.
CLI entry point. Parses argv and returns the process exit code.