nemo_automodel.components.datasets.llm.mock_prefix_tree
nemo_automodel.components.datasets.llm.mock_prefix_tree
Deterministic mock shared-prefix rollout data for prefix-tree smoke runs.
Module Contents
Functions
API
Build a deterministic mock shared-prefix rollout dataset for smoke runs.
Each group is one shared prompt with completions_per_group completions, in
the {"prompt_ids", "completions"} schema consumed by
prefix_tree_collate_fn. Token ids are random in [2, vocab_size); this
is a pipeline smoke, not a quality dataset.
Parameters:
number of rollout groups.
completions (leaves) sharing each prompt.
shared prompt length per group.
length of each completion.
upper bound (exclusive) for random token ids.
RNG seed for reproducibility.
Returns: list[dict]
A list of {"prompt_ids": list[int], "completions": list[list[int]]}.