nemo_automodel.components.datasets.llm.mock_prefix_tree

View as Markdown

Deterministic mock shared-prefix rollout data for prefix-tree smoke runs.

Module Contents

Functions

NameDescription
build_mock_rollout_datasetBuild a deterministic mock shared-prefix rollout dataset for smoke runs.

API

nemo_automodel.components.datasets.llm.mock_prefix_tree.build_mock_rollout_dataset(
num_groups: int = 16,
completions_per_group: int = 4,
prompt_len: int = 32,
completion_len: int = 16,
vocab_size: int = 1024,
seed: int = 0
) -> list[dict]

Build a deterministic mock shared-prefix rollout dataset for smoke runs.

Each group is one shared prompt with completions_per_group completions, in the {"prompt_ids", "completions"} schema consumed by prefix_tree_collate_fn. Token ids are random in [2, vocab_size); this is a pipeline smoke, not a quality dataset.

Parameters:

num_groups
intDefaults to 16

number of rollout groups.

completions_per_group
intDefaults to 4

completions (leaves) sharing each prompt.

prompt_len
intDefaults to 32

shared prompt length per group.

completion_len
intDefaults to 16

length of each completion.

vocab_size
intDefaults to 1024

upper bound (exclusive) for random token ids.

seed
intDefaults to 0

RNG seed for reproducibility.

Returns: list[dict]

A list of {"prompt_ids": list[int], "completions": list[list[int]]}.