nemo_automodel.components.datasets.llm.mock_packed

View as Markdown

Module Contents

Functions

NameDescription
build_packed_datasetDataset builder.
flush_blockFlush helper (build position_ids that reset after <eos>).
gen_sentence_idsSentence generator with Gaussian length control.
make_vocabBuild a trivial vocab; index 0=<pad>, 1=<eos>, rest = word_i.

Data

ds

API

nemo_automodel.components.datasets.llm.mock_packed.build_packed_dataset(
num_blocks: int = 10,
block_size: int = 128,
mean_len: float = 20.0,
std_len: float = 6.0,
vocab_size: int = 100,
max_sentence_len: int = 64,
seed: int = 0,
tokenizer = None
)

Dataset builder.

nemo_automodel.components.datasets.llm.mock_packed.flush_block(
block,
block_size
)

Flush helper (build position_ids that reset after <eos>).

nemo_automodel.components.datasets.llm.mock_packed.gen_sentence_ids(
vocab,
mean_len: float,
std_len: float,
max_len: int
)

Sentence generator with Gaussian length control.

nemo_automodel.components.datasets.llm.mock_packed.make_vocab(
vocab_size: int = 100
)

Build a trivial vocab; index 0=<pad>, 1=<eos>, rest = word_i.

nemo_automodel.components.datasets.llm.mock_packed.ds = build_packed_dataset(num_blocks=3, block_size=32, mean_len=10, std_len=3, vocab_...