nemo_automodel.datasets.llm.mock#

Module Contents#

Functions#

make_vocab

Build a trivial vocab; index 0=, 1=, rest = tok_i.

gen_sentence_ids

Sentence generator with Gaussian length control.

build_unpacked_dataset

Build a dataset where each example is one sentence (variable length).

API#

nemo_automodel.datasets.llm.mock.make_vocab(vocab_size: int = 100)[source]#

Build a trivial vocab; index 0=, 1=, rest = tok_i.

nemo_automodel.datasets.llm.mock.gen_sentence_ids(vocab, mean_len: float, std_len: float, max_len: int)[source]#

Sentence generator with Gaussian length control.

nemo_automodel.datasets.llm.mock.build_unpacked_dataset(
*,
num_sentences: int = 10,
mean_len: float = 20.0,
std_len: float = 6.0,
vocab_size: int = 100,
max_sentence_len: int = 64,
seed: int = 0,
)[source]#

Build a dataset where each example is one sentence (variable length).

Returns:

input_ids: Sequence(int64) attention_mask:Sequence(int8) labels: Sequence(int64) position_ids: Sequence(int64)

Return type:

  • a HuggingFace Dataset with fields