nemo_automodel.components.datasets.llm.squad

View as Markdown

Module Contents

Functions

NameDescription
_formatting_prompts_func-
_formatting_prompts_func_with_chat_template-
make_squad_datasetLoad and preprocess a SQuAD-style QA dataset for model fine-tuning.

API

nemo_automodel.components.datasets.llm.squad._formatting_prompts_func(
example,
tokenizer,
eos_token_id,
pad_token_id,
seq_length = None,
padding = None,
truncation = None
)
nemo_automodel.components.datasets.llm.squad._formatting_prompts_func_with_chat_template(
example,
tokenizer,
eos_token_id,
pad_token_id,
seq_length = None,
padding = None,
truncation = None
)
nemo_automodel.components.datasets.llm.squad.make_squad_dataset(
tokenizer,
seq_length = None,
limit_dataset_samples = None,
fp8 = False,
split = 'train',
dataset_name = 'squad',
padding = False,
truncation = False
)

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

This function retrieves the specified split of the SQuAD dataset, applies either a simple prompt–completion format or a chat‐template format (if tokenizer.chat_template is set), tokenizes each example, constructs input_ids and labels, and optionally pads all sequences to a fixed length.

Parameters:

tokenizer

A Hugging Face tokenizer with attributes eos_token_id, optional bos_id, optional eos_id, and optionally chat_template/apply_chat_template.

seq_length
intDefaults to None

If set, pad/truncate each example to this length.

limit_dataset_samples
intDefaults to None

If set, limit the number of examples loaded from the split.

fp8
boolDefaults to False

Flag for future use (e.g., mixed precision). Currently unused.

split
strDefaults to 'train'

Which split of the dataset to load (e.g. ‘train’, ‘validation’).

dataset_name
strDefaults to 'squad'

Identifier for the Hugging Face dataset (default “rajpurkar/squad”).

padding
Optional[str | bool]Defaults to False

Optional padding strategy.

truncation
Optional[str | bool]Defaults to False

Optional truncation strategy.

Returns:

A Hugginggth Face Dataset where each example is a dict with keys: