`nemo_automodel.datasets.llm.squad`#

Module Contents#

Functions#

make_squad_dataset

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

API#

nemo_automodel.datasets.llm.squad.make_squad_dataset( tokenizer, seq_length=None, limit_dataset_samples=None, start_of_turn_token=None, fp8=False, split='train', dataset_name='rajpurkar/squad', )[source]#

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

This function retrieves the specified split of the SQuAD dataset, applies either a simple prompt–completion format or a chat‐template format (if tokenizer.chat_template is set), tokenizes each example, constructs input_ids, labels, and loss_mask, and optionally pads all sequences to a fixed length.

Parameters:

tokenizer – A Hugging Face tokenizer with attributes eos_token_id, optional bos_id, optional eos_id, and optionally chat_template/apply_chat_template.
seq_length (int, optional) – If set, pad/truncate each example to this length.
limit_dataset_samples (int, optional) – If set, limit the number of examples loaded from the split.
start_of_turn_token (str or None) – If using a chat template, the token that marks the start of each turn. Used to compute the response offset for loss_mask.
fp8 (bool) – Flag for future use (e.g., mixed precision). Currently unused.
split (str) – Which split of the dataset to load (e.g. ‘train’, ‘validation’).
dataset_name (str) – Identifier for the Hugging Face dataset (default “rajpurkar/squad”).

Returns:

input_ids: List of token IDs for the prompt + answer.
labels: List of token IDs shifted for language modeling.
loss_mask: List of 0/1 flags indicating which tokens contribute to the loss (answers only).

Return type:

A Hugginggth Face Dataset where each example is a dict with keys

nemo_automodel.datasets.llm.squad#

Module Contents#

Functions#

API#

`nemo_automodel.datasets.llm.squad`#