nemo_automodel.components.datasets.llm.squad
nemo_automodel.components.datasets.llm.squad
Module Contents
Functions
API
Load and preprocess a SQuAD-style QA dataset for model fine-tuning.
This function retrieves the specified split of the SQuAD dataset, applies
either a simple prompt–completion format or a chat‐template format
(if tokenizer.chat_template is set), tokenizes each example,
constructs input_ids and labels, and optionally pads
all sequences to a fixed length.
Parameters:
A Hugging Face tokenizer with attributes
eos_token_id, optional bos_id, optional eos_id, and
optionally chat_template/apply_chat_template.
If set, pad/truncate each example to this length.
If set, limit the number of examples loaded from the split.
Flag for future use (e.g., mixed precision). Currently unused.
Which split of the dataset to load (e.g. ‘train’, ‘validation’).
Identifier for the Hugging Face dataset (default “rajpurkar/squad”).
Optional padding strategy.
Optional truncation strategy.
Returns:
A Hugginggth Face Dataset where each example is a dict with keys: