nemo_automodel.datasets.llm.squad
#
Module Contents#
Functions#
Load and preprocess a SQuAD-style QA dataset for model fine-tuning. |
API#
- nemo_automodel.datasets.llm.squad.make_squad_dataset(
- tokenizer,
- seq_length=None,
- limit_dataset_samples=None,
- start_of_turn_token=None,
- fp8=False,
- split='train',
- dataset_name='rajpurkar/squad',
Load and preprocess a SQuAD-style QA dataset for model fine-tuning.
This function retrieves the specified split of the SQuAD dataset, applies either a simple promptācompletion format or a chatātemplate format (if
tokenizer.chat_template
is set), tokenizes each example, constructsinput_ids
,labels
, andloss_mask
, and optionally pads all sequences to a fixed length.- Parameters:
tokenizer ā A Hugging Face tokenizer with attributes
eos_token_id
, optionalbos_id
, optionaleos_id
, and optionallychat_template
/apply_chat_template
.seq_length (int, optional) ā If set, pad/truncate each example to this length.
limit_dataset_samples (int, optional) ā If set, limit the number of examples loaded from the split.
start_of_turn_token (str or None) ā If using a chat template, the token that marks the start of each turn. Used to compute the response offset for
loss_mask
.fp8 (bool) ā Flag for future use (e.g., mixed precision). Currently unused.
split (str) ā Which split of the dataset to load (e.g. ātrainā, āvalidationā).
dataset_name (str) ā Identifier for the Hugging Face dataset (default ārajpurkar/squadā).
- Returns:
input_ids
: List of token IDs for the prompt + answer.labels
: List of token IDs shifted for language modeling.loss_mask
: List of 0/1 flags indicating which tokens contribute to the loss (answers only).
- Return type:
A Hugginggth Face Dataset where each example is a dict with keys