nemo_automodel.components.datasets.llm.squad#
Module Contents#
Functions#
Load and preprocess a SQuAD-style QA dataset for model fine-tuning. |
API#
- nemo_automodel.components.datasets.llm.squad._formatting_prompts_func(
- example,
- tokenizer,
- eos_token_id,
- pad_token_id,
- seq_length=None,
- padding=None,
- truncation=None,
- nemo_automodel.components.datasets.llm.squad._formatting_prompts_func_with_chat_template(
- example,
- tokenizer,
- eos_token_id,
- pad_token_id,
- seq_length=None,
- padding=None,
- truncation=None,
- nemo_automodel.components.datasets.llm.squad.make_squad_dataset(
- tokenizer,
- seq_length=None,
- limit_dataset_samples=None,
- fp8=False,
- split='train',
- dataset_name='squad',
- padding=False,
- truncation=False,
Load and preprocess a SQuAD-style QA dataset for model fine-tuning.
This function retrieves the specified split of the SQuAD dataset, applies either a simple promptâcompletion format or a chatâtemplate format (if
tokenizer.chat_templateis set), tokenizes each example, constructsinput_idsandlabels, and optionally pads all sequences to a fixed length.- Parameters:
tokenizer â A Hugging Face tokenizer with attributes
eos_token_id, optionalbos_id, optionaleos_id, and optionallychat_template/apply_chat_template.seq_length (int, optional) â If set, pad/truncate each example to this length.
limit_dataset_samples (int, optional) â If set, limit the number of examples loaded from the split.
fp8 (bool) â Flag for future use (e.g., mixed precision). Currently unused.
split (str) â Which split of the dataset to load (e.g. âtrainâ, âvalidationâ).
dataset_name (str) â Identifier for the Hugging Face dataset (default ârajpurkar/squadâ).
padding (Optional[str|bool]) â Optional padding strategy.
truncation (Optional[str|bool]) â Optional truncation strategy.
- Returns:
input_ids: List of token IDs for the prompt + answer.labels: List of token IDs shifted for language modeling. to the loss (answers only).
- Return type:
A Hugginggth Face Dataset where each example is a dict with keys