nemo_automodel.components.datasets.llm.squad#

Module Contents#

Functions#

_formatting_prompts_func

_formatting_prompts_func_with_chat_template

make_squad_dataset

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

API#

nemo_automodel.components.datasets.llm.squad._formatting_prompts_func(
example,
tokenizer,
eos_token_id,
pad_token_id,
seq_length=None,
padding=None,
truncation=None,
)#
nemo_automodel.components.datasets.llm.squad._formatting_prompts_func_with_chat_template(
example,
tokenizer,
eos_token_id,
pad_token_id,
seq_length=None,
padding=None,
truncation=None,
)#
nemo_automodel.components.datasets.llm.squad.make_squad_dataset(
tokenizer,
seq_length=None,
limit_dataset_samples=None,
fp8=False,
split='train',
dataset_name='squad',
padding=False,
truncation=False,
)#

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

This function retrieves the specified split of the SQuAD dataset, applies either a simple prompt–completion format or a chat‐template format (if tokenizer.chat_template is set), tokenizes each example, constructs input_ids and labels, and optionally pads all sequences to a fixed length.

Parameters:
  • tokenizer – A Hugging Face tokenizer with attributes eos_token_id, optional bos_id, optional eos_id, and optionally chat_template/apply_chat_template.

  • seq_length (int, optional) – If set, pad/truncate each example to this length.

  • limit_dataset_samples (int, optional) – If set, limit the number of examples loaded from the split.

  • fp8 (bool) – Flag for future use (e.g., mixed precision). Currently unused.

  • split (str) – Which split of the dataset to load (e.g. ‘train’, ‘validation’).

  • dataset_name (str) – Identifier for the Hugging Face dataset (default “rajpurkar/squad”).

  • padding (Optional[str|bool]) – Optional padding strategy.

  • truncation (Optional[str|bool]) – Optional truncation strategy.

Returns:

  • input_ids: List of token IDs for the prompt + answer.

  • labels: List of token IDs shifted for language modeling. to the loss (answers only).

Return type:

A Hugginggth Face Dataset where each example is a dict with keys