nemo_automodel.components.datasets.llm.squad

Module Contents

Functions

Name	Description
`_formatting_prompts_func`	-
`_formatting_prompts_func_with_chat_template`	-
`make_squad_dataset`	Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

API

nemo_automodel.components.datasets.llm.squad._formatting_prompts_func(
    example,
    tokenizer,
    eos_token_id,
    pad_token_id,
    seq_length = None,
    padding = None,
    truncation = None
)

nemo_automodel.components.datasets.llm.squad._formatting_prompts_func_with_chat_template(
    example,
    tokenizer,
    eos_token_id,
    pad_token_id,
    seq_length = None,
    padding = None,
    truncation = None
)

nemo_automodel.components.datasets.llm.squad.make_squad_dataset(
    tokenizer,
    seq_length = None,
    limit_dataset_samples = None,
    fp8 = False,
    split = 'train',
    dataset_name = 'squad',
    padding = False,
    truncation = False
)

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

This function retrieves the specified split of the SQuAD dataset, applies either a simple prompt–completion format or a chat‐template format (if tokenizer.chat_template is set), tokenizes each example, constructs input_ids and labels, and optionally pads all sequences to a fixed length.

Parameters:

tokenizer

A Hugging Face tokenizer with attributes eos_token_id, optional bos_id, optional eos_id, and optionally chat_template/apply_chat_template.

seq_length

intDefaults to None

If set, pad/truncate each example to this length.

limit_dataset_samples

intDefaults to None

If set, limit the number of examples loaded from the split.

fp8

boolDefaults to False

Flag for future use (e.g., mixed precision). Currently unused.

split

strDefaults to 'train'

Which split of the dataset to load (e.g. ‘train’, ‘validation’).

dataset_name

strDefaults to 'squad'

Identifier for the Hugging Face dataset (default “rajpurkar/squad”).

padding

Optional[str | bool]Defaults to False

Optional padding strategy.

truncation

Optional[str | bool]Defaults to False

Optional truncation strategy.

Returns:

A Hugginggth Face Dataset where each example is a dict with keys: