> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.squad

## Module Contents

### Functions

| Name                                                                                                                                       | Description                                                         |
| ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------- |
| [`_formatting_prompts_func`](#nemo_automodel-components-datasets-llm-squad-_formatting_prompts_func)                                       | -                                                                   |
| [`_formatting_prompts_func_with_chat_template`](#nemo_automodel-components-datasets-llm-squad-_formatting_prompts_func_with_chat_template) | -                                                                   |
| [`make_squad_dataset`](#nemo_automodel-components-datasets-llm-squad-make_squad_dataset)                                                   | Load and preprocess a SQuAD-style QA dataset for model fine-tuning. |

### API

```python
nemo_automodel.components.datasets.llm.squad._formatting_prompts_func(
    example,
    tokenizer,
    eos_token_id,
    pad_token_id,
    seq_length = None,
    padding = None,
    truncation = None
)
```

```python
nemo_automodel.components.datasets.llm.squad._formatting_prompts_func_with_chat_template(
    example,
    tokenizer,
    eos_token_id,
    pad_token_id,
    seq_length = None,
    padding = None,
    truncation = None
)
```

```python
nemo_automodel.components.datasets.llm.squad.make_squad_dataset(
    tokenizer,
    seq_length = None,
    limit_dataset_samples = None,
    fp8 = False,
    split = 'train',
    dataset_name = 'squad',
    padding = False,
    truncation = False
)
```

Load and preprocess a SQuAD-style QA dataset for model fine-tuning.

This function retrieves the specified split of the SQuAD dataset, applies
either a simple prompt–completion format or a chat‐template format
(if `tokenizer.chat_template` is set), tokenizes each example,
constructs `input_ids` and `labels`, and optionally pads
all sequences to a fixed length.

**Parameters:**

A Hugging Face tokenizer with attributes
`eos_token_id`, optional `bos_id`, optional `eos_id`, and
optionally `chat_template`/`apply_chat_template`.

If set, pad/truncate each example to this
length.

If set, limit the number of
examples loaded from the split.

Flag for future use (e.g., mixed precision). Currently
unused.

Which split of the dataset to load (e.g. 'train',
'validation').

Identifier for the Hugging Face dataset
(default "rajpurkar/squad").

Optional padding strategy.

Optional truncation strategy.

**Returns:**

A Hugginggth Face Dataset where each example is a dict with keys: