> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.chat_dataset

## Module Contents

### Classes

| Name                                                                              | Description                                              |
| --------------------------------------------------------------------------------- | -------------------------------------------------------- |
| [`ChatDataset`](#nemo_automodel-components-datasets-llm-chat_dataset-ChatDataset) | Dataset for OpenAI-format tool-calling chat transcripts. |

### Functions

| Name                                                                                                            | Description                                                                             |
| --------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| [`_as_iter`](#nemo_automodel-components-datasets-llm-chat_dataset-_as_iter)                                     | -                                                                                       |
| [`_conversations_to_messages`](#nemo_automodel-components-datasets-llm-chat_dataset-_conversations_to_messages) | Convert a ShareGPT `conversations` list to OpenAI `messages`.                           |
| [`_is_hf_repo_id`](#nemo_automodel-components-datasets-llm-chat_dataset-_is_hf_repo_id)                         | -                                                                                       |
| [`_load_openai_messages`](#nemo_automodel-components-datasets-llm-chat_dataset-_load_openai_messages)           | Load OpenAI chat messages datasets from HF or local JSON/JSONL files.                   |
| [`_normalize_messages`](#nemo_automodel-components-datasets-llm-chat_dataset-_normalize_messages)               | Ensure messages list is valid and content fields are strings for system/user/assistant. |
| [`_parse_split_slice`](#nemo_automodel-components-datasets-llm-chat_dataset-_parse_split_slice)                 | Parse a split string like `"train[1024:]"` into `(base_split, slice \| None)`.          |

### Data

[`_SHAREGPT_ROLE_MAP`](#nemo_automodel-components-datasets-llm-chat_dataset-_SHAREGPT_ROLE_MAP)

[`_SPLIT_SLICE_RE`](#nemo_automodel-components-datasets-llm-chat_dataset-_SPLIT_SLICE_RE)

### API

```python
class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset(
    path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
    tokenizer,
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    start_of_turn_token: typing.Optional[str] = None,
    chat_template: typing.Optional[str] = None,
    shuffle_seed: typing.Optional[int] = None,
    mask_reasoning_content: bool = False,
    mask_history: bool = False,
    unshifted: bool = False,
    skip_invalid_samples: bool = False
)
```

**Bases:** `Dataset`

Dataset for OpenAI-format tool-calling chat transcripts.

Each row should contain a `messages` list in OpenAI chat format (`role` /
`content`), potentially including tool calls and tool responses. Rows that
instead carry a ShareGPT `conversations` list (`from` / `value`, as used by
PerfectBlend and similar) are auto-converted, so no manual column rename is
needed. The conversation is formatted via the tokenizer's chat template to
produce `input_ids`, `labels`, and `attention_mask` suitable for SFT.

```python
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.List[int]]
```

```python
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__len__() -> int
```

```python
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset._keep_last_supervised_run(
    seq: typing.List[int],
    unsupervised_value: int
) -> None
```

staticmethod

In place, keep only the final contiguous supervised run; mask the rest.

Supervised positions are those `!= unsupervised_value` (0 for `loss_mask`,
-100 for `labels`). Used for `mask_history`: a multi-turn conversation has
one supervised run per assistant turn separated by unsupervised user turns;
this collapses it to the last turn so the supervised tokens form a single
suffix.

```python
nemo_automodel.components.datasets.llm.chat_dataset._as_iter(
    val: typing.Union[str, typing.Sequence[str]]
) -> typing.Iterator[str]
```

```python
nemo_automodel.components.datasets.llm.chat_dataset._conversations_to_messages(
    conversations: typing.Any
) -> typing.List[typing.Dict[str, typing.Any]]
```

Convert a ShareGPT `conversations` list to OpenAI `messages`.

ShareGPT-style rows store turns as `&#123;"from": &lt;role&gt;, "value": &lt;text&gt;&#125;` under
a `conversations` column instead of OpenAI `&#123;"role", "content"&#125;` under
`messages`. Map the common plain-chat roles so such datasets load without a
manual rename. Raises on an unsupported role rather than guessing.

```python
nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(
    val: str
) -> bool
```

```python
nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages(
    path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    shuffle_seed: typing.Optional[int] = None,
    skip_invalid_samples: bool = False
)
```

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load\_dataset.  When *split*
is provided, the full base split is loaded and shuffled *before* any
slice (e.g. `[1024:]`) is applied so that train/val splits sample
from a consistent random order.  When *split* is `None` it is passed
through to `load_dataset` as-is (no default override).

For local files, we manually parse JSONL/JSON to avoid pyarrow type
inference issues (e.g., heterogeneous field types under `tools`).

**Parameters:**

HF dataset ID or local file path(s).

Dataset split to load (e.g., "train", "train\[1024:]").

Dataset configuration/subset name

Random seed for shuffling HF datasets before slicing.
Set to `None` to disable shuffling.

If `True`, skip malformed JSONL lines for local
files instead of failing fast.

```python
nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages(
    messages: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Dict[str, typing.Any]]
```

Ensure messages list is valid and content fields are strings for system/user/assistant.

* Keeps tool\_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
* If content is a list of parts, only keep text parts.

```python
nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(
    split: typing.Optional[str]
)
```

Parse a split string like `"train[1024:]"` into `(base_split, slice | None)`.

```python
nemo_automodel.components.datasets.llm.chat_dataset._SHAREGPT_ROLE_MAP = {'system': 'system', 'human': 'user', 'user': 'user', 'gpt': 'assistant', 'assis...
```

```python
nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE = re.compile('^(\\w+)\\[(\\d*):(\\d*)\\]$')
```