nemo_automodel.components.datasets.llm.chat_dataset

Module Contents

Classes

Name	Description
`ChatDataset`	Dataset for OpenAI-format tool-calling chat transcripts.

Functions

Name	Description
`_as_iter`	-
`_conversations_to_messages`	Convert a ShareGPT `conversations` list to OpenAI `messages`.
`_is_hf_repo_id`	-
`_load_openai_messages`	Load OpenAI chat messages datasets from HF or local JSON/JSONL files.
`_normalize_messages`	Ensure messages list is valid and content fields are strings for system/user/assistant.
`_parse_split_slice`	Parse a split string like `"train[1024:]"` into `(base_split, slice \| None)`.

Data

_SHAREGPT_ROLE_MAP

_SPLIT_SLICE_RE

API

class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset(
    path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
    tokenizer,
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    start_of_turn_token: typing.Optional[str] = None,
    chat_template: typing.Optional[str] = None,
    shuffle_seed: typing.Optional[int] = None,
    mask_reasoning_content: bool = False,
    mask_history: bool = False,
    unshifted: bool = False,
    skip_invalid_samples: bool = False
)

Bases: Dataset

Dataset for OpenAI-format tool-calling chat transcripts.

Each row should contain a messages list in OpenAI chat format (role / content), potentially including tool calls and tool responses. Rows that instead carry a ShareGPT conversations list (from / value, as used by PerfectBlend and similar) are auto-converted, so no manual column rename is needed. The conversation is formatted via the tokenizer’s chat template to produce input_ids, labels, and attention_mask suitable for SFT.

dataset

pad_token_id

= _add_pad_token(self.tokenizer) or eos_token_id

nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__getitem__(
    idx: int
) -> typing.Dict[str, typing.List[int]]

nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__len__() -> int

nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset._keep_last_supervised_run(
    seq: typing.List[int],
    unsupervised_value: int
) -> None

staticmethod

In place, keep only the final contiguous supervised run; mask the rest.

Supervised positions are those != unsupervised_value (0 for loss_mask, -100 for labels). Used for mask_history: a multi-turn conversation has one supervised run per assistant turn separated by unsupervised user turns; this collapses it to the last turn so the supervised tokens form a single suffix.

nemo_automodel.components.datasets.llm.chat_dataset._as_iter(
    val: typing.Union[str, typing.Sequence[str]]
) -> typing.Iterator[str]

nemo_automodel.components.datasets.llm.chat_dataset._conversations_to_messages(
    conversations: typing.Any
) -> typing.List[typing.Dict[str, typing.Any]]

Convert a ShareGPT conversations list to OpenAI messages.

ShareGPT-style rows store turns as {"from": <role>, "value": <text>} under a conversations column instead of OpenAI {"role", "content"} under messages. Map the common plain-chat roles so such datasets load without a manual rename. Raises on an unsupported role rather than guessing.

nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(
    val: str
) -> bool

nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages(
    path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
    split: typing.Optional[str] = None,
    name: typing.Optional[str] = None,
    shuffle_seed: typing.Optional[int] = None,
    skip_invalid_samples: bool = False
)

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load_dataset. When split is provided, the full base split is loaded and shuffled before any slice (e.g. [1024:]) is applied so that train/val splits sample from a consistent random order. When split is None it is passed through to load_dataset as-is (no default override).

For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under tools).

Parameters:

path_or_dataset_id

Union[str, Sequence[str]]

HF dataset ID or local file path(s).

split

Optional[str]Defaults to None

Dataset split to load (e.g., “train”, “train[1024:]”).

name

Optional[str]Defaults to None

Dataset configuration/subset name

shuffle_seed

Optional[int]Defaults to None

Random seed for shuffling HF datasets before slicing. Set to None to disable shuffling.

skip_invalid_samples

boolDefaults to False

If True, skip malformed JSONL lines for local files instead of failing fast.

nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages(
    messages: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Dict[str, typing.Any]]

Ensure messages list is valid and content fields are strings for system/user/assistant.

Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
If content is a list of parts, only keep text parts.

nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(
    split: typing.Optional[str]
)

Parse a split string like "train[1024:]" into (base_split, slice | None).

nemo_automodel.components.datasets.llm.chat_dataset._SHAREGPT_ROLE_MAP = {'system': 'system', 'human': 'user', 'user': 'user', 'gpt': 'assistant', 'assis...

nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE = re.compile('^(\\w+)\\[(\\d*):(\\d*)\\]$')