nemo_automodel.components.datasets.llm.chat_dataset

View as Markdown

Module Contents

Classes

NameDescription
ChatDatasetDataset for OpenAI-format tool-calling chat transcripts.

Functions

NameDescription
_as_iter-
_conversations_to_messagesConvert a ShareGPT conversations list to OpenAI messages.
_is_hf_repo_id-
_load_openai_messagesLoad OpenAI chat messages datasets from HF or local JSON/JSONL files.
_normalize_messagesEnsure messages list is valid and content fields are strings for system/user/assistant.
_parse_split_sliceParse a split string like "train[1024:]" into (base_split, slice | None).

Data

_SHAREGPT_ROLE_MAP

_SPLIT_SLICE_RE

API

class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset(
path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
tokenizer,
split: typing.Optional[str] = None,
name: typing.Optional[str] = None,
seq_length: typing.Optional[int] = None,
padding: typing.Union[str, bool] = 'do_not_pad',
truncation: typing.Union[str, bool] = 'do_not_truncate',
start_of_turn_token: typing.Optional[str] = None,
chat_template: typing.Optional[str] = None,
shuffle_seed: typing.Optional[int] = None,
mask_reasoning_content: bool = False,
mask_history: bool = False,
unshifted: bool = False,
skip_invalid_samples: bool = False
)

Bases: Dataset

Dataset for OpenAI-format tool-calling chat transcripts.

Each row should contain a messages list in OpenAI chat format (role / content), potentially including tool calls and tool responses. Rows that instead carry a ShareGPT conversations list (from / value, as used by PerfectBlend and similar) are auto-converted, so no manual column rename is needed. The conversation is formatted via the tokenizer’s chat template to produce input_ids, labels, and attention_mask suitable for SFT.

dataset
pad_token_id
= _add_pad_token(self.tokenizer) or eos_token_id
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__getitem__(
idx: int
) -> typing.Dict[str, typing.List[int]]
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset.__len__() -> int
nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset._keep_last_supervised_run(
seq: typing.List[int],
unsupervised_value: int
) -> None
staticmethod

In place, keep only the final contiguous supervised run; mask the rest.

Supervised positions are those != unsupervised_value (0 for loss_mask, -100 for labels). Used for mask_history: a multi-turn conversation has one supervised run per assistant turn separated by unsupervised user turns; this collapses it to the last turn so the supervised tokens form a single suffix.

nemo_automodel.components.datasets.llm.chat_dataset._as_iter(
val: typing.Union[str, typing.Sequence[str]]
) -> typing.Iterator[str]
nemo_automodel.components.datasets.llm.chat_dataset._conversations_to_messages(
conversations: typing.Any
) -> typing.List[typing.Dict[str, typing.Any]]

Convert a ShareGPT conversations list to OpenAI messages.

ShareGPT-style rows store turns as {"from": <role>, "value": <text>} under a conversations column instead of OpenAI {"role", "content"} under messages. Map the common plain-chat roles so such datasets load without a manual rename. Raises on an unsupported role rather than guessing.

nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(
val: str
) -> bool
nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages(
path_or_dataset_id: typing.Union[str, typing.Sequence[str]],
split: typing.Optional[str] = None,
name: typing.Optional[str] = None,
shuffle_seed: typing.Optional[int] = None,
skip_invalid_samples: bool = False
)

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load_dataset. When split is provided, the full base split is loaded and shuffled before any slice (e.g. [1024:]) is applied so that train/val splits sample from a consistent random order. When split is None it is passed through to load_dataset as-is (no default override).

For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under tools).

Parameters:

path_or_dataset_id
Union[str, Sequence[str]]

HF dataset ID or local file path(s).

split
Optional[str]Defaults to None

Dataset split to load (e.g., “train”, “train[1024:]”).

name
Optional[str]Defaults to None

Dataset configuration/subset name

shuffle_seed
Optional[int]Defaults to None

Random seed for shuffling HF datasets before slicing. Set to None to disable shuffling.

skip_invalid_samples
boolDefaults to False

If True, skip malformed JSONL lines for local files instead of failing fast.

nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages(
messages: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Dict[str, typing.Any]]

Ensure messages list is valid and content fields are strings for system/user/assistant.

  • Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
  • If content is a list of parts, only keep text parts.
nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(
split: typing.Optional[str]
)

Parse a split string like "train[1024:]" into (base_split, slice | None).

nemo_automodel.components.datasets.llm.chat_dataset._SHAREGPT_ROLE_MAP = {'system': 'system', 'human': 'user', 'user': 'user', 'gpt': 'assistant', 'assis...
nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE = re.compile('^(\\w+)\\[(\\d*):(\\d*)\\]$')