nemo_automodel.components.datasets.llm.chat_dataset
nemo_automodel.components.datasets.llm.chat_dataset
Module Contents
Classes
Functions
Data
API
Bases: Dataset
Dataset for OpenAI-format tool-calling chat transcripts.
Each row should contain a messages list in OpenAI chat format (role /
content), potentially including tool calls and tool responses. Rows that
instead carry a ShareGPT conversations list (from / value, as used by
PerfectBlend and similar) are auto-converted, so no manual column rename is
needed. The conversation is formatted via the tokenizer’s chat template to
produce input_ids, labels, and attention_mask suitable for SFT.
In place, keep only the final contiguous supervised run; mask the rest.
Supervised positions are those != unsupervised_value (0 for loss_mask,
-100 for labels). Used for mask_history: a multi-turn conversation has
one supervised run per assistant turn separated by unsupervised user turns;
this collapses it to the last turn so the supervised tokens form a single
suffix.
Convert a ShareGPT conversations list to OpenAI messages.
ShareGPT-style rows store turns as {"from": <role>, "value": <text>} under
a conversations column instead of OpenAI {"role", "content"} under
messages. Map the common plain-chat roles so such datasets load without a
manual rename. Raises on an unsupported role rather than guessing.
Load OpenAI chat messages datasets from HF or local JSON/JSONL files.
For HF repo IDs, we delegate to datasets.load_dataset. When split
is provided, the full base split is loaded and shuffled before any
slice (e.g. [1024:]) is applied so that train/val splits sample
from a consistent random order. When split is None it is passed
through to load_dataset as-is (no default override).
For local files, we manually parse JSONL/JSON to avoid pyarrow type
inference issues (e.g., heterogeneous field types under tools).
Parameters:
HF dataset ID or local file path(s).
Dataset split to load (e.g., “train”, “train[1024:]”).
Dataset configuration/subset name
Random seed for shuffling HF datasets before slicing.
Set to None to disable shuffling.
If True, skip malformed JSONL lines for local
files instead of failing fast.
Ensure messages list is valid and content fields are strings for system/user/assistant.
- Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
- If content is a list of parts, only keep text parts.
Parse a split string like "train[1024:]" into (base_split, slice | None).