`nemo_automodel.components.datasets.llm.chat_dataset`#

Module Contents#

Classes#

ChatDataset

Dataset for OpenAI-format tool-calling chat transcripts.

Functions#

`_is_hf_repo_id`
`_as_iter`
`_parse_split_slice`	Parse a split string like `"train[1024:]"` into `(base_split, slice \| None)`.
`_load_openai_messages`	Load OpenAI chat messages datasets from HF or local JSON/JSONL files.
`_normalize_messages`	Ensure messages list is valid and content fields are strings for system/user/assistant.

Data#

_SPLIT_SLICE_RE

API#

nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(val: str) → bool#

nemo_automodel.components.datasets.llm.chat_dataset._as_iter( val: Union[str, Sequence[str]], ) → Iterator[str]#

nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE#: ‘compile(…)’

nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(split: Optional[str])#: Parse a split string like "train[1024:]" into (base_split, slice | None).

nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages( path_or_dataset_id: Union[str, Sequence[str]], split: Optional[str] = None, name: Optional[str] = None, shuffle_seed: Optional[int] = None, skip_invalid_samples: bool = False, )#

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load_dataset. When split is provided, the full base split is loaded and shuffled before any slice (e.g. [1024:]) is applied so that train/val splits sample from a consistent random order. When split is None it is passed through to load_dataset as-is (no default override).

For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under tools).

Parameters:

path_or_dataset_id – HF dataset ID or local file path(s).
split – Dataset split to load (e.g., “train”, “train[1024:]”).
name – Dataset configuration/subset name
shuffle_seed – Random seed for shuffling HF datasets before slicing. Set to None to disable shuffling.
skip_invalid_samples – If True, skip malformed JSONL lines for local files instead of failing fast.

nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages( messages: List[Dict[str, Any]], ) → List[Dict[str, Any]]#

Ensure messages list is valid and content fields are strings for system/user/assistant.

Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
If content is a list of parts, only keep text parts.

class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset( path_or_dataset_id: Union[str, Sequence[str]], tokenizer, *, split: Optional[str] = None, name: Optional[str] = None, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', start_of_turn_token: Optional[str] = None, chat_template: Optional[str] = None, shuffle_seed: Optional[int] = None, mask_reasoning_content: bool = False, unshifted: bool = False, skip_invalid_samples: bool = False, )#

Bases: torch.utils.data.Dataset

Dataset for OpenAI-format tool-calling chat transcripts.

This class expects each row to contain a messages list in OpenAI chat format, potentially including tool calls and tool responses. The datasetformats the conversation via the tokenizer’s chat template to produce input_ids, labels, and attention_mask suitable for SFT.

Initialization

Load OpenAI-format chat rows and tokenize via the chat template.