nemo_automodel.components.datasets.llm.chat_dataset#

Module Contents#

Classes#

ChatDataset

Dataset for OpenAI-format tool-calling chat transcripts.

Functions#

_is_hf_repo_id

_as_iter

_parse_split_slice

Parse a split string like "train[1024:]" into (base_split, slice | None).

_load_openai_messages

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

_normalize_messages

Ensure messages list is valid and content fields are strings for system/user/assistant.

Data#

API#

nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(val: str) bool#
nemo_automodel.components.datasets.llm.chat_dataset._as_iter(
val: Union[str, Sequence[str]],
) Iterator[str]#
nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE#

‘compile(…)’

nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(split: Optional[str])#

Parse a split string like "train[1024:]" into (base_split, slice | None).

nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages(
path_or_dataset_id: Union[str, Sequence[str]],
split: Optional[str] = None,
name: Optional[str] = None,
shuffle_seed: Optional[int] = None,
skip_invalid_samples: bool = False,
)#

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load_dataset. When split is provided, the full base split is loaded and shuffled before any slice (e.g. [1024:]) is applied so that train/val splits sample from a consistent random order. When split is None it is passed through to load_dataset as-is (no default override).

For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under tools).

Parameters:
  • path_or_dataset_id – HF dataset ID or local file path(s).

  • split – Dataset split to load (e.g., “train”, “train[1024:]”).

  • name – Dataset configuration/subset name

  • shuffle_seed – Random seed for shuffling HF datasets before slicing. Set to None to disable shuffling.

  • skip_invalid_samples – If True, skip malformed JSONL lines for local files instead of failing fast.

nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages(
messages: List[Dict[str, Any]],
) List[Dict[str, Any]]#

Ensure messages list is valid and content fields are strings for system/user/assistant.

  • Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).

  • If content is a list of parts, only keep text parts.

class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset(
path_or_dataset_id: Union[str, Sequence[str]],
tokenizer,
*,
split: Optional[str] = None,
name: Optional[str] = None,
seq_length: Optional[int] = None,
padding: Union[str, bool] = 'do_not_pad',
truncation: Union[str, bool] = 'do_not_truncate',
start_of_turn_token: Optional[str] = None,
chat_template: Optional[str] = None,
shuffle_seed: Optional[int] = None,
mask_reasoning_content: bool = False,
unshifted: bool = False,
skip_invalid_samples: bool = False,
)#

Bases: torch.utils.data.Dataset

Dataset for OpenAI-format tool-calling chat transcripts.

This class expects each row to contain a messages list in OpenAI chat format, potentially including tool calls and tool responses. The datasetformats the conversation via the tokenizer’s chat template to produce input_ids, labels, and attention_mask suitable for SFT.

Initialization

Load OpenAI-format chat rows and tokenize via the chat template.

Parameters:
  • path_or_dataset_id – Hugging Face dataset id, local JSON/JSONL path(s), Parquet file, or Parquet directory.

  • tokenizer – Tokenizer with chat template support (required).

  • split – Dataset split or slice (e.g. train, train[1024:]).

  • name – Optional Hub subset / config name.

  • seq_length – Maximum sequence length for padding and truncation in formatting.

  • padding – Padding mode for format_chat_template.

  • truncation – Truncation mode for format_chat_template.

  • start_of_turn_token – Optional token marking assistant turns for answer-only loss.

  • chat_template – Optional Jinja template string overriding tokenizer.chat_template.

  • shuffle_seed – If set, shuffles Hub/Parquet data before applying a split slice.

  • mask_reasoning_content – If True, exclude rendered reasoning traces from the loss mask.

  • unshifted – Passed through to format_chat_template.

  • skip_invalid_samples – If True, skip malformed JSONL lines when reading local files (warning logs include skip counts). If False, a bad line raises. Does not skip invalid structured rows after load; those still raise when a sample is accessed.

__len__() int#
__getitem__(idx: int) Dict[str, List[int]]#