nemo_automodel.components.datasets.llm.chat_dataset#
Module Contents#
Classes#
Dataset for OpenAI-format tool-calling chat transcripts. |
Functions#
Parse a split string like |
|
Load OpenAI chat messages datasets from HF or local JSON/JSONL files. |
|
Ensure messages list is valid and content fields are strings for system/user/assistant. |
Data#
API#
- nemo_automodel.components.datasets.llm.chat_dataset._is_hf_repo_id(val: str) bool#
- nemo_automodel.components.datasets.llm.chat_dataset._as_iter(
- val: Union[str, Sequence[str]],
- nemo_automodel.components.datasets.llm.chat_dataset._SPLIT_SLICE_RE#
‘compile(…)’
- nemo_automodel.components.datasets.llm.chat_dataset._parse_split_slice(split: Optional[str])#
Parse a split string like
"train[1024:]"into(base_split, slice | None).
- nemo_automodel.components.datasets.llm.chat_dataset._load_openai_messages(
- path_or_dataset_id: Union[str, Sequence[str]],
- split: Optional[str] = None,
- name: Optional[str] = None,
- shuffle_seed: Optional[int] = None,
- skip_invalid_samples: bool = False,
Load OpenAI chat messages datasets from HF or local JSON/JSONL files.
For HF repo IDs, we delegate to datasets.load_dataset. When split is provided, the full base split is loaded and shuffled before any slice (e.g.
[1024:]) is applied so that train/val splits sample from a consistent random order. When split isNoneit is passed through toload_datasetas-is (no default override).For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under
tools).- Parameters:
path_or_dataset_id – HF dataset ID or local file path(s).
split – Dataset split to load (e.g., “train”, “train[1024:]”).
name – Dataset configuration/subset name
shuffle_seed – Random seed for shuffling HF datasets before slicing. Set to
Noneto disable shuffling.skip_invalid_samples – If
True, skip malformed JSONL lines for local files instead of failing fast.
- nemo_automodel.components.datasets.llm.chat_dataset._normalize_messages(
- messages: List[Dict[str, Any]],
Ensure messages list is valid and content fields are strings for system/user/assistant.
Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
If content is a list of parts, only keep text parts.
- class nemo_automodel.components.datasets.llm.chat_dataset.ChatDataset(
- path_or_dataset_id: Union[str, Sequence[str]],
- tokenizer,
- *,
- split: Optional[str] = None,
- name: Optional[str] = None,
- seq_length: Optional[int] = None,
- padding: Union[str, bool] = 'do_not_pad',
- truncation: Union[str, bool] = 'do_not_truncate',
- start_of_turn_token: Optional[str] = None,
- chat_template: Optional[str] = None,
- shuffle_seed: Optional[int] = None,
- mask_reasoning_content: bool = False,
- unshifted: bool = False,
- skip_invalid_samples: bool = False,
Bases:
torch.utils.data.DatasetDataset for OpenAI-format tool-calling chat transcripts.
This class expects each row to contain a
messageslist in OpenAI chat format, potentially including tool calls and tool responses. The datasetformats the conversation via the tokenizer’s chat template to produceinput_ids,labels, andattention_masksuitable for SFT.Initialization
Load OpenAI-format chat rows and tokenize via the chat template.
- Parameters:
path_or_dataset_id – Hugging Face dataset id, local JSON/JSONL path(s), Parquet file, or Parquet directory.
tokenizer – Tokenizer with chat template support (required).
split – Dataset split or slice (e.g.
train,train[1024:]).name – Optional Hub subset / config name.
seq_length – Maximum sequence length for padding and truncation in formatting.
padding – Padding mode for
format_chat_template.truncation – Truncation mode for
format_chat_template.start_of_turn_token – Optional token marking assistant turns for answer-only loss.
chat_template – Optional Jinja template string overriding
tokenizer.chat_template.shuffle_seed – If set, shuffles Hub/Parquet data before applying a split slice.
mask_reasoning_content – If
True, exclude rendered reasoning traces from the loss mask.unshifted – Passed through to
format_chat_template.skip_invalid_samples – If
True, skip malformed JSONL lines when reading local files (warning logs include skip counts). IfFalse, a bad line raises. Does not skip invalid structured rows after load; those still raise when a sample is accessed.
- __len__() int#
- __getitem__(idx: int) Dict[str, List[int]]#