nemo_automodel.components.datasets.llm.tool_calling_chat_dataset
#
Module Contents#
Classes#
Dataset for OpenAI-format tool-calling chat transcripts. |
Functions#
Load OpenAI chat messages datasets from HF or local JSON/JSONL files. |
|
Ensure messages list is valid and content fields are strings for system/user/assistant. |
API#
- nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._is_hf_repo_id(val: str) bool #
- nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._as_iter(
- val: Union[str, Sequence[str]],
- nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._load_openai_messages(
- path_or_dataset_id: Union[str, Sequence[str]],
- split: Optional[str] = None,
Load OpenAI chat messages datasets from HF or local JSON/JSONL files.
For HF repo IDs, we delegate to datasets.load_dataset. For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under
tools
).
- nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._normalize_messages(
- messages: List[Dict[str, Any]],
Ensure messages list is valid and content fields are strings for system/user/assistant.
Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).
If content is a list of parts, only keep text parts.
- class nemo_automodel.components.datasets.llm.tool_calling_chat_dataset.ToolCallingChatDataset(
- path_or_dataset_id: Union[str, Sequence[str]],
- tokenizer,
- *,
- split: Optional[str] = None,
- seq_length: Optional[int] = None,
- start_of_turn_token: Optional[str] = None,
- chat_template: Optional[str] = None,
Bases:
torch.utils.data.Dataset
Dataset for OpenAI-format tool-calling chat transcripts.
This class expects each row to contain a
messages
list in OpenAI chat format, potentially including tool calls and tool responses. The datasetformats the conversation via the tokenizer’s chat template to produceinput_ids
,labels
, andattention_mask
suitable for SFT.Initialization
- __len__() int #
- __getitem__(idx: int) Dict[str, List[int]] #