nemo_automodel.components.datasets.llm.tool_calling_chat_dataset#

Module Contents#

Classes#

ToolCallingChatDataset

Dataset for OpenAI-format tool-calling chat transcripts.

Functions#

_is_hf_repo_id

_as_iter

_load_openai_messages

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

_normalize_messages

Ensure messages list is valid and content fields are strings for system/user/assistant.

API#

nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._is_hf_repo_id(val: str) bool#
nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._as_iter(
val: Union[str, Sequence[str]],
) Iterator[str]#
nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._load_openai_messages(
path_or_dataset_id: Union[str, Sequence[str]],
split: Optional[str] = None,
)#

Load OpenAI chat messages datasets from HF or local JSON/JSONL files.

For HF repo IDs, we delegate to datasets.load_dataset. For local files, we manually parse JSONL/JSON to avoid pyarrow type inference issues (e.g., heterogeneous field types under tools).

nemo_automodel.components.datasets.llm.tool_calling_chat_dataset._normalize_messages(
messages: List[Dict[str, Any]],
) List[Dict[str, Any]]#

Ensure messages list is valid and content fields are strings for system/user/assistant.

  • Keeps tool_calling fields if present (e.g., tool calls in assistant messages, tool role messages).

  • If content is a list of parts, only keep text parts.

class nemo_automodel.components.datasets.llm.tool_calling_chat_dataset.ToolCallingChatDataset(
path_or_dataset_id: Union[str, Sequence[str]],
tokenizer,
*,
split: Optional[str] = None,
seq_length: Optional[int] = None,
start_of_turn_token: Optional[str] = None,
chat_template: Optional[str] = None,
)#

Bases: torch.utils.data.Dataset

Dataset for OpenAI-format tool-calling chat transcripts.

This class expects each row to contain a messages list in OpenAI chat format, potentially including tool calls and tool responses. The datasetformats the conversation via the tokenizer’s chat template to produce input_ids, labels, and attention_mask suitable for SFT.

Initialization

__len__() int#
__getitem__(idx: int) Dict[str, List[int]]#