bridge.data.hf_datasets.conversation_dataset#

Core dataset types for HF conversation-style examples.

Module Contents#

Classes#

ConversationDataset

Repeating wrapper over a list of HF-style conversation examples.

API#

class bridge.data.hf_datasets.conversation_dataset.ConversationDataset(
base_examples: List[Dict[str, Any]],
target_length: int,
processor: Any,
collate_impl: Optional[Callable[[list, Any], Dict[str, torch.Tensor]]] = None,
pack_sequences: bool = False,
pack_sequences_pad_to_multiple_of: int = 1,
)#

Bases: torch.utils.data.Dataset

Repeating wrapper over a list of HF-style conversation examples.

  • Each base example is expected to contain a “conversation” key following processor.apply_chat_template conventions. Optional modality fields like “audio” are passed through and consumed by the collate function.

  • Dataset length is set to a target length and indexes wrap around the underlying list to meet the requested size.

  • A collate_fn attribute is exposed so the framework can pass it to the DataLoader.

Initialization

__len__() int#
__getitem__(idx: int) Dict[str, Any]#