bridge.data.hf_datasets.makers#

Built-in maker functions that transform HuggingFace datasets into Bridge chat or multimodal conversation examples.

Module Contents#

Functions#

_load_hf_dataset

Load a Hugging Face dataset with optional subset.

_make_messages_example

Create a text-only chat example with optional evaluation answers.

_extract_final_answer

Extract the final numerical answer after the #### delimiter.

_strip_intermediate_boxed

Replace all \boxed{content} occurrences in text with content.

make_squad_dataset

Load and preprocess SQuAD into text chat examples.

make_gsm8k_dataset

Load and preprocess GSM8K into text chat examples.

make_openmathinstruct2_dataset

Load and preprocess OpenMathInstruct-2 into text chat examples.

make_openmathinstruct2_thinking_dataset

Load OpenMathInstruct-2 with reasoning in thinking and final answer in content.

make_rdr_dataset

Load and preprocess the RDR dataset for image-to-text fine-tuning.

make_cord_v2_dataset

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

make_medpix_dataset

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

make_text_chat_dataset

Load a text-only HF chat dataset into the conversation-provider schema.

make_raven_dataset

Load and preprocess the Raven subset from the Cauldron dataset.

make_llava_video_178k_dataset

Load and preprocess a subset of the LLaVA-Video-178K dataset.

make_default_audio_dataset

Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning.

make_valor32k_avqa_dataset

Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning.

make_cv17_dataset

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

get_hf_dataset_maker

Return a built-in Hugging Face dataset maker by name or alias.

Data#

API#

bridge.data.hf_datasets.makers.HF_MAKER_ALIASES#

None

bridge.data.hf_datasets.makers._load_hf_dataset(
path_or_dataset: str,
subset: str | None = None,
split: str = 'train',
**kwargs,
) Any#

Load a Hugging Face dataset with optional subset.

bridge.data.hf_datasets.makers._make_messages_example(
prompt: str,
answer: str,
original_answers: list[str] | None = None,
) Dict[str, Any]#

Create a text-only chat example with optional evaluation answers.

bridge.data.hf_datasets.makers._extract_final_answer(answer: str) str#

Extract the final numerical answer after the #### delimiter.

bridge.data.hf_datasets.makers._strip_intermediate_boxed(text: str) str#

Replace all \boxed{content} occurrences in text with content.

bridge.data.hf_datasets.makers.make_squad_dataset(
path_or_dataset: str = 'rajpurkar/squad',
subset: str | None = None,
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess SQuAD into text chat examples.

bridge.data.hf_datasets.makers.make_gsm8k_dataset(
path_or_dataset: str = 'openai/gsm8k',
subset: str | None = 'main',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess GSM8K into text chat examples.

bridge.data.hf_datasets.makers.make_openmathinstruct2_dataset(
path_or_dataset: str = 'nvidia/OpenMathInstruct-2',
subset: str | None = None,
split: str = 'train_1M',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess OpenMathInstruct-2 into text chat examples.

bridge.data.hf_datasets.makers.make_openmathinstruct2_thinking_dataset(
path_or_dataset: str = 'nvidia/OpenMathInstruct-2',
subset: str | None = None,
split: str = 'train_1M',
**kwargs,
) List[Dict[str, Any]]#

Load OpenMathInstruct-2 with reasoning in thinking and final answer in content.

bridge.data.hf_datasets.makers.make_rdr_dataset(
path_or_dataset: str = 'quintend/rdr-items',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the RDR dataset for image-to-text fine-tuning.

Returns a list of examples with a β€œconversation” field that includes an image and text.

bridge.data.hf_datasets.makers.make_cord_v2_dataset(
path_or_dataset: str = 'naver-clova-ix/cord-v2',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

bridge.data.hf_datasets.makers.make_medpix_dataset(
path_or_dataset: str = 'mmoukouba/MedPix-VQA',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

bridge.data.hf_datasets.makers.make_text_chat_dataset(
path_or_dataset: str,
subset: str | None = None,
split: str = 'train',
messages_column: str = 'messages',
conversation_column: str = 'conversation',
conversations_column: str = 'conversations',
**kwargs,
) List[Dict[str, Any]]#

Load a text-only HF chat dataset into the conversation-provider schema.

The input dataset must already contain OpenAI-style messages, a processor-ready conversation column, or a legacy conversations column. Extra fields are preserved so collators can consume metadata such as tool schemas.

bridge.data.hf_datasets.makers.make_raven_dataset(
path_or_dataset: str = 'HuggingFaceM4/the_cauldron',
subset: str = 'raven',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the Raven subset from the Cauldron dataset.

This subset follows the IDEFICS-style layout where each sample contains:

  • images: a (possibly empty) list of PIL images

  • texts: a list of conversation dictionaries. For Raven, texts[0] is a single turn stored as a dictionary with two keys::

    {"user": "<question>", "assistant": "<answer>"}
    

    Only the first element is used. The user string is taken as the user prompt, and assistant is the ground-truth answer.

Conversation building policy:

  1. All images are placed at the beginning of the user turn followed by the textual prompt.

  2. The assistant turn contains the answer text.

Examples missing either images or the required fields are filtered out.

bridge.data.hf_datasets.makers.make_llava_video_178k_dataset(
video_root_path: str,
path_or_dataset: str = 'lmms-lab/LLaVA-Video-178K',
subsets: str | List[str] = '0_30_s_nextqa',
split: str = 'open_ended',
) List[Dict[str, Any]]#

Load and preprocess a subset of the LLaVA-Video-178K dataset.

Each row contains:
- ``video``: path or URL to the MP4 file.
- ``conversations``: a **two-turn** list::

      [{"from": "human", "value": "<video>

”}, {β€œfrom”: β€œgpt”, β€œvalue”: β€œβ€}]

  We map this schema to our internal multimodal conversation format:

  User turn  β†’  [video, user prompt]
  Assistant  β†’  answer text

Note:
    Video files are assumed to be pre-downloaded and stored locally in the
    ``video_root_path`` directory. Rows with missing videos or empty
    conversations are filtered out from the final output.

Args:
    video_root_path: Root directory where video files are stored locally.
    path_or_dataset: HF dataset path or local cache dir.
    subsets: Single subset name or list of the dataset's directory-style
        subsets to load.
    split: Split to load from the dataset. Note that "train" is automatically
        mapped to "open_ended".

Returns:
    A list of dicts each containing a ``conversation`` field ready for
    downstream VLM processors.
bridge.data.hf_datasets.makers.make_default_audio_dataset(
path_or_dataset: str,
subset: str | None = None,
split: str = 'train',
audio_column: str = 'audio',
text_column: str = 'text',
prompt: str = 'Transcribe the audio clip.',
remove_text_spaces: bool = True,
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning.

Formats each example into a conversation with an audio user turn and a text assistant turn. Works with any HF dataset that has audio and text columns.

bridge.data.hf_datasets.makers.make_valor32k_avqa_dataset(
data_root: str,
split: str = 'train',
max_audio_duration: float = 10.0,
modality_filter: str = 'all',
**kwargs,
) List[Dict[str, Any]]#

Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning.

Expects a directory produced by tutorials/data/valor32k-avqa/prepare_valor32k_avqa.py::

data_root/
β”œβ”€β”€ videos/                                  # 10s MP4 clips
β”œβ”€β”€ audio/                                   # 16 kHz mono WAV
└── combined_dataset_{split}_flattened.json
Parameters:
  • data_root – Root directory of the preprocessed dataset.

  • split – "train", "val", or "test".

  • max_audio_duration – Maximum audio duration in seconds.

  • modality_filter – "all", "audio-visual", "audio", or "visual".

bridge.data.hf_datasets.makers.make_cv17_dataset(
path_or_dataset: str = 'ysdede/commonvoice_17_tr_fixed',
split: str = 'train',
prompt: str = 'Transcribe the Turkish audio clip.',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

bridge.data.hf_datasets.makers.get_hf_dataset_maker(maker_name: str)#

Return a built-in Hugging Face dataset maker by name or alias.