`bridge.data.hf_datasets.makers`#

Built-in maker functions that transform HuggingFace datasets into Bridge chat or multimodal conversation examples.

Module Contents#

Functions#

`_load_hf_dataset`	Load a Hugging Face dataset with optional subset.
`_make_messages_example`	Create a text-only chat example with optional evaluation answers.
`_extract_final_answer`	Extract the final numerical answer after the `####` delimiter.
`_strip_intermediate_boxed`	Replace all `\boxed{content}` occurrences in text with `content`.
`make_squad_dataset`	Load and preprocess SQuAD into text chat examples.
`make_gsm8k_dataset`	Load and preprocess GSM8K into text chat examples.
`make_openmathinstruct2_dataset`	Load and preprocess OpenMathInstruct-2 into text chat examples.
`make_openmathinstruct2_thinking_dataset`	Load OpenMathInstruct-2 with reasoning in `thinking` and final answer in content.
`make_rdr_dataset`	Load and preprocess the RDR dataset for image-to-text fine-tuning.
`make_cord_v2_dataset`	Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.
`make_medpix_dataset`	Load and preprocess the MedPix dataset for image-to-text fine-tuning.
`make_text_chat_dataset`	Load a text-only HF chat dataset into the conversation-provider schema.
`make_raven_dataset`	Load and preprocess the Raven subset from the Cauldron dataset.
`make_llava_video_178k_dataset`	Load and preprocess a subset of the LLaVA-Video-178K dataset.
`make_default_audio_dataset`	Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning.
`make_valor32k_avqa_dataset`	Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning.
`make_cv17_dataset`	Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.
`get_hf_dataset_maker`	Return a built-in Hugging Face dataset maker by name or alias.

Data#

HF_MAKER_ALIASES

API#

bridge.data.hf_datasets.makers.HF_MAKER_ALIASES#: None

bridge.data.hf_datasets.makers._load_hf_dataset(

path_or_dataset: str,

subset: str | None = None,

split: str = 'train',

**kwargs,

) → Any#: Load a Hugging Face dataset with optional subset.

bridge.data.hf_datasets.makers._make_messages_example( prompt: str, answer: str, original_answers: list[str] | None = None, ) → Dict[str, Any]#: Create a text-only chat example with optional evaluation answers.

bridge.data.hf_datasets.makers._extract_final_answer(answer: str) → str#: Extract the final numerical answer after the #### delimiter.

bridge.data.hf_datasets.makers._strip_intermediate_boxed(text: str) → str#: Replace all \boxed{content} occurrences in text with content.

bridge.data.hf_datasets.makers.make_squad_dataset(

path_or_dataset: str = 'rajpurkar/squad',

subset: str | None = None,

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess SQuAD into text chat examples.

bridge.data.hf_datasets.makers.make_gsm8k_dataset(

path_or_dataset: str = 'openai/gsm8k',

subset: str | None = 'main',

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess GSM8K into text chat examples.

bridge.data.hf_datasets.makers.make_openmathinstruct2_dataset(

path_or_dataset: str = 'nvidia/OpenMathInstruct-2',

subset: str | None = None,

split: str = 'train_1M',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess OpenMathInstruct-2 into text chat examples.

bridge.data.hf_datasets.makers.make_openmathinstruct2_thinking_dataset(

path_or_dataset: str = 'nvidia/OpenMathInstruct-2',

subset: str | None = None,

split: str = 'train_1M',

**kwargs,

) → List[Dict[str, Any]]#: Load OpenMathInstruct-2 with reasoning in thinking and final answer in content.

bridge.data.hf_datasets.makers.make_rdr_dataset(

path_or_dataset: str = 'quintend/rdr-items',

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#

Load and preprocess the RDR dataset for image-to-text fine-tuning.

Returns a list of examples with a “conversation” field that includes an image and text.

bridge.data.hf_datasets.makers.make_cord_v2_dataset(

path_or_dataset: str = 'naver-clova-ix/cord-v2',

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

bridge.data.hf_datasets.makers.make_medpix_dataset(

path_or_dataset: str = 'mmoukouba/MedPix-VQA',

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess the MedPix dataset for image-to-text fine-tuning.

bridge.data.hf_datasets.makers.make_text_chat_dataset(

path_or_dataset: str,

subset: str | None = None,

split: str = 'train',

messages_column: str = 'messages',

conversation_column: str = 'conversation',

conversations_column: str = 'conversations',

**kwargs,

) → List[Dict[str, Any]]#

Load a text-only HF chat dataset into the conversation-provider schema.

The input dataset must already contain OpenAI-style messages, a processor-ready conversation column, or a legacy conversations column. Extra fields are preserved so collators can consume metadata such as tool schemas.

bridge.data.hf_datasets.makers.make_raven_dataset(

path_or_dataset: str = 'HuggingFaceM4/the_cauldron',

subset: str = 'raven',

split: str = 'train',

**kwargs,

) → List[Dict[str, Any]]#

Load and preprocess the Raven subset from the Cauldron dataset.

This subset follows the IDEFICS-style layout where each sample contains:

images: a (possibly empty) list of PIL images
texts: a list of conversation dictionaries. For Raven, texts[0] is a single turn stored as a dictionary with two keys::
```
{"user": "<question>", "assistant": "<answer>"}
```
Only the first element is used. The user string is taken as the user prompt, and assistant is the ground-truth answer.

Conversation building policy:

All images are placed at the beginning of the user turn followed by the textual prompt.
The assistant turn contains the answer text.

Examples missing either images or the required fields are filtered out.

bridge.data.hf_datasets.makers.make_llava_video_178k_dataset( video_root_path: str, path_or_dataset: str = 'lmms-lab/LLaVA-Video-178K', subsets: str | List[str] = '0_30_s_nextqa', split: str = 'open_ended', ) → List[Dict[str, Any]]#

Load and preprocess a subset of the LLaVA-Video-178K dataset.

Each row contains:
- ``video``: path or URL to the MP4 file.
- ``conversations``: a **two-turn** list::

      [{"from": "human", "value": "<video>

”}, {“from”: “gpt”, “value”: “”}]

  We map this schema to our internal multimodal conversation format:

  User turn  →  [video, user prompt]
  Assistant  →  answer text

Note:
    Video files are assumed to be pre-downloaded and stored locally in the
    ``video_root_path`` directory. Rows with missing videos or empty
    conversations are filtered out from the final output.

Args:
    video_root_path: Root directory where video files are stored locally.
    path_or_dataset: HF dataset path or local cache dir.
    subsets: Single subset name or list of the dataset's directory-style
        subsets to load.
    split: Split to load from the dataset. Note that "train" is automatically
        mapped to "open_ended".

Returns:
    A list of dicts each containing a ``conversation`` field ready for
    downstream VLM processors.

bridge.data.hf_datasets.makers.make_default_audio_dataset(

path_or_dataset: str,

subset: str | None = None,

split: str = 'train',

audio_column: str = 'audio',

text_column: str = 'text',

prompt: str = 'Transcribe the audio clip.',

remove_text_spaces: bool = True,

**kwargs,

) → List[Dict[str, Any]]#

Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning.

Formats each example into a conversation with an audio user turn and a text assistant turn. Works with any HF dataset that has audio and text columns.

bridge.data.hf_datasets.makers.make_valor32k_avqa_dataset(

data_root: str,

split: str = 'train',

max_audio_duration: float = 10.0,

modality_filter: str = 'all',

**kwargs,

) → List[Dict[str, Any]]#

Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning.

Expects a directory produced by tutorials/data/valor32k-avqa/prepare_valor32k_avqa.py::

data_root/
├── videos/                                  # 10s MP4 clips
├── audio/                                   # 16 kHz mono WAV
└── combined_dataset_{split}_flattened.json

Parameters:

data_root – Root directory of the preprocessed dataset.
split – "train", "val", or "test".
max_audio_duration – Maximum audio duration in seconds.
modality_filter – "all", "audio-visual", "audio", or "visual".

bridge.data.hf_datasets.makers.make_cv17_dataset(

path_or_dataset: str = 'ysdede/commonvoice_17_tr_fixed',

split: str = 'train',

prompt: str = 'Transcribe the Turkish audio clip.',

**kwargs,

) → List[Dict[str, Any]]#: Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

bridge.data.hf_datasets.makers.get_hf_dataset_maker(maker_name: str)#: Return a built-in Hugging Face dataset maker by name or alias.

bridge.data.hf_datasets.makers#

Module Contents#

Functions#

Data#

API#

`bridge.data.hf_datasets.makers`#