bridge.data.hf_datasets.makers#
Built-in maker functions that transform HuggingFace datasets into Bridge chat or multimodal conversation examples.
Module Contents#
Functions#
Load a Hugging Face dataset with optional subset. |
|
Create a text-only chat example with optional evaluation answers. |
|
Extract the final numerical answer after the |
|
Replace all |
|
Load and preprocess SQuAD into text chat examples. |
|
Load and preprocess GSM8K into text chat examples. |
|
Load and preprocess OpenMathInstruct-2 into text chat examples. |
|
Load OpenMathInstruct-2 with reasoning in |
|
Load and preprocess the RDR dataset for image-to-text fine-tuning. |
|
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning. |
|
Load and preprocess the MedPix dataset for image-to-text fine-tuning. |
|
Load a text-only HF chat dataset into the conversation-provider schema. |
|
Load and preprocess the Raven subset from the Cauldron dataset. |
|
Load and preprocess a subset of the LLaVA-Video-178K dataset. |
|
Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning. |
|
Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning. |
|
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning. |
|
Return a built-in Hugging Face dataset maker by name or alias. |
Data#
API#
- bridge.data.hf_datasets.makers.HF_MAKER_ALIASES#
None
- bridge.data.hf_datasets.makers._load_hf_dataset(
- path_or_dataset: str,
- subset: str | None = None,
- split: str = 'train',
- **kwargs,
Load a Hugging Face dataset with optional subset.
- bridge.data.hf_datasets.makers._make_messages_example(
- prompt: str,
- answer: str,
- original_answers: list[str] | None = None,
Create a text-only chat example with optional evaluation answers.
- bridge.data.hf_datasets.makers._extract_final_answer(answer: str) str#
Extract the final numerical answer after the
####delimiter.
- bridge.data.hf_datasets.makers._strip_intermediate_boxed(text: str) str#
Replace all
\boxed{content}occurrences in text withcontent.
- bridge.data.hf_datasets.makers.make_squad_dataset(
- path_or_dataset: str = 'rajpurkar/squad',
- subset: str | None = None,
- split: str = 'train',
- **kwargs,
Load and preprocess SQuAD into text chat examples.
- bridge.data.hf_datasets.makers.make_gsm8k_dataset(
- path_or_dataset: str = 'openai/gsm8k',
- subset: str | None = 'main',
- split: str = 'train',
- **kwargs,
Load and preprocess GSM8K into text chat examples.
- bridge.data.hf_datasets.makers.make_openmathinstruct2_dataset(
- path_or_dataset: str = 'nvidia/OpenMathInstruct-2',
- subset: str | None = None,
- split: str = 'train_1M',
- **kwargs,
Load and preprocess OpenMathInstruct-2 into text chat examples.
- bridge.data.hf_datasets.makers.make_openmathinstruct2_thinking_dataset(
- path_or_dataset: str = 'nvidia/OpenMathInstruct-2',
- subset: str | None = None,
- split: str = 'train_1M',
- **kwargs,
Load OpenMathInstruct-2 with reasoning in
thinkingand final answer in content.
- bridge.data.hf_datasets.makers.make_rdr_dataset(
- path_or_dataset: str = 'quintend/rdr-items',
- split: str = 'train',
- **kwargs,
Load and preprocess the RDR dataset for image-to-text fine-tuning.
Returns a list of examples with a βconversationβ field that includes an image and text.
- bridge.data.hf_datasets.makers.make_cord_v2_dataset(
- path_or_dataset: str = 'naver-clova-ix/cord-v2',
- split: str = 'train',
- **kwargs,
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.
- bridge.data.hf_datasets.makers.make_medpix_dataset(
- path_or_dataset: str = 'mmoukouba/MedPix-VQA',
- split: str = 'train',
- **kwargs,
Load and preprocess the MedPix dataset for image-to-text fine-tuning.
- bridge.data.hf_datasets.makers.make_text_chat_dataset(
- path_or_dataset: str,
- subset: str | None = None,
- split: str = 'train',
- messages_column: str = 'messages',
- conversation_column: str = 'conversation',
- conversations_column: str = 'conversations',
- **kwargs,
Load a text-only HF chat dataset into the conversation-provider schema.
The input dataset must already contain OpenAI-style
messages, a processor-readyconversationcolumn, or a legacyconversationscolumn. Extra fields are preserved so collators can consume metadata such as tool schemas.
- bridge.data.hf_datasets.makers.make_raven_dataset(
- path_or_dataset: str = 'HuggingFaceM4/the_cauldron',
- subset: str = 'raven',
- split: str = 'train',
- **kwargs,
Load and preprocess the Raven subset from the Cauldron dataset.
This subset follows the IDEFICS-style layout where each sample contains:
images: a (possibly empty) list of PIL imagestexts: a list of conversation dictionaries. For Raven,texts[0]is a single turn stored as a dictionary with two keys::{"user": "<question>", "assistant": "<answer>"}Only the first element is used. The
userstring is taken as the user prompt, andassistantis the ground-truth answer.
Conversation building policy:
All images are placed at the beginning of the user turn followed by the textual prompt.
The assistant turn contains the answer text.
Examples missing either images or the required fields are filtered out.
- bridge.data.hf_datasets.makers.make_llava_video_178k_dataset(
- video_root_path: str,
- path_or_dataset: str = 'lmms-lab/LLaVA-Video-178K',
- subsets: str | List[str] = '0_30_s_nextqa',
- split: str = 'open_ended',
Load and preprocess a subset of the LLaVA-Video-178K dataset.
Each row contains: - ``video``: path or URL to the MP4 file. - ``conversations``: a **two-turn** list:: [{"from": "human", "value": "<video>β}, {βfromβ: βgptβ, βvalueβ: β β}] We map this schema to our internal multimodal conversation format: User turn β [video, user prompt] Assistant β answer text Note: Video files are assumed to be pre-downloaded and stored locally in the ``video_root_path`` directory. Rows with missing videos or empty conversations are filtered out from the final output. Args: video_root_path: Root directory where video files are stored locally. path_or_dataset: HF dataset path or local cache dir. subsets: Single subset name or list of the dataset's directory-style subsets to load. split: Split to load from the dataset. Note that "train" is automatically mapped to "open_ended". Returns: A list of dicts each containing a ``conversation`` field ready for downstream VLM processors.
- bridge.data.hf_datasets.makers.make_default_audio_dataset(
- path_or_dataset: str,
- subset: str | None = None,
- split: str = 'train',
- audio_column: str = 'audio',
- text_column: str = 'text',
- prompt: str = 'Transcribe the audio clip.',
- remove_text_spaces: bool = True,
- **kwargs,
Load and preprocess a HuggingFace audio dataset for audio-to-text fine-tuning.
Formats each example into a conversation with an audio user turn and a text assistant turn. Works with any HF dataset that has audio and text columns.
- bridge.data.hf_datasets.makers.make_valor32k_avqa_dataset(
- data_root: str,
- split: str = 'train',
- max_audio_duration: float = 10.0,
- modality_filter: str = 'all',
- **kwargs,
Load Valor32k-AVQA v2.0 dataset for audio-visual QA finetuning.
Expects a directory produced by
tutorials/data/valor32k-avqa/prepare_valor32k_avqa.py::data_root/ βββ videos/ # 10s MP4 clips βββ audio/ # 16 kHz mono WAV βββ combined_dataset_{split}_flattened.json- Parameters:
data_root β Root directory of the preprocessed dataset.
split β
"train","val", or"test".max_audio_duration β Maximum audio duration in seconds.
modality_filter β
"all","audio-visual","audio", or"visual".
- bridge.data.hf_datasets.makers.make_cv17_dataset(
- path_or_dataset: str = 'ysdede/commonvoice_17_tr_fixed',
- split: str = 'train',
- prompt: str = 'Transcribe the Turkish audio clip.',
- **kwargs,
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.
- bridge.data.hf_datasets.makers.get_hf_dataset_maker(maker_name: str)#
Return a built-in Hugging Face dataset maker by name or alias.