bridge.data.vlm_datasets.hf_dataset_makers#

Built-in maker functions that transform HuggingFace datasets into conversation-style examples consumable by VLM processors.

Module Contents#

Functions#

make_rdr_dataset

Load and preprocess the RDR dataset for image-to-text fine-tuning.

make_cord_v2_dataset

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

make_medpix_dataset

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

make_raven_dataset

Load and preprocess the Raven subset from the Cauldron dataset.

make_llava_video_178k_dataset

Load and preprocess a subset of the LLaVA-Video-178K dataset.

make_cv17_dataset

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.

API#

bridge.data.vlm_datasets.hf_dataset_makers.make_rdr_dataset(
path_or_dataset: str = 'quintend/rdr-items',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the RDR dataset for image-to-text fine-tuning.

Returns a list of examples with a “conversation” field that includes an image and text.

bridge.data.vlm_datasets.hf_dataset_makers.make_cord_v2_dataset(
path_or_dataset: str = 'naver-clova-ix/cord-v2',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.

bridge.data.vlm_datasets.hf_dataset_makers.make_medpix_dataset(
path_or_dataset: str = 'mmoukouba/MedPix-VQA',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the MedPix dataset for image-to-text fine-tuning.

bridge.data.vlm_datasets.hf_dataset_makers.make_raven_dataset(
path_or_dataset: str = 'HuggingFaceM4/the_cauldron',
subset: str = 'raven',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the Raven subset from the Cauldron dataset.

This subset follows the IDEFICS-style layout where each sample contains:

  • images: a (possibly empty) list of PIL images

  • texts: a list of conversation dictionaries. For Raven, texts[0] is a single turn stored as a dictionary with two keys::

    {"user": "<question>", "assistant": "<answer>"}
    

    Only the first element is used. The user string is taken as the user prompt, and assistant is the ground-truth answer.

Conversation building policy:

  1. All images are placed at the beginning of the user turn followed by the textual prompt.

  2. The assistant turn contains the answer text.

Examples missing either images or the required fields are filtered out.

bridge.data.vlm_datasets.hf_dataset_makers.make_llava_video_178k_dataset(
video_root_path: str,
path_or_dataset: str = 'lmms-lab/LLaVA-Video-178K',
subsets: str | List[str] = '0_30_s_nextqa',
split: str = 'open_ended',
) List[Dict[str, Any]]#

Load and preprocess a subset of the LLaVA-Video-178K dataset.

Each row contains:
- ``video``: path or URL to the MP4 file.
- ``conversations``: a **two-turn** list::

      [{"from": "human", "value": "<video>

”}, {“from”: “gpt”, “value”: “”}]

  We map this schema to our internal multimodal conversation format:

  User turn  →  [video, user prompt]
  Assistant  →  answer text

Note:
    Video files are assumed to be pre-downloaded and stored locally in the
    ``video_root_path`` directory. Rows with missing videos or empty
    conversations are filtered out from the final output.

Args:
    video_root_path: Root directory where video files are stored locally.
    path_or_dataset: HF dataset path or local cache dir.
    subsets: Single subset name or list of the dataset's directory-style
        subsets to load.
    split: Split to load from the dataset. Note that "train" is automatically
        mapped to "open_ended".

Returns:
    A list of dicts each containing a ``conversation`` field ready for
    downstream VLM processors.
bridge.data.vlm_datasets.hf_dataset_makers.make_cv17_dataset(
path_or_dataset: str = 'ysdede/commonvoice_17_tr_fixed',
split: str = 'train',
**kwargs,
) List[Dict[str, Any]]#

Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.