bridge.data.vlm_datasets.hf_dataset_makers#
Built-in maker functions that transform HuggingFace datasets into conversation-style examples consumable by VLM processors.
Module Contents#
Functions#
Load and preprocess the RDR dataset for image-to-text fine-tuning. |
|
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning. |
|
Load and preprocess the MedPix dataset for image-to-text fine-tuning. |
|
Load and preprocess the Raven subset from the Cauldron dataset. |
|
Load and preprocess a subset of the LLaVA-Video-178K dataset. |
|
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning. |
API#
- bridge.data.vlm_datasets.hf_dataset_makers.make_rdr_dataset(
- path_or_dataset: str = 'quintend/rdr-items',
- split: str = 'train',
- **kwargs,
Load and preprocess the RDR dataset for image-to-text fine-tuning.
Returns a list of examples with a “conversation” field that includes an image and text.
- bridge.data.vlm_datasets.hf_dataset_makers.make_cord_v2_dataset(
- path_or_dataset: str = 'naver-clova-ix/cord-v2',
- split: str = 'train',
- **kwargs,
Load and preprocess the CORD-V2 dataset for image-to-text fine-tuning.
- bridge.data.vlm_datasets.hf_dataset_makers.make_medpix_dataset(
- path_or_dataset: str = 'mmoukouba/MedPix-VQA',
- split: str = 'train',
- **kwargs,
Load and preprocess the MedPix dataset for image-to-text fine-tuning.
- bridge.data.vlm_datasets.hf_dataset_makers.make_raven_dataset(
- path_or_dataset: str = 'HuggingFaceM4/the_cauldron',
- subset: str = 'raven',
- split: str = 'train',
- **kwargs,
Load and preprocess the Raven subset from the Cauldron dataset.
This subset follows the IDEFICS-style layout where each sample contains:
images: a (possibly empty) list of PIL imagestexts: a list of conversation dictionaries. For Raven,texts[0]is a single turn stored as a dictionary with two keys::{"user": "<question>", "assistant": "<answer>"}Only the first element is used. The
userstring is taken as the user prompt, andassistantis the ground-truth answer.
Conversation building policy:
All images are placed at the beginning of the user turn followed by the textual prompt.
The assistant turn contains the answer text.
Examples missing either images or the required fields are filtered out.
- bridge.data.vlm_datasets.hf_dataset_makers.make_llava_video_178k_dataset(
- video_root_path: str,
- path_or_dataset: str = 'lmms-lab/LLaVA-Video-178K',
- subsets: str | List[str] = '0_30_s_nextqa',
- split: str = 'open_ended',
Load and preprocess a subset of the LLaVA-Video-178K dataset.
Each row contains: - ``video``: path or URL to the MP4 file. - ``conversations``: a **two-turn** list:: [{"from": "human", "value": "<video>”}, {“from”: “gpt”, “value”: “ ”}] We map this schema to our internal multimodal conversation format: User turn → [video, user prompt] Assistant → answer text Note: Video files are assumed to be pre-downloaded and stored locally in the ``video_root_path`` directory. Rows with missing videos or empty conversations are filtered out from the final output. Args: video_root_path: Root directory where video files are stored locally. path_or_dataset: HF dataset path or local cache dir. subsets: Single subset name or list of the dataset's directory-style subsets to load. split: Split to load from the dataset. Note that "train" is automatically mapped to "open_ended". Returns: A list of dicts each containing a ``conversation`` field ready for downstream VLM processors.
- bridge.data.vlm_datasets.hf_dataset_makers.make_cv17_dataset(
- path_or_dataset: str = 'ysdede/commonvoice_17_tr_fixed',
- split: str = 'train',
- **kwargs,
Load and preprocess the CommonVoice 17 dataset for audio-to-text fine-tuning.