`bridge.data.vlm_datasets.mock_provider`#

Generic mock conversation-style VLM dataset and provider.

This module produces synthetic image(s) and minimal conversations that are compatible with HF AutoProcessor.apply_chat_template and the collate functions defined in collate.py. It is processor-agnostic and can be used with any multimodal model whose processor supports the standard conversation schema and optional images argument.

Module Contents#

Classes#

MockVLMConversationProvider

DatasetProvider for generic mock VLM conversation datasets.

API#

class bridge.data.vlm_datasets.mock_provider.MockVLMConversationProvider#

Bases: megatron.bridge.training.config.DatasetProvider

DatasetProvider for generic mock VLM conversation datasets.

Builds train/valid/test datasets using a HF AutoProcessor and the MockVLMConversationDataset implementation. Intended to work across different VLM models whose processors support the conversation schema.

seq_length: int#: None

hf_processor_path: str#: None

prompt: str#: ‘Describe this image.’

random_seed: int#: 0

image_size: Tuple[int, int]#: (256, 256)

pad_to_max_length: bool#: True

create_attention_mask: bool#: True

skip_getting_attention_mask_from_dataset: bool#: True

num_images: int#: 1

dataloader_type: Optional[Literal[single, cyclic, external]]#: ‘single’

_processor: Optional[Any]#: None

pack_sequences_in_batch: bool#: False

_make_single_example( rng: numpy.random.Generator, prompt_text: str, response_text: str, ) → Dict[str, Any]#: Create a single mock conversation example with the given prompt and response text.

_make_base_examples() → List[Dict[str, Any]]#

build_datasets( context: megatron.bridge.training.config.DatasetBuildContext, )#

bridge.data.vlm_datasets.mock_provider#

Module Contents#

Classes#

API#

`bridge.data.vlm_datasets.mock_provider`#