bridge.data.mimo.mock_provider#
Mock dataset provider for MIMO testing with synthetic multimodal data.
This module produces synthetic multimodal inputs (random images, audio, etc.) that are compatible with HuggingFace processors. It follows the same pattern as vlm_datasets/mock_provider.py - generating fake input data but using real processors for preprocessing.
Module Contents#
Classes#
DatasetProvider for mock MIMO datasets with synthetic multimodal data. |
Functions#
Generate a random RGB image. |
|
Generate random audio waveform. |
API#
- bridge.data.mimo.mock_provider._generate_random_image(
- width: int,
- height: int,
- rng: numpy.random.Generator,
Generate a random RGB image.
- bridge.data.mimo.mock_provider._generate_random_audio(
- duration_sec: float,
- sample_rate: int,
- rng: numpy.random.Generator,
Generate random audio waveform.
- class bridge.data.mimo.mock_provider.MockMimoProvider#
Bases:
megatron.bridge.training.config.DatasetProviderDatasetProvider for mock MIMO datasets with synthetic multimodal data.
Generates synthetic multimodal inputs (random images, audio, etc.) and uses real HuggingFace processors to preprocess them. This tests the full data pipeline without requiring real datasets.
Follows the same pattern as vlm_datasets/MockVLMConversationProvider.
- Parameters:
seq_length – Total sequence length for the model (encoder placeholders + text tokens). Must be greater than sum(encoder_seq_lengths.values()) to leave room for text.
processor_paths – Per-modality HF processor paths, e.g., {“vision”: “openai/clip-vit-large-patch14”}.
tokenizer_path – HuggingFace tokenizer identifier.
special_token_ids – Per-encoder placeholder token IDs, e.g., {“vision”: 32000}.
encoder_seq_lengths – Per-encoder output sequence lengths, e.g., {“vision”: 577}. Determines how many placeholder tokens to insert for each modality.
modality_configs – Per-modality generation config, e.g., {“vision”: {“type”: “image”, “width”: 224, “height”: 224}}.
text_prompt – Default text prompt for synthetic examples.
random_seed – Seed for random generation.
.. rubric:: Example
provider = MockMimoProvider( … seq_length=2048, … processor_paths={“vision”: “openai/clip-vit-large-patch14”}, … tokenizer_path=”meta-llama/Llama-2-7b-hf”, … special_token_ids={“vision”: 32000}, … encoder_seq_lengths={“vision”: 577}, # CLIP ViT-L/14 output tokens … modality_configs={“vision”: {“type”: “image”, “width”: 224, “height”: 224}}, … ) context = DatasetBuildContext(train_samples=1000, valid_samples=100, test_samples=100) train_ds, valid_ds, test_ds = provider.build_datasets(context)
- seq_length: int#
None
- processor_paths: Dict[str, str]#
‘field(…)’
- tokenizer_path: str = <Multiline-String>#
- special_token_ids: Dict[str, int]#
‘field(…)’
- encoder_seq_lengths: Dict[str, int]#
‘field(…)’
- modality_configs: Dict[str, Dict[str, Any]]#
‘field(…)’
- text_prompt: str#
‘Describe this input.’
- random_seed: int#
0
- trust_remote_code: bool#
False
- dataloader_type: Optional[Literal[single, cyclic, external]]#
‘single’
- _processors: Optional[Dict[str, Any]]#
‘field(…)’
- _tokenizer: Optional[Any]#
‘field(…)’
- _load_processors() Dict[str, Any]#
Load HuggingFace processors for each modality.
- _load_tokenizer() Any#
Load HuggingFace tokenizer.
- _generate_synthetic_examples(
- size: int,
- seed_offset: int = 0,
Generate synthetic multimodal examples.
- Parameters:
size – Number of examples to generate.
seed_offset – Offset to add to random seed for different splits.
- Returns:
List of examples with synthetic modality data.
- _build_split_dataset(
- size: int,
- processors: Dict[str, Any],
- tokenizer: Any,
- seed_offset: int = 0,
Build dataset for a single split.
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,
Build train, validation, and test datasets with synthetic data.
- Parameters:
context – Build context with sample counts.
- Returns:
Tuple of (train_dataset, valid_dataset, test_dataset).
- get_collate_fn() Callable#
Return collate function for MIMO datasets.
- Returns:
Partial function of mimo_collate_fn with modality names pre-filled.