bridge.data.mimo.hf_provider#
HuggingFace dataset provider for MIMO multi-encoder models.
Module Contents#
Classes#
DatasetProvider for MIMO models using HuggingFace datasets. |
API#
- class bridge.data.mimo.hf_provider.HFMimoDatasetProvider#
Bases:
megatron.bridge.training.config.DatasetProviderDatasetProvider for MIMO models using HuggingFace datasets.
Loads datasets from HuggingFace Hub and applies per-modality processors to convert raw inputs (images, audio, text) into preprocessed tensors that MIMO encoder modules consume during training.
For testing with synthetic data, use MockMimoProvider instead.
- Parameters:
seq_length – Total sequence length for the model (encoder placeholders + text tokens). Must be greater than sum(encoder_seq_lengths.values()) to leave room for text. Text is truncated to fit: max_text_tokens = seq_length - total_encoder_tokens.
hf_dataset_path – HuggingFace dataset identifier, e.g., “liuhaotian/LLaVA-Instruct-150K”.
hf_dataset_name – Optional dataset configuration name.
hf_tokenizer_path – HuggingFace tokenizer identifier.
processor_paths – Per-modality processor paths, e.g., {“vision”: “openai/clip-vit-large-patch14”}.
special_token_ids – Per-encoder placeholder token IDs, e.g., {“vision”: 32000}.
encoder_seq_lengths – Per-encoder output sequence lengths, e.g., {“vision”: 577}. Determines how many placeholder tokens to insert for each modality.
modality_columns – Map modality name to dataset column, e.g., {“vision”: “image”}.
text_column – Column name for text data. Default: “text”.
train_split – Dataset split for training. Default: “train”.
valid_split – Dataset split for validation. Default: “validation”.
test_split – Dataset split for testing. Default: “test”.
trust_remote_code – Whether to trust remote code for HF models/processors.
.. rubric:: Example
provider = HFMimoDatasetProvider( … seq_length=2048, … hf_dataset_path=”liuhaotian/LLaVA-Instruct-150K”, … hf_tokenizer_path=”meta-llama/Llama-2-7b-hf”, … processor_paths={“vision”: “openai/clip-vit-large-patch14”}, … special_token_ids={“vision”: 32000}, … encoder_seq_lengths={“vision”: 577}, # CLIP ViT-L/14 output tokens … modality_columns={“vision”: “image”}, … ) context = DatasetBuildContext(train_samples=10000, valid_samples=1000, test_samples=1000) train_ds, valid_ds, test_ds = provider.build_datasets(context)
- seq_length: int#
None
- hf_dataset_path: str#
None
- hf_dataset_name: Optional[str]#
None
- hf_tokenizer_path: str = <Multiline-String>#
- processor_paths: Dict[str, str]#
‘field(…)’
- special_token_ids: Dict[str, int]#
‘field(…)’
- encoder_seq_lengths: Dict[str, int]#
‘field(…)’
- modality_columns: Dict[str, str]#
‘field(…)’
- text_column: str#
‘text’
- train_split: str#
‘train’
- valid_split: str#
‘validation’
- test_split: str#
‘test’
- _processors: Optional[Dict[str, Any]]#
‘field(…)’
- _tokenizer: Optional[Any]#
‘field(…)’
- _load_processors() Dict[str, Any]#
Load HuggingFace processors for each modality.
- _load_tokenizer() Any#
Load HuggingFace tokenizer.
- _load_hf_dataset(split: str) Any#
Load a HuggingFace dataset split.
- _build_split_dataset(
- split: str,
- target_samples: int,
- processors: Dict[str, Any],
- tokenizer: Any,
Build dataset for a single split.
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,
Build train, validation, and test datasets.
- Parameters:
context – Build context with sample counts.
- Returns:
Tuple of (train_dataset, valid_dataset, test_dataset). Any element can be None if split doesn’t exist or sample count is 0.
- get_collate_fn() Callable#
Return collate function for MIMO datasets.
- Returns:
Partial function of mimo_collate_fn with modality names pre-filled.