bridge.data.mimo.hf_provider#

HuggingFace dataset provider for MIMO multi-encoder models.

Module Contents#

Classes#

HFMimoDatasetProvider

DatasetProvider for MIMO models using HuggingFace datasets.

API#

class bridge.data.mimo.hf_provider.HFMimoDatasetProvider#

Bases: megatron.bridge.training.config.DatasetProvider

DatasetProvider for MIMO models using HuggingFace datasets.

Loads datasets from HuggingFace Hub and applies per-modality processors to convert raw inputs (images, audio, text) into preprocessed tensors that MIMO encoder modules consume during training.

For testing with synthetic data, use MockMimoProvider instead.

Parameters:
  • seq_length – Total sequence length for the model (encoder placeholders + text tokens). Must be greater than sum(encoder_seq_lengths.values()) to leave room for text. Text is truncated to fit: max_text_tokens = seq_length - total_encoder_tokens.

  • hf_dataset_path – HuggingFace dataset identifier, e.g., “liuhaotian/LLaVA-Instruct-150K”.

  • hf_dataset_name – Optional dataset configuration name.

  • hf_tokenizer_path – HuggingFace tokenizer identifier.

  • processor_paths – Per-modality processor paths, e.g., {“vision”: “openai/clip-vit-large-patch14”}.

  • special_token_ids – Per-encoder placeholder token IDs, e.g., {“vision”: 32000}.

  • encoder_seq_lengths – Per-encoder output sequence lengths, e.g., {“vision”: 577}. Determines how many placeholder tokens to insert for each modality.

  • modality_columns – Map modality name to dataset column, e.g., {“vision”: “image”}.

  • text_column – Column name for text data. Default: “text”.

  • train_split – Dataset split for training. Default: “train”.

  • valid_split – Dataset split for validation. Default: “validation”.

  • test_split – Dataset split for testing. Default: “test”.

  • trust_remote_code – Whether to trust remote code for HF models/processors.

.. rubric:: Example

provider = HFMimoDatasetProvider( … seq_length=2048, … hf_dataset_path=”liuhaotian/LLaVA-Instruct-150K”, … hf_tokenizer_path=”meta-llama/Llama-2-7b-hf”, … processor_paths={“vision”: “openai/clip-vit-large-patch14”}, … special_token_ids={“vision”: 32000}, … encoder_seq_lengths={“vision”: 577}, # CLIP ViT-L/14 output tokens … modality_columns={“vision”: “image”}, … ) context = DatasetBuildContext(train_samples=10000, valid_samples=1000, test_samples=1000) train_ds, valid_ds, test_ds = provider.build_datasets(context)

seq_length: int#

None

hf_dataset_path: str#

None

hf_dataset_name: Optional[str]#

None

hf_tokenizer_path: str = <Multiline-String>#
processor_paths: Dict[str, str]#

‘field(…)’

special_token_ids: Dict[str, int]#

‘field(…)’

encoder_seq_lengths: Dict[str, int]#

‘field(…)’

modality_columns: Dict[str, str]#

‘field(…)’

text_column: str#

‘text’

train_split: str#

‘train’

valid_split: str#

‘validation’

test_split: str#

‘test’

_processors: Optional[Dict[str, Any]]#

‘field(…)’

_tokenizer: Optional[Any]#

‘field(…)’

_load_processors() Dict[str, Any]#

Load HuggingFace processors for each modality.

_load_tokenizer() Any#

Load HuggingFace tokenizer.

_load_hf_dataset(split: str) Any#

Load a HuggingFace dataset split.

_build_split_dataset(
split: str,
target_samples: int,
processors: Dict[str, Any],
tokenizer: Any,
) Optional[megatron.bridge.data.mimo.dataset.MimoDataset]#

Build dataset for a single split.

build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) Tuple[Optional[torch.utils.data.Dataset], Optional[torch.utils.data.Dataset], Optional[torch.utils.data.Dataset]]#

Build train, validation, and test datasets.

Parameters:

context – Build context with sample counts.

Returns:

Tuple of (train_dataset, valid_dataset, test_dataset). Any element can be None if split doesn’t exist or sample count is 0.

get_collate_fn() Callable#

Return collate function for MIMO datasets.

Returns:

Partial function of mimo_collate_fn with modality names pre-filled.