bridge.data.vlm_datasets.preloaded_provider
#
Provider for datasets preloaded from JSON/JSONL files into conversation schema.
Module Contents#
Classes#
DatasetProvider that builds VLM conversation datasets from preloaded JSON/JSONL files. |
Functions#
Split legacy text containing “ |
|
Transform a single legacy record into an AutoProcessor-friendly conversation schema. Supports two input styles: |
|
API#
- bridge.data.vlm_datasets.preloaded_provider._split_text_by_placeholders(
- text: str,
- image_paths: List[str],
- video_paths: Optional[List[str]] = None,
Split legacy text containing “
”/”
- bridge.data.vlm_datasets.preloaded_provider._normalize_paths(
- paths: Optional[List[Any]],
- base_folder: Optional[str],
- bridge.data.vlm_datasets.preloaded_provider._record_to_conversation(
- record: Dict[str, Any],
- image_folder: Optional[str],
Transform a single legacy record into an AutoProcessor-friendly conversation schema. Supports two input styles:
{“conversation”: […]} already in HF schema -> passthrough
{“messages”: […], “images”: […], “videos”: […]} with
/
- bridge.data.vlm_datasets.preloaded_provider._load_preloaded_examples(
- path: str,
- class bridge.data.vlm_datasets.preloaded_provider.PreloadedVLMConversationProvider#
Bases:
megatron.bridge.training.config.DatasetProvider
DatasetProvider that builds VLM conversation datasets from preloaded JSON/JSONL files.
The provider converts legacy Qwen2/VL style records with ‘
’/’ - sequence_length: int#
None
- hf_processor_path: str#
‘Qwen/Qwen2.5-VL-3B-Instruct’
- train_data_path: Optional[str]#
None
- valid_data_path: Optional[str]#
None
- test_data_path: Optional[str]#
None
- image_folder: Optional[str]#
None
- skip_getting_attention_mask_from_dataset: bool#
True
- dataloader_type: Optional[Literal[single, cyclic, external]]#
‘single’
- _build_split_dataset(
- split_path: Optional[str],
- target_length: int,
- processor: Any,
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,