bridge.data.vlm_datasets.preloaded_provider#

Provider for datasets preloaded from JSON/JSONL files into conversation schema.

Module Contents#

Classes#

PreloadedVLMConversationProvider

DatasetProvider that builds VLM conversation datasets from preloaded JSON/JSONL files.

Functions#

_split_text_by_placeholders

Split legacy text containing “”/”

_normalize_paths

_record_to_conversation

Transform a single legacy record into an AutoProcessor-friendly conversation schema. Supports two input styles:

_load_preloaded_examples

API#

bridge.data.vlm_datasets.preloaded_provider._split_text_by_placeholders(
text: str,
image_paths: List[str],
video_paths: Optional[List[str]] = None,
) List[Dict[str, Any]]#

Split legacy text containing “”/”

bridge.data.vlm_datasets.preloaded_provider._normalize_paths(
paths: Optional[List[Any]],
base_folder: Optional[str],
) Optional[List[Any]]#
bridge.data.vlm_datasets.preloaded_provider._record_to_conversation(
record: Dict[str, Any],
image_folder: Optional[str],
) Optional[List[Dict[str, Any]]]#

Transform a single legacy record into an AutoProcessor-friendly conversation schema. Supports two input styles:

  • {“conversation”: […]} already in HF schema -> passthrough

  • {“messages”: […], “images”: […], “videos”: […]} with /

bridge.data.vlm_datasets.preloaded_provider._load_preloaded_examples(
path: str,
) List[Dict[str, Any]]#
class bridge.data.vlm_datasets.preloaded_provider.PreloadedVLMConversationProvider#

Bases: megatron.bridge.training.config.DatasetProvider

DatasetProvider that builds VLM conversation datasets from preloaded JSON/JSONL files.

The provider converts legacy Qwen2/VL style records with ‘’/’

sequence_length: int#

None

hf_processor_path: str#

‘Qwen/Qwen2.5-VL-3B-Instruct’

train_data_path: Optional[str]#

None

valid_data_path: Optional[str]#

None

test_data_path: Optional[str]#

None

image_folder: Optional[str]#

None

skip_getting_attention_mask_from_dataset: bool#

True

dataloader_type: Optional[Literal[single, cyclic, external]]#

‘single’

_build_split_dataset(
split_path: Optional[str],
target_length: int,
processor: Any,
) Optional[megatron.bridge.data.vlm_datasets.conversation_dataset.VLMConversationDataset]#
build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) Tuple[Optional[Any], Optional[Any], Optional[Any]]#