bridge.data.vlm_datasets.hf_provider#

Provider that builds conversation datasets from HuggingFace datasets.

Module Contents#

Classes#

HFDatasetConversationProvider

DatasetProvider that builds VLM conversation datasets from HF datasets.

API#

class bridge.data.vlm_datasets.hf_provider.HFDatasetConversationProvider#

Bases: megatron.bridge.training.config.DatasetProvider

DatasetProvider that builds VLM conversation datasets from HF datasets.

This provider leverages simple maker functions that return lists of examples with a ā€œconversationā€ schema understood by model processors. It binds a HuggingFace AutoProcessor for the specified model and selects an appropriate collate function for batching.

sequence_length: int#

None

hf_processor_path: str#

None

maker_name: str#

None

maker_kwargs: Optional[Dict[str, Any]]#

None

collate_impl: Optional[Callable[[list, Any], Dict[str, torch.Tensor]]]#

None

skip_getting_attention_mask_from_dataset: bool#

True

dataloader_type: Optional[Literal[single, cyclic, external]]#

ā€˜single’

_get_maker() Callable[..., List[Dict[str, Any]]]#
_build_split_dataset(
split: str,
target_length: int,
processor: Any,
) Optional[megatron.bridge.data.vlm_datasets.conversation_dataset.VLMConversationDataset]#
build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) Tuple[Optional[Any], Optional[Any], Optional[Any]]#