bridge.data.vlm_datasets.hf_provider
#
Provider that builds conversation datasets from HuggingFace datasets.
Module Contents#
Classes#
DatasetProvider that builds VLM conversation datasets from HF datasets. |
API#
- class bridge.data.vlm_datasets.hf_provider.HFDatasetConversationProvider#
Bases:
megatron.bridge.training.config.DatasetProvider
DatasetProvider that builds VLM conversation datasets from HF datasets.
This provider leverages simple maker functions that return lists of examples with a āconversationā schema understood by model processors. It binds a HuggingFace
AutoProcessor
for the specified model and selects an appropriate collate function for batching.- sequence_length: int#
None
- hf_processor_path: str#
None
- maker_name: str#
None
- maker_kwargs: Optional[Dict[str, Any]]#
None
- collate_impl: Optional[Callable[[list, Any], Dict[str, torch.Tensor]]]#
None
- skip_getting_attention_mask_from_dataset: bool#
True
- dataloader_type: Optional[Literal[single, cyclic, external]]#
āsingleā
- _get_maker() Callable[..., List[Dict[str, Any]]] #
- _build_split_dataset(
- split: str,
- target_length: int,
- processor: Any,
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,