bridge.data.hf_datasets.provider#
Provider that builds conversation datasets from HuggingFace datasets.
Module Contents#
Classes#
DatasetProvider that builds conversation datasets from Hugging Face datasets. |
Data#
API#
- bridge.data.hf_datasets.provider.logger#
‘getLogger(…)’
- class bridge.data.hf_datasets.provider.HFConversationDatasetProvider#
Bases:
megatron.bridge.training.config.DatasetProviderDatasetProvider that builds conversation datasets from Hugging Face datasets.
This provider leverages simple maker functions that return lists of examples with a
messagesorconversationschema understood by model processors. It binds a Hugging Face processor/tokenizer for the specified model and selects an appropriate collate function for batching.HF data creation workflow: 1. A maker function loads a Hugging Face dataset split and normalizes each row into Bridge’s chat schema:
messagesfor text-only rows orconversationfor processor-ready multimodal rows. 2.ConversationDatasetrepeats that normalized list to the requested Megatron sample count, then binds the selected collate implementation. 3. The collate function renders chat templates, tokenizes the batch, and builds shifted labels/loss masks or model-specific visual inputs.- seq_length: int#
None
- hf_processor_path: str | None#
None
- maker_name: str#
None
- maker_kwargs: Optional[Dict[str, Any]]#
None
- val_maker_kwargs: Optional[Dict[str, Any]]#
None
- test_maker_kwargs: Optional[Dict[str, Any]]#
None
- do_validation: bool#
True
- do_test: bool#
True
- collate_impl: Optional[Callable[[list, Any], Dict[str, torch.Tensor]]]#
None
- skip_getting_attention_mask_from_dataset: bool#
True
- dataloader_type: Optional[Literal[single, cyclic, batch, external]]#
‘single’
- enable_in_batch_packing: bool#
False
- in_batch_packing_pad_to_multiple_of: int#
1
- _collate_supports_packing(processor: Any) bool#
- _get_maker() Callable[..., List[Dict[str, Any]]]#
- _build_split_dataset(
- split: str,
- target_length: int,
- processor: Any,
- extra_kwargs: Optional[Dict[str, Any]] = None,
- _load_processor_or_tokenizer(
- tokenizer: Any | None = None,
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,