bridge.data.hf_datasets.provider#

Provider that builds conversation datasets from HuggingFace datasets.

Module Contents#

Classes#

HFConversationDatasetProvider

DatasetProvider that builds conversation datasets from Hugging Face datasets.

Data#

API#

bridge.data.hf_datasets.provider.logger#

‘getLogger(…)’

class bridge.data.hf_datasets.provider.HFConversationDatasetProvider#

Bases: megatron.bridge.training.config.DatasetProvider

DatasetProvider that builds conversation datasets from Hugging Face datasets.

This provider leverages simple maker functions that return lists of examples with a messages or conversation schema understood by model processors. It binds a Hugging Face processor/tokenizer for the specified model and selects an appropriate collate function for batching.

HF data creation workflow: 1. A maker function loads a Hugging Face dataset split and normalizes each row into Bridge’s chat schema: messages for text-only rows or conversation for processor-ready multimodal rows. 2. ConversationDataset repeats that normalized list to the requested Megatron sample count, then binds the selected collate implementation. 3. The collate function renders chat templates, tokenizes the batch, and builds shifted labels/loss masks or model-specific visual inputs.

seq_length: int#

None

hf_processor_path: str | None#

None

maker_name: str#

None

maker_kwargs: Optional[Dict[str, Any]]#

None

val_maker_kwargs: Optional[Dict[str, Any]]#

None

test_maker_kwargs: Optional[Dict[str, Any]]#

None

do_validation: bool#

True

do_test: bool#

True

collate_impl: Optional[Callable[[list, Any], Dict[str, torch.Tensor]]]#

None

skip_getting_attention_mask_from_dataset: bool#

True

dataloader_type: Optional[Literal[single, cyclic, batch, external]]#

‘single’

enable_in_batch_packing: bool#

False

in_batch_packing_pad_to_multiple_of: int#

1

_collate_supports_packing(processor: Any) bool#
_get_maker() Callable[..., List[Dict[str, Any]]]#
_build_split_dataset(
split: str,
target_length: int,
processor: Any,
extra_kwargs: Optional[Dict[str, Any]] = None,
) Optional[megatron.bridge.data.hf_datasets.conversation_dataset.ConversationDataset]#
_load_processor_or_tokenizer(
tokenizer: Any | None = None,
) Any#
build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) Tuple[Optional[Any], Optional[Any], Optional[Any]]#