bridge.data.hf_datasets.text_sft_provider#
Text SFT provider for Hugging Face datasets with offline packing support.
Module Contents#
Classes#
Build text SFT datasets from Hugging Face makers via the standard SFT builder. |
API#
- class bridge.data.hf_datasets.text_sft_provider.HFTextSFTDatasetProvider#
Bases:
megatron.bridge.training.config.DatasetProviderBuild text SFT datasets from Hugging Face makers via the standard SFT builder.
Maker outputs are written as JSONL chat rows and then loaded through
FinetuningDatasetBuilder. This preserves optional offline packed-sequence preparation throughenable_offline_packingandPackedSequenceSpecswhile keeping Hugging Face row normalization inmegatron.bridge.data.hf_datasets.- seq_length: int#
None
- maker_name: str#
None
- maker_kwargs: dict[str, Any] | None#
None
- val_maker_kwargs: dict[str, Any] | None#
None
- test_maker_kwargs: dict[str, Any] | None#
None
- dataset_root: str | pathlib.Path | None#
None
- seed: int#
5678
- memmap_workers: int#
1
- max_train_samples: int | None#
None
- enable_offline_packing: bool#
False
- offline_packing_specs: megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs | None#
None
- dataset_kwargs: dict[str, Any] | None#
None
- val_proportion: float | None#
None
- do_validation: bool#
True
- do_test: bool#
True
- rewrite: bool#
False
- dataloader_type: Literal[single, cyclic, batch, external] | None#
‘batch’
- _default_dataset_root() pathlib.Path#
- _dataset_root() pathlib.Path#
- _effective_dataset_kwargs() dict[str, Any]#
- _output_path(root: pathlib.Path, output_name: str) pathlib.Path#
- _needs_write(root: pathlib.Path, output_name: str) bool#
- _load_examples(
- *,
- split: str,
- extra_kwargs: dict[str, Any] | None,
- _write_examples(
- *,
- root: pathlib.Path,
- output_name: str,
- examples: list[dict[str, Any]],
- _write_split(
- *,
- root: pathlib.Path,
- output_name: str,
- split: str,
- extra_kwargs: dict[str, Any] | None,
- _split_training_for_validation(
- examples: list[dict[str, Any]],
- _write_train_and_validation_from_train(root: pathlib.Path) None#
- _prepare_jsonl_data(root: pathlib.Path) None#
- build_datasets(
- context: megatron.bridge.training.config.DatasetBuildContext,