bridge.data.hf_datasets.text_sft_provider#

Text SFT provider for Hugging Face datasets with offline packing support.

Module Contents#

Classes#

HFTextSFTDatasetProvider

Build text SFT datasets from Hugging Face makers via the standard SFT builder.

API#

class bridge.data.hf_datasets.text_sft_provider.HFTextSFTDatasetProvider#

Bases: megatron.bridge.training.config.DatasetProvider

Build text SFT datasets from Hugging Face makers via the standard SFT builder.

Maker outputs are written as JSONL chat rows and then loaded through FinetuningDatasetBuilder. This preserves optional offline packed-sequence preparation through enable_offline_packing and PackedSequenceSpecs while keeping Hugging Face row normalization in megatron.bridge.data.hf_datasets.

seq_length: int#

None

maker_name: str#

None

maker_kwargs: dict[str, Any] | None#

None

val_maker_kwargs: dict[str, Any] | None#

None

test_maker_kwargs: dict[str, Any] | None#

None

dataset_root: str | pathlib.Path | None#

None

seed: int#

5678

memmap_workers: int#

1

max_train_samples: int | None#

None

enable_offline_packing: bool#

False

offline_packing_specs: megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs | None#

None

dataset_kwargs: dict[str, Any] | None#

None

val_proportion: float | None#

None

do_validation: bool#

True

do_test: bool#

True

rewrite: bool#

False

dataloader_type: Literal[single, cyclic, batch, external] | None#

‘batch’

_default_dataset_root() pathlib.Path#
_dataset_root() pathlib.Path#
_effective_dataset_kwargs() dict[str, Any]#
_output_path(root: pathlib.Path, output_name: str) pathlib.Path#
_needs_write(root: pathlib.Path, output_name: str) bool#
_load_examples(
*,
split: str,
extra_kwargs: dict[str, Any] | None,
) list[dict[str, Any]]#
_write_examples(
*,
root: pathlib.Path,
output_name: str,
examples: list[dict[str, Any]],
) None#
_write_split(
*,
root: pathlib.Path,
output_name: str,
split: str,
extra_kwargs: dict[str, Any] | None,
) None#
_split_training_for_validation(
examples: list[dict[str, Any]],
) tuple[list[dict[str, Any]], list[dict[str, Any]]]#
_write_train_and_validation_from_train(root: pathlib.Path) None#
_prepare_jsonl_data(root: pathlib.Path) None#
build_datasets(
context: megatron.bridge.training.config.DatasetBuildContext,
) tuple[Any | None, Any | None, Any | None]#