bridge.data.builders.hf_dataset
#
Module Contents#
Classes#
Expected output structure from a |
|
Protocol defining the signature for a function that processes a single dataset example. |
|
Configuration specific to using Hugging Face datasets for finetuning. |
|
Builder class for Hugging Face datasets. |
Functions#
Download, preprocess, split, and save a Hugging Face dataset to JSONL files. |
Data#
API#
- bridge.data.builders.hf_dataset.logger#
βgetLogger(β¦)β
- class bridge.data.builders.hf_dataset.ProcessExampleOutput#
Bases:
typing.TypedDict
Expected output structure from a
ProcessExampleFn
.Initialization
Initialize self. See help(type(self)) for accurate signature.
- input: str#
None
- output: str#
None
- original_answers: list[str]#
None
- class bridge.data.builders.hf_dataset.ProcessExampleFn#
Bases:
typing.Protocol
Protocol defining the signature for a function that processes a single dataset example.
- __call__(
- example: dict[str, Any],
- tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
- class bridge.data.builders.hf_dataset.HFDatasetConfig#
Bases:
megatron.bridge.training.config.FinetuningDatasetConfig
Configuration specific to using Hugging Face datasets for finetuning.
Inherits from FinetuningDatasetConfig and adds HF-specific options.
.. attribute:: dataset_name
Name of the dataset on the Hugging Face Hub.
.. attribute:: process_example_fn
A callable conforming to ProcessExampleFn protocol to process raw examples into the desired format.
.. attribute:: dataset_subset
Optional subset name if the dataset has multiple subsets.
.. attribute:: dataset_dict
Optional pre-loaded DatasetDict to use instead of downloading.
.. attribute:: split
Optional specific split to load (e.g., βtrain[:10%]β).
.. attribute:: download_mode
Download mode for load_dataset (e.g., βforce_redownloadβ).
.. attribute:: val_proportion
Proportion of the training set to use for validation if no validation set is present.
.. attribute:: split_val_from_train
If True, creates validation set from training set. If False, uses test set to create validation set.
.. attribute:: delete_raw
If True, delete the raw downloaded dataset files after processing.
.. attribute:: rewrite
If True, rewrite existing processed files.
.. attribute:: hf_kwargs
Additional keyword arguments to pass to
load_dataset
... attribute:: hf_filter_lambda
Optional function to filter the loaded dataset.
.. attribute:: hf_filter_lambda_kwargs
Optional keyword arguments for
hf_filter_lambda
.- dataset_name: str#
None
- process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn#
None
- dataset_subset: Optional[str]#
None
- dataset_dict: Optional[datasets.DatasetDict]#
None
- split: Optional[str]#
None
- download_mode: Optional[str]#
None
- val_proportion: Optional[float]#
0.05
- split_val_from_train: bool#
True
- delete_raw: bool#
False
- rewrite: bool#
True
- hf_kwargs: Optional[dict[str, Any]]#
None
- hf_filter_lambda: Optional[Callable]#
None
- hf_filter_lambda_kwargs: Optional[dict[str, Any]]#
None
- bridge.data.builders.hf_dataset.preprocess_and_split_data(
- dset: datasets.DatasetDict,
- dataset_name: str,
- dataset_root: pathlib.Path,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn,
- split_val_from_train: bool = True,
- val_proportion: Optional[float] = None,
- train_aliases: tuple[str] = ('train', 'training'),
- test_aliases: tuple[str] = ('test', 'testing'),
- val_aliases: tuple[str] = ('val', 'validation', 'valid', 'eval'),
- delete_raw: bool = False,
- seed: int = 1234,
- rewrite: bool = False,
- do_test: bool = True,
- do_validation: bool = True,
Download, preprocess, split, and save a Hugging Face dataset to JSONL files.
Handles splitting into train/validation/test sets based on available splits and the
val_proportion
parameter. Processes each example using the providedprocess_example_fn
and saves the results.- Parameters:
dset β The loaded Hugging Face DatasetDict.
dataset_name β Name of the dataset (for logging).
dataset_root β The root directory to save the processed JSONL files.
tokenizer β The tokenizer instance.
process_example_fn β Function to process individual examples.
split_val_from_train β If True, split validation from train set. Otherwise, split from test set (if available).
val_proportion β Proportion of data to use for validation split.
train_aliases β Tuple of possible names for the training split.
test_aliases β Tuple of possible names for the test split.
val_aliases β Tuple of possible names for the validation split.
delete_raw β If True, delete raw HF dataset cache after processing.
seed β Random seed for splitting.
rewrite β If True, overwrite existing processed files.
- class bridge.data.builders.hf_dataset.HFDatasetBuilder(
- dataset_name: str,
- tokenizer,
- process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn,
- dataset_dict: Optional[datasets.DatasetDict] = None,
- dataset_subset: Optional[str] = None,
- dataset_root: Optional[Union[str, pathlib.Path]] = None,
- split=None,
- seq_length=1024,
- seed: int = 1234,
- memmap_workers: int = 1,
- max_train_samples: Optional[int] = None,
- packed_sequence_specs: Optional[megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs] = None,
- download_mode: Optional[str] = None,
- val_proportion: Optional[float] = 0.05,
- split_val_from_train: bool = True,
- rewrite: bool = True,
- delete_raw: bool = False,
- hf_kwargs: Optional[dict[str, Any]] = None,
- dataset_kwargs: Optional[dict[str, Any]] = None,
- hf_filter_lambda: Optional[Callable] = None,
- hf_filter_lambda_kwargs: Optional[dict[str, Any]] = None,
- do_validation: bool = True,
- do_test: bool = True,
Bases:
megatron.bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder
Builder class for Hugging Face datasets.
This class extends FinetuningDatasetBuilder to work with Hugging Face datasets instead of file paths. It provides methods to build datasets from Hugging Faceβs datasets library.
Initialization
Initializes the HFDatasetBuilder.
- Parameters:
dataset_name β Name of the dataset on Hugging Face Hub.
tokenizer β The tokenizer instance.
is_built_on_rank β Callable to determine if data should be built on the current rank.
process_example_fn β Function conforming to ProcessExampleFn protocol.
dataset_dict β Optional pre-loaded DatasetDict.
dataset_subset β Optional dataset subset name.
dataset_root β Optional root directory for data; defaults based on dataset_name.
split β Optional specific split to load.
seq_length β Sequence length for processing.
seed β Random seed.
memmap_workers β Number of workers for memmapping.
max_train_samples β Optional maximum number of training samples.
packed_sequence_specs β Optional PackedSequenceSpecs for packed sequence datasets.
download_mode β Download mode for
load_dataset
.val_proportion β Proportion for validation split.
split_val_from_train β Whether to split validation from train set.
rewrite β Whether to rewrite existing processed files.
delete_raw β Whether to delete raw downloaded files.
hf_kwargs β Additional kwargs for
load_dataset
.dataset_kwargs β Additional kwargs for the underlying dataset constructor.
hf_filter_lambda β Optional function to filter the dataset.
hf_filter_lambda_kwargs β Optional kwargs for the filter function.
do_validation β Whether to build the validation set.
do_test β Whether to build the test set.
- prepare_data() None #
Loads/downloads the dataset, filters it, preprocesses/splits it, and prepares memmaps.
- _load_dataset() datasets.DatasetDict #
Load the dataset from Hugging Face or use the provided dataset.