`bridge.data.builders.hf_dataset`#

Module Contents#

Classes#

`ProcessExampleOutput`	Expected output structure from a `ProcessExampleFn`.
`ProcessExampleFn`	Protocol defining the signature for a function that processes a single dataset example.
`HFDatasetConfig`	Configuration specific to using Hugging Face datasets for finetuning.
`HFDatasetBuilder`	Builder class for Hugging Face datasets.

Functions#

preprocess_and_split_data

Download, preprocess, split, and save a Hugging Face dataset to JSONL files.

Data#

logger

API#

bridge.data.builders.hf_dataset.logger#: ‘getLogger(…)’

class bridge.data.builders.hf_dataset.ProcessExampleOutput#

Bases: typing.TypedDict

Expected output structure from a ProcessExampleFn.

Initialization

Initialize self. See help(type(self)) for accurate signature.

input: str#: None

output: str#: None

original_answers: list[str]#: None

class bridge.data.builders.hf_dataset.ProcessExampleFn#

Bases: typing.Protocol

Protocol defining the signature for a function that processes a single dataset example.

__call__( example: dict[str, Any], tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None, ) → bridge.data.builders.hf_dataset.ProcessExampleOutput#

class bridge.data.builders.hf_dataset.HFDatasetConfig#

Bases: megatron.bridge.training.config.FinetuningDatasetConfig

Configuration specific to using Hugging Face datasets for finetuning.

Inherits from FinetuningDatasetConfig and adds HF-specific options.

.. attribute:: dataset_name

Name of the dataset on the Hugging Face Hub.

.. attribute:: process_example_fn

A callable conforming to ProcessExampleFn protocol to process raw examples into the desired format.

.. attribute:: dataset_subset

Optional subset name if the dataset has multiple subsets.

.. attribute:: dataset_dict

Optional pre-loaded DatasetDict to use instead of downloading.

.. attribute:: split

Optional specific split to load (e.g., ‘train[:10%]’).

.. attribute:: download_mode

Download mode for load_dataset (e.g., ‘force_redownload’).

.. attribute:: val_proportion

Proportion of the training set to use for validation if no validation set is present.

.. attribute:: split_val_from_train

If True, creates validation set from training set. If False, uses test set to create validation set.

.. attribute:: delete_raw

If True, delete the raw downloaded dataset files after processing.

.. attribute:: rewrite

If True, rewrite existing processed files.

.. attribute:: hf_kwargs

Additional keyword arguments to pass to load_dataset.

.. attribute:: hf_filter_lambda

Optional function to filter the loaded dataset.

.. attribute:: hf_filter_lambda_kwargs

Optional keyword arguments for hf_filter_lambda.

dataset_name: str#: None

process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn#: None

dataset_subset: Optional[str]#: None

dataset_dict: Optional[datasets.DatasetDict]#: None

split: Optional[str]#: None

download_mode: Optional[str]#: None

val_proportion: Optional[float]#: 0.05

split_val_from_train: bool#: True

delete_raw: bool#: False

rewrite: bool#: True

hf_kwargs: Optional[dict[str, Any]]#: None

hf_filter_lambda: Optional[Callable]#: None

hf_filter_lambda_kwargs: Optional[dict[str, Any]]#: None

bridge.data.builders.hf_dataset.preprocess_and_split_data( dset: datasets.DatasetDict, dataset_name: str, dataset_root: pathlib.Path, tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer, process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn, split_val_from_train: bool = True, val_proportion: Optional[float] = None, train_aliases: tuple[str] = ('train', 'training'), test_aliases: tuple[str] = ('test', 'testing'), val_aliases: tuple[str] = ('val', 'validation', 'valid', 'eval'), delete_raw: bool = False, seed: int = 1234, rewrite: bool = False, do_test: bool = True, do_validation: bool = True, )#

Download, preprocess, split, and save a Hugging Face dataset to JSONL files.

Handles splitting into train/validation/test sets based on available splits and the val_proportion parameter. Processes each example using the provided process_example_fn and saves the results.

Parameters:

dset – The loaded Hugging Face DatasetDict.
dataset_name – Name of the dataset (for logging).
dataset_root – The root directory to save the processed JSONL files.
tokenizer – The tokenizer instance.
process_example_fn – Function to process individual examples.
split_val_from_train – If True, split validation from train set. Otherwise, split from test set (if available).
val_proportion – Proportion of data to use for validation split.
train_aliases – Tuple of possible names for the training split.
test_aliases – Tuple of possible names for the test split.
val_aliases – Tuple of possible names for the validation split.
delete_raw – If True, delete raw HF dataset cache after processing.
seed – Random seed for splitting.
rewrite – If True, overwrite existing processed files.

class bridge.data.builders.hf_dataset.HFDatasetBuilder( dataset_name: str, tokenizer, process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn, dataset_dict: Optional[datasets.DatasetDict] = None, dataset_subset: Optional[str] = None, dataset_root: Optional[Union[str, pathlib.Path]] = None, split=None, seq_length=1024, seed: int = 1234, memmap_workers: int = 1, max_train_samples: Optional[int] = None, packed_sequence_specs: Optional[megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs] = None, download_mode: Optional[str] = None, val_proportion: Optional[float] = 0.05, split_val_from_train: bool = True, rewrite: bool = True, delete_raw: bool = False, hf_kwargs: Optional[dict[str, Any]] = None, dataset_kwargs: Optional[dict[str, Any]] = None, hf_filter_lambda: Optional[Callable] = None, hf_filter_lambda_kwargs: Optional[dict[str, Any]] = None, do_validation: bool = True, do_test: bool = True, )#

Bases: megatron.bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder

Builder class for Hugging Face datasets.

This class extends FinetuningDatasetBuilder to work with Hugging Face datasets instead of file paths. It provides methods to build datasets from Hugging Face’s datasets library.