bridge.data.builders.hf_dataset#

Module Contents#

Classes#

ProcessExampleOutput

Expected output structure from a ProcessExampleFn.

ProcessExampleFn

Protocol defining the signature for a function that processes a single dataset example.

HFDatasetConfig

Configuration specific to using Hugging Face datasets for finetuning.

HFDatasetBuilder

Builder class for Hugging Face datasets.

Functions#

preprocess_and_split_data

Download, preprocess, split, and save a Hugging Face dataset to JSONL files.

Data#

API#

bridge.data.builders.hf_dataset.logger#

β€˜getLogger(…)’

class bridge.data.builders.hf_dataset.ProcessExampleOutput#

Bases: typing.TypedDict

Expected output structure from a ProcessExampleFn.

Initialization

Initialize self. See help(type(self)) for accurate signature.

input: str#

None

output: str#

None

original_answers: list[str]#

None

class bridge.data.builders.hf_dataset.ProcessExampleFn#

Bases: typing.Protocol

Protocol defining the signature for a function that processes a single dataset example.

__call__(
example: dict[str, Any],
tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
) bridge.data.builders.hf_dataset.ProcessExampleOutput#
class bridge.data.builders.hf_dataset.HFDatasetConfig#

Bases: megatron.bridge.training.config.FinetuningDatasetConfig

Configuration specific to using Hugging Face datasets for finetuning.

Inherits from FinetuningDatasetConfig and adds HF-specific options.

.. attribute:: dataset_name

Name of the dataset on the Hugging Face Hub.

.. attribute:: process_example_fn

A callable conforming to ProcessExampleFn protocol to process raw examples into the desired format.

.. attribute:: dataset_subset

Optional subset name if the dataset has multiple subsets.

.. attribute:: dataset_dict

Optional pre-loaded DatasetDict to use instead of downloading.

.. attribute:: split

Optional specific split to load (e.g., β€˜train[:10%]’).

.. attribute:: download_mode

Download mode for load_dataset (e.g., β€˜force_redownload’).

.. attribute:: val_proportion

Proportion of the training set to use for validation if no validation set is present.

.. attribute:: split_val_from_train

If True, creates validation set from training set. If False, uses test set to create validation set.

.. attribute:: delete_raw

If True, delete the raw downloaded dataset files after processing.

.. attribute:: rewrite

If True, rewrite existing processed files.

.. attribute:: hf_kwargs

Additional keyword arguments to pass to load_dataset.

.. attribute:: hf_filter_lambda

Optional function to filter the loaded dataset.

.. attribute:: hf_filter_lambda_kwargs

Optional keyword arguments for hf_filter_lambda.

dataset_name: str#

None

process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn#

None

dataset_subset: Optional[str]#

None

dataset_dict: Optional[datasets.DatasetDict]#

None

split: Optional[str]#

None

download_mode: Optional[str]#

None

val_proportion: Optional[float]#

0.05

split_val_from_train: bool#

True

delete_raw: bool#

False

rewrite: bool#

True

hf_kwargs: Optional[dict[str, Any]]#

None

hf_filter_lambda: Optional[Callable]#

None

hf_filter_lambda_kwargs: Optional[dict[str, Any]]#

None

bridge.data.builders.hf_dataset.preprocess_and_split_data(
dset: datasets.DatasetDict,
dataset_name: str,
dataset_root: pathlib.Path,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn,
split_val_from_train: bool = True,
val_proportion: Optional[float] = None,
train_aliases: tuple[str] = ('train', 'training'),
test_aliases: tuple[str] = ('test', 'testing'),
val_aliases: tuple[str] = ('val', 'validation', 'valid', 'eval'),
delete_raw: bool = False,
seed: int = 1234,
rewrite: bool = False,
do_test: bool = True,
do_validation: bool = True,
)#

Download, preprocess, split, and save a Hugging Face dataset to JSONL files.

Handles splitting into train/validation/test sets based on available splits and the val_proportion parameter. Processes each example using the provided process_example_fn and saves the results.

Parameters:
  • dset – The loaded Hugging Face DatasetDict.

  • dataset_name – Name of the dataset (for logging).

  • dataset_root – The root directory to save the processed JSONL files.

  • tokenizer – The tokenizer instance.

  • process_example_fn – Function to process individual examples.

  • split_val_from_train – If True, split validation from train set. Otherwise, split from test set (if available).

  • val_proportion – Proportion of data to use for validation split.

  • train_aliases – Tuple of possible names for the training split.

  • test_aliases – Tuple of possible names for the test split.

  • val_aliases – Tuple of possible names for the validation split.

  • delete_raw – If True, delete raw HF dataset cache after processing.

  • seed – Random seed for splitting.

  • rewrite – If True, overwrite existing processed files.

class bridge.data.builders.hf_dataset.HFDatasetBuilder(
dataset_name: str,
tokenizer,
process_example_fn: bridge.data.builders.hf_dataset.ProcessExampleFn,
dataset_dict: Optional[datasets.DatasetDict] = None,
dataset_subset: Optional[str] = None,
dataset_root: Optional[Union[str, pathlib.Path]] = None,
split=None,
seq_length=1024,
seed: int = 1234,
memmap_workers: int = 1,
max_train_samples: Optional[int] = None,
packed_sequence_specs: Optional[megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs] = None,
download_mode: Optional[str] = None,
val_proportion: Optional[float] = 0.05,
split_val_from_train: bool = True,
rewrite: bool = True,
delete_raw: bool = False,
hf_kwargs: Optional[dict[str, Any]] = None,
dataset_kwargs: Optional[dict[str, Any]] = None,
hf_filter_lambda: Optional[Callable] = None,
hf_filter_lambda_kwargs: Optional[dict[str, Any]] = None,
do_validation: bool = True,
do_test: bool = True,
)#

Bases: megatron.bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder

Builder class for Hugging Face datasets.

This class extends FinetuningDatasetBuilder to work with Hugging Face datasets instead of file paths. It provides methods to build datasets from Hugging Face’s datasets library.

Initialization

Initializes the HFDatasetBuilder.

Parameters:
  • dataset_name – Name of the dataset on Hugging Face Hub.

  • tokenizer – The tokenizer instance.

  • is_built_on_rank – Callable to determine if data should be built on the current rank.

  • process_example_fn – Function conforming to ProcessExampleFn protocol.

  • dataset_dict – Optional pre-loaded DatasetDict.

  • dataset_subset – Optional dataset subset name.

  • dataset_root – Optional root directory for data; defaults based on dataset_name.

  • split – Optional specific split to load.

  • seq_length – Sequence length for processing.

  • seed – Random seed.

  • memmap_workers – Number of workers for memmapping.

  • max_train_samples – Optional maximum number of training samples.

  • packed_sequence_specs – Optional PackedSequenceSpecs for packed sequence datasets.

  • download_mode – Download mode for load_dataset.

  • val_proportion – Proportion for validation split.

  • split_val_from_train – Whether to split validation from train set.

  • rewrite – Whether to rewrite existing processed files.

  • delete_raw – Whether to delete raw downloaded files.

  • hf_kwargs – Additional kwargs for load_dataset.

  • dataset_kwargs – Additional kwargs for the underlying dataset constructor.

  • hf_filter_lambda – Optional function to filter the dataset.

  • hf_filter_lambda_kwargs – Optional kwargs for the filter function.

  • do_validation – Whether to build the validation set.

  • do_test – Whether to build the test set.

prepare_data() None#

Loads/downloads the dataset, filters it, preprocesses/splits it, and prepares memmaps.

_load_dataset() datasets.DatasetDict#

Load the dataset from Hugging Face or use the provided dataset.