bridge.data.builders.finetuning_dataset
#
Module Contents#
Classes#
Builder class for fine-tuning datasets. |
Data#
API#
- bridge.data.builders.finetuning_dataset.logger#
‘getLogger(…)’
- class bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder(
- dataset_root: Union[str, pathlib.Path],
- tokenizer,
- seq_length: int = 2048,
- seed: int = 1234,
- memmap_workers: int = 1,
- max_train_samples: Optional[int] = None,
- packed_sequence_specs: Optional[megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs] = None,
- dataset_kwargs: Optional[dict[str, Any]] = None,
- do_validation: bool = True,
- do_test: bool = True,
Builder class for fine-tuning datasets.
This class provides methods to build datasets for fine-tuning large language models. It follows a builder pattern similar to BlendedMegatronDatasetBuilder but adapted for fine-tuning scenarios.
- Parameters:
dataset_root (Union[str, Path]) – The root directory containing training, validation, and test data.
tokenizer – The tokenizer to use for preprocessing text.
is_built_on_rank (Callable) – Function that returns True if the dataset should be built on current rank.
seq_length (int, optional) – The maximum sequence length. Defaults to 2048.
seed (int, optional) – Random seed for data shuffling. Defaults to 1234.
memmap_workers (int, optional) – Number of worker processes for memmap datasets. Defaults to 1.
max_train_samples (int, optional) – Maximum number of training samples. Defaults to None.
packed_sequence_specs (Optional[PackedSequenceSpecs], optional) – Specifications for packed sequences. Defaults to None.
dataset_kwargs (Optional[dict[str, Any]], optional) – Additional dataset creation arguments. Defaults to None.
do_validation (bool, optional) – Whether to build the validation dataset. Defaults to True.
do_test (bool, optional) – Whether to build the test dataset. Defaults to True.
Initialization
- prepare_data() None #
Prepare data if needed.
- prepare_packed_data() None #
Prepare packed sequence data files if configured.
- build() list[Optional[Any]] #
Build train, validation, and test datasets.
This method creates the necessary datasets based on the configuration. It first ensures data preparation (e.g., packing) is done (on rank 0), then builds the datasets potentially using the prepared files.
- Returns:
A list containing the train, validation, and test datasets. Elements can be None if the corresponding data file doesn’t exist or if dataset building is skipped for the split.
- _build_datasets() list[Optional[Any]] #
Internal method to build all datasets.
- Returns:
The train, validation, and test datasets.
- Return type:
list[Optional[Any]]
- _create_dataset(
- path: Union[str, pathlib.Path],
- pack_metadata_path: Optional[Union[str, pathlib.Path]] = None,
- is_test: bool = False,
- **kwargs: Any,
Create a single dataset instance (train, validation, or test).
- Parameters:
path – Path to the dataset file
pack_metadata_path – Path to the packed sequence metadata
is_test – Whether this is a test dataset
**kwargs – Additional arguments to pass to the dataset constructor
- Returns:
The created dataset
- property train_path: pathlib.Path#
Path to the training dataset file (training.jsonl).
- property default_pack_path: pathlib.Path#
The default directory path for storing packed sequence files.
Constructed based on the dataset root and tokenizer model name. Creates the directory if it doesn’t exist.
- Returns:
The Path object for the default packing directory.
- property pack_metadata: pathlib.Path#
Path to the metadata file for packed sequences.
Determined by
packed_sequence_specs
or defaults based on thedefault_pack_path
andpacked_sequence_size
.- Returns:
The Path object for the packed sequence metadata file.
- Raises:
ValueError – If packed sequences are not configured.
- property train_path_packed: pathlib.Path#
Path to the packed training dataset file (.npy).
Determined by
packed_sequence_specs
or defaults based on thedefault_pack_path
andpacked_sequence_size
.- Returns:
The Path object for the packed training data file.
- Raises:
ValueError – If packed sequences are not configured.
- property validation_path_packed: pathlib.Path#
Path to the packed validation dataset file (.npy).
Determined by
packed_sequence_specs
or defaults based on thedefault_pack_path
andpacked_sequence_size
.- Returns:
The Path object for the packed validation data file.
- Raises:
ValueError – If packed sequences are not configured.
- property validation_path: pathlib.Path#
Path to the validation dataset file (validation.jsonl).
- property test_path: pathlib.Path#
Path to the test dataset file (test.jsonl).
- _extract_tokenizer_model_name() str #
Automatically get the model name from model path.