core.datasets.retro.db.utils#
Utilities for building a chunk database.
Module Contents#
Functions#
Sub-directory for DB data. |
|
Gather meta-info about each indexed dataset. |
|
Path to indexed dataset meta-infos. |
|
Save dataset order & meta-info. |
|
Loaded indexed datasets into memory-mapped datasets. |
|
Load indexed dataset meta-infos. |
|
Individual DB’s directory. |
|
Get paths of all database blocks of an individual dataset. |
|
Load individual dataset’s chunk DB. |
|
Load individual dataset’s document offsets. |
|
Paths to merged datasets. |
|
Get merged dataset. |
|
Get sampled dataset (for training the vector index). |
|
Get training dataset (for adding to the vector index). |
|
Get validation dataset (for testing the vector index). |
|
Get all merged datasets. |
API#
- core.datasets.retro.db.utils.get_db_dir(project_dir: str) str#
Sub-directory for DB data.
- Parameters:
project_dir (str) – Path to Retro project dir.
- Returns:
Path of the DB sub-directory within the project.
- core.datasets.retro.db.utils.init_indexed_dataset_infos(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Gather meta-info about each indexed dataset.
The returned info array allows for easy access to the configuration, and helps remove ambiguity.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
ratio: Data split weight.
prefix: Relative path to dataset under DB sub-directory.
- Return type:
List of processing metadata for each dataset, including
- core.datasets.retro.db.utils.get_indexed_dataset_infos_path(project_dir: str) str#
Path to indexed dataset meta-infos.
- Parameters:
project_dir (str) – Path to Retro project dir.
- Returns:
Path to the
indexed_dataset_infos.jsonfile.
- core.datasets.retro.db.utils.save_indexed_dataset_infos(
- project_dir: str,
- indexed_dataset_infos: List[Dict],
Save dataset order & meta-info.
- Parameters:
project_dir (str) – Path to Retro project dir.
indexed_dataset_infos (List[Dict]) – List of metadata for each dataset, with each entry containing:
ratio (-) – Data split weight.
prefix (-) – Relative path to dataset under DB sub-directory.
n_docs (-) – Number of documents.
n_docs_train (-) – Number of documents used for pretraining.
n_chunks (-) – Number of valid chunks.
n_chunks_train (-) – Number of valid chunks used for pretraining.
n_chunks_invalid (-) – Number of invalid chunks.
n_chunks_sampled (-) – Number of valid chunks used for vector index training.
- core.datasets.retro.db.utils.load_indexed_datasets(
- project_dir: str,
- indexed_dataset_infos: List[Dict],
Loaded indexed datasets into memory-mapped datasets.
- Parameters:
project_dir (str) – Path to Retro project dir.
indexed_dataset_infos (List[Dict]) – List of metadata for each dataset (see
save_indexed_dataset_infos()for more details.
- core.datasets.retro.db.utils.get_indexed_dataset_infos(
- project_dir: str,
Load indexed dataset meta-infos.
- Parameters:
project_dir (str) – Path to Retro project dir.
- Returns:
List of metadata for each dataset (see
save_indexed_dataset_infos()for more details.
- core.datasets.retro.db.utils.get_individual_db_dir(project_dir: str, prefix: str) str#
Individual DB’s directory.
- Parameters:
project_dir (str) – Path to Retro project dir.
prefix (str) – Unique relative path to dataset within project dir.
- Returns:
Path to the given datasets’s chunk database.
- core.datasets.retro.db.utils.get_individual_db_paths(
- project_dir: str,
- prefix: str,
Get paths of all database blocks of an individual dataset.
- Parameters:
project_dir (str) – Path to Retro project dir.
prefix (str) – Unique relative path to dataset within project dir.
- Returns:
Paths to each HDF5 chunk database files that comprises this datasets full chunk database.
- core.datasets.retro.db.utils.get_individual_chunk_db(
- project_dir: str,
- ds_id: int,
- ds_info: dict,
Load individual dataset’s chunk DB.
- Parameters:
project_dir (str) – Path to Retro project dir.
ds_id (int) – Index of dataset within blended dataset.
ds_info (dict) – Preprocessing metadata for dataset (see
save_indexed_dataset_infos()for more detail).
- Returns:
Array of chunk start/end indexes for this dataset, where the chunk indexes can be used for indexing into the corresponding indexed dataset.
- core.datasets.retro.db.utils.get_individual_doc_offsets(
- project_dir: str,
- ds_id: int,
- ds_info: dict,
Load individual dataset’s document offsets.
- Parameters:
project_dir (str) – Path to Retro project dir.
ds_id (int) – Index of dataset within blended dataset.
ds_info (dict) – Preprocessing metadata for dataset (see
save_indexed_dataset_infos()for more detail).
- Returns:
Array of document offsets by chunk index for this dataset.
- core.datasets.retro.db.utils.get_merged_db_path_map(project_dir: str) dict#
Paths to merged datasets.
- Parameters:
project_dir (str) – Path to Retro project dir.
- Returns:
sampled: Chunks used for training the vector index.
train: Chunks used for pretraining ‘train’ dataset.
valid: Chunks used for pretraining ‘valid’ dataset.
- Return type:
A dict of chunk databases, one for each of
- core.datasets.retro.db.utils.get_merged_dataset(
- project_dir: str,
- chunk_length: int,
- eod_token_id: int,
- db_type: str,
- indexed_dataset_infos: Optional[List[Dict]] = None,
Get merged dataset.
- Parameters:
project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see
save_indexed_dataset_infos()for more detail). If not provided, the indexed dataset infos will be loaded from disk.
- Returns:
A DBDataset, which is a dataset that wraps the HDF5 chunk index array.
- core.datasets.retro.db.utils.get_merged_sampled_dataset(
- project_dir: str,
- chunk_length: int,
- eod_token_id: int,
- indexed_dataset_infos: Optional[List[Dict]] = None,
Get sampled dataset (for training the vector index).
- Parameters:
project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see
save_indexed_dataset_infos()for more detail). If not provided, the indexed dataset infos will be loaded from disk.
- Returns:
A DBDataset, which is a dataset that wraps the HDF5 chunk index array.
- core.datasets.retro.db.utils.get_merged_train_dataset(
- project_dir: str,
- chunk_length: int,
- eod_token_id: int,
- indexed_dataset_infos: Optional[List[Dict]] = None,
Get training dataset (for adding to the vector index).
- Parameters:
project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see
save_indexed_dataset_infos()for more detail). If not provided, the indexed dataset infos will be loaded from disk.
- Returns:
A DBDataset, which is a dataset that wraps the HDF5 chunk index array.
- core.datasets.retro.db.utils.get_merged_valid_dataset(
- project_dir: str,
- chunk_length: int,
- eod_token_id: int,
- indexed_dataset_infos: Optional[List[Dict]] = None,
Get validation dataset (for testing the vector index).
- Parameters:
project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see
save_indexed_dataset_infos()for more detail). If not provided, the indexed dataset infos will be loaded from disk.
- Returns:
A DBDataset, which is a dataset that wraps the HDF5 chunk index array.
- core.datasets.retro.db.utils.get_merged_datasets(
- project_dir: str,
- chunk_length: int,
- eod_token_id: int,
Get all merged datasets.
- Parameters:
project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
- Returns:
A dict mapping DB type (‘sampled’, ‘train’, or ‘valid’) to the corresponding DBDataset, which is a dataset that wraps the HDF5 chunk index array.