`core.datasets.retro.db.utils`#

Utilities for building a chunk database.

Module Contents#

Functions#

`get_db_dir`	Sub-directory for DB data.
`init_indexed_dataset_infos`	Gather meta-info about each indexed dataset.
`get_indexed_dataset_infos_path`	Path to indexed dataset meta-infos.
`save_indexed_dataset_infos`	Save dataset order & meta-info.
`load_indexed_datasets`	Loaded indexed datasets into memory-mapped datasets.
`get_indexed_dataset_infos`	Load indexed dataset meta-infos.
`get_individual_db_dir`	Individual DB’s directory.
`get_individual_db_paths`	Get paths of all database blocks of an individual dataset.
`get_individual_chunk_db`	Load individual dataset’s chunk DB.
`get_individual_doc_offsets`	Load individual dataset’s document offsets.
`get_merged_db_path_map`	Paths to merged datasets.
`get_merged_dataset`	Get merged dataset.
`get_merged_sampled_dataset`	Get sampled dataset (for training the vector index).
`get_merged_train_dataset`	Get training dataset (for adding to the vector index).
`get_merged_valid_dataset`	Get validation dataset (for testing the vector index).
`get_merged_datasets`	Get all merged datasets.

API#

core.datasets.retro.db.utils.get_db_dir(project_dir: str) → str#

Sub-directory for DB data.

Parameters:: project_dir (str) – Path to Retro project dir.
Returns:: Path of the DB sub-directory within the project.

core.datasets.retro.db.utils.init_indexed_dataset_infos( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → List[Dict]#

Gather meta-info about each indexed dataset.

The returned info array allows for easy access to the configuration, and helps remove ambiguity.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

ratio: Data split weight.
prefix: Relative path to dataset under DB sub-directory.

Return type:

List of processing metadata for each dataset, including

core.datasets.retro.db.utils.get_indexed_dataset_infos_path(project_dir: str) → str#

Path to indexed dataset meta-infos.

Parameters:: project_dir (str) – Path to Retro project dir.
Returns:: Path to the indexed_dataset_infos.json file.

core.datasets.retro.db.utils.save_indexed_dataset_infos( project_dir: str, indexed_dataset_infos: List[Dict], ) → None#

Save dataset order & meta-info.

Parameters:

project_dir (str) – Path to Retro project dir.
indexed_dataset_infos (List[Dict]) – List of metadata for each dataset, with each entry containing:
ratio (-) – Data split weight.
prefix (-) – Relative path to dataset under DB sub-directory.
n_docs (-) – Number of documents.
n_docs_train (-) – Number of documents used for pretraining.
n_chunks (-) – Number of valid chunks.
n_chunks_train (-) – Number of valid chunks used for pretraining.
n_chunks_invalid (-) – Number of invalid chunks.
n_chunks_sampled (-) – Number of valid chunks used for vector index training.

core.datasets.retro.db.utils.load_indexed_datasets( project_dir: str, indexed_dataset_infos: List[Dict], ) → None#

Loaded indexed datasets into memory-mapped datasets.

Parameters:

project_dir (str) – Path to Retro project dir.
indexed_dataset_infos (List[Dict]) – List of metadata for each dataset (see save_indexed_dataset_infos() for more details.

core.datasets.retro.db.utils.get_indexed_dataset_infos( project_dir: str, ) → List[Dict]#

Load indexed dataset meta-infos.

Parameters:: project_dir (str) – Path to Retro project dir.
Returns:: List of metadata for each dataset (see save_indexed_dataset_infos() for more details.

core.datasets.retro.db.utils.get_individual_db_dir(project_dir: str, prefix: str) → str#

Individual DB’s directory.

Parameters:

project_dir (str) – Path to Retro project dir.
prefix (str) – Unique relative path to dataset within project dir.

Returns:

Path to the given datasets’s chunk database.

core.datasets.retro.db.utils.get_individual_db_paths( project_dir: str, prefix: str, ) → List[str]#

Get paths of all database blocks of an individual dataset.

Parameters:

project_dir (str) – Path to Retro project dir.
prefix (str) – Unique relative path to dataset within project dir.

Returns:

Paths to each HDF5 chunk database files that comprises this datasets full chunk database.

core.datasets.retro.db.utils.get_individual_chunk_db( project_dir: str, ds_id: int, ds_info: dict, ) → numpy.ndarray#

Load individual dataset’s chunk DB.

Parameters:

project_dir (str) – Path to Retro project dir.
ds_id (int) – Index of dataset within blended dataset.
ds_info (dict) – Preprocessing metadata for dataset (see save_indexed_dataset_infos() for more detail).

Returns:

Array of chunk start/end indexes for this dataset, where the chunk indexes can be used for indexing into the corresponding indexed dataset.

core.datasets.retro.db.utils.get_individual_doc_offsets( project_dir: str, ds_id: int, ds_info: dict, ) → numpy.ndarray#

Load individual dataset’s document offsets.

Parameters:

project_dir (str) – Path to Retro project dir.
ds_id (int) – Index of dataset within blended dataset.
ds_info (dict) – Preprocessing metadata for dataset (see save_indexed_dataset_infos() for more detail).

Returns:

Array of document offsets by chunk index for this dataset.

core.datasets.retro.db.utils.get_merged_db_path_map(project_dir: str) → dict#

Paths to merged datasets.

Parameters:

project_dir (str) – Path to Retro project dir.

Returns:

sampled: Chunks used for training the vector index.
train: Chunks used for pretraining ‘train’ dataset.
valid: Chunks used for pretraining ‘valid’ dataset.

Return type:

A dict of chunk databases, one for each of

core.datasets.retro.db.utils.get_merged_dataset( project_dir: str, chunk_length: int, eod_token_id: int, db_type: str, indexed_dataset_infos: Optional[List[Dict]] = None, ) → core.datasets.retro.db.dataset.DBDataset#

Get merged dataset.

Parameters:

project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_sampled_dataset( project_dir: str, chunk_length: int, eod_token_id: int, indexed_dataset_infos: Optional[List[Dict]] = None, ) → core.datasets.retro.db.dataset.DBDataset#

Get sampled dataset (for training the vector index).

Parameters:

project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_train_dataset( project_dir: str, chunk_length: int, eod_token_id: int, indexed_dataset_infos: Optional[List[Dict]] = None, ) → core.datasets.retro.db.dataset.DBDataset#

Get training dataset (for adding to the vector index).

Parameters:

project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_valid_dataset( project_dir: str, chunk_length: int, eod_token_id: int, indexed_dataset_infos: Optional[List[Dict]] = None, ) → core.datasets.retro.db.dataset.DBDataset#

Get validation dataset (for testing the vector index).

Parameters:

project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.
indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_datasets( project_dir: str, chunk_length: int, eod_token_id: int, ) → dict#

Get all merged datasets.

Parameters:

project_dir (str) – Path to Retro project dir.
chunk_length (int) – GPT chunk length (e.g., 64).
eod_token_id (int) – EOD token ID.

Returns:

A dict mapping DB type (‘sampled’, ‘train’, or ‘valid’) to the corresponding DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils#

Module Contents#

Functions#

API#

`core.datasets.retro.db.utils`#