core.datasets.retro.db.utils#

Utilities for building a chunk database.

Module Contents#

Functions#

get_db_dir

Sub-directory for DB data.

init_indexed_dataset_infos

Gather meta-info about each indexed dataset.

get_indexed_dataset_infos_path

Path to indexed dataset meta-infos.

save_indexed_dataset_infos

Save dataset order & meta-info.

load_indexed_datasets

Loaded indexed datasets into memory-mapped datasets.

get_indexed_dataset_infos

Load indexed dataset meta-infos.

get_individual_db_dir

Individual DB’s directory.

get_individual_db_paths

Get paths of all database blocks of an individual dataset.

get_individual_chunk_db

Load individual dataset’s chunk DB.

get_individual_doc_offsets

Load individual dataset’s document offsets.

get_merged_db_path_map

Paths to merged datasets.

get_merged_dataset

Get merged dataset.

get_merged_sampled_dataset

Get sampled dataset (for training the vector index).

get_merged_train_dataset

Get training dataset (for adding to the vector index).

get_merged_valid_dataset

Get validation dataset (for testing the vector index).

get_merged_datasets

Get all merged datasets.

API#

core.datasets.retro.db.utils.get_db_dir(project_dir: str) str#

Sub-directory for DB data.

Parameters:

project_dir (str) – Path to Retro project dir.

Returns:

Path of the DB sub-directory within the project.

core.datasets.retro.db.utils.init_indexed_dataset_infos(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) List[Dict]#

Gather meta-info about each indexed dataset.

The returned info array allows for easy access to the configuration, and helps remove ambiguity.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

  • ratio: Data split weight.

  • prefix: Relative path to dataset under DB sub-directory.

Return type:

List of processing metadata for each dataset, including

core.datasets.retro.db.utils.get_indexed_dataset_infos_path(project_dir: str) str#

Path to indexed dataset meta-infos.

Parameters:

project_dir (str) – Path to Retro project dir.

Returns:

Path to the indexed_dataset_infos.json file.

core.datasets.retro.db.utils.save_indexed_dataset_infos(
project_dir: str,
indexed_dataset_infos: List[Dict],
) None#

Save dataset order & meta-info.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • indexed_dataset_infos (List[Dict]) – List of metadata for each dataset, with each entry containing:

  • ratio (-) – Data split weight.

  • prefix (-) – Relative path to dataset under DB sub-directory.

  • n_docs (-) – Number of documents.

  • n_docs_train (-) – Number of documents used for pretraining.

  • n_chunks (-) – Number of valid chunks.

  • n_chunks_train (-) – Number of valid chunks used for pretraining.

  • n_chunks_invalid (-) – Number of invalid chunks.

  • n_chunks_sampled (-) – Number of valid chunks used for vector index training.

core.datasets.retro.db.utils.load_indexed_datasets(
project_dir: str,
indexed_dataset_infos: List[Dict],
) None#

Loaded indexed datasets into memory-mapped datasets.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • indexed_dataset_infos (List[Dict]) – List of metadata for each dataset (see save_indexed_dataset_infos() for more details.

core.datasets.retro.db.utils.get_indexed_dataset_infos(
project_dir: str,
) List[Dict]#

Load indexed dataset meta-infos.

Parameters:

project_dir (str) – Path to Retro project dir.

Returns:

List of metadata for each dataset (see save_indexed_dataset_infos() for more details.

core.datasets.retro.db.utils.get_individual_db_dir(project_dir: str, prefix: str) str#

Individual DB’s directory.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • prefix (str) – Unique relative path to dataset within project dir.

Returns:

Path to the given datasets’s chunk database.

core.datasets.retro.db.utils.get_individual_db_paths(
project_dir: str,
prefix: str,
) List[str]#

Get paths of all database blocks of an individual dataset.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • prefix (str) – Unique relative path to dataset within project dir.

Returns:

Paths to each HDF5 chunk database files that comprises this datasets full chunk database.

core.datasets.retro.db.utils.get_individual_chunk_db(
project_dir: str,
ds_id: int,
ds_info: dict,
) numpy.ndarray#

Load individual dataset’s chunk DB.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • ds_id (int) – Index of dataset within blended dataset.

  • ds_info (dict) – Preprocessing metadata for dataset (see save_indexed_dataset_infos() for more detail).

Returns:

Array of chunk start/end indexes for this dataset, where the chunk indexes can be used for indexing into the corresponding indexed dataset.

core.datasets.retro.db.utils.get_individual_doc_offsets(
project_dir: str,
ds_id: int,
ds_info: dict,
) numpy.ndarray#

Load individual dataset’s document offsets.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • ds_id (int) – Index of dataset within blended dataset.

  • ds_info (dict) – Preprocessing metadata for dataset (see save_indexed_dataset_infos() for more detail).

Returns:

Array of document offsets by chunk index for this dataset.

core.datasets.retro.db.utils.get_merged_db_path_map(project_dir: str) dict#

Paths to merged datasets.

Parameters:

project_dir (str) – Path to Retro project dir.

Returns:

  • sampled: Chunks used for training the vector index.

  • train: Chunks used for pretraining ‘train’ dataset.

  • valid: Chunks used for pretraining ‘valid’ dataset.

Return type:

A dict of chunk databases, one for each of

core.datasets.retro.db.utils.get_merged_dataset(
project_dir: str,
chunk_length: int,
eod_token_id: int,
db_type: str,
indexed_dataset_infos: Optional[List[Dict]] = None,
) core.datasets.retro.db.dataset.DBDataset#

Get merged dataset.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • chunk_length (int) – GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

  • db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).

  • indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_sampled_dataset(
project_dir: str,
chunk_length: int,
eod_token_id: int,
indexed_dataset_infos: Optional[List[Dict]] = None,
) core.datasets.retro.db.dataset.DBDataset#

Get sampled dataset (for training the vector index).

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • chunk_length (int) – GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

  • indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_train_dataset(
project_dir: str,
chunk_length: int,
eod_token_id: int,
indexed_dataset_infos: Optional[List[Dict]] = None,
) core.datasets.retro.db.dataset.DBDataset#

Get training dataset (for adding to the vector index).

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • chunk_length (int) – GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

  • indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_valid_dataset(
project_dir: str,
chunk_length: int,
eod_token_id: int,
indexed_dataset_infos: Optional[List[Dict]] = None,
) core.datasets.retro.db.dataset.DBDataset#

Get validation dataset (for testing the vector index).

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • chunk_length (int) – GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

  • indexed_dataset_infos (Optional[List[Dict]]) – Optionally, pre-loaded list of dataset metadata (see save_indexed_dataset_infos() for more detail). If not provided, the indexed dataset infos will be loaded from disk.

Returns:

A DBDataset, which is a dataset that wraps the HDF5 chunk index array.

core.datasets.retro.db.utils.get_merged_datasets(
project_dir: str,
chunk_length: int,
eod_token_id: int,
) dict#

Get all merged datasets.

Parameters:
  • project_dir (str) – Path to Retro project dir.

  • chunk_length (int) – GPT chunk length (e.g., 64).

  • eod_token_id (int) – EOD token ID.

Returns:

A dict mapping DB type (‘sampled’, ‘train’, or ‘valid’) to the corresponding DBDataset, which is a dataset that wraps the HDF5 chunk index array.