core.datasets.retro.db.build#

Build a chunk database from a list of indexed datasets.

Building a chunk database consists of.

  • Breaking each document of each indexed dataset into consecutive retro_gpt_chunk_length chunks.

  • Re-tokenize each chunk into Bert, and discard any chunks with empty Bert tokens.

  • Save chunk offsets to disk for each indexed dataset.

Module Contents#

Functions#

build_partial_db

Process a document index range of the indexed dataset.

build_block_db

Split each document within block into consecutive retro_gpt_chunk_length size chunks.

save_block_db

Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index.

build_individual_db

Process a single indexed dataset & extract chunks.

build_individual_dbs

Iterate each indexed dataset & process its chunks.

update_chunk_counts

Set n_chunks_train & n_chunks sampled for each individual DB.

merge_dbs

Merge individual DBs into single DB.

build_merged_dbs

Merge individual dataset components into single database.

build_db

Extract token chunks from each indexed dataset.

API#

core.datasets.retro.db.build.build_partial_db(
config: types.SimpleNamespace,
dataset_idx: int,
n_datasets: int,
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
block_id: int,
n_blocks: int,
block: dict,
proc_id: int,
n_procs: int,
) Tuple[int, list, list, dict]#

Process a document index range of the indexed dataset.

The chunk database is built in parallel blocks, since de-tokenizing & re-tokenizing for Bert-length computation is expensive. This method iterates each document and extracts sequential ‘chunk-length’ sequences from each document.

Parameters:
  • config (types.SimpleNamespace) – Subset of Retro config, containing ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.

  • dataset_idx (int) – Index of this dataset out of all blended datasets.

  • n_datasets (int) – Total number of blended datasets.

  • indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.

  • block_id (int) – Block index out of all blocks to be processed.

  • n_blocks (int) – Total number of blocks to be processed.

  • block (dict) – Range information such as start/end points for chunking idnexed dataset.

  • proc_id (int) – Process ID for tracking parallel process order.

  • n_procs (int) – Total number of parallel processes.

Returns:

  • Process ID.

  • List of valid chunks.

  • List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).

  • Dict mapping document ID to number of valid chunks.

Return type:

A tuple containing

core.datasets.retro.db.build.build_block_db(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
dataset_idx: int,
n_datasets: int,
indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
n_procs: int,
executor: concurrent.futures.ProcessPoolExecutor,
n_missing_blocks: int,
block_idx: int,
block: dict,
) Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]#

Split each document within block into consecutive retro_gpt_chunk_length size chunks.

Parameters:
  • config (RetroPreprocessingConfig) – For DB building, we make use of attributes ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.

  • dataset_idx (int) – Index of this dataset out of all blended datasets.

  • n_datasets (int) – Total number of blended datasets.

  • indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.

  • n_procs (int) – Total number of parallel processes. executor (ProcessPoolExecutor): Executor for launching parallel processes.

  • n_missing_blocks (int) – Total number of blocks to be processed.

  • block_idx (int) – Block index out of all blocks to be processed.

  • block (dict) – Range information such as start/end points for chunking idnexed dataset.

Returns:

  • List of valid chunks.

  • List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).

  • Dict mapping document ID to number of valid chunks.

Return type:

A tuple containing

core.datasets.retro.db.build.save_block_db(
block: dict,
chunk_db_valid: numpy.ndarray,
chunk_db_invalid: numpy.ndarray,
doc_offsets: numpy.ndarray,
) None#

Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index.

Parameters:
  • block (dict) – Range information such as start/end points for chunking idnexed dataset.

  • chunk_db_valid (np.ndarray) – Array of valid chunk indexes.

  • chunk_db_invalid (np.ndarray) – Array of invalid chunk indexes.

  • doc_offsets (np.ndarray) – Array of document offsets by chunks.

core.datasets.retro.db.build.build_individual_db(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
dataset_idx: int,
n_datasets: int,
dataset_info: dict,
) None#

Process a single indexed dataset & extract chunks.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • dataset_idx (int) – Dataset index within blended dataset.

  • n_datasets (int) – Total number of datasets within blended dataset.

  • dataset_info (dict) – Metadata for dataset (see save_indexed_dataset_infos() in utils.py for more detail).

core.datasets.retro.db.build.build_individual_dbs(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
indexed_dataset_infos: List[Dict],
) None#

Iterate each indexed dataset & process its chunks.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset.

core.datasets.retro.db.build.update_chunk_counts(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
indexed_dataset_infos: List[Dict],
) None#

Set n_chunks_train & n_chunks sampled for each individual DB.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).

core.datasets.retro.db.build.merge_dbs(
project_dir: str,
indexed_dataset_infos: List[Dict],
db_type: str,
) None#

Merge individual DBs into single DB.

Parameters:
  • project_dir (str) – Retro project dir.

  • indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).

  • db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).

core.datasets.retro.db.build.build_merged_dbs(
project_dir: str,
indexed_dataset_infos: List[Dict],
) None#

Merge individual dataset components into single database.

This method merges databases for DB types:

  • ‘sampled’: used for training the vector index.

  • ‘train’: used for adding to the trained vector index.

  • ‘valid’: can be used for validating/testing the vector index.

Parameters:
  • project_dir (str) – Retro project dir.

  • indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).

core.datasets.retro.db.build.build_db(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Extract token chunks from each indexed dataset.

Iterate each document of each indexed dataset, extract that document’s chunks, and save to a ‘DB’ (hdf5 file).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.