`core.datasets.retro.db.build`#

Build a chunk database from a list of indexed datasets.

Building a chunk database consists of.

Breaking each document of each indexed dataset into consecutive retro_gpt_chunk_length chunks.
Re-tokenize each chunk into Bert, and discard any chunks with empty Bert tokens.
Save chunk offsets to disk for each indexed dataset.

Module Contents#

Functions#

`build_partial_db`	Process a document index range of the indexed dataset.
`build_block_db`	Split each document within block into consecutive retro_gpt_chunk_length size chunks.
`save_block_db`	Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index.
`build_individual_db`	Process a single indexed dataset & extract chunks.
`build_individual_dbs`	Iterate each indexed dataset & process its chunks.
`update_chunk_counts`	Set n_chunks_train & n_chunks sampled for each individual DB.
`merge_dbs`	Merge individual DBs into single DB.
`build_merged_dbs`	Merge individual dataset components into single database.
`build_db`	Extract token chunks from each indexed dataset.

API#

core.datasets.retro.db.build.build_partial_db( config: types.SimpleNamespace, dataset_idx: int, n_datasets: int, indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, block_id: int, n_blocks: int, block: dict, proc_id: int, n_procs: int, ) → Tuple[int, list, list, dict]#

Process a document index range of the indexed dataset.

The chunk database is built in parallel blocks, since de-tokenizing & re-tokenizing for Bert-length computation is expensive. This method iterates each document and extracts sequential ‘chunk-length’ sequences from each document.

Parameters:

config (types.SimpleNamespace) – Subset of Retro config, containing ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.
dataset_idx (int) – Index of this dataset out of all blended datasets.
n_datasets (int) – Total number of blended datasets.
indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.
block_id (int) – Block index out of all blocks to be processed.
n_blocks (int) – Total number of blocks to be processed.
block (dict) – Range information such as start/end points for chunking idnexed dataset.
proc_id (int) – Process ID for tracking parallel process order.
n_procs (int) – Total number of parallel processes.

Returns:

Process ID.
List of valid chunks.
List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).
Dict mapping document ID to number of valid chunks.

Return type:

A tuple containing

core.datasets.retro.db.build.build_block_db( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, dataset_idx: int, n_datasets: int, indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset, n_procs: int, executor: concurrent.futures.ProcessPoolExecutor, n_missing_blocks: int, block_idx: int, block: dict, ) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]#

Split each document within block into consecutive retro_gpt_chunk_length size chunks.

Parameters:

config (RetroPreprocessingConfig) – For DB building, we make use of attributes ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.
dataset_idx (int) – Index of this dataset out of all blended datasets.
n_datasets (int) – Total number of blended datasets.
indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.
n_procs (int) – Total number of parallel processes. executor (ProcessPoolExecutor): Executor for launching parallel processes.
n_missing_blocks (int) – Total number of blocks to be processed.
block_idx (int) – Block index out of all blocks to be processed.
block (dict) – Range information such as start/end points for chunking idnexed dataset.

Returns:

List of valid chunks.
List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).
Dict mapping document ID to number of valid chunks.

Return type:

A tuple containing

core.datasets.retro.db.build.save_block_db( block: dict, chunk_db_valid: numpy.ndarray, chunk_db_invalid: numpy.ndarray, doc_offsets: numpy.ndarray, ) → None#

Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index.

Parameters:

block (dict) – Range information such as start/end points for chunking idnexed dataset.
chunk_db_valid (np.ndarray) – Array of valid chunk indexes.
chunk_db_invalid (np.ndarray) – Array of invalid chunk indexes.
doc_offsets (np.ndarray) – Array of document offsets by chunks.

core.datasets.retro.db.build.build_individual_db( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, dataset_idx: int, n_datasets: int, dataset_info: dict, ) → None#

Process a single indexed dataset & extract chunks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
dataset_idx (int) – Dataset index within blended dataset.
n_datasets (int) – Total number of datasets within blended dataset.
dataset_info (dict) – Metadata for dataset (see save_indexed_dataset_infos() in utils.py for more detail).

core.datasets.retro.db.build.build_individual_dbs( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, indexed_dataset_infos: List[Dict], ) → None#

Iterate each indexed dataset & process its chunks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset.

core.datasets.retro.db.build.update_chunk_counts( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, indexed_dataset_infos: List[Dict], ) → None#

Set n_chunks_train & n_chunks sampled for each individual DB.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).

core.datasets.retro.db.build.merge_dbs( project_dir: str, indexed_dataset_infos: List[Dict], db_type: str, ) → None#

Merge individual DBs into single DB.

Parameters:

project_dir (str) – Retro project dir.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).
db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).

core.datasets.retro.db.build.build_merged_dbs( project_dir: str, indexed_dataset_infos: List[Dict], ) → None#

Merge individual dataset components into single database.

This method merges databases for DB types:

‘sampled’: used for training the vector index.
‘train’: used for adding to the trained vector index.
‘valid’: can be used for validating/testing the vector index.

Parameters:

project_dir (str) – Retro project dir.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).

core.datasets.retro.db.build.build_db( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Extract token chunks from each indexed dataset.

Iterate each document of each indexed dataset, extract that document’s chunks, and save to a ‘DB’ (hdf5 file).

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.db.build#

Module Contents#

Functions#

API#

`core.datasets.retro.db.build`#