core.datasets.retro.db.build#
Build a chunk database from a list of indexed datasets.
Building a chunk database consists of.
Breaking each document of each indexed dataset into consecutive retro_gpt_chunk_length chunks.
Re-tokenize each chunk into Bert, and discard any chunks with empty Bert tokens.
Save chunk offsets to disk for each indexed dataset.
Module Contents#
Functions#
Process a document index range of the indexed dataset. |
|
Split each document within block into consecutive retro_gpt_chunk_length size chunks. |
|
Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index. |
|
Process a single indexed dataset & extract chunks. |
|
Iterate each indexed dataset & process its chunks. |
|
Set n_chunks_train & n_chunks sampled for each individual DB. |
|
Merge individual DBs into single DB. |
|
Merge individual dataset components into single database. |
|
Extract token chunks from each indexed dataset. |
API#
- core.datasets.retro.db.build.build_partial_db(
- config: types.SimpleNamespace,
- dataset_idx: int,
- n_datasets: int,
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- block_id: int,
- n_blocks: int,
- block: dict,
- proc_id: int,
- n_procs: int,
Process a document index range of the indexed dataset.
The chunk database is built in parallel blocks, since de-tokenizing & re-tokenizing for Bert-length computation is expensive. This method iterates each document and extracts sequential ‘chunk-length’ sequences from each document.
- Parameters:
config (types.SimpleNamespace) – Subset of Retro config, containing ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.
dataset_idx (int) – Index of this dataset out of all blended datasets.
n_datasets (int) – Total number of blended datasets.
indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.
block_id (int) – Block index out of all blocks to be processed.
n_blocks (int) – Total number of blocks to be processed.
block (dict) – Range information such as start/end points for chunking idnexed dataset.
proc_id (int) – Process ID for tracking parallel process order.
n_procs (int) – Total number of parallel processes.
- Returns:
Process ID.
List of valid chunks.
List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).
Dict mapping document ID to number of valid chunks.
- Return type:
A tuple containing
- core.datasets.retro.db.build.build_block_db(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- dataset_idx: int,
- n_datasets: int,
- indexed_dataset: megatron.core.datasets.indexed_dataset.IndexedDataset,
- n_procs: int,
- executor: concurrent.futures.ProcessPoolExecutor,
- n_missing_blocks: int,
- block_idx: int,
- block: dict,
Split each document within block into consecutive retro_gpt_chunk_length size chunks.
- Parameters:
config (RetroPreprocessingConfig) – For DB building, we make use of attributes ‘chunk_length’, ‘gpt_eod’, ‘gpt_detokenize’, ‘bert_tokenize’, and ‘task_validate’.
dataset_idx (int) – Index of this dataset out of all blended datasets.
n_datasets (int) – Total number of blended datasets.
indexed_dataset (IndexedDataset) – Indexed dataset to be chunked.
n_procs (int) – Total number of parallel processes. executor (ProcessPoolExecutor): Executor for launching parallel processes.
n_missing_blocks (int) – Total number of blocks to be processed.
block_idx (int) – Block index out of all blocks to be processed.
block (dict) – Range information such as start/end points for chunking idnexed dataset.
- Returns:
List of valid chunks.
List of invalid chunks (i.e., chunks that converted to empty Bert embeddings.).
Dict mapping document ID to number of valid chunks.
- Return type:
A tuple containing
- core.datasets.retro.db.build.save_block_db(
- block: dict,
- chunk_db_valid: numpy.ndarray,
- chunk_db_invalid: numpy.ndarray,
- doc_offsets: numpy.ndarray,
Save block of chunked tokens to disk. These blocks are later used for training and adding to the vector index.
- Parameters:
block (dict) – Range information such as start/end points for chunking idnexed dataset.
chunk_db_valid (np.ndarray) – Array of valid chunk indexes.
chunk_db_invalid (np.ndarray) – Array of invalid chunk indexes.
doc_offsets (np.ndarray) – Array of document offsets by chunks.
- core.datasets.retro.db.build.build_individual_db(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- dataset_idx: int,
- n_datasets: int,
- dataset_info: dict,
Process a single indexed dataset & extract chunks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
dataset_idx (int) – Dataset index within blended dataset.
n_datasets (int) – Total number of datasets within blended dataset.
dataset_info (dict) – Metadata for dataset (see
save_indexed_dataset_infos()inutils.pyfor more detail).
- core.datasets.retro.db.build.build_individual_dbs(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- indexed_dataset_infos: List[Dict],
Iterate each indexed dataset & process its chunks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset.
- core.datasets.retro.db.build.update_chunk_counts(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- indexed_dataset_infos: List[Dict],
Set n_chunks_train & n_chunks sampled for each individual DB.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).
- core.datasets.retro.db.build.merge_dbs(
- project_dir: str,
- indexed_dataset_infos: List[Dict],
- db_type: str,
Merge individual DBs into single DB.
- Parameters:
project_dir (str) – Retro project dir.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).
db_type (str) – DB type (e.g., ‘sampled’, ‘train’, or ‘valid’).
- core.datasets.retro.db.build.build_merged_dbs(
- project_dir: str,
- indexed_dataset_infos: List[Dict],
Merge individual dataset components into single database.
This method merges databases for DB types:
‘sampled’: used for training the vector index.
‘train’: used for adding to the trained vector index.
‘valid’: can be used for validating/testing the vector index.
- Parameters:
project_dir (str) – Retro project dir.
indexed_dataset_infos (List[Dict]) – Preprocessing metadata for each dataset (i.e., ‘prefix’, ‘ratio’, ‘n_chunks’, etc.).
- core.datasets.retro.db.build.build_db(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Extract token chunks from each indexed dataset.
Iterate each document of each indexed dataset, extract that document’s chunks, and save to a ‘DB’ (hdf5 file).
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.