`core.datasets.retro.index.build`#

Construct an index.

Constructing an index generally happens in two phases:

index.train(): Train an index on a representative set of vectors.
index.add(): Add vectors to an index, to be available for retrieval.

Module Contents#

Functions#

`get_empty_index_path`	Path of empty index.
`get_block_nload`	Compute number of blocks to load.
`merge_embedding_blocks`	Merge individual embedding blocks into a single binary mmap file.
`get_text_dataset_for_training`	Convert GPT token chunk dataset to a text dataset for passing to the embedder.
`embed_training_chunks`	Embed DB chunks.
`train_on_embeddings`	Train index on embedded DB chunks.
`remove_embeddings`	Remove embeddings after training.
`_train_index`	Train index on DB chunks.
`train_index`	Entry point for training the index.
`get_text_dataset_for_adding`	Convert GPT token chunk dataset to a text dataset for passing to the embedder.
`_add_to_index`	Add DB chunks to index.
`add_to_index`	Entry point for adding to the index.
`build_index`	Build index.

API#

core.datasets.retro.index.build.get_empty_index_path( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → str#

Path of empty index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.
Returns:: Path to the empty (trained, but without added samples) vector index.

core.datasets.retro.index.build.get_block_nload(block_path: str, load_fraction: float) → int#

Compute number of blocks to load.

This is computed by multiplying the total number of available blocks with the fraction of blocks to load.

Parameters:

block_path (str) – Path to HDF5 file containing block of data. File must contain key ‘data’.
load_fraction (float) – Fraction (0 < load_fraction <= 1) of block samples to load.

Returns:

Number of block samples to load.

core.datasets.retro.index.build.merge_embedding_blocks( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Merge individual embedding blocks into a single binary mmap file.

The embeddings are initially stored in block-sized (e.g., ~100k embeddings per block) HDF5 files. These individual block files must be merged into a single file before training, to be based as a numpy mmap array to the index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.get_text_dataset_for_training( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → megatron.core.datasets.retro.utils.GPTToTextDataset#

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.
Returns:: The text dataset consisting of tokens converted from sampled chunk database.

core.datasets.retro.index.build.embed_training_chunks( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Embed DB chunks.

Store chunks in blocks on disk. These blocks will later be merged into a single dataset for training the index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.train_on_embeddings( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Train index on embedded DB chunks.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.remove_embeddings( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Remove embeddings after training.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build._train_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Train index on DB chunks.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.train_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Entry point for training the index.

We select whether to train a new index, or validate an existing index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.get_text_dataset_for_adding( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → megatron.core.datasets.retro.utils.GPTToTextDataset#

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.
Returns:: The text dataset that consists of tokens converted from the ‘train’ chunk database. These are the chunks used for retrieval by the pretraining ‘train’ dataset.

core.datasets.retro.index.build._add_to_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → str#

Add DB chunks to index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.
Returns:: Path to the populated index.

core.datasets.retro.index.build.add_to_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Entry point for adding to the index.

We select whether to add to a new index, or validate an existing index.

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.build_index( config: megatron.core.datasets.retro.config.RetroPreprocessingConfig, ) → None#

Build index.

Building index involves sequentially running stages above:

Train index (on sampled training chunks).
Add to index (on all training chunks).

Parameters:: config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build#

Module Contents#

Functions#

API#

`core.datasets.retro.index.build`#