core.datasets.retro.index.build#

Construct an index.

Constructing an index generally happens in two phases:

  • index.train(): Train an index on a representative set of vectors.

  • index.add(): Add vectors to an index, to be available for retrieval.

Module Contents#

Functions#

get_empty_index_path

Path of empty index.

get_block_nload

Compute number of blocks to load.

merge_embedding_blocks

Merge individual embedding blocks into a single binary mmap file.

get_text_dataset_for_training

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

embed_training_chunks

Embed DB chunks.

train_on_embeddings

Train index on embedded DB chunks.

remove_embeddings

Remove embeddings after training.

_train_index

Train index on DB chunks.

train_index

Entry point for training the index.

get_text_dataset_for_adding

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

_add_to_index

Add DB chunks to index.

add_to_index

Entry point for adding to the index.

build_index

Build index.

API#

core.datasets.retro.index.build.get_empty_index_path(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Path of empty index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the empty (trained, but without added samples) vector index.

core.datasets.retro.index.build.get_block_nload(block_path: str, load_fraction: float) int#

Compute number of blocks to load.

This is computed by multiplying the total number of available blocks with the fraction of blocks to load.

Parameters:
  • block_path (str) – Path to HDF5 file containing block of data. File must contain key ‘data’.

  • load_fraction (float) – Fraction (0 < load_fraction <= 1) of block samples to load.

Returns:

Number of block samples to load.

core.datasets.retro.index.build.merge_embedding_blocks(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Merge individual embedding blocks into a single binary mmap file.

The embeddings are initially stored in block-sized (e.g., ~100k embeddings per block) HDF5 files. These individual block files must be merged into a single file before training, to be based as a numpy mmap array to the index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.get_text_dataset_for_training(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) megatron.core.datasets.retro.utils.GPTToTextDataset#

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

The text dataset consisting of tokens converted from sampled chunk database.

core.datasets.retro.index.build.embed_training_chunks(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Embed DB chunks.

Store chunks in blocks on disk. These blocks will later be merged into a single dataset for training the index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.train_on_embeddings(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Train index on embedded DB chunks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.remove_embeddings(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Remove embeddings after training.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build._train_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Train index on DB chunks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.train_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Entry point for training the index.

We select whether to train a new index, or validate an existing index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.get_text_dataset_for_adding(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) megatron.core.datasets.retro.utils.GPTToTextDataset#

Convert GPT token chunk dataset to a text dataset for passing to the embedder.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

The text dataset that consists of tokens converted from the ‘train’ chunk database. These are the chunks used for retrieval by the pretraining ‘train’ dataset.

core.datasets.retro.index.build._add_to_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Add DB chunks to index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the populated index.

core.datasets.retro.index.build.add_to_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Entry point for adding to the index.

We select whether to add to a new index, or validate an existing index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.build.build_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Build index.

Building index involves sequentially running stages above:

  • Train index (on sampled training chunks).

  • Add to index (on all training chunks).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.