core.datasets.retro.index.index#

Base class for all vector indexes.

A vector index is a type of retrieval database that is queried using vectors, and returns vectors that are ‘similar’ (e.g., by cosine distance) to the query vector. The construction and usage of an index generally has the following pattern:

  • Train the index on representative vectors.

  • Add vectors to the index (i.e., vectors available for retrieval)

  • Query index with new vector, to retrieve similar vector indexes.

Module Contents#

Classes#

Index

Abstract base class for indexes.

API#

class core.datasets.retro.index.index.Index#

Bases: abc.ABC

Abstract base class for indexes.

Note : While currently only Faiss-based classes are implemented, in the future, this class will be extended with other types of indexes that have different performance-accuracy trade-offs.

The primary methods to override are:

  • train() : Train index on the sampled training chunks.

  • add() : Add all training chunks to index.

classmethod make_object_verbose(index: faiss.Index, verbose: bool) None#

Make index object verbose.

Parameters:
  • index (faiss.Index) – Faiss object to set verbose.

  • verbose (bool) – Sets whether index should log status updates during training and adding.

get_empty_index_path(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get file path to empty index (i.e., trained, but unpopulated).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

File path to empty index (i.e., this index has had index.train() called, but not yet index.add()).

get_empty_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) faiss.Index#

Get empty index (i.e., trained, but unpopulated).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Empty Faiss index, loaded from storage.

get_added_index_path(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get file path to index that has been populated with vectors.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

File path to added index (i.e., this index has had both index.train() and index.add() called).

get_added_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) faiss.Index#

Get index that has been populated with vectors.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

‘Added’ (i.e., populated) Faiss index, loaded from storage.

abstractmethod train(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Train index on a representative set of vectors.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

abstractmethod add(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
text_dataset: megatron.core.datasets.retro.utils.GPTToTextDataset,
) None#

Add vectors to index.

Parameters:
embed_text_dataset_block(
embedder: megatron.core.datasets.retro.config.Embedder,
text_dataset: megatron.core.datasets.retro.utils.GPTToTextDataset,
_range: Tuple[int, int],
) numpy.ndarray#

Embed a range of a text dataset.

Parameters:
  • embedder (Embedder) – Embedder used for embedding a text dataset.

  • text_dataset (GPTToTextDataset) – Text dataset that will be embedded.

  • _range (Tuple[int, int]) – Start/end sample indices within text dataset used for embedding.

Returns:

An array of embeddings, with shape (len(text_dataset), dimension(embedder)).