core.datasets.retro.index.indexes.faiss_par_add#
Multi-process & multi-node version of Faiss’s index.add().
This class inherits from FaissBaseIndex, and optimizes the ‘add()’ method by making it multi-node and multi-process, with bit-wise equivalence to FaissBaseIndex. This allows ‘add()’ to scale out to very large datasets, since the vast majority of the computational effort is embarrassingly parallel.
Module Contents#
Classes#
This class parallelizes both 1) encoding vectors, and 2) adding codes to the index. This class is more performant than naive use of Faiss, because most of the computational work is in encoding the vectors, which is an embarassingly parallel operation. |
API#
- class core.datasets.retro.index.indexes.faiss_par_add.FaissParallelAddIndex#
Bases:
core.datasets.retro.index.indexes.faiss_base.FaissBaseIndexThis class parallelizes both 1) encoding vectors, and 2) adding codes to the index. This class is more performant than naive use of Faiss, because most of the computational work is in encoding the vectors, which is an embarassingly parallel operation.
- encode_block(
- index: faiss.Index,
- embedder: megatron.core.datasets.retro.config.Embedder,
- text_dataset: megatron.core.datasets.retro.utils.GPTToTextDataset,
- block: dict,
Encode sub-dataset block, to be later added to index.
Encode the data subset, generally in blocks of 1M vectors each. For each block, the empty/trained index is loaded, codes are computed via index.sa_encode(), and the resulting codes are saved to disk.
- Parameters:
index (faiss.Index) – Faiss index object.
embedder (Embedder) – Embedder used to embed text dataset.
text_dataset (GPTToTextDataset) – Text dataset to be embedded and encoded.
block (dict) – Range information specifying start/end indices within text dataset.
- Returns:
A tuple of (embeddings, encodings) for the given block subset of the text dataset.
- save_block(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- block: dict,
- codes: numpy.ndarray,
Save block of codes to disk.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
block (dict) – Range information specifying the start/end indices within the encoded text dataset. Here, the ‘path’ item is used for writing the encodings to storage.
codes (np.ndarray) – Block of encodings to be saved to storage.
- encode(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- text_dataset: megatron.core.datasets.retro.utils.GPTToTextDataset,
Encode text dataset, to be later added to index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
text_dataset (GPTToTextDataset) – Text dataset to be encoded by the index.
- add_codes(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Read codes from disk, and add them to the index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- remove_codes(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Remove added codes after adding to index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- add(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- text_dataset: megatron.core.datasets.retro.utils.GPTToTextDataset,
Add vectors to index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
text_dataset (GPTToTextDataset) – Text dataset that will be embedded and added to the index.