core.datasets.retro.index.build#
Construct an index.
Constructing an index generally happens in two phases:
index.train(): Train an index on a representative set of vectors.
index.add(): Add vectors to an index, to be available for retrieval.
Module Contents#
Functions#
Path of empty index. |
|
Compute number of blocks to load. |
|
Merge individual embedding blocks into a single binary mmap file. |
|
Convert GPT token chunk dataset to a text dataset for passing to the embedder. |
|
Embed DB chunks. |
|
Train index on embedded DB chunks. |
|
Remove embeddings after training. |
|
Train index on DB chunks. |
|
Entry point for training the index. |
|
Convert GPT token chunk dataset to a text dataset for passing to the embedder. |
|
Add DB chunks to index. |
|
Entry point for adding to the index. |
|
Build index. |
API#
- core.datasets.retro.index.build.get_empty_index_path(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Path of empty index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the empty (trained, but without added samples) vector index.
- core.datasets.retro.index.build.get_block_nload(block_path: str, load_fraction: float) int#
Compute number of blocks to load.
This is computed by multiplying the total number of available blocks with the fraction of blocks to load.
- Parameters:
block_path (str) – Path to HDF5 file containing block of data. File must contain key ‘data’.
load_fraction (float) – Fraction (0 < load_fraction <= 1) of block samples to load.
- Returns:
Number of block samples to load.
- core.datasets.retro.index.build.merge_embedding_blocks(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Merge individual embedding blocks into a single binary mmap file.
The embeddings are initially stored in block-sized (e.g., ~100k embeddings per block) HDF5 files. These individual block files must be merged into a single file before training, to be based as a numpy mmap array to the index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.get_text_dataset_for_training(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Convert GPT token chunk dataset to a text dataset for passing to the embedder.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
The text dataset consisting of tokens converted from sampled chunk database.
- core.datasets.retro.index.build.embed_training_chunks(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Embed DB chunks.
Store chunks in blocks on disk. These blocks will later be merged into a single dataset for training the index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.train_on_embeddings(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Train index on embedded DB chunks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.remove_embeddings(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Remove embeddings after training.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build._train_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Train index on DB chunks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.train_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Entry point for training the index.
We select whether to train a new index, or validate an existing index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.get_text_dataset_for_adding(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Convert GPT token chunk dataset to a text dataset for passing to the embedder.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
The text dataset that consists of tokens converted from the ‘train’ chunk database. These are the chunks used for retrieval by the pretraining ‘train’ dataset.
- core.datasets.retro.index.build._add_to_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Add DB chunks to index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the populated index.
- core.datasets.retro.index.build.add_to_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Entry point for adding to the index.
We select whether to add to a new index, or validate an existing index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.build.build_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Build index.
Building index involves sequentially running stages above:
Train index (on sampled training chunks).
Add to index (on all training chunks).
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.