core.datasets.retro.index.utils#
Utilities for building an index.
Module Contents#
Functions#
Create sub-directory for this index. |
|
Split a range (length num_samples) into sequence of block ranges of size block_size. |
|
Get root directory for embeddings (blocks and merged data). |
|
Get directory for of saved embedding blocks. |
|
Get paths to saved embedding blocks. |
|
Get path to merged training embeddings. |
|
Get directory of saved encodings. |
|
Get paths to all saved encodings. |
API#
- core.datasets.retro.index.utils.get_index_dir(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Create sub-directory for this index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to index sub-directory within Retro project.
- core.datasets.retro.index.utils.num_samples_to_block_ranges(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
- num_samples: int,
Split a range (length num_samples) into sequence of block ranges of size block_size.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
num_samples (int) – Split
num_samplesinto consecutive block ranges, where each block is sizeconfig.retro_block_size.
- Returns:
A list of tuples where each item is the (start, end) index for a given block.
- core.datasets.retro.index.utils.get_training_data_root_dir(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get root directory for embeddings (blocks and merged data).
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the training data directory, which contains both training embedding blocks and the final merged training embeddings.
- core.datasets.retro.index.utils.get_training_data_block_dir(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get directory for of saved embedding blocks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the directory containing the training embedding blocks, which will be later merged into a single embedding array.
- core.datasets.retro.index.utils.get_training_data_block_paths(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get paths to saved embedding blocks.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Paths of all training embedding blocks.
- core.datasets.retro.index.utils.get_training_data_merged_path(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get path to merged training embeddings.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the merged training embedding binary file.
- core.datasets.retro.index.utils.get_added_codes_dir(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get directory of saved encodings.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Path to the directory containing the vector encodings for adding to the index.
- core.datasets.retro.index.utils.get_added_code_paths(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Get paths to all saved encodings.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- Returns:
Paths of all vector encoding blocks, for adding to the index.