core.datasets.retro.index.utils#

Utilities for building an index.

Module Contents#

Functions#

get_index_dir

Create sub-directory for this index.

num_samples_to_block_ranges

Split a range (length num_samples) into sequence of block ranges of size block_size.

get_training_data_root_dir

Get root directory for embeddings (blocks and merged data).

get_training_data_block_dir

Get directory for of saved embedding blocks.

get_training_data_block_paths

Get paths to saved embedding blocks.

get_training_data_merged_path

Get path to merged training embeddings.

get_added_codes_dir

Get directory of saved encodings.

get_added_code_paths

Get paths to all saved encodings.

API#

core.datasets.retro.index.utils.get_index_dir(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Create sub-directory for this index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to index sub-directory within Retro project.

core.datasets.retro.index.utils.num_samples_to_block_ranges(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
num_samples: int,
) List[Tuple[int, int]]#

Split a range (length num_samples) into sequence of block ranges of size block_size.

Parameters:
  • config (RetroPreprocessingConfig) – Retro preprocessing config.

  • num_samples (int) – Split num_samples into consecutive block ranges, where each block is size config.retro_block_size.

Returns:

A list of tuples where each item is the (start, end) index for a given block.

core.datasets.retro.index.utils.get_training_data_root_dir(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get root directory for embeddings (blocks and merged data).

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the training data directory, which contains both training embedding blocks and the final merged training embeddings.

core.datasets.retro.index.utils.get_training_data_block_dir(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get directory for of saved embedding blocks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the directory containing the training embedding blocks, which will be later merged into a single embedding array.

core.datasets.retro.index.utils.get_training_data_block_paths(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) List[str]#

Get paths to saved embedding blocks.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Paths of all training embedding blocks.

core.datasets.retro.index.utils.get_training_data_merged_path(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get path to merged training embeddings.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the merged training embedding binary file.

core.datasets.retro.index.utils.get_added_codes_dir(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) str#

Get directory of saved encodings.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Path to the directory containing the vector encodings for adding to the index.

core.datasets.retro.index.utils.get_added_code_paths(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) List[str]#

Get paths to all saved encodings.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

Returns:

Paths of all vector encoding blocks, for adding to the index.