core.datasets.retro.index.validate#

Validate an index’s data.

This module contains functionality for checking for bitwise equality across code changes. The training and adding steps of index construction can be validated separately. The following high-level checks are supported:

  • Training: Validate that saved training embeddings are bitwise equal with a sample set of freshly computed embeddings. (Note: --no-retro-index-delete-training-embeddings must be used.)

  • Adding: Validate that the saved encodings are bitwise equal with a sample of sample set of freshly computed encodings. (Note: --no-retro-index-delete-added-codes must be used.)

Module Contents#

Functions#

validate_training_embeddings

Validate training embeddings.

validate_added_encodings

Validate added encodings.

validate_index

Validate index.

API#

core.datasets.retro.index.validate.validate_training_embeddings(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Validate training embeddings.

Steps:

  • Randomly sample subset of text dataset blocks.

  • Embed each block.

  • Compare against saved embeddings.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.validate.validate_added_encodings(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Validate added encodings.

Steps:

  • Randomly sample subset of text dataset blocks.

  • Encode each block.

  • Compare against saved encodings.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.

core.datasets.retro.index.validate.validate_index(
config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
) None#

Validate index.

Validating index involves sequentially running stages above:

  • Validate trained index.

  • Validate filled index.

Parameters:

config (RetroPreprocessingConfig) – Retro preprocessing config.