core.datasets.retro.index.validate#
Validate an index’s data.
This module contains functionality for checking for bitwise equality across code changes. The training and adding steps of index construction can be validated separately. The following high-level checks are supported:
Training: Validate that saved training embeddings are bitwise equal with a sample set of freshly computed embeddings. (Note:
--no-retro-index-delete-training-embeddingsmust be used.)Adding: Validate that the saved encodings are bitwise equal with a sample of sample set of freshly computed encodings. (Note:
--no-retro-index-delete-added-codesmust be used.)
Module Contents#
Functions#
Validate training embeddings. |
|
Validate added encodings. |
|
Validate index. |
API#
- core.datasets.retro.index.validate.validate_training_embeddings(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Validate training embeddings.
Steps:
Randomly sample subset of text dataset blocks.
Embed each block.
Compare against saved embeddings.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.validate.validate_added_encodings(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Validate added encodings.
Steps:
Randomly sample subset of text dataset blocks.
Encode each block.
Compare against saved encodings.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.
- core.datasets.retro.index.validate.validate_index(
- config: megatron.core.datasets.retro.config.RetroPreprocessingConfig,
Validate index.
Validating index involves sequentially running stages above:
Validate trained index.
Validate filled index.
- Parameters:
config (RetroPreprocessingConfig) – Retro preprocessing config.