core.datasets.retro.config.config#
Retro preprocessing config.
Module Contents#
Classes#
Configuration object for Retro preprocessing. |
API#
- class core.datasets.retro.config.config.RetroPreprocessingConfig#
Bases:
megatron.core.transformer.TransformerConfigConfiguration object for Retro preprocessing.
Note : Arguments prefixed with ‘–retro-gpt-’ or ‘–retro-bert-’ are included and named as such to more easily handle managing both models running at the same time. Megatron is not optimized to run two models at once, so this naming convention makes it clearer.
- Parameters:
retro_project_dir (str) – Retro project directory, which contains the preprocessed data for for pretraining. This directory is built during preprocessing (see tools/retro/README.md), and contains subdirectories for the chunk database and pretraining neighbors.
retro_tasks (str) – Comma-separated list of tasks to run. Run entire preprocesing pipeline by using ‘–retro-tasks build’. Alternatively, run individual stages with tasks (in this order) ‘db-build’, ‘index-build’, or ‘query-pretraining-neighbors’. For example, ‘–retro-tasks db-build,index-build,query-pretraining-neighbors’ is equivalent to ‘–retro-tasks build’; or the argument can contain a subset of these tasks. Stages must always be run in the correct order (listed above).
retro_task_validate (float) – If defined, validate a randomly sampled subset of the existing results of the given task. Each task implements a ‘validate’ method that is responsible for sampling a
retro_task_validatefraction of the existing results, and then checking for bitwise equality with the current code base. (E.g.,--retro-task-validate 0.01.)retro_block_size (int) – Number of chunks to process at a time when generating Bert embeddings and querying the search index. Partial results for each block are generally saved to disk in separate files.
retro_doc_block_size (int) – Number of documents to processe at time when processing token datasets into chunk databases. The partial chunk database for each block is saved into a separate file.
retro_gpt_seed (int) – Random seed used for python, numpy, pytorch, and cuda.
retro_gpt_data_path (str) – Path to the training dataset. Accepted format: 1) a single data path, 2) multiple datasets in the form: dataset1-weight dataset1-path dataset2-weight dataset2-path … It is used with –split when a single dataset used for all three: train, valid and test. It is exclusive to the other –*-data-path args.
retro_gpt_data_cache_path (str) – Path to a directory to hold cached index files.
retro_gpt_split (str) – Comma-separated list of proportions for training, validation, and test split. For example the split
90,5,5will use 90%% of data for training, 5%% for validation and 5%% for test.retro_gpt_train_samples (int) – Total number of samples to train over all training runs.
retro_gpt_eval_interval (int) – GPT evaluation interval.
retro_gpt_eval_iters (int) – GPT evaluation iterations.
retro_gpt_tokenizer_type (str) – GPT tokenizer type.
retro_gpt_tokenizer_model (str) – GPT tokenizer model file.
retro_gpt_vocab_file (str) – GPT vocab file.
retro_gpt_merge_file (str) – GPT merge file.
retro_gpt_seq_length (int) – GPT sequence length.
retro_gpt_global_batch_size (int) – GPT global batch size.
retro_gpt_chunk_length (int) – GPT chunk length.
retro_bert_tokenizer_type (str) – Bert tokenizer type (for when using ‘–bert-embedder-type megatron’).
retro_bert_vocab_file (str) – Bert vocab file.
retro_bert_batch_size (int) – Micro-batch size for processing Bert embeddings.
retro_bert_max_chunk_length (int) – Maximum sequence length for Bert embeddings. (Named ‘chunk’ here in reference to these Bert sequences being converted from GPT chunks.)
retro_index_type (str) – A ‘faiss-base’ index is a simple, un-optimized wrapper around a Faiss index. A ‘faiss-par-add’ index optimizes the ‘add()’ method by making it multi-node and multi-process, but with bit-wise equivalent results.
retro_index_str (str) – Index string used for calling faiss.index_factory(). For example, ‘IVF262144_HNSW32,Flat’ or ‘OPQ32_256,IVF4194304_HNSW32,PQ32’.
retro_index_ntrain (int) – Number of database chunks to use for training the index. This value must be less or equal to the total number of chunks in the database.
retro_index_train_load_fraction (float) – Fraction of sampled chunks to use for training the index. Useful when our total sampled embeddings use too much memory; lowering the load fraction is less costly than re-embedding a new sampled dataset from scratch.
retro_index_add_load_fraction (float) – Fraction of database chunks to use for adding to the index. Useful when our total index size would use too much memory; lowering the load fraction is less costly than re-designing our token datasets.
retro_index_delete_training_embeddings (bool) – Delete training embeddings for the search index. Useful for debugging.
retro_index_delete_added_codes (bool) – Delete added codes for the search index. Useful for debugging.
retro_query_ef_search (int) – Index ef-search parameter for Hierarchical Navigable Small Worlds (HNSW) during querying.
retro_query_nprobe (int) – Index nprobe parameter for Inverted File (IVF) during querying.
retro_query_num_neighbors_query (int) – Number of neighbors to retrieve when calling index.search().
retro_query_num_neighbors_save (int) – Number of neighbors to save to disk after the index’s returned neighbors. If longer than target value, neighbors truncated; and if shorter than target value, neighbors are padded with -1’s.
retro_bert_embedders (RetroBertEmbedders) – Set of Bert embedders used for embedding chunks. Contains entries: 1) ‘mem’ for an in-memory embedder, and 2) ‘disk’ for an embedder that saves results in blocks to disk.
retro_gpt_chunk_datasets (RetroGPTChunkDatasets) – GPT datasets for ‘train’, ‘valid’, and ‘test’.
retro_tokenizers (RetroTokenizers) – GPT (‘gpt’) and Bert (‘bert’) tokenizers.
- retro_project_dir: str#
None
- retro_tasks: str#
‘build’
- retro_task_validate: float#
None
- retro_block_size: int#
100000
- retro_doc_block_size: int#
100000
- retro_gpt_seed: int#
1234
- retro_gpt_data_path: list#
None
- retro_gpt_data_cache_path: str#
None
- retro_gpt_split: str#
‘969,30,1’
- retro_gpt_train_samples: int#
None
- retro_gpt_eval_interval: int#
None
- retro_gpt_eval_iters: int#
None
- retro_gpt_tokenizer_type: str#
None
- retro_gpt_tokenizer_model: str#
None
- retro_gpt_vocab_file: str#
None
- retro_gpt_merge_file: str#
None
- retro_gpt_seq_length: int#
None
- retro_gpt_global_batch_size: int#
None
- retro_gpt_chunk_length: int#
64
- retro_bert_tokenizer_type: str#
None
- retro_bert_vocab_file: str#
None
- retro_bert_batch_size: int#
128
- retro_bert_max_chunk_length: int#
256
- retro_index_type: str#
‘faiss-par-add’
- retro_index_str: str#
None
- retro_index_ntrain: int#
None
- retro_index_train_load_fraction: float#
1.0
- retro_index_add_load_fraction: float#
1.0
- retro_index_delete_training_embeddings: bool#
True
- retro_index_delete_added_codes: bool#
True
- retro_query_ef_search: int#
256
- retro_query_nprobe: int#
65536
- retro_query_num_neighbors_query: int#
200
- retro_query_num_neighbors_save: int#
20
- retro_bert_embedders: core.datasets.retro.config.bert_embedders.RetroBertEmbedders#
None
- retro_gpt_chunk_datasets: core.datasets.retro.config.gpt_chunk_datasets.RetroGPTChunkDatasets#
None
- retro_tokenizers: core.datasets.retro.config.tokenizers.RetroTokenizers#
None
- __post_init__() None#
Validate Retro config.