Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
RETRO Model
The Retrieval-Enhanced Transformer (RETRO) (Borgeaud et al., 2022) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. RETRO features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network’s parameters. This approach significantly reduces the model’s parameter count while achieving lower perplexity than the standard GPT model. RETRO also provides the flexibility to update the knowledge stored in LMs (Wang et al., 2023a) by updating the retrieval database without training LMs again.
For the legacy native NeMo RETRO model documentation, please see NeMo RETRO Model (Legacy).
Quick Start
The following instructions demonstrate how to preprocess the data as well as train and evaluate a RETRO model.
Data Preprocessing
For detailed information on data preprocessing, refer to the Megatron-LM Github repository. This repository contains scripts and comprehensive instructions for the entire preprocessing procedure, specifically focusing on RETRO Data Preparation. The main stages of the process are summarized below.
The outcome of the preparation step yields a processed RETRO data directory, fully primed for pre-training. Specifically, this directory encompasses the following key files and subdirectories:
config.json
: contains the hyperparameters used in the data preparation step, which will then be retrieved to use in the pre-training step for consistency. For example: sample length, chunk length, data splits, tokenizer files, etc.data
: contains the original data before any preprocessing.tokenizer
: contains tokenizer files used in the preparation step.db
: contains the chunk database of processed and chunked text used for retrieving neighbors.index
: contains the Faiss index of the chunk database for retrieval.query
: contains the queried neighboring chunks for all training samples.
The data preparation process contains the following main stages:
Build Retrieval Chunk Database
This stage involves creating a database of text chunks from a corpus such as Wikipedia to be used for retrievals. The chunks are non-overlapping and extracted from the original GPT token dataset, with each chunk traditionally being 64 tokens in length. The database is stored as a 2-D array and is not a relational database.
The main output of this stage is:
/db/merged/train.hdf5
: the database containing all processed and chunked text./db/merged/sampled.hdf5
: the database containing a small portion of all chunks, only used for training the index in the next stage.
Build Index for Similarity Search
The second stage is to build a search index using Faiss, a library for efficient similarity search. The index is trained on a subset of the chunks sampled.hdf5
from the database. After training, all chunks are added to the index to enable querying. The index accepts 1-D floating point vectors, so chunks must be embedded using Bert embeddings before they can be added to the index. Particularly, the stage is comprised of two sub-stages:
- Extract BERT embeddings from the sampled chunk database (
sampled.hdf5
) and use them to train a Faiss index.- Extract BERT embeddings for each chunk in the all chunks database (
train.hdf5
) and add them to the trained Faiss index.
The main output of this stage is:
/index/<RETRO_INDEX_TYPE>/<RETRO_INDEX_STR>/added.faissindex
: the trained index, with all chunks in the database added to it
Query Pretraining Neighbors
To speed up the RETRO pretraining process, you pre-retrieve neighbors for all training samples instead of retrieving them on-the-fly. In this stage, the pretraining datasets are processed to find and save k-nearest neighbors for each chunk in each sample. The neighbors are saved to disk and labeled with unique properties to ensure they match the pretraining configuration. Query-time hyperparameters can be tuned to improve the quality of the neighbors.
The main output of this stage is:
train_<UNIQUE_HASH>
: directory containing retrieved neighbors for all training samples.valid_<UNIQUE_HASH>
: directory containing retrieved neighbors for all validating samples.
Train RETRO Model
Once the training samples, pre-retrieved neighbors, and other data are prepared, you are ready to train the RETRO model. The training process will use the output directory from the data preparation step. We set the path to this directory at the retro.retro_project_dir
argument. Many of the data hyperparameters will be retrieved from the config.json
file in this directory, including data splits, sequence length, chunk length, number of training and validating samples, tokenizer, etc.
The table below lists some of the common architecture and optimizer parameters that can be configured for model pre-training. Many of these values are set in examples/nlp/language_modeling/conf/megatron_retro_config.yaml
, which is used when training unless being overriden by the running command. Notice unlike other NeMo models, the model.data.data_prefix value is set to None, because all data information will be retrieved from model.retro.retro_project_dir.
Parameter |
Default |
Description |
---|---|---|
retro_data.retro_chunk_length |
64 |
the chunk size used to retrieve |
retro.retro_num_neighbors |
2 |
token sequence length |
retro_encoder_num_layers |
2 |
total number of encoder layers |
model.num_layers |
12 |
total number of decoder layers |
model.encoder_seq_length |
2048 |
token sequence length |
model.hidden_size |
768 |
model hidden size |
model.ffn_hidden_size |
3072 |
model FFN hidden size. Usually 4 * hidden_size |
model.num_attention_heads |
12 |
number of attention heads |
model.init_method_std |
0.023 |
standard deviation of the zero mean normal distribution used for weight initialization |
model.hidden_dropout |
0.1 |
dropout probability for hidden state transformer |
model.attention_dropout |
0.1 |
dropout probability in the attention layer |
model.ffn_dropout |
0.1 |
dropout probability in the feed-forward layer |
The following example shows a RETRO pre-training script. The rest of the argument values are retrieved from examples/nlp/language_modeling/conf/megatron_retro_config.yaml
.
python /examples/nlp/language_modeling/megatron_retro_pretraining.py \
trainer.num_nodes=1 \
trainer.devices=8 \
trainer.precision=bf16 \
trainer.accelerator=gpu \
trainer.max_steps=750000
trainer.val_check_interval=10 \
trainer.precision=16 \
exp_manager.exp_dir=/path/to/exp_dir \
model.mcore_gpt=True \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.megatron_amp_O2=True \
model.retro.num_layers=12 \
model.retro.retro_encoder_num_layers=2 \
model.retro.retro_num_retrieved_chunks=2 \
model.retro.retro_project_dir=/path/to/retro_workdir \
model.micro_batch_size=4 \
model.data.num_workers=4 \
model.data.data_prefix=["none"] \
model.data.shuffle_documents=False \
model.data.dataloader_type=single \
model.data.splits_string=\'98,2,0\' \
model.optim.lr=6.0e-4 \
model.optim.weight_decay=0.1 \
model.optim.sched.name=CosineAnnealing \
model.optim.sched.min_lr=6.0e-5 \
model.optim.sched.max_steps=650000 \
model.optim.name=distributed_fused_adam
During the training, we can monitor the process with Weights and Biases (WandB) by setting exp_manager.create_wandb_logger=True
and set relevant wandb arguments.
After training, the model distributed checkpoint directory can be found at the result checkpoint directory.
Run RETRO Model Inference
Once the RETRO model has been trained, you can put it into inference mode and experiment with it.
During inference, you are not limited to the indexed corpus to retrieve relevant chunks, but can directly provide any relevant contexts to the prompt through the argument neighbors
.
When performing inference, the input for RETRO differs from that used during training structurally. Specifically, the model’s input consists of only two chunks: one for the prompt and another for the answer to be generated. Unlike during training, these chunks do not necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. When context neighbors are supplied for a prompt, these neighbors correspond to the first chunk and are processed through the RETRO encoder to generate text for the second chunk.
The following example shows a RETRO inferencing script. The rest of the argument values are retrieved from examples/nlp/language_modeling/conf/megatron_retro_inference.yaml
.
python /examples/nlp/language_modeling/megatron_retro_eval.py \
checkpoint_dir=/path/to/checkpoints \
checkpoint_name=/checkpoint_name \
trainer.devices=1 \
trainer.num_nodes=1 \
trainer.accelerator=gpu \
trainer.precision=32 \
megatron_amp_O2=False \
inference.tokens_to_generate=10 \
inference.greedy=False \
inference.add_BOS=False \
inference.temperature=1.0 \
inference.retro_inference.retro_num_neighbors=2 \
prompt="sample prompt" \
neighbors=["sample neighbor 1","sample neighbor 2"]