Data Preparation

To prepare data for RETRO LLM, refer to the Megatron-LM Github repository for information. This repository covers the entire preprocessing process, specifically focusing on RETRO Data Preparation. The main stages of the process are described in this section.

After the preparation step, the processed RETRO data directory will be ready for pre-training. It will include the following key files and subdirectories:

  • config.json - contains the hyperparameters used in the data preparation step, which will be retrieved to use in the pre-training step for consistency. For example, sample length, chunk length, data splits, tokenizer files, etc.

  • data - contains the original data before any preprocessing.

  • tokenizer - contains tokenizer files used in the preparation step.

  • db - contains the chunk database of processed and chunked text used for retrieving neighbors.

  • index - contains the Faiss index for the chunk database used in retrieval.

  • query - contains the queried neighboring chunks for all training samples.

Summary of Main Stages

This section describes the main stages in the data preparation process.

Build Retrieval Chunk Database

In the initial stage of the process, a database of text chunks is created from a corpus (such as Wikipedia). These chunks serve as the foundation for retrievals during pre-training. The non-overlapping chunks are extracted from the original GPT token dataset. Each chunk is traditionally 64 tokens in length. The database is stored as a 2-D array. Unlike a relational database, this structure is optimized for efficient retrieval during pre-training.

The main outputs of this stage are:

  • /db/merged/train.hdf5: the database containing all processed and chunked text.

  • /db/merged/sampled.hdf5: the database containing a small portion of all chunks, only used for training the index in the next stage.

Query Pretraining Neighbors

To accelerate the Retro pretraining process, you need to pre-retrieve neighbors for all training samples instead of retrieving them on-the-fly. In this stage, the pretraining datasets are processed to find and save k-nearest neighbors for each chunk in each sample. The neighbors are saved to disk and labeled with unique properties to ensure they match the pretraining configuration. Query-time hyperparameters can be tuned to improve the quality of the neighbors.

The main outputs of this stage are:

  • train_<UNIQUE_HASH> - directory containing retrieved neighbors for all training samples.

  • valid_<UNIQUE_HASH> - directory containing retrieved neighbors for all validating samples.