Data Preparation

User Guide (Latest Version)

To prepare data for RETRO LLM, refer to the Megatron-LM Github repository for information. This repository covers the entire preprocessing process, specifically focusing on RETRO Data Preparation. The main stages of the process are described in this section.

After the preparation step, the processed RETRO data directory will be ready for pre-training. It will include the following key files and subdirectories:

  • config.json - contains the hyperparameters used in the data preparation step, which will be retrieved to use in the pre-training step for consistency. For example, sample length, chunk length, data splits, tokenizer files, etc.

  • data - contains the original data before any preprocessing.

  • tokenizer - contains tokenizer files used in the preparation step.

  • db - contains the chunk database of processed and chunked text used for retrieving neighbors.

  • index - contains the Faiss index for the chunk database used in retrieval.

  • query - contains the queried neighboring chunks for all training samples.

This section describes the main stages in the data preparation process.

Build Retrieval Chunk Database

In the initial stage of the process, a database of text chunks is created from a corpus (such as Wikipedia). These chunks serve as the foundation for retrievals during pre-training. The non-overlapping chunks are extracted from the original GPT token dataset. Each chunk is traditionally 64 tokens in length. The database is stored as a 2-D array. Unlike a relational database, this structure is optimized for efficient retrieval during pre-training.

The main outputs of this stage are:

  • /db/merged/train.hdf5: the database containing all processed and chunked text.

  • /db/merged/sampled.hdf5: the database containing a small portion of all chunks, only used for training the index in the next stage.

In the second stage of the process, a search index is built using Faiss, a library for efficient similarity searches. The index is trained on a subset of the chunks, sampled.hdf5, from the database. After training, all chunks are added to the index to enable querying. The index accepts 1-D floating point vectors. The chunks must be embedded using BERT embeddings before they can be added to the index. The second stage of the process comprises two sub-stages:

  • Extract BERT embeddings from the sampled chunk database (sampled.hdf5) and use them to train a Faiss index.

  • Extract BERT embeddings for each chunk in the all chunks database (train.hdf5) and add them to the trained Faiss index.

The main outputs of this stage are:

  • /index/<RETRO_INDEX_TYPE>/<RETRO_INDEX_STR>/added.faissindex - the trained index, with all chunks in the database added to it.

Query Pretraining Neighbors

To accelerate the Retro pretraining process, you need to pre-retrieve neighbors for all training samples instead of retrieving them on-the-fly. In this stage, the pretraining datasets are processed to find and save k-nearest neighbors for each chunk in each sample. The neighbors are saved to disk and labeled with unique properties to ensure they match the pretraining configuration. Query-time hyperparameters can be tuned to improve the quality of the neighbors.

The main outputs of this stage are:

  • train_<UNIQUE_HASH> - directory containing retrieved neighbors for all training samples.

  • valid_<UNIQUE_HASH> - directory containing retrieved neighbors for all validating samples.

© | | | | | | |. Last updated on May 30, 2024.