Index Corpus Data for RAG

Retrieval-augmented generation (RAG) is a technique that combines information retrieval with a set of carefully designed system prompts to provide more accurate, up-to-date, and contextually relevant responses from Large Language Models (LLMs). By incorporating data from various sources such as relational databases, unstructured document repositories, internet data streams, and media news feeds, RAG can significantly improve the quality of generative AI systems.

The initial phase of a RAG pipeline involves indexing the corpus data used for retrieving relevant information for the text generation process. This phase includes chunking documents, extracting embeddings, and storing the chunks and corresponding embeddings in a database.

The provided script runs a complete process using the LlamaIndex library in conjunction with a pre-trained NeMo embedding model. LlamaIndex, a Python library, efficiently connects data sources to LLMs. It offers tools for chunking, embedding, managing the index, and retrieving text for generation. In this procedure, the pre-trained NeMo model serves as the embedder, extracting embeddings from chunked text within the corpus.

Run Indexing on a Base Model

This section provides basic instructions for running RAG indexing on a Slurm cluster.

To initiate indexing:

  1. Assign the stages variable in conf/config.yaml to “rag_indexing”.

  2. Define the configuration for indexing by setting the rag_indexing variable in <embedder_model_type>/<model_size> to a specific embedder config file path.

    For example, setting the rag_indexing variable to bert/340m specifies the configuration file path as conf/rag_indexing/bert/340m.yaml. This path corresponds to a BERT-type embedder model with 340 million parameters.

Run Indexing on a Slurm Cluster

To run indexing on a Slurm cluster:

  1. Set the run configuration in conf/rag_indexing/bert/340m.yaml to define the job-specific configuration:

    run:
    name: ${.eval_name}_${.model_train_name}
    time_limit: "4:00:00"
    dependency: "singleton"
    nodes: 1
    ntasks_per_node: 1
    eval_name: rag_indexing
    model_train_name: rag_pipeline
    results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}
    
  2. Set the path for the embedder checkpoint, corpus data, and saved index:

    indexing:
       embedder:
          model_path: /path/to/checkpoint_dir
       data:
          data_path: /path/to/corpus_data
       index_path: /path/to/saved_index
    
  3. Set the configuration for the Slurm cluster in conf/cluster/bcm.yaml:

    partition: null
    account: null
    exclusive: True
    gpus_per_task: null
    gpus_per_node: 8
    mem: 0
    job_name_prefix: 'nemo-megatron-'
    srun_args:
    - "--no-container-mount-home"
    
  4. Set the stages section of conf/config.yaml:

    stages:
      - rag_indexing
    
  5. Run the Python script:

    python3 main.py
    

    All the configurations are read from conf/config.yaml and conf/rag_indexing/bert/340m.yaml.