Index Corpus Data for RAG

User Guide (Latest Version)

Retrieval-augmented generation (RAG) is a technique that combines information retrieval with a set of carefully designed system prompts to provide more accurate, up-to-date, and contextually relevant responses from Large Language Models (LLMs). By incorporating data from various sources such as relational databases, unstructured document repositories, internet data streams, and media news feeds, RAG can significantly improve the quality of generative AI systems.

The initial phase of a RAG pipeline involves indexing the corpus data used for retrieving relevant information for the text generation process. This phase includes chunking documents, extracting embeddings, and storing the chunks and corresponding embeddings in a database.

The provided script runs a complete process using the LlamaIndex library in conjunction with a pre-trained NeMo embedding model. LlamaIndex, a Python library, efficiently connects data sources to LLMs. It offers tools for chunking, embedding, managing the index, and retrieving text for generation. In this procedure, the pre-trained NeMo model serves as the embedder, extracting embeddings from chunked text within the corpus.

This section provides basic instructions for running RAG indexing on a Slurm cluster.

To initiate indexing:

  1. Assign the stages variable in conf/config.yaml to “rag_indexing”.

  2. Define the configuration for indexing by setting the rag_indexing variable in <embedder_model_type>/<model_size> to a specific embedder config file path.

    For example, setting the rag_indexing variable to bert/340m specifies the configuration file path as conf/rag_indexing/bert/340m.yaml. This path corresponds to a BERT-type embedder model with 340 million parameters.

Run Indexing on a Slurm Cluster

To run indexing on a Slurm cluster:

  1. Set the run configuration in conf/rag_indexing/bert/340m.yaml to define the job-specific configuration:

    Copy
    Copied!
                

    run: name: ${.eval_name}_${.model_train_name} time_limit: "4:00:00" dependency: "singleton" nodes: 1 ntasks_per_node: 1 eval_name: rag_indexing model_train_name: rag_pipeline results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}


  2. Set the path for the embedder checkpoint, corpus data, and saved index:

    Copy
    Copied!
                

    indexing: embedder: model_path: /path/to/checkpoint_dir data: data_path: /path/to/corpus_data index_path: /path/to/saved_index


  3. Set the configuration for the Slurm cluster in conf/cluster/bcm.yaml:

    Copy
    Copied!
                

    partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 job_name_prefix: 'nemo-megatron-' srun_args: - "--no-container-mount-home"


  4. Set the stages section of conf/config.yaml:

    Copy
    Copied!
                

    stages: - rag_indexing


  5. Run the Python script:

    Copy
    Copied!
                

    python3 main.py


    All the configurations are read from conf/config.yaml and conf/rag_indexing/bert/340m.yaml.

Previous Generate Text with RAG
Next Large Language Models
© | | | | | | |. Last updated on Jun 19, 2024.