Generate Text with RAG

After indexing the corpus data for the RAG pipeline, you can retrieve the relevant contexts to augment text generation. Given a query or prompt, you first extract embeddings from the query using the same embedder that was used during index creation. Next, you retrieve k-nearest-neighbors contexts related to the query from the index. Finally, you concatenate these contexts with the query and feed the resulting prompt to the NeMo LLM.

The supplied script runs the entire process with LlamaIndex library, using a trained NeMo embedding model, and a NeMo LLM model.

In this procedure, you use the same NeMo embedding model as the one used when indexing the corpus data. Additionally, you’ll work with a NeMo LLM model, such as GPT, LLama, or Gemma. For instructions on training a LLM model in NeMo, see NVIDIA GPT.

Run Text Generation on a Base Model

This section provides basic instructions for running text generation on a Slurm cluster.

To initiate text generation:

  1. Assign the stages variable in conf/config.yaml to “rag_generating”.

  2. Define the configuration for text generation by setting the rag_generating variable to <llm_model_type>/<model_size> to a specific LLM config file path.

For example, setting the rag_generating variable to gpt3/7b specifies the configuration file path as conf/rag_generating/gpt3/7b.yaml. This path corresponds to the GPT-type LLM model with 7 billion parameters.

Run Text Generation on a Slurm Cluster

To run text generation on a Slurm cluster:

  1. Set the run configuration in conf/rag_generating/gpt3/7b.yaml to define the job-specific configuration:

    run:
    name: ${.eval_name}_${.model_train_name}
    time_limit: "4:00:00"
    dependency: "singleton"
    nodes: 1
    ntasks_per_node: 1
    eval_name: rag_generating
     model_train_name: rag_pipeline
    results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}
    
  2. Set the path for the embedder checkpoint and saved index. Ensure that the values correspond to the same embedder model used in the indexing step.

    indexing:
       embedder:
          model_path: /path/to/embedder_checkpoint_dir
       index_path: /path/to/saved_index
    
  3. Set the values for text generation including the LLMs checkpoints path, query, temperature, and tokens to generate:

    generating:
       llm:
          model_path: /path/to/llm_checkpoint_dir
       inference:
          tokens_to_generate: 50
          greedy: False
          temperature: 1.0
       query: 'Which art schools did I applied to?'
    

    Based on the query, relevant contexts will be retrieved from the corpus to augment text generation.

  4. Set the configuration for the Slurm cluster in conf/cluster/bcm.yaml:

    partition: null
    account: null
    exclusive: True
    gpus_per_task: null
    gpus_per_node: 8
    mem: 0
    job_name_prefix: 'nemo-megatron-'
    srun_args:
       - "--no-container-mount-home"
    
  5. Set the stages section of conf/config.yaml:

    stages:
      - rag_generating
    
  6. Run the Python script:

    python3 main.py
    

    All the configurations are read from conf/config.yaml and conf/rag_generating/gpt3/7b.yaml.