Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Generate Text with RAG

After indexing the corpus data for the RAG pipeline, you can retrieve the relevant contexts to augment text generation. Given a query or prompt, you first extract embeddings from the query using the same embedder that was used during index creation. Next, you retrieve k-nearest-neighbors contexts related to the query from the index. Finally, you concatenate these contexts with the query and feed the resulting prompt to the NeMo LLM.

The supplied script runs the entire process with LlamaIndex library, using a trained NeMo embedding model, and a NeMo LLM model.

In this procedure, you use the same NeMo embedding model as the one used when indexing the corpus data. Additionally, you’ll work with a NeMo LLM model, such as GPT, LLama, or Gemma. For instructions on training a LLM model in NeMo, see NVIDIA GPT.

Run Text Generation on a Base Model

This section provides basic instructions for running text generation on a Slurm cluster.

To initiate text generation:

Assign the stages variable in conf/config.yaml to “rag_generating”.
Define the configuration for text generation by setting the rag_generating variable to <llm_model_type>/<model_size> to a specific LLM config file path.

For example, setting the rag_generating variable to gpt3/7b specifies the configuration file path as conf/rag_generating/gpt3/7b.yaml. This path corresponds to the GPT-type LLM model with 7 billion parameters.

Run Text Generation on a Slurm Cluster

To run text generation on a Slurm cluster:

Set the run configuration in conf/rag_generating/gpt3/7b.yaml to define the job-specific configuration:

run:
name: ${.eval_name}_${.model_train_name}
time_limit: "4:00:00"
dependency: "singleton"
nodes: 1
ntasks_per_node: 1
eval_name: rag_generating
 model_train_name: rag_pipeline
results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

Set the path for the embedder checkpoint and saved index. Ensure that the values correspond to the same embedder model used in the indexing step.
indexing: embedder: model_path: /path/to/embedder_checkpoint_dir index_path: /path/to/saved_index

Set the values for text generation including the LLMs checkpoints path, query, temperature, and tokens to generate:

generating:
   llm:
      model_path: /path/to/llm_checkpoint_dir
   inference:
      tokens_to_generate: 50
      greedy: False
      temperature: 1.0
   query: 'Which art schools did I applied to?'

Based on the query, relevant contexts will be retrieved from the corpus to augment text generation.

Set the configuration for the Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
   - "--no-container-mount-home"

Set the stages section of conf/config.yaml:
stages: - rag_generating
Run the Python script:
python3 main.py
All the configurations are read from conf/config.yaml and conf/rag_generating/gpt3/7b.yaml.