Sample RAG Pipeline - NVIDIA Docs

About the Pipeline Software Components

The following figure shows a high-level overview of the software components in the pipeline.

Configuring Pipelines

Prerequisites

Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.

Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.

Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.
Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.
A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:
- nemollm-inference-pvc: 50 GB
- nemo-embedding-pvc: 50 GB
- pgvector-pvc: 5 GB
The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.

For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.

Special Considerations for VMware vSphere with Tanzu

If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.

Enter the following commands before creating the persistent volume claims:

Copy
Copied!

            
            $ kubectl create namespace local-path-provisioner
$ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

You also need to label the sample RAG pipeline namespace.

Copy
Copied!

            
            $ kubectl create namespace rag-sample
$ kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

Procedure

Download the manifests for the sample pipeline.

NVIDIA provides sample pipeline manifests for the TensorRT-LLM backend and the vLLM backend to NVIDIA NIM for LLMs. Refer to Supported Inference Models and GPU Requirements to determine which backend to use.
Copy

Copied!
```
            
            $ ngc registry resource download-version ohlfw0olaadg/ea-participants/rag-sample-pipeline:24.03
        
```
The NGC CLI downloads the manifests to a new directory, rag-sample-pipeline_v24.03. The manifests for persistent volume claims and the sample pipeline are in the new directory.

Change directory to the sample manifests:

Copy
Copied!

            
            $ cd rag-sample-pipeline_v24.03

Create the namespace, if it isn’t already created:

Copy
Copied!

            
            $ kubectl create namespace rag-sample

Configure persistent storage.

In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the examples directory.

Specify the spec.storageClassName value for each.

Create the PVCs:

Copy
Copied!

            
            $ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample
$ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample
$ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample

Confirm the PVCs are created:

Copy
Copied!

            
            $ kubectl get pvc -n rag-sample

The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.

Copy
Copied!

            
            NAMESPACE    NAME                                          STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
rag-sample   persistentvolumeclaim/nemollm-embedding-pvc   Pending                                      local-path     3h7m
rag-sample   persistentvolumeclaim/nemollm-inference-pvc   Pending                                      local-path     3h7m
rag-sample   persistentvolumeclaim/pgvector-pvc            Pending                                      local-path     3h7m

Add secrets that use your NGC CLI API key.

Add a Docker registry secret that the Operator uses for pulling containers from NGC:

Copy
Copied!

            
            $ kubectl create secret -n rag-sample docker-registry ngc-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=<ngc-cli-api-key>

Add a generic secret that the init container for the inference and embedding containers use to download models:

Copy
Copied!

            
            $ kubectl create secret -n rag-sample generic ngc-api-secret \
    --from-literal=NGC_CLI_API_KEY=<ngc-cli-api-key>

To use the vLLM backend and models from Hugging Face Model Hub, add a secret with your personal access token for downloading models from the hub:
Copy

Copied!
```
            
            $ kubectl create secret -n rag-sample generic hf-secret \
    --from-literal=HF_USER=<user%40example.com> \
    --from-literal=HF_PAT=<hf-personal-access-token>
        
```
To use an @ character in the HF_USER value, specify the URL-encoded value, %40.
Edit a sample pipeline file, config/samples/helmpipeline_<app|nemo>_<backend>.yaml.

Consider the following topics:
Create the pipeline:
Copy

Copied!
```
            
            $ kubectl apply -f config/samples/helmpipeline_<app|nemo>_<backend>.yaml -n rag-sample
        
```
The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.

Optional: Monitor the pods:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample

Example Output

Copy
Copied!

            
            NAME                                   READY   STATUS    RESTARTS   AGE
chain-server-5c9c9f7d75-ggw2h          1/1     Running   0          42m
nemollm-embedding-5bbc63f38d3b911f-0   1/1     Running   0          42m
nemollm-inference-5bbc63f38d3b911f-0   1/1     Running   0          29m
pgvector-0                             1/1     Running   0          42m
rag-playground-cb67f854b-6jkzb         1/1     Running   0          42m

Access the sample chat application.

If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.

Determine the node port for the sample chat application:

Copy
Copied!

            
            $ kubectl get service -n rag-sample rag-playground

In the following sample output, the application is listening on node port 32092.

Copy
Copied!

            
            NAME             TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
rag-playground   NodePort   10.99.219.137   <none>        3001:32092/TCP   141m

Forward the port:

Copy
Copied!

            
            $ kubectl port-forward service/rag-playground -n rag-sample 32092:3001

After you forward the port, you can access the application at http://localhost:32092.

Optional: Access the Chain Server. This step can be useful if you want to develop your own chat client or chat templates.

If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the Chain Server.
- Patch the service to change it from cluster IP to node port:
  Copy
  
  Copied!
```
            
            $ kubectl patch service -n rag-sample chain-server -p '{"spec":{"type":"NodePort"}}'
        
```
- Determine the node port for the chain server:
  Copy
  
  Copied!
```
            
            $ kubectl get service -n rag-sample chain-server
        
```
  In the following sample output, the server is listening on node port 31157.
  Copy
  
  Copied!
```
            
            NAME           TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
chain-server   NodePort   10.108.98.47   <none>        8081:31157/TCP   14d
        
```
- Forward the port:
  Copy
  
  Copied!
```
            
            $ kubectl port-forward service/chain-server -n rag-sample 31157:8081
        
```
After you forward the port, you can access the server at http://localhost:31157. You can view the API at http://localhost:31157/docs.

Common Customizations

The sample pipelines provide a simple baseline that you can customize.

Field	Customization	Default Value
`nodeSelector.nvidia.com/gpu.product`	If your host has an NVIDIA H100 or L40S GPU, specify the product name. Run the following command to display the GPU models on your nodes: Copy Copied! `kubectl get nodes -l nvidia.com/gpu.present=true \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.nvidia\.com\/gpu\.product}{"\n"}{end}'`	`NVIDIA-A100-80GB-PCIe`
`frontend.service.type`	Specify `loadBalancer` if your cluster integrates with an external load balancer.	`nodePort`
`chartValues.updateStrategy.type` for nemollm-inference, nemollm-embedding, and pgvector	Specify `OnDelete` to prevent Kubernetes from performing a rolling update. This setting can be helpful to avoid an inference or embedding pod becoming stuck in `Pending` when you update the specification and no GPU are allocatable. Refer to Update Strategies for stateful sets in the Kubernetes documentation.	`RollingUpdate`
`query.deployStrategy.type` `frontend.deployStrategy.type`	Specify `Recreate` to have Kubernetes delete existing pods before creating new ones. This setting can be helpful to avoid the query pod becoming stuck in `Pending` when you update the specification and no GPUs are allocatable. By default, Kubernetes creates a new pod, including a GPU resource request, before deleting the currently running pod. Refer to Strategy for deployments in the Kubernetes documentation.	`RollingUpdate`

Assigning Multiple GPUs with the TensorRT-LLM Backend

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The model.numGpus field corresponds to the --num_gpus command-line argument to the nemollm_inference_ms command.

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-inference"
    chartSpec:
      chart: "nemollm-inference"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-inference"
      model:
        name: llama-2-13b-chat
        numGpus: 2
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
      resources:
        limits:
          nvidia.com/gpu: 2  # Number of GPUs to present to the running service

Assigning Multiple GPUs to the Embedding Container

To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.

Copy
Copied!

            
              - repoEntry:
      name: nemollm-embedding
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-embedding"
    chartSpec:
      chart: "nemollm-embedding"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-embedding"
      ...
      resources:
        limits:
          nvidia.com/gpu: 2  # Number of GPUs to present to the running service

Changing the Inference Model for the TensorRT-LLM Backend

About Choosing a Model

By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.

Important

A model, such as Llama-2-13B-Chat has a model version, such as a100x2_fp16_24.02. The model version encodes the following information:

Required GPU model, such as A100.
Required GPU count, such as 2.
Model release, such as 24.02.

For the Operator to deploy a model from NGC and use the TensorRT-LLM backend, the GPU model and count must match a node in your cluster. The model release must match the NIM for LLMs container image tag. You can locate the image tag in the Helm pipeline manifest file.

You can access the registry and list models using a web browser or the NGC CLI:

Web browser
NGC CLI

Based on the preceding output, consider the following requirements:

The node that runs the NIM for LLMs container must have four NVIDIA A100 80 GB GPUs or four H100 80 GB GPUs.
The NIM for LLMs image tag must match the model release value, 24.02.

Procedure

To change the inference model, perform the following steps:

Determine the version of the NIM for LLMs image:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample  -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'

Example Output

Copy
Copied!

            
            nvcr.io/ohlfw0olaadg/ea-participants/nim_llm:24.02

Based on the example output, only models with a version suffix of 24.02 will operate with the microservice image.

Edit the config/samples/helmpipeline_<app|nemo>_trtllm.yaml file.

Modify the NIM for LLMs specification and set the NGC_MODEL_NAME, NGC_MODEL_VERSION, and MODEL_NAME environment variables:

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
    # ...
    chartValues:
      model:
        name: mixtral-8x7b-instruct-v0-1
      # ...
      initContainers:
        ngcInit: # disabled by default
        # ...
          env:
            STORE_MOUNT_PATH: /model-store
            NGC_CLI_ORG: <org-name> # ngc org where model lives
            NGC_CLI_TEAM: <team-name> # ngc team where model lives
            NGC_MODEL_NAME: mistral-8x7b-instruct-v0-1 # model name in ngc
            NGC_MODEL_VERSION: a100x4_fp16_24.02 # model version in ngc
            NGC_EXE: ngc  # path to ngc cli, if pre-installed in container
            DOWNLOAD_NGC_CLI: "false"  # set to string 'true' if container should download and install ngc cli
            NGC_CLI_VERSION: "3.37.1"  # version of ngc cli to download (only matters if downloading)
            TARFILE: yes  # tells the script to untar the model. defaults to "yes", set to "" to turn off
            MODEL_NAME: mixtral-8x7b-instruct-v0-1 # actual model name, once downloaded

Modify the query service specification and set APP_LLM_MODELNAME environment variable:

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: rag-llm-app
      url: "file:///helm-charts/pipeline"
    # ...
    chartValues:
      # ...
      query:
        # ...
        env:
          # ...
          APP_LLM_MODELNAME: mistral-8x7b-instruct-v0-1

      frontend:
        # ...
        env:
          # ...
          - name: APP_MODELNAME
            value: "mistral-8x7b-instruct-v0-1"

Apply the configuration change:

Copy
Copied!

            
            $ kubectl apply -f config/samples/helmpipeline_<app|nemo>_trtllm.yaml -n rag-sample

Downloading the model from NGC typically requires between 10 and 60 minutes.

Example Output

Copy
Copied!

            
            helmpipeline.package.nvidia.com/my-sample-pipeline configured

Optional: Monitor the progress:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample

Example Output

Copy
Copied!

            
            NAME                                   READY   STATUS     RESTARTS   AGE
chain-server-6d66c45bb9-rpjjz          1/1     Running    0          4m25s
nemollm-embedding-5bbc63f38d3b911f-0   1/1     Running    0          120m
nemollm-inference-5bbc63f38d3b911f-0   0/1     Init:0/1   0          3m34s
pgvector-0                             1/1     Running    0          4d
rag-playground-cb67f854b-6jkzb         1/1     Running    0          4d

Changing the Inference Model for the vLLM Backend

Changing the inference model for the vLLM backend requires stopping the inference container, deleting and recreating the persistent volume used by the inference container, and then applying the Helm pipeline manifest that specifies the model.

When you use the vLLM backend, the inference container has an init container, hf-model-puller, that downloads the inference model from Hugging Face Model Hub. The init container uses Git to clone the model repository using a repository address like the following:

Copy
Copied!

            
            https://${HF_USER}:${HF_PAT}@huggingface.co/${MODEL_ORG}/${MODEL_NAME}

The HF_USER and HF_PAT values are supplied from the hf-secret in the same namespace as the inference container. You specify the model org and model name in the Helm pipeline manifest. For example, to clone the model at https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, specify MODEL_ORG: meta-llama and MODEL_NAME: Llama-2-70b-chat-hf to complete the Git repository address.

Important

The personal access token associated with your Hugging Face account must have access to the model organization and model.

The PVC for the inference container must have sufficient free disk space. For example, the Llama-2-70b-chat-hf model requires more than 280 GB of storage.

Edit the Helm pipeline manifest in the config/samples/ directory.

Specify model.name and model.config values for the nemollm-inference container.
Specify MODEL_ORG and MODEL_NAME values for the nemollm-inference init container.
Specify APP_LLM_MODELNAME for the chain-server container.

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
    name: nemollm-inference
    ...
    chartValues:
      fullnameOverride: "nemollm-inference"
      backend: "vllm"
      model:
        name: Llama-2-70b-chat
        config: /model-store/model_config.yaml
    ...
    initContainers:
      ngcInit: [] # disabled by default
      hfInit:
        imageName: bitnami/git
        imageTag: latest
        secret: # name of kube secret for hf with keys named HF_USER and HF_PAT
          name: hf-secret
        env:
          STORE_MOUNT_PATH: /model-store
          HF_MODEL_NAME: Llama-2-70b-chat-hf # HF model name
          HF_MODEL_ORG: meta-llama # HF org where model lives
          USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default
          ...
  - repoEntry:
      name: rag-llm-app
      ...
      chartValues:
        query:
          ...
          env:
            APP_LLM_MODELNAME: Llama-2-70b-chat
            ...
...

Delete the stateful set for the inference container:

Copy
Copied!

            
            $ kubectl delete sts -n rag-sample -lapp.kubernetes.io/name=nemollm-inference

Delete the PVC for the inference container:

Copy
Copied!

            
            $ kubectl delete -n rag-sample -f examples/pvc-inferencing.yaml

Recreate the PVC for the inference container:

Copy
Copied!

            
            $ kubectl apply -n rag-sample -f examples/pvc-inferencing.yaml

Apply the Helm pipeline manifest to download and use the new model:

Copy
Copied!

            
            $ kubectl apply -n rag-sample -f config/samples/helm_<app|nemo>_vllm.yaml

Optional: Monitor the model download:

Copy
Copied!

            
            $ kubectl logs -n rag-sample -lapp.kubernetes.io/name=nemollm-inference -c hf-model-puller

Example Output

Copy
Copied!

            
            ...
pipefail        off
posix           off
privileged      off
verbose         off
vi              off
xtrace          off
downloading model meta-llama/Llama-2-70b-chat-hf
Git LFS initialized.
Cloning into '/model-store/model-store'...
Filtering content: ...

Use the RAG playground user interface or access the chain server API to use the new model.

Assigning Multiple GPUs with the vLLM Backend

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The tensor_parallel_size field specifies the number of GPUs to use.

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-inference"
    chartSpec:
      chart: "nemollm-inference"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-inference"
      model:
        name: llama-2-13b-chat
        numGpus: 2  # The vLLM backend does not use this field.
        # num_workers: 1
        vllm_config:
          engine:
            model: /model-store
            enforce_eager: false
            max_context_len_to_capture: 8192
            max_num_seqs: 256
            dtype: float16
            tensor_parallel_size: 2
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
      resources:
        limits:
          nvidia.com/gpu: 2  # Number of GPUs to present to the running service

The fields beneath chartValues.model.vllm_config correspond to the VllmEngine object for NIM for LLMs. Refer to Model Configuration Values for vLLM in the NVIDIA NIM for LLMs documentation.

Deleting RAG Pipelines

To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:

View the Helm pipeline custom resources:

Copy
Copied!

            
            $ kubectl get helmpipelines -A

Example Output

Copy
Copied!

            
            NAMESPACE    NAME                 STATUS
rag-sample   my-sample-pipeline   deployed

Delete the custom resource:

Copy
Copied!

            
            $ kubectl delete helmpipeline -n rag-sample my-sample-pipeline

Example Output

Copy
Copied!

            
            helmpipeline.package.nvidia.com "my-sample-pipeline" deleted

If you do not plan to redeploy the pipeline, delete the persistent storage:

Copy
Copied!

            
            $ kubectl delete pvc -n rag-sample nemollm-embedding-pvc
$ kubectl delete pvc -n rag-sample nemollm-inference-pvc
$ kubectl delete pvc -n rag-sample pgvector-pvc