Sample RAG Pipeline - NVIDIA Docs

About the Pipeline Software Components

The following figure shows a high-level overview of the software components in the pipeline.

Configuring Pipelines

Prerequisites

Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.

Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.

Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.
Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.
A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:
- nemollm-inference-pvc: 50 GB
- nemo-embedding-pvc: 50 GB
- pgvector-pvc: 5 GB
The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.

For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.

Special Considerations for VMware vSphere with Tanzu

If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.

Enter the following commands before creating the persistent volume claims:

Copy
Copied!

            
            $ kubectl create namespace local-path-provisioner
$ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

You also need to label the sample RAG pipeline namespace.

Copy
Copied!

            
            $ kubectl create namespace rag-sample
# kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

Procedure

Download the manifests for the sample pipeline:
Copy

Copied!
```
            
            $ ngc registry resource download-version ohlfw0olaadg/ea-rag-examples/rag-sample-pipeline:0.4.0
        
```
The NGC CLI downloads the manifests to a new directory, rag-sample-pipeline_v0.4.0. The manifests for persistent volume claims and the sample pipeline are in the new directory.

Create the namespace, if it isn’t already created:

Copy
Copied!

            
            $ kubectl create namespace rag-sample

Configure persistent storage.

In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the examples directory.

Specify the spec.storageClassName value for each.

Create the PVCs:

Copy
Copied!

            
            $ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample
$ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample
$ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample

Confirm PV and PVCs are bound:

Copy
Copied!

            
            $ kubectl get pv,pvc -n rag-sample

The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.

Copy
Copied!

            
            NAMESPACE    NAME                                          STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
rag-sample   persistentvolumeclaim/nemollm-embedding-pvc   Pending                                      local-path     3h7m
rag-sample   persistentvolumeclaim/nemollm-inference-pvc   Pending                                      local-path     3h7m
rag-sample   persistentvolumeclaim/pgvector-pvc            Pending                                      local-path     3h7m

Edit the sample pipeline, config/samples/helmpipeline_app.yaml.

Required Changes
- Update the imagePullSecret.password fields (two) with your NGC API key.
- Update the secret.apiKey fields (two) with your NGC API key.
Optional Customizations
- Refer to Common Customizations for more information.
Create the pipeline:
Copy

Copied!
```
            
            $ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample
        
```
The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.

Optional: Monitor the pods:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample

Example Output

Copy
Copied!

            
            NAME                                   READY   STATUS    RESTARTS   AGE
frontend-7cb65566d4-mjlmq              1/1     Running   0          47h
nemollm-embedding-5bbc63f38d3b911f-0   1/1     Running   0          47h
nemollm-inference-5bbc63f38d3b911f-0   1/1     Running   0          47h
pgvector-0                             1/1     Running   0          47h
query-router-7fcd9ffd84-9f6j8          1/1     Running   0          47h

Access the sample chat application.

If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.

Determine the node port for the sample chat application:

Copy
Copied!

            
            $ kubectl get service -n rag-sample frontend

In the following sample output, the application is listening on node port 32092.

Copy
Copied!

            
            NAME       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
frontend   NodePort   10.99.219.137   <none>        8090:32092/TCP   47h

Forward the port:

Copy
Copied!

            
            $ kubectl port-forward service/frontend -n rag-sample 32092:8090

After you forward the port, you can access the application at http://localhost:32092.

Common Customizations

The sample pipeline, config/samples/helmpipline_app.yaml provides a simple baseline that you can customize.

Field	Customization	Default Value
`nodeSelector.nvidia.com/gpu.product`	If your host has an NVIDIA H100 or L40S GPU, specify the product name. Run the following command to display the GPU models on your nodes: Copy Copied! `kubectl get nodes -l nvidia.com/gpu.present=true \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.nvidia\.com\/gpu\.product}{"\n"}{end}'`	`NVIDIA-A100-80GB-PCIe`
`frontend.service.type`	Specify `loadBalancer` if your cluster integrates with an external load balancer.	`nodePort`
`chartValues.updateStrategy.type` for nemollm-inference, nemollm-embedding, and pgvector	Specify `OnDelete` to prevent Kubernetes from performing a rolling update. This setting can be helpful to avoid an inference or embedding pod becoming stuck in `Pending` when you update the specification and no GPU are allocatable. Refer to Update Strategies for stateful sets in the Kubernetes documentation.	`RollingUpdate`
`query.deployStrategy.type` `frontend.deployStrategy.type`	Specify `Recreate` to have Kubernetes delete existing pods before creating new ones. This setting can be helpful to avoid the query pod becoming stuck in `Pending` when you update the specification and no GPUs are allocatable. By default, Kubernetes creates a new pod, including a GPU resource request, before deleting the currently running pod. Refer to Strategy for deployments in the Kubernetes documentation.	`RollingUpdate`

Assigning Multiple GPUs

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The model.numGpus field corresponds to the --num_gpus command-line argument to the nemollm_inference_ms command.

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-inference"
    chartSpec:
      chart: "nemollm-inference"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-inference"
      model:
        name: llama-2-13b-chat
        numGpus: 2
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
      resources:
        limits:
          nvidia.com/gpu: 2  # Number of GPUs to present to the running service

To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.

Copy
Copied!

            
              - repoEntry:
      name: nemollm-embedding
      url: "file:///helm-charts/pipeline"
      #url: "cm://rag-application/nemollm-embedding"
    chartSpec:
      chart: "nemollm-embedding"
      wait: false
    chartValues:
      fullnameOverride: "nemollm-embedding"
      ...
      resources:
        limits:
          nvidia.com/gpu: 2  # Number of GPUs to present to the running service

Changing the Inference Model

About Choosing a Model

By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.

Important

A model name, such as llama-2-70b-chat-4k-FP16-4-A100.24.01, has a GPU suffix, such as A100, and a model version suffix, such as 24.01. The GPU suffix must match the GPU model on the node that runs the nemollm-inference container. The model version suffix must match the tag on the nemollm-inference image.

You can access the registry and list models using a web browser or the NGC CLI:

Web browser
NGC CLI

Based on the preceding output, consider the following requirements:

The node that runs the nemollm-inference container must have an NVIDIA A100 80 GB GPU.
The nemollm-inference image tag must match the model version suffix of 24.01.rc4 or 24.01.

Procedure

To change the inference model, perform the following steps:

Determine the version of the Inference Microservice image:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample  -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'

Example Output

Copy
Copied!

            
            nvcr.io/............/ea-participants/nemollm-inference-ms:24.01

Based on the example output, only models with a version suffix of 24.01 will work with the microservice image.

Edit the config/samples/helmpipeline_app.yaml file.

Modify the Inference Microservice specification and set the NGC_MODEL_NAME, NGC_MODEL_VERSION, and MODEL_NAME environment variables:

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: nemollm-inference
      url: "file:///helm-charts/pipeline"
    # ...
    chartValues:
      model:
        name: mistral-7b-instruct
      # ...
      initContainers:
        ngcInit: # disabled by default
        # ...
          env:
            STORE_MOUNT_PATH: /model-store
            NGC_CLI_ORG: <org-name> # ngc org where model lives
            NGC_CLI_TEAM: <team-name> # ngc team where model lives
            NGC_MODEL_NAME: mistral-7b-instruct # model name in ngc
            NGC_MODEL_VERSION: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # model version in ngc
            NGC_EXE: ngc  # path to ngc cli, if pre-installed in container
            DOWNLOAD_NGC_CLI: "false"  # set to string 'true' if container should download and install ngc cli
            NGC_CLI_VERSION: "3.37.1"  # version of ngc cli to download (only matters if downloading)
            TARFILE: yes  # tells the script to untar the model. defaults to "yes", set to "" to turn off
            MODEL_NAME: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # actual model name, once downloaded

Modify the query service specification and set APP_LLM_MODELNAME environment variable:

Copy
Copied!

            
            spec:
  pipeline:
  - repoEntry:
      name: rag-llm-app
      url: "file:///helm-charts/pipeline"
    # ...
    chartValues:
      # ...
      query:
        # ...
        env:
          # ...
          APP_LLM_MODELNAME: mistral-7b-instruct

      frontend:
        # ...
        env:
          # ...
          - name: APP_MODELNAME
            value: "mistral-7b-instruct"

Apply the configuration change:

Copy
Copied!

            
            $ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample

Downloading the model from NGC typically requires between 10 and 60 minutes.

Example Output

Copy
Copied!

            
            helmpipeline.package.nvidia.com/my-sample-pipeline configured

Optional: Monitor the progress:

Copy
Copied!

            
            $ kubectl get pods -n rag-sample

Example Output

Copy
Copied!

            
            NAME                                   READY   STATUS    RESTARTS   AGE
frontend-7cb65566d4-xhx8p              1/1     Running   0          100s
nemollm-embedding-5bbc63f38d3b911f-0   1/1     Running   0          40m
nemollm-inference-5bbc63f38d3b911f-0   1/1     Running   0          58s
pgvector-0                             1/1     Running   0          40m
query-router-7fcd9ffd84-pqxrq          1/1     Running   0          100s

Deleting RAG Pipelines

To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:

View the Helm pipeline custom resources:

Copy
Copied!

            
            $ kubectl get helmpipelines -A

Example Output

Copy
Copied!

            
            NAMESPACE    NAME                 STATUS
rag-sample   my-sample-pipeline   deployed

Delete the custom resource:

Copy
Copied!

            
            $ kubectl delete helmpipeline -n rag-sample my-sample-pipeline

Example Output

Copy
Copied!

            
            helmpipeline.package.nvidia.com "my-sample-pipeline" deleted

If you do not plan to redeploy the pipeline, delete the persistent storage:

Copy
Copied!

            
            $ kubectl delete pvc -n rag-sample nemollm-embedding-pvc
$ kubectl delete pvc -n rag-sample nemollm-inference-pvc
$ kubectl delete pvc -n rag-sample pgvector-pvc