Sample RAG Pipeline
The following figure shows a high-level overview of the software components in the pipeline.
![pipeline-components.png](https://docscontent.nvidia.com/dims4/default/04722c6/2147483647/strip/true/crop/830x426+0+0/resize/830x426!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018e-6130-d04c-a7fe-7f7a4b260000%2Fai-enterprise%2Frag-llm-operator%2F0.4.1%2F_images%2Fpipeline-components.png)
Prerequisites
Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.
Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.
Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.
Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.
A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:
nemollm-inference-pvc: 50 GB
nemo-embedding-pvc: 50 GB
pgvector-pvc: 5 GB
The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.
For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.
Special Considerations for VMware vSphere with Tanzu
If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.
Enter the following commands before creating the persistent volume claims:
$ kubectl create namespace local-path-provisioner
$ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged
You also need to label the sample RAG pipeline namespace.
$ kubectl create namespace rag-sample
# kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged
Procedure
Download the manifests for the sample pipeline:
$ ngc registry resource download-version ohlfw0olaadg/ea-rag-examples/rag-sample-pipeline:0.4.0
The NGC CLI downloads the manifests to a new directory,
rag-sample-pipeline_v0.4.0
. The manifests for persistent volume claims and the sample pipeline are in the new directory.Create the namespace, if it isn’t already created:
$ kubectl create namespace rag-sample
Configure persistent storage.
In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the
examples
directory.Specify the
spec.storageClassName
value for each.Create the PVCs:
$ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample $ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample $ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample
Confirm PV and PVCs are bound:
$ kubectl get pv,pvc -n rag-sample
The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rag-sample persistentvolumeclaim/nemollm-embedding-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/nemollm-inference-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/pgvector-pvc Pending local-path 3h7m
Edit the sample pipeline,
config/samples/helmpipeline_app.yaml
.Required Changes
Update the
imagePullSecret.password
fields (two) with your NGC API key.Update the
secret.apiKey
fields (two) with your NGC API key.
Optional Customizations
Refer to Common Customizations for more information.
Create the pipeline:
$ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample
The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.
Optional: Monitor the pods:
$ kubectl get pods -n rag-sample
Example Output
NAME READY STATUS RESTARTS AGE frontend-7cb65566d4-mjlmq 1/1 Running 0 47h nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 47h nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 47h pgvector-0 1/1 Running 0 47h query-router-7fcd9ffd84-9f6j8 1/1 Running 0 47h
Access the sample chat application.
If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.
Determine the node port for the sample chat application:
$ kubectl get service -n rag-sample frontend
In the following sample output, the application is listening on node port
32092
.NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend NodePort 10.99.219.137 <none> 8090:32092/TCP 47h
Forward the port:
$ kubectl port-forward service/frontend -n rag-sample 32092:8090
After you forward the port, you can access the application at http://localhost:32092.
The sample pipeline, config/samples/helmpipline_app.yaml
provides a simple baseline that you can customize.
Field |
Customization |
Default Value |
---|---|---|
nodeSelector.nvidia.com/gpu.product |
If your host has an NVIDIA H100 or L40S GPU, specify the product name.
Run the following command to display the GPU models on your nodes:
|
NVIDIA-A100-80GB-PCIe |
frontend.service.type |
Specify loadBalancer if your cluster integrates with an external load balancer. |
nodePort |
chartValues.updateStrategy.type for nemollm-inference, nemollm-embedding,
and pgvector |
Specify OnDelete to prevent Kubernetes from performing a rolling update.
This setting can be helpful to avoid an inference or embedding pod becoming
stuck in Pending when you update the specification and no GPU are allocatable.
Refer to Update Strategies
for stateful sets in the Kubernetes documentation.
|
RollingUpdate |
query.deployStrategy.type
frontend.deployStrategy.type
|
Specify Recreate to have Kubernetes delete existing pods before creating new ones.
This setting can be helpful to avoid the query pod becoming stuck in Pending when
you update the specification and no GPUs are allocatable.
By default, Kubernetes creates a new pod, including a GPU resource request, before
deleting the currently running pod.
Refer to Strategy
for deployments in the Kubernetes documentation.
|
RollingUpdate |
Some models exceed the memory capacity of a single GPU and require more than one GPU.
To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example.
The model.numGpus
field corresponds to the --num_gpus
command-line argument to the nemollm_inference_ms
command.
spec:
pipeline:
- repoEntry:
name: nemollm-inference
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-inference"
chartSpec:
chart: "nemollm-inference"
wait: false
chartValues:
fullnameOverride: "nemollm-inference"
model:
name: llama-2-13b-chat
numGpus: 2
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
resources:
limits:
nvidia.com/gpu: 2 # Number of GPUs to present to the running service
To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.
- repoEntry:
name: nemollm-embedding
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-embedding"
chartSpec:
chart: "nemollm-embedding"
wait: false
chartValues:
fullnameOverride: "nemollm-embedding"
...
resources:
limits:
nvidia.com/gpu: 2 # Number of GPUs to present to the running service
About Choosing a Model
By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.
A model name, such as llama-2-70b-chat-4k-FP16-4-A100.24.01, has a GPU suffix, such as A100, and a model version suffix, such as 24.01. The GPU suffix must match the GPU model on the node that runs the nemollm-inference container. The model version suffix must match the tag on the nemollm-inference image.
You can access the registry and list models using a web browser or the NGC CLI:
- Web browser
- NGC CLI
Go to https://registry.ngc.nvidia.com/models. After you log in and set your organization and team, browse the models.
Use the ngc registry model list <organization>/*
command to list the models.
The following output shows the mistral-7b-instruct model. Your NGC organization and team membership determines the models that are available to you.
Use the ngc registry model list <organization/team/model>:*
to get the model versions.
For example, run ngc registry model list "............/ea-participants/mistral-7b-instruct:*"
.
Based on the preceding output, consider the following requirements:
The node that runs the nemollm-inference container must have an NVIDIA A100 80 GB GPU.
The nemollm-inference image tag must match the model version suffix of
24.01.rc4
or24.01
.
Procedure
To change the inference model, perform the following steps:
Determine the version of the Inference Microservice image:
$ kubectl get pods -n rag-sample -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'
Example Output
nvcr.io/............/ea-participants/nemollm-inference-ms:24.01
Based on the example output, only models with a version suffix of 24.01 will work with the microservice image.
Edit the
config/samples/helmpipeline_app.yaml
file.Modify the Inference Microservice specification and set the
NGC_MODEL_NAME
,NGC_MODEL_VERSION
, andMODEL_NAME
environment variables:spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" # ... chartValues: model: name: mistral-7b-instruct # ... initContainers: ngcInit: # disabled by default # ... env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: <org-name> # ngc org where model lives NGC_CLI_TEAM: <team-name> # ngc team where model lives NGC_MODEL_NAME: mistral-7b-instruct # model name in ngc NGC_MODEL_VERSION: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: yes # tells the script to untar the model. defaults to "yes", set to "" to turn off MODEL_NAME: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # actual model name, once downloaded
Modify the query service specification and set
APP_LLM_MODELNAME
environment variable:spec: pipeline: - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" # ... chartValues: # ... query: # ... env: # ... APP_LLM_MODELNAME: mistral-7b-instruct frontend: # ... env: # ... - name: APP_MODELNAME value: "mistral-7b-instruct"
Apply the configuration change:
$ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample
Downloading the model from NGC typically requires between 10 and 60 minutes.
Example Output
helmpipeline.package.nvidia.com/my-sample-pipeline configured
Optional: Monitor the progress:
$ kubectl get pods -n rag-sample
Example Output
NAME READY STATUS RESTARTS AGE frontend-7cb65566d4-xhx8p 1/1 Running 0 100s nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 40m nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 58s pgvector-0 1/1 Running 0 40m query-router-7fcd9ffd84-pqxrq 1/1 Running 0 100s
To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:
View the Helm pipeline custom resources:
$ kubectl get helmpipelines -A
Example Output
NAMESPACE NAME STATUS rag-sample my-sample-pipeline deployed
Delete the custom resource:
$ kubectl delete helmpipeline -n rag-sample my-sample-pipeline
Example Output
helmpipeline.package.nvidia.com "my-sample-pipeline" deleted
If you do not plan to redeploy the pipeline, delete the persistent storage:
$ kubectl delete pvc -n rag-sample nemollm-embedding-pvc $ kubectl delete pvc -n rag-sample nemollm-inference-pvc $ kubectl delete pvc -n rag-sample pgvector-pvc