Sample RAG Pipeline

Enterprise RAG LLM Operator - (Latest Version)

The following figure shows a high-level overview of the software components in the pipeline.

pipeline-components.png

Prerequisites

  • Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.

    Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.

    Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.

  • Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.

  • A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:

    • nemollm-inference-pvc: 50 GB

    • nemo-embedding-pvc: 50 GB

    • pgvector-pvc: 5 GB

    The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.

    For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.

Special Considerations for VMware vSphere with Tanzu

If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.

Enter the following commands before creating the persistent volume claims:

Copy
Copied!
            

$ kubectl create namespace local-path-provisioner $ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

You also need to label the sample RAG pipeline namespace.

Copy
Copied!
            

$ kubectl create namespace rag-sample $ kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

Procedure

  1. Download the manifests for the sample pipeline.

    NVIDIA provides sample pipeline manifests for the TensorRT-LLM backend and the vLLM backend to NVIDIA NIM for LLMs. Refer to Supported Inference Models and GPU Requirements to determine which backend to use.

    Copy
    Copied!
                

    $ ngc registry resource download-version ohlfw0olaadg/ea-participants/rag-sample-pipeline:24.03

    The NGC CLI downloads the manifests to a new directory, rag-sample-pipeline_v24.03. The manifests for persistent volume claims and the sample pipeline are in the new directory.

  2. Change directory to the sample manifests:

    Copy
    Copied!
                

    $ cd rag-sample-pipeline_v24.03

  3. Create the namespace, if it isn’t already created:

    Copy
    Copied!
                

    $ kubectl create namespace rag-sample

  4. Configure persistent storage.

    1. In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the examples directory.

      Specify the spec.storageClassName value for each.

    2. Create the PVCs:

      Copy
      Copied!
                  

      $ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample $ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample $ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample

    3. Confirm the PVCs are created:

      Copy
      Copied!
                  

      $ kubectl get pvc -n rag-sample

      The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.

      Copy
      Copied!
                  

      NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rag-sample persistentvolumeclaim/nemollm-embedding-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/nemollm-inference-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/pgvector-pvc Pending local-path 3h7m

  5. Add secrets that use your NGC CLI API key.

    • Add a Docker registry secret that the Operator uses for pulling containers from NGC:

      Copy
      Copied!
                  

      $ kubectl create secret -n rag-sample docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<ngc-cli-api-key>

    • Add a generic secret that the init container for the inference and embedding containers use to download models:

      Copy
      Copied!
                  

      $ kubectl create secret -n rag-sample generic ngc-api-secret \ --from-literal=NGC_CLI_API_KEY=<ngc-cli-api-key>

  6. To use the vLLM backend and models from Hugging Face Model Hub, add a secret with your personal access token for downloading models from the hub:

    Copy
    Copied!
                

    $ kubectl create secret -n rag-sample generic hf-secret \ --from-literal=HF_USER=<user%40example.com> \ --from-literal=HF_PAT=<hf-personal-access-token>

    To use an @ character in the HF_USER value, specify the URL-encoded value, %40.

  7. Edit a sample pipeline file, config/samples/helmpipeline_<app|nemo>_<backend>.yaml.

    Consider the following topics:

  8. Create the pipeline:

    Copy
    Copied!
                

    $ kubectl apply -f config/samples/helmpipeline_<app|nemo>_<backend>.yaml -n rag-sample

    The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.

  9. Optional: Monitor the pods:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample

    Example Output

    Copy
    Copied!
                

    NAME READY STATUS RESTARTS AGE chain-server-5c9c9f7d75-ggw2h 1/1 Running 0 42m nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 42m nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 29m pgvector-0 1/1 Running 0 42m rag-playground-cb67f854b-6jkzb 1/1 Running 0 42m

  10. Access the sample chat application.

    If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.

    • Determine the node port for the sample chat application:

      Copy
      Copied!
                  

      $ kubectl get service -n rag-sample rag-playground

      In the following sample output, the application is listening on node port 32092.

      Copy
      Copied!
                  

      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE rag-playground NodePort 10.99.219.137 <none> 3001:32092/TCP 141m

    • Forward the port:

      Copy
      Copied!
                  

      $ kubectl port-forward service/rag-playground -n rag-sample 32092:3001

    After you forward the port, you can access the application at http://localhost:32092.

  11. Optional: Access the Chain Server. This step can be useful if you want to develop your own chat client or chat templates.

    If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the Chain Server.

    • Patch the service to change it from cluster IP to node port:

      Copy
      Copied!
                  

      $ kubectl patch service -n rag-sample chain-server -p '{"spec":{"type":"NodePort"}}'

    • Determine the node port for the chain server:

      Copy
      Copied!
                  

      $ kubectl get service -n rag-sample chain-server

      In the following sample output, the server is listening on node port 31157.

      Copy
      Copied!
                  

      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE chain-server NodePort 10.108.98.47 <none> 8081:31157/TCP 14d

    • Forward the port:

      Copy
      Copied!
                  

      $ kubectl port-forward service/chain-server -n rag-sample 31157:8081

    After you forward the port, you can access the server at http://localhost:31157. You can view the API at http://localhost:31157/docs.

The sample pipelines provide a simple baseline that you can customize.

Field

Customization

Default Value

nodeSelector.nvidia.com/gpu.product If your host has an NVIDIA H100 or L40S GPU, specify the product name. Run the following command to display the GPU models on your nodes:
Copy
Copied!
            

kubectl get nodes -l nvidia.com/gpu.present=true \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.nvidia\.com\/gpu\.product}{"\n"}{end}'

NVIDIA-A100-80GB-PCIe
frontend.service.type Specify loadBalancer if your cluster integrates with an external load balancer. nodePort
chartValues.updateStrategy.type for nemollm-inference, nemollm-embedding, and pgvector Specify OnDelete to prevent Kubernetes from performing a rolling update. This setting can be helpful to avoid an inference or embedding pod becoming stuck in Pending when you update the specification and no GPU are allocatable. Refer to Update Strategies for stateful sets in the Kubernetes documentation. RollingUpdate
query.deployStrategy.type frontend.deployStrategy.type Specify Recreate to have Kubernetes delete existing pods before creating new ones. This setting can be helpful to avoid the query pod becoming stuck in Pending when you update the specification and no GPUs are allocatable. By default, Kubernetes creates a new pod, including a GPU resource request, before deleting the currently running pod. Refer to Strategy for deployments in the Kubernetes documentation. RollingUpdate

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The model.numGpus field corresponds to the --num_gpus command-line argument to the nemollm_inference_ms command.

Copy
Copied!
            

spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-inference" chartSpec: chart: "nemollm-inference" wait: false chartValues: fullnameOverride: "nemollm-inference" model: name: llama-2-13b-chat numGpus: 2 nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe resources: limits: nvidia.com/gpu: 2 # Number of GPUs to present to the running service

To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.

Copy
Copied!
            

- repoEntry: name: nemollm-embedding url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-embedding" chartSpec: chart: "nemollm-embedding" wait: false chartValues: fullnameOverride: "nemollm-embedding" ... resources: limits: nvidia.com/gpu: 2 # Number of GPUs to present to the running service

About Choosing a Model

By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.

Important

A model, such as Llama-2-13B-Chat has a model version, such as a100x2_fp16_24.02. The model version encodes the following information:

  • Required GPU model, such as A100.

  • Required GPU count, such as 2.

  • Model release, such as 24.02.

For the Operator to deploy a model from NGC and use the TensorRT-LLM backend, the GPU model and count must match a node in your cluster. The model release must match the NIM for LLMs container image tag. You can locate the image tag in the Helm pipeline manifest file.

You can access the registry and list models using a web browser or the NGC CLI:

Web browser

Go to https://registry.ngc.nvidia.com/models. After you log in and set your organization and team, browse the models.

NGC CLI

Use the ngc registry model list <organization>/* command to list the models.

The following output shows the mixtral-8x7b-instruct-v0-1 model. Your NGC organization and team membership determines the models that are available to you.

Use the ngc registry model list <organization/team/model>:* command to list the model versions. For example, run ngc registry model list "............/ea-participants/mixtral-8x7b-instruct-v0-1:*".

Based on the preceding output, consider the following requirements:

  • The node that runs the NIM for LLMs container must have four NVIDIA A100 80 GB GPUs or four H100 80 GB GPUs.

  • The NIM for LLMs image tag must match the model release value, 24.02.

Procedure

To change the inference model, perform the following steps:

  1. Determine the version of the NIM for LLMs image:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'

    Example Output

    Copy
    Copied!
                

    nvcr.io/ohlfw0olaadg/ea-participants/nim_llm:24.02

    Based on the example output, only models with a version suffix of 24.02 will operate with the microservice image.

  2. Edit the config/samples/helmpipeline_<app|nemo>_trtllm.yaml file.

    • Modify the NIM for LLMs specification and set the NGC_MODEL_NAME, NGC_MODEL_VERSION, and MODEL_NAME environment variables:

      Copy
      Copied!
                  

      spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" # ... chartValues: model: name: mixtral-8x7b-instruct-v0-1 # ... initContainers: ngcInit: # disabled by default # ... env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: <org-name> # ngc org where model lives NGC_CLI_TEAM: <team-name> # ngc team where model lives NGC_MODEL_NAME: mistral-8x7b-instruct-v0-1 # model name in ngc NGC_MODEL_VERSION: a100x4_fp16_24.02 # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: yes # tells the script to untar the model. defaults to "yes", set to "" to turn off MODEL_NAME: mixtral-8x7b-instruct-v0-1 # actual model name, once downloaded

    • Modify the query service specification and set APP_LLM_MODELNAME environment variable:

      Copy
      Copied!
                  

      spec: pipeline: - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" # ... chartValues: # ... query: # ... env: # ... APP_LLM_MODELNAME: mistral-8x7b-instruct-v0-1 frontend: # ... env: # ... - name: APP_MODELNAME value: "mistral-8x7b-instruct-v0-1"

  3. Apply the configuration change:

    Copy
    Copied!
                

    $ kubectl apply -f config/samples/helmpipeline_<app|nemo>_trtllm.yaml -n rag-sample

    Downloading the model from NGC typically requires between 10 and 60 minutes.

    Example Output

    Copy
    Copied!
                

    helmpipeline.package.nvidia.com/my-sample-pipeline configured

  4. Optional: Monitor the progress:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample

    Example Output

    Copy
    Copied!
                

    NAME READY STATUS RESTARTS AGE chain-server-6d66c45bb9-rpjjz 1/1 Running 0 4m25s nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 120m nemollm-inference-5bbc63f38d3b911f-0 0/1 Init:0/1 0 3m34s pgvector-0 1/1 Running 0 4d rag-playground-cb67f854b-6jkzb 1/1 Running 0 4d

Changing the inference model for the vLLM backend requires stopping the inference container, deleting and recreating the persistent volume used by the inference container, and then applying the Helm pipeline manifest that specifies the model.

When you use the vLLM backend, the inference container has an init container, hf-model-puller, that downloads the inference model from Hugging Face Model Hub. The init container uses Git to clone the model repository using a repository address like the following:

Copy
Copied!
            

https://${HF_USER}:${HF_PAT}@huggingface.co/${MODEL_ORG}/${MODEL_NAME}

The HF_USER and HF_PAT values are supplied from the hf-secret in the same namespace as the inference container. You specify the model org and model name in the Helm pipeline manifest. For example, to clone the model at https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, specify MODEL_ORG: meta-llama and MODEL_NAME: Llama-2-70b-chat-hf to complete the Git repository address.

Important

The personal access token associated with your Hugging Face account must have access to the model organization and model.

The PVC for the inference container must have sufficient free disk space. For example, the Llama-2-70b-chat-hf model requires more than 280 GB of storage.

  1. Edit the Helm pipeline manifest in the config/samples/ directory.

    • Specify model.name and model.config values for the nemollm-inference container.

    • Specify MODEL_ORG and MODEL_NAME values for the nemollm-inference init container.

    • Specify APP_LLM_MODELNAME for the chain-server container.

    Copy
    Copied!
                

    spec: pipeline: - repoEntry: name: nemollm-inference ... chartValues: fullnameOverride: "nemollm-inference" backend: "vllm" model: name: Llama-2-70b-chat config: /model-store/model_config.yaml ... initContainers: ngcInit: [] # disabled by default hfInit: imageName: bitnami/git imageTag: latest secret: # name of kube secret for hf with keys named HF_USER and HF_PAT name: hf-secret env: STORE_MOUNT_PATH: /model-store HF_MODEL_NAME: Llama-2-70b-chat-hf # HF model name HF_MODEL_ORG: meta-llama # HF org where model lives USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default ... - repoEntry: name: rag-llm-app ... chartValues: query: ... env: APP_LLM_MODELNAME: Llama-2-70b-chat ... ...

  2. Delete the stateful set for the inference container:

    Copy
    Copied!
                

    $ kubectl delete sts -n rag-sample -lapp.kubernetes.io/name=nemollm-inference

  3. Delete the PVC for the inference container:

    Copy
    Copied!
                

    $ kubectl delete -n rag-sample -f examples/pvc-inferencing.yaml

  4. Recreate the PVC for the inference container:

    Copy
    Copied!
                

    $ kubectl apply -n rag-sample -f examples/pvc-inferencing.yaml

  5. Apply the Helm pipeline manifest to download and use the new model:

    Copy
    Copied!
                

    $ kubectl apply -n rag-sample -f config/samples/helm_<app|nemo>_vllm.yaml

  6. Optional: Monitor the model download:

    Copy
    Copied!
                

    $ kubectl logs -n rag-sample -lapp.kubernetes.io/name=nemollm-inference -c hf-model-puller

    Example Output

    Copy
    Copied!
                

    ... pipefail off posix off privileged off verbose off vi off xtrace off downloading model meta-llama/Llama-2-70b-chat-hf Git LFS initialized. Cloning into '/model-store/model-store'... Filtering content: ...

Use the RAG playground user interface or access the chain server API to use the new model.

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The tensor_parallel_size field specifies the number of GPUs to use.

Copy
Copied!
            

spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-inference" chartSpec: chart: "nemollm-inference" wait: false chartValues: fullnameOverride: "nemollm-inference" model: name: llama-2-13b-chat numGpus: 2 # The vLLM backend does not use this field. # num_workers: 1 vllm_config: engine: model: /model-store enforce_eager: false max_context_len_to_capture: 8192 max_num_seqs: 256 dtype: float16 tensor_parallel_size: 2 nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe resources: limits: nvidia.com/gpu: 2 # Number of GPUs to present to the running service

The fields beneath chartValues.model.vllm_config correspond to the VllmEngine object for NIM for LLMs. Refer to Model Configuration Values for vLLM in the NVIDIA NIM for LLMs documentation.

To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:

  1. View the Helm pipeline custom resources:

    Copy
    Copied!
                

    $ kubectl get helmpipelines -A

    Example Output

    Copy
    Copied!
                

    NAMESPACE NAME STATUS rag-sample my-sample-pipeline deployed

  2. Delete the custom resource:

    Copy
    Copied!
                

    $ kubectl delete helmpipeline -n rag-sample my-sample-pipeline

    Example Output

    Copy
    Copied!
                

    helmpipeline.package.nvidia.com "my-sample-pipeline" deleted

  3. If you do not plan to redeploy the pipeline, delete the persistent storage:

    Copy
    Copied!
                

    $ kubectl delete pvc -n rag-sample nemollm-embedding-pvc $ kubectl delete pvc -n rag-sample nemollm-inference-pvc $ kubectl delete pvc -n rag-sample pgvector-pvc

Previous Installing the NVIDIA Enterprise RAG LLM Operator
Next RAG Playground Web Application
© Copyright © 2024, NVIDIA Corporation. Last updated on May 21, 2024.