NVIDIA Enterprise RAG LLM Operator
Enterprise RAG LLM Operator - (Latest Version)

Sample RAG Pipeline

The following figure shows a high-level overview of the software components in the pipeline.

pipeline-components.png

Prerequisites

  • Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.

    Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.

    Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.

  • Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.

  • A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:

    • nemollm-inference-pvc: 50 GB

    • nemo-embedding-pvc: 50 GB

    • pgvector-pvc: 5 GB

    The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.

    For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.

Special Considerations for VMware vSphere with Tanzu

If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.

Enter the following commands before creating the persistent volume claims:

Copy
Copied!
            

$ kubectl create namespace local-path-provisioner $ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

You also need to label the sample RAG pipeline namespace.

Copy
Copied!
            

$ kubectl create namespace rag-sample # kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged

Procedure

  1. Download the manifests for the sample pipeline:

    Copy
    Copied!
                

    $ ngc registry resource download-version ohlfw0olaadg/ea-rag-examples/rag-sample-pipeline:0.4.0

    The NGC CLI downloads the manifests to a new directory, rag-sample-pipeline_v0.4.0. The manifests for persistent volume claims and the sample pipeline are in the new directory.

  2. Create the namespace, if it isn’t already created:

    Copy
    Copied!
                

    $ kubectl create namespace rag-sample

  3. Configure persistent storage.

    1. In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the examples directory.

      Specify the spec.storageClassName value for each.

    2. Create the PVCs:

      Copy
      Copied!
                  

      $ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample $ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample $ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample

    3. Confirm PV and PVCs are bound:

      Copy
      Copied!
                  

      $ kubectl get pv,pvc -n rag-sample

      The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.

      Copy
      Copied!
                  

      NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rag-sample persistentvolumeclaim/nemollm-embedding-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/nemollm-inference-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/pgvector-pvc Pending local-path 3h7m

  4. Edit the sample pipeline, config/samples/helmpipeline_app.yaml.

    Required Changes

    • Update the imagePullSecret.password fields (two) with your NGC API key.

    • Update the secret.apiKey fields (two) with your NGC API key.

    Optional Customizations

  5. Create the pipeline:

    Copy
    Copied!
                

    $ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample

    The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.

  6. Optional: Monitor the pods:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample

    Example Output

    Copy
    Copied!
                

    NAME READY STATUS RESTARTS AGE frontend-7cb65566d4-mjlmq 1/1 Running 0 47h nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 47h nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 47h pgvector-0 1/1 Running 0 47h query-router-7fcd9ffd84-9f6j8 1/1 Running 0 47h

  7. Access the sample chat application.

    If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.

    • Determine the node port for the sample chat application:

      Copy
      Copied!
                  

      $ kubectl get service -n rag-sample frontend

      In the following sample output, the application is listening on node port 32092.

      Copy
      Copied!
                  

      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend NodePort 10.99.219.137 <none> 8090:32092/TCP 47h

    • Forward the port:

      Copy
      Copied!
                  

      $ kubectl port-forward service/frontend -n rag-sample 32092:8090

    After you forward the port, you can access the application at http://localhost:32092.

The sample pipeline, config/samples/helmpipline_app.yaml provides a simple baseline that you can customize.

Field

Customization

Default Value

nodeSelector.nvidia.com/gpu.product If your host has an NVIDIA H100 or L40S GPU, specify the product name. Run the following command to display the GPU models on your nodes:
Copy
Copied!
            

kubectl get nodes -l nvidia.com/gpu.present=true \ -o=jsonpath='{range .items[*]}{.metadata.name}: {.metadata.labels.nvidia\.com\/gpu\.product}{"\n"}{end}'

NVIDIA-A100-80GB-PCIe
frontend.service.type Specify loadBalancer if your cluster integrates with an external load balancer. nodePort
chartValues.updateStrategy.type for nemollm-inference, nemollm-embedding, and pgvector Specify OnDelete to prevent Kubernetes from performing a rolling update. This setting can be helpful to avoid an inference or embedding pod becoming stuck in Pending when you update the specification and no GPU are allocatable. Refer to Update Strategies for stateful sets in the Kubernetes documentation. RollingUpdate
query.deployStrategy.type frontend.deployStrategy.type Specify Recreate to have Kubernetes delete existing pods before creating new ones. This setting can be helpful to avoid the query pod becoming stuck in Pending when you update the specification and no GPUs are allocatable. By default, Kubernetes creates a new pod, including a GPU resource request, before deleting the currently running pod. Refer to Strategy for deployments in the Kubernetes documentation. RollingUpdate

Some models exceed the memory capacity of a single GPU and require more than one GPU.

To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The model.numGpus field corresponds to the --num_gpus command-line argument to the nemollm_inference_ms command.

Copy
Copied!
            

spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-inference" chartSpec: chart: "nemollm-inference" wait: false chartValues: fullnameOverride: "nemollm-inference" model: name: llama-2-13b-chat numGpus: 2 nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe resources: limits: nvidia.com/gpu: 2 # Number of GPUs to present to the running service

To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.

Copy
Copied!
            

- repoEntry: name: nemollm-embedding url: "file:///helm-charts/pipeline" #url: "cm://rag-application/nemollm-embedding" chartSpec: chart: "nemollm-embedding" wait: false chartValues: fullnameOverride: "nemollm-embedding" ... resources: limits: nvidia.com/gpu: 2 # Number of GPUs to present to the running service

About Choosing a Model

By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.

Important

A model name, such as llama-2-70b-chat-4k-FP16-4-A100.24.01, has a GPU suffix, such as A100, and a model version suffix, such as 24.01. The GPU suffix must match the GPU model on the node that runs the nemollm-inference container. The model version suffix must match the tag on the nemollm-inference image.

You can access the registry and list models using a web browser or the NGC CLI:

Web browser

Go to https://registry.ngc.nvidia.com/models. After you log in and set your organization and team, browse the models.

NGC CLI

Use the ngc registry model list <organization>/* command to list the models.

The following output shows the mistral-7b-instruct model. Your NGC organization and team membership determines the models that are available to you.

Use the ngc registry model list <organization/team/model>:* to get the model versions. For example, run ngc registry model list "............/ea-participants/mistral-7b-instruct:*".

Based on the preceding output, consider the following requirements:

  • The node that runs the nemollm-inference container must have an NVIDIA A100 80 GB GPU.

  • The nemollm-inference image tag must match the model version suffix of 24.01.rc4 or 24.01.

Procedure

To change the inference model, perform the following steps:

  1. Determine the version of the Inference Microservice image:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'

    Example Output

    Copy
    Copied!
                

    nvcr.io/............/ea-participants/nemollm-inference-ms:24.01

    Based on the example output, only models with a version suffix of 24.01 will work with the microservice image.

  2. Edit the config/samples/helmpipeline_app.yaml file.

    • Modify the Inference Microservice specification and set the NGC_MODEL_NAME, NGC_MODEL_VERSION, and MODEL_NAME environment variables:

      Copy
      Copied!
                  

      spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" # ... chartValues: model: name: mistral-7b-instruct # ... initContainers: ngcInit: # disabled by default # ... env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: <org-name> # ngc org where model lives NGC_CLI_TEAM: <team-name> # ngc team where model lives NGC_MODEL_NAME: mistral-7b-instruct # model name in ngc NGC_MODEL_VERSION: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: yes # tells the script to untar the model. defaults to "yes", set to "" to turn off MODEL_NAME: MISTRAL-7b-INSTRUCT-1-A100.24.01.rc4 # actual model name, once downloaded

    • Modify the query service specification and set APP_LLM_MODELNAME environment variable:

      Copy
      Copied!
                  

      spec: pipeline: - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" # ... chartValues: # ... query: # ... env: # ... APP_LLM_MODELNAME: mistral-7b-instruct frontend: # ... env: # ... - name: APP_MODELNAME value: "mistral-7b-instruct"

  3. Apply the configuration change:

    Copy
    Copied!
                

    $ kubectl apply -f config/samples/helmpipeline_app.yaml -n rag-sample

    Downloading the model from NGC typically requires between 10 and 60 minutes.

    Example Output

    Copy
    Copied!
                

    helmpipeline.package.nvidia.com/my-sample-pipeline configured

  4. Optional: Monitor the progress:

    Copy
    Copied!
                

    $ kubectl get pods -n rag-sample

    Example Output

    Copy
    Copied!
                

    NAME READY STATUS RESTARTS AGE frontend-7cb65566d4-xhx8p 1/1 Running 0 100s nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 40m nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 58s pgvector-0 1/1 Running 0 40m query-router-7fcd9ffd84-pqxrq 1/1 Running 0 100s

To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:

  1. View the Helm pipeline custom resources:

    Copy
    Copied!
                

    $ kubectl get helmpipelines -A

    Example Output

    Copy
    Copied!
                

    NAMESPACE NAME STATUS rag-sample my-sample-pipeline deployed

  2. Delete the custom resource:

    Copy
    Copied!
                

    $ kubectl delete helmpipeline -n rag-sample my-sample-pipeline

    Example Output

    Copy
    Copied!
                

    helmpipeline.package.nvidia.com "my-sample-pipeline" deleted

  3. If you do not plan to redeploy the pipeline, delete the persistent storage:

    Copy
    Copied!
                

    $ kubectl delete pvc -n rag-sample nemollm-embedding-pvc $ kubectl delete pvc -n rag-sample nemollm-inference-pvc $ kubectl delete pvc -n rag-sample pgvector-pvc

Previous Installing the NVIDIA Enterprise RAG LLM Operator
Next Sample Chat Bot Web Application
© Copyright 2024, NVIDIA. Last updated on Mar 21, 2024.