The following figure shows a high-level overview of the software components in the pipeline.
Prerequisites
Installed the NGC CLI on your client machine. You can download the CLI from https://ngc.nvidia.com/setup/installers/cli.
Refer to the NVIDIA NGC User Guide for information about Generating Your NGC API key.
Refer to the NVIDIA NGC CLI User Guide for information about how to set the config with the CLI for your organization and team.
Installed the NVIDIA GPU Operator and NVIDIA Enterprise RAG LLM Operator.
A default storage class for persistent volumes. The embedding and inference models are downloaded and stored in persistent storage. The sample pipeline uses three persistent volume claims:
nemollm-inference-pvc: 50 GB
nemo-embedding-pvc: 50 GB
pgvector-pvc: 5 GB
The preceding sample PVC sizes apply to the sample Helm pipeline. You might need to increase the sizes if you deploy different models.
For VMware vSphere with Tanzu, NVIDIA recommends vSphere CNS. For Kubernetes, NVIDIA used the local-path-provisioner from Rancher Labs during development.
Special Considerations for VMware vSphere with Tanzu
If you install a persistent volume provisioner, such as Rancher Local Path Provisioner, you need to label the namespace to prevent the admission controller from enforcing the pod security policy.
Enter the following commands before creating the persistent volume claims:
$ kubectl create namespace local-path-provisioner
$ kubectl label --overwrite ns local-path-provisioner pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged
You also need to label the sample RAG pipeline namespace.
$ kubectl create namespace rag-sample
$ kubectl label --overwrite ns rag-sample pod-security.kubernetes.io/warn=privileged pod-security.kubernetes.io/enforce=privileged
Procedure
Download the manifests for the sample pipeline.
NVIDIA provides sample pipeline manifests for the TensorRT-LLM backend and the vLLM backend to NVIDIA NIM for LLMs. Refer to Supported Inference Models and GPU Requirements to determine which backend to use.
$ ngc registry resource download-version ohlfw0olaadg/ea-participants/rag-sample-pipeline:24.03
The NGC CLI downloads the manifests to a new directory,
rag-sample-pipeline_v24.03
. The manifests for persistent volume claims and the sample pipeline are in the new directory.Change directory to the sample manifests:
$ cd rag-sample-pipeline_v24.03
Create the namespace, if it isn’t already created:
$ kubectl create namespace rag-sample
Configure persistent storage.
In the downloaded directory, edit the three persistent volume claim (PVC) manifests in the
examples
directory.Specify the
spec.storageClassName
value for each.Create the PVCs:
$ kubectl apply -f examples/pvc-embedding.yaml -n rag-sample $ kubectl apply -f examples/pvc-inferencing.yaml -n rag-sample $ kubectl apply -f examples/pvc-pgvector.yaml -n rag-sample
Confirm the PVCs are created:
$ kubectl get pvc -n rag-sample
The following output applies to the Local Path Provisioner. The output is different if your storage class has a different volume binding mode.
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rag-sample persistentvolumeclaim/nemollm-embedding-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/nemollm-inference-pvc Pending local-path 3h7m rag-sample persistentvolumeclaim/pgvector-pvc Pending local-path 3h7m
Add secrets that use your NGC CLI API key.
Add a Docker registry secret that the Operator uses for pulling containers from NGC:
$ kubectl create secret -n rag-sample docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<ngc-cli-api-key>
Add a generic secret that the init container for the inference and embedding containers use to download models:
$ kubectl create secret -n rag-sample generic ngc-api-secret \ --from-literal=NGC_CLI_API_KEY=<ngc-cli-api-key>
To use the vLLM backend and models from Hugging Face Model Hub, add a secret with your personal access token for downloading models from the hub:
$ kubectl create secret -n rag-sample generic hf-secret \ --from-literal=HF_USER=<user%40example.com> \ --from-literal=HF_PAT=<hf-personal-access-token>
To use an
@
character in theHF_USER
value, specify the URL-encoded value,%40
.Edit a sample pipeline file,
config/samples/helmpipeline_<app|nemo>_<backend>.yaml
.Consider the following topics:
Create the pipeline:
$ kubectl apply -f config/samples/helmpipeline_<app|nemo>_<backend>.yaml -n rag-sample
The deployment typically requires between 10 and 60 minutes to start the containers, download models from NGC, and become ready for service.
Optional: Monitor the pods:
$ kubectl get pods -n rag-sample
Example Output
NAME READY STATUS RESTARTS AGE chain-server-5c9c9f7d75-ggw2h 1/1 Running 0 42m nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 42m nemollm-inference-5bbc63f38d3b911f-0 1/1 Running 0 29m pgvector-0 1/1 Running 0 42m rag-playground-cb67f854b-6jkzb 1/1 Running 0 42m
Access the sample chat application.
If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the sample chat application.
Determine the node port for the sample chat application:
$ kubectl get service -n rag-sample rag-playground
In the following sample output, the application is listening on node port
32092
.NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE rag-playground NodePort 10.99.219.137 <none> 3001:32092/TCP 141m
Forward the port:
$ kubectl port-forward service/rag-playground -n rag-sample 32092:3001
After you forward the port, you can access the application at http://localhost:32092.
Optional: Access the Chain Server. This step can be useful if you want to develop your own chat client or chat templates.
If your cluster is not configured to work with an external load balancer, you can port-forward the HTTP connection to the Chain Server.
Patch the service to change it from cluster IP to node port:
$ kubectl patch service -n rag-sample chain-server -p '{"spec":{"type":"NodePort"}}'
Determine the node port for the chain server:
$ kubectl get service -n rag-sample chain-server
In the following sample output, the server is listening on node port
31157
.NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE chain-server NodePort 10.108.98.47 <none> 8081:31157/TCP 14d
Forward the port:
$ kubectl port-forward service/chain-server -n rag-sample 31157:8081
After you forward the port, you can access the server at http://localhost:31157. You can view the API at http://localhost:31157/docs.
The sample pipelines provide a simple baseline that you can customize.
Field | Customization | Default Value |
---|---|---|
nodeSelector.nvidia.com/gpu.product | If your host has an NVIDIA H100 or L40S GPU, specify the product name. Run the following command to display the GPU models on your nodes:
| NVIDIA-A100-80GB-PCIe |
frontend.service.type | Specify loadBalancer if your cluster integrates with an external load balancer. | nodePort |
chartValues.updateStrategy.type for nemollm-inference, nemollm-embedding, and pgvector | Specify OnDelete to prevent Kubernetes from performing a rolling update. This setting can be helpful to avoid an inference or embedding pod becoming stuck in Pending when you update the specification and no GPU are allocatable. Refer to Update Strategies for stateful sets in the Kubernetes documentation. | RollingUpdate |
query.deployStrategy.type frontend.deployStrategy.type | Specify Recreate to have Kubernetes delete existing pods before creating new ones. This setting can be helpful to avoid the query pod becoming stuck in Pending when you update the specification and no GPUs are allocatable. By default, Kubernetes creates a new pod, including a GPU resource request, before deleting the currently running pod. Refer to Strategy for deployments in the Kubernetes documentation. | RollingUpdate |
Some models exceed the memory capacity of a single GPU and require more than one GPU.
To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The model.numGpus
field corresponds to the --num_gpus
command-line argument to the nemollm_inference_ms
command.
spec:
pipeline:
- repoEntry:
name: nemollm-inference
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-inference"
chartSpec:
chart: "nemollm-inference"
wait: false
chartValues:
fullnameOverride: "nemollm-inference"
model:
name: llama-2-13b-chat
numGpus: 2
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
resources:
limits:
nvidia.com/gpu: 2 # Number of GPUs to present to the running service
To assign multiple GPUs to the embedding container, modify the Helm pipeline manifest and specify the number in the resource request.
- repoEntry:
name: nemollm-embedding
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-embedding"
chartSpec:
chart: "nemollm-embedding"
wait: false
chartValues:
fullnameOverride: "nemollm-embedding"
...
resources:
limits:
nvidia.com/gpu: 2 # Number of GPUs to present to the running service
About Choosing a Model
By default, the sample pipeline deploys the Llama-2-13B-Chat model. You can configure the pipeline to deploy a different inference model from your organization and team’s NGC Private Registry.
A model, such as Llama-2-13B-Chat has a model version, such as a100x2_fp16_24.02. The model version encodes the following information:
Required GPU model, such as A100.
Required GPU count, such as 2.
Model release, such as 24.02.
For the Operator to deploy a model from NGC and use the TensorRT-LLM backend, the GPU model and count must match a node in your cluster. The model release must match the NIM for LLMs container image tag. You can locate the image tag in the Helm pipeline manifest file.
You can access the registry and list models using a web browser or the NGC CLI:
- Web browser
- NGC CLI
Go to https://registry.ngc.nvidia.com/models. After you log in and set your organization and team, browse the models.
Use the ngc registry model list <organization>/*
command to list the models.
The following output shows the mixtral-8x7b-instruct-v0-1 model. Your NGC organization and team membership determines the models that are available to you.
Use the ngc registry model list <organization/team/model>:*
command to list the model versions. For example, run ngc registry model list "............/ea-participants/mixtral-8x7b-instruct-v0-1:*"
.
Based on the preceding output, consider the following requirements:
The node that runs the NIM for LLMs container must have four NVIDIA A100 80 GB GPUs or four H100 80 GB GPUs.
The NIM for LLMs image tag must match the model release value,
24.02
.
Procedure
To change the inference model, perform the following steps:
Determine the version of the NIM for LLMs image:
$ kubectl get pods -n rag-sample -l app.kubernetes.io/name=nemollm-inference -o=jsonpath='{.items[*].spec.containers[?(@.name=="nemollm-inference")].image}{"\n"}'
Example Output
nvcr.io/ohlfw0olaadg/ea-participants/nim_llm:24.02
Based on the example output, only models with a version suffix of 24.02 will operate with the microservice image.
Edit the
config/samples/helmpipeline_<app|nemo>_trtllm.yaml
file.Modify the NIM for LLMs specification and set the
NGC_MODEL_NAME
,NGC_MODEL_VERSION
, andMODEL_NAME
environment variables:spec: pipeline: - repoEntry: name: nemollm-inference url: "file:///helm-charts/pipeline" # ... chartValues: model: name: mixtral-8x7b-instruct-v0-1 # ... initContainers: ngcInit: # disabled by default # ... env: STORE_MOUNT_PATH: /model-store NGC_CLI_ORG: <org-name> # ngc org where model lives NGC_CLI_TEAM: <team-name> # ngc team where model lives NGC_MODEL_NAME: mistral-8x7b-instruct-v0-1 # model name in ngc NGC_MODEL_VERSION: a100x4_fp16_24.02 # model version in ngc NGC_EXE: ngc # path to ngc cli, if pre-installed in container DOWNLOAD_NGC_CLI: "false" # set to string 'true' if container should download and install ngc cli NGC_CLI_VERSION: "3.37.1" # version of ngc cli to download (only matters if downloading) TARFILE: yes # tells the script to untar the model. defaults to "yes", set to "" to turn off MODEL_NAME: mixtral-8x7b-instruct-v0-1 # actual model name, once downloaded
Modify the query service specification and set
APP_LLM_MODELNAME
environment variable:spec: pipeline: - repoEntry: name: rag-llm-app url: "file:///helm-charts/pipeline" # ... chartValues: # ... query: # ... env: # ... APP_LLM_MODELNAME: mistral-8x7b-instruct-v0-1 frontend: # ... env: # ... - name: APP_MODELNAME value: "mistral-8x7b-instruct-v0-1"
Apply the configuration change:
$ kubectl apply -f config/samples/helmpipeline_<app|nemo>_trtllm.yaml -n rag-sample
Downloading the model from NGC typically requires between 10 and 60 minutes.
Example Output
helmpipeline.package.nvidia.com/my-sample-pipeline configured
Optional: Monitor the progress:
$ kubectl get pods -n rag-sample
Example Output
NAME READY STATUS RESTARTS AGE chain-server-6d66c45bb9-rpjjz 1/1 Running 0 4m25s nemollm-embedding-5bbc63f38d3b911f-0 1/1 Running 0 120m nemollm-inference-5bbc63f38d3b911f-0 0/1 Init:0/1 0 3m34s pgvector-0 1/1 Running 0 4d rag-playground-cb67f854b-6jkzb 1/1 Running 0 4d
Changing the inference model for the vLLM backend requires stopping the inference container, deleting and recreating the persistent volume used by the inference container, and then applying the Helm pipeline manifest that specifies the model.
When you use the vLLM backend, the inference container has an init container, hf-model-puller, that downloads the inference model from Hugging Face Model Hub. The init container uses Git to clone the model repository using a repository address like the following:
https://${HF_USER}:${HF_PAT}@huggingface.co/${MODEL_ORG}/${MODEL_NAME}
The HF_USER
and HF_PAT
values are supplied from the hf-secret
in the same namespace as the inference container. You specify the model org and model name in the Helm pipeline manifest. For example, to clone the model at https://huggingface.co/meta-llama/Llama-2-70b-chat-hf, specify MODEL_ORG: meta-llama
and MODEL_NAME: Llama-2-70b-chat-hf
to complete the Git repository address.
The personal access token associated with your Hugging Face account must have access to the model organization and model.
The PVC for the inference container must have sufficient free disk space. For example, the Llama-2-70b-chat-hf model requires more than 280 GB of storage.
Edit the Helm pipeline manifest in the
config/samples/
directory.Specify
model.name
andmodel.config
values for thenemollm-inference
container.Specify
MODEL_ORG
andMODEL_NAME
values for thenemollm-inference
init container.Specify
APP_LLM_MODELNAME
for thechain-server
container.
spec: pipeline: - repoEntry: name: nemollm-inference ... chartValues: fullnameOverride: "nemollm-inference" backend: "vllm" model: name: Llama-2-70b-chat config: /model-store/model_config.yaml ... initContainers: ngcInit: [] # disabled by default hfInit: imageName: bitnami/git imageTag: latest secret: # name of kube secret for hf with keys named HF_USER and HF_PAT name: hf-secret env: STORE_MOUNT_PATH: /model-store HF_MODEL_NAME: Llama-2-70b-chat-hf # HF model name HF_MODEL_ORG: meta-llama # HF org where model lives USE_SHALLOW_LFS_CLONE: 0 # Disable shallow LFS clone by default ... - repoEntry: name: rag-llm-app ... chartValues: query: ... env: APP_LLM_MODELNAME: Llama-2-70b-chat ... ...
Delete the stateful set for the inference container:
$ kubectl delete sts -n rag-sample -lapp.kubernetes.io/name=nemollm-inference
Delete the PVC for the inference container:
$ kubectl delete -n rag-sample -f examples/pvc-inferencing.yaml
Recreate the PVC for the inference container:
$ kubectl apply -n rag-sample -f examples/pvc-inferencing.yaml
Apply the Helm pipeline manifest to download and use the new model:
$ kubectl apply -n rag-sample -f config/samples/helm_<app|nemo>_vllm.yaml
Optional: Monitor the model download:
$ kubectl logs -n rag-sample -lapp.kubernetes.io/name=nemollm-inference -c hf-model-puller
Example Output
... pipefail off posix off privileged off verbose off vi off xtrace off downloading model meta-llama/Llama-2-70b-chat-hf Git LFS initialized. Cloning into '/model-store/model-store'... Filtering content: ...
Use the RAG playground user interface or access the chain server API to use the new model.
Some models exceed the memory capacity of a single GPU and require more than one GPU.
To assign multiple GPUs to the inference container, modify the Helm pipeline manifest like the following example. The tensor_parallel_size
field specifies the number of GPUs to use.
spec:
pipeline:
- repoEntry:
name: nemollm-inference
url: "file:///helm-charts/pipeline"
#url: "cm://rag-application/nemollm-inference"
chartSpec:
chart: "nemollm-inference"
wait: false
chartValues:
fullnameOverride: "nemollm-inference"
model:
name: llama-2-13b-chat
numGpus: 2 # The vLLM backend does not use this field.
# num_workers: 1
vllm_config:
engine:
model: /model-store
enforce_eager: false
max_context_len_to_capture: 8192
max_num_seqs: 256
dtype: float16
tensor_parallel_size: 2
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
resources:
limits:
nvidia.com/gpu: 2 # Number of GPUs to present to the running service
The fields beneath chartValues.model.vllm_config
correspond to the VllmEngine
object for NIM for LLMs. Refer to Model Configuration Values for vLLM in the NVIDIA NIM for LLMs documentation.
To delete a pipeline and remove the resources and objects associated with the services, perform the following steps:
View the Helm pipeline custom resources:
$ kubectl get helmpipelines -A
Example Output
NAMESPACE NAME STATUS rag-sample my-sample-pipeline deployed
Delete the custom resource:
$ kubectl delete helmpipeline -n rag-sample my-sample-pipeline
Example Output
helmpipeline.package.nvidia.com "my-sample-pipeline" deleted
If you do not plan to redeploy the pipeline, delete the persistent storage:
$ kubectl delete pvc -n rag-sample nemollm-embedding-pvc $ kubectl delete pvc -n rag-sample nemollm-inference-pvc $ kubectl delete pvc -n rag-sample pgvector-pvc