Deploying on Kubernetes
You can deploy Text Reranking NIM with a Helm chart.
The chart downloads the model and starts the service to begin running.
This Helm Chart simplifies Text Reranking NIM deployment on Kubernetes. It aims to support deployment with a variety of possible cluster, GPU and storage confurations.
NIMs are intended to be run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use helm, you must have a Kubernetes cluster with appropriate GPU nodes and the GPU Operator installed.
Benefits of Helm Chart Deployment
Using a helm chart:
Enables using Kubernetes Nodes and horizontally scaling the service
Encapsulates the complexity of running Docker commands directly
Enables monitoring metrics from the NIM
Setting Up the Environment
If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, see the information in the User Guide.
The helm chart requires that you have a secret with your NGC API key configured for downloading private images, and one with your NGC API key (below named ngc-api). The secrets should have the same key, but have different formats (dockerconfig.json vs opaque). See Creating Secrets below.
These instructions require that you have exported your NGC_API_KEY
to the environment. Use the following command to export your key.
export NGC_API_KEY="<YOUR NGC API KEY>"
Downloading NIM Models to Cache
In the event that model assets must be pre-fetched (e.g. in an air-gapped system), the NIM container supports downloading these assets to the NIM cache without starting the server.
# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/nv-rerankqa-mistral-4b-v3
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)
# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
# Start the NIM container with a command to download the model to the cache
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME download-to-cache
# Start the NIM container in an airgapped environment and serve the model
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus=all \
--shm-size=16GB \
--network=none \
-v $LOCAL_NIM_CACHE:/mnt/nim-cache:ro \
-u $(id -u) \
-e NIM_CACHE_PATH=/mnt/nim-cache \
-e NGC_API_KEY \
-p 8000:8000 \
$IMG_NAME
By default, the download-to-cache
command downloads the most appropriate model assets for the detected GPU. To override this behavior and download a specific model, set the NIM_MODEL_PROFILE
environment variable when launching the container. Use the list-model-profiles
command available within the NIM container to list all profiles. See Optimization for more details.
Fetching the Helm Chart
You can fetch the helm chart from NGC by executing the following command:
helm fetch https://helm.ngc.nvidia.com/nim/nvidia/charts/text-reranking-nim-1.0.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY
You can use OpenTelemetry for monitoring your container. See OpenTelemetry parameters for details.
Namespace
You can choose to deploy to whichever namespace is appropriate, but this document uses the namespace reranking-nim
.
kubectl create namespace reranking-nim
Creating Secrets
Use the following script to create the expected secrets for this helm chart.
DOCKER_CONFIG='{"auths":{"nvcr.io":{"username":"$oauthtoken", "password":"'${NGC_API_KEY}'" }}}'
echo -n $DOCKER_CONFIG | base64 -w0
NGC_REGISTRY_PASSWORD=$(echo -n $DOCKER_CONFIG | base64 -w0 )
cat <<EOF > imagepull.yaml
apiVersion: v1
kind: Secret
metadata:
name: nvcrimagepullsecret
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: ${NGC_REGISTRY_PASSWORD}
EOF
kubectl apply -n reranking-nim -f imagepull.yaml
kubectl create -n reranking-nim secret generic ngc-api --from-literal=NGC_CLI_API_KEY=${NGC_API_KEY}
Configuration Considerations
By default, the following deployment commands create a single deployment with one replica using the NV-RerankQA-Mistral4B-v3 model. Use the following options to modify how the model behaves. See Parameters for information about parameters.
image.repository
– The container (Text Reranking NIM) to deployimage.tag
– The version of that container (Text Reranking NIM)Storage options, based on the environment and cluster in use
resources
– Use this option when a model requires more than the default of one GPU. See below for support matrix and resource requirements.env
– Which is an array of environment variables presented to the container, if advanced configuration is needed
Storage
This NIM uses persistent storage for storing downloaded models. These instructions require that you have a local-nfs
storage class provisioner installed in your cluster.
helm repo add nfs-ganesha-server-and-external-provisioner https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/
helm install nfs-server nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner --set storageClass.name=local-nfs
Advanced Storage Configuration
Storage is a particular concern when setting up NIMs. Models can be quite large, and you can fill a disk downloading things to emptyDirs or other locations around your pod image. We recommend that you mount persistent storage of some kind on your pod.
This chart supports two general categories:
Persistent Volume Claims (enabled with
persistence.enabled
)hostPath (enabled with
persistences.hostPath
)
By default, the chart uses the standard
storage class and creates a PersistentVolume
and a PersistentVolumeClaim
.
If you do not have a Storage Class Provisioner
that creates PersistentVolume
s automatically, set the value persistence.createPV=true
. This is also necessary when you use persistence.hostPath
on minikube.
If you have an existing PersistentVolumeClaim
where you’d like the models to be stored at, pass that value in at persistence.exsitingClaimName
.
See the Helm options in Parameters.
Deploying
Basic deploymnet
helm upgrade --install \
--namespace reranking-nim \
nemo-ranker \
--set persistence.class="local-nfs" \
text-reranking-nim-1.0.0.tgz
You can also change the version of the E5 model in use by adding the following after --namespace
--set image.tag=1.0.0 \
After deploying check the pods to ensure that it is running, initial image pull and model download can take upwards of 15 minutes.
kubectl get pods -n reranking-nim
The pod should eventually end up in the running state.
NAME READY STATUS RESTARTS AGE
text-reranking-nim-0 1/1 Running 0 8m44s
Check events for failures:
kubectl get events --field-selector involvedObject.name=text-reranking-nim -n ranking-nim
Recommended Configuration for Minikube
Minikube creates a hostPath based PV and PVC by default with this chart. You should add the following setting to your helm commands.
--set persistence.class=standard
Deploying Mistral 4B
Create a values files for the resource requirements of the 4B Model:
# values-mistral.yaml
resources:
limits:
ephemeral-storage: 28Gi
nvidia.com/gpu: 1
memory: 32Gi
cpu: "16000m"
requests:
ephemeral-storage: 28Gi
nvidia.com/gpu: 1
memory: 16Gi
cpu: "4000m"
Then deploy the model:
helm upgrade --install \
--namespace reranking-nim \
-f values-mistral.yaml \
--set image.repository=nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3 \
--set image.tag=1.0.0 \
--set persistence.class="local-nfs" \
nemo-ranker \
text-reranking-nim-1.0.0.tgz
Running Inference
In the previous example the API endpoint is exposed on port 8080 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands require that the nvidia/nv-rerankqa-mistral-4b-v3
model has been deployed.
Adjust the “model” value in the request JSON body to use a different model.
Use the following command to port-forward the service to your local machine to test inference.
kubectl port-forward -n reranking-nim service/text-reranking-nim 8080:8080
Then try a request:
curl -X 'POST' \
'http://localhost:8080/v1/ranking' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"query": {"text": "which way should i go?"},
"model": "nvidia/nv-rerankqa-mistral-4b-v3",
"passages": [
{
"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"
},
{
"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"
},
{
"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."
}
]
}'
Logging
Use the following command to view the container log messages in the docker logs.
docker logs $CONTAINER_NAME -f