Deploy NeMo Retriever Text Embedding NIM on Vertex AI#
Use this documentation to learn how to deploy NeMo Retriever Text Embedding NIM on Google Cloud Vertex AI.
Note
Currently, Vertex AI is only supported for the NV-EmbedQA-E5-v5 model.
Prerequisites#
Before you can deploy the Nemo Retriever Text Embedding NIM on GCP Vertex AI, you need the following:
NGC_API_KEY for access to NVIDIA models.
Google Cloud Account with Vertex AI enabled.
Google Cloud SDK installed.
Environment Setup#
Create a repository in Artifact Registry where you will store the NIM docker image.
Set the following Google Cloud environment variables in your terminal:
export REGION={REGION}
export PROJECT_ID={PROJECT_ID}
export ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY={ARTIFACT_REGISTRY}
Download and Push latest Embedding NIM Image#
Set your NGC API KEY for authentication with NGC.
export NGC_API_KEY= <your_ngc_api_key>
Download the image from NGC to local machine.
docker pull nvcr.io/nvstaging/nim/nv-embedqa-e5-v5:1.5.0
Use the following command to tag the image, so Google Cloud can identify the location and repository to store your image.
docker tag nvcr.io/nvstaging/nim/nv-embedqa-e5-v5:1.5.0 ${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0
Authenticate the Google Cloud SDK with your Google account.
gcloud auth login
Configure Docker to use gcloud authentication for pulling and pushing images to the registry.
gcloud auth configure-docker ${REGION}-docker.pkg.dev
Push the image to the GCP Artifact Registry you created above.
docker push ${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0
Upload NIM to Vertex AI#
Upload the image as a model resource in Vertex AI by running the following code.
gcloud ai models upload \
--region=${REGION} \
--display-name=nv-embedqa-e5-v5:1.5.0 \
--container-image-uri=${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0 \
--container-ports=8080 \
--container-predict-route="/v1/embeddings" \
--container-health-route="/v1/health/ready" \
--container-shared-memory-size-mb=16000 \
--container-env-vars="NGC_API_KEY=$NGC_API_KEY"
Create a Vertex AI endpoint#
Create an endpoint that the triton client uses to send requests by running the following code.
gcloud ai endpoints create \
--region=${REGION} \
--display-name="nv-embedqa-endpoint"
Extract the MODEL_ID and ENDPOINT_ID by running the following code.
export MODEL_ID=$(gcloud ai models list --format="value(MODEL_ID)")
export ENDPOINT_ID=$(gcloud ai endpoints list --format="value(ENDPOINT_ID)")
Deploy the model to the endpoint#
Deploy the model to the endpoint by running the following code.
gcloud ai endpoints deploy-model ${ENDPOINT_ID} \
--region=${REGION} \
--model=${MODEL_ID} \
--display-name=nv-embedqa-e5-v5:1.5.0 \
--machine-type=a2-ultragpu-1g \
--accelerator=type=nvidia-a100-80gb,count=1 \
--traffic-split=0=100
Test endpoint#
Confirm the endpoint is active and ready to receive requests by running the following code.
curl -X POST "https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict" \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"input": ["Hello world"],
"model": "nvidia/nv-embedqa-e5-v5",
"input_type": "query"
}'