Deploy NeMo Retriever Text Embedding NIM on Vertex AI#

Use this documentation to learn how to deploy NeMo Retriever Text Embedding NIM on Google Cloud Vertex AI.

Note

Currently, Vertex AI is only supported for the NV-EmbedQA-E5-v5 model.

Prerequisites#

Before you can deploy the Nemo Retriever Text Embedding NIM on GCP Vertex AI, you need the following:

Environment Setup#

Create a repository in Artifact Registry where you will store the NIM docker image.

Set the following Google Cloud environment variables in your terminal:

export REGION={REGION}
export PROJECT_ID={PROJECT_ID}
export ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY={ARTIFACT_REGISTRY}

Download and Push latest Embedding NIM Image#

  1. Set your NGC API KEY for authentication with NGC.

    export NGC_API_KEY= <your_ngc_api_key>
    
  2. Download the image from NGC to local machine.

    docker pull nvcr.io/nvstaging/nim/nv-embedqa-e5-v5:1.5.0
    
  3. Use the following command to tag the image, so Google Cloud can identify the location and repository to store your image.

    docker tag nvcr.io/nvstaging/nim/nv-embedqa-e5-v5:1.5.0 ${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0 
    
  4. Authenticate the Google Cloud SDK with your Google account.

    gcloud auth login
    
  5. Configure Docker to use gcloud authentication for pulling and pushing images to the registry.

    gcloud auth configure-docker ${REGION}-docker.pkg.dev
    
  6. Push the image to the GCP Artifact Registry you created above.

    docker push ${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0
    

Upload NIM to Vertex AI#

Upload the image as a model resource in Vertex AI by running the following code.

gcloud ai models upload \
  --region=${REGION} \
  --display-name=nv-embedqa-e5-v5:1.5.0 \
  --container-image-uri=${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_NAME_ON_GCP_ARTIFACT_REGISTRY}/nv-embedqa-e5-v5:1.5.0 \
  --container-ports=8080 \
  --container-predict-route="/v1/embeddings" \
  --container-health-route="/v1/health/ready" \
  --container-shared-memory-size-mb=16000 \
  --container-env-vars="NGC_API_KEY=$NGC_API_KEY"

Create a Vertex AI endpoint#

Create an endpoint that the triton client uses to send requests by running the following code.

gcloud ai endpoints create \
  --region=${REGION} \
  --display-name="nv-embedqa-endpoint"

Extract the MODEL_ID and ENDPOINT_ID by running the following code.

export MODEL_ID=$(gcloud ai models list --format="value(MODEL_ID)")
export ENDPOINT_ID=$(gcloud ai endpoints list --format="value(ENDPOINT_ID)")

Deploy the model to the endpoint#

Deploy the model to the endpoint by running the following code.

gcloud ai endpoints deploy-model ${ENDPOINT_ID} \
--region=${REGION}   \
--model=${MODEL_ID}   \
--display-name=nv-embedqa-e5-v5:1.5.0   \
--machine-type=a2-ultragpu-1g   \
--accelerator=type=nvidia-a100-80gb,count=1   \
--traffic-split=0=100

Test endpoint#

Confirm the endpoint is active and ready to receive requests by running the following code.

curl -X POST "https://${REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["Hello world"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "input_type": "query"
  }'