gRPC Reference for NeMo Retriever Text Embedding NIM#

This documentation contains the gRPC reference for NeMo Retriever Text Embedding NIM.

The Text Embedding NIM supports the Open Inference Protocol (KServe V2). You can make gRPC inference requests by using the Triton Client Libraries.

Launch the Text Embedding NIM#

Launch Text Embedding NIM by following the Get Started guide. In the code to Launch the NIM, include the additional argument -p 8001:8001 as shown following.

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -p 8001:8001 \ # additional argument
  $IMG_NAME

Make Inference Calls#

After you launch the Text Embedding NIM, you can make inference calls by using the following code.

Install Python dependencies.

python3 -m pip install tritonclient[all]==2.53.0

Make inference calls.

import numpy as np
from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput


def infer_batch(
    text_batch: list[str], client: InferenceServerClient, model_name: str, parameters: dict
):
    text_np = np.array([[text.encode("utf-8")] for text in text_batch], dtype=np.object_)
    text_input = InferInput("text", text_np.shape, "BYTES")
    text_input.set_data_from_numpy(text_np)
    infer_input = [text_input]
    infer_output = [InferRequestedOutput(nn) for nn in ["token_count", "embeddings"]]
    result = client.infer(
        model_name=model_name,
        parameters=parameters,
        inputs=infer_input,
        outputs=infer_output,
    )
    token_count, embeddings = result.as_numpy("token_count"), result.as_numpy("embeddings")
    return token_count, embeddings


def infer_with_grpc(
    text_ls: list[str],
    model_name: str,
    grpc_host: str = "localhost:8001",
):
    parameters = {"input_type": "query", "truncate": "END"}

    grpc_client = InferenceServerClient(url=grpc_host, verbose=False)
    config = grpc_client.get_model_config(model_name=model_name).config
    max_batch_size = config.max_batch_size

    total_token_count = 0
    embeddings_ls = []
    for offset in range(0, len(text_ls), max_batch_size):
        text_batch = text_ls[offset : offset + max_batch_size]
        token_count, embeddings = infer_batch(
            text_batch=text_batch,
            client=grpc_client,
            model_name=model_name,
            parameters=parameters,
        )
        if token_count is not None:
            total_token_count += token_count.prod()
        embeddings_ls.append(embeddings)
    embeddings = np.concatenate(embeddings_ls)
    return total_token_count, embeddings


infer_with_grpc(text_ls=["hello world"] * 100, model_name="nvidia_llama_3_2_nv_embedqa_1b_v2")

The result would look like the following.

(500,
 array([[ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        ...,
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.01421121,  0.00781021,  0.03530755, ...,  0.02615815,
          0.02227391, -0.00074966]], dtype=float32))

API Reference#

gRPC Models#

The gRPC model names differ from the NIM model IDs shown in the Support Matrix. The following table contains the mapping of the names.

Model ID	gRPC Model Name
nvidia/llama-3.2-nv-embedqa-1b-v2	nvidia_llama_3_2_nv_embedqa_1b_v2
nvidia/nv-embedqa-e5-v5	nvidia_nv_embedqa_e5_v5

Request Inputs#

Input	Shape	Data Type	Description
`text`	[batch_size, 1]	BYTES	The text to embed, encoded as UTF-8.

Request Parameters#

Parameter	Type	Description	Valid Values	Default	Required
`input_type`	String	The context of the embedding.	`"query"`, `"document"`	`"None"`	No
`truncate`	String	How to handle text that exceeds the maximum token length.	`"END"`, `"START"`	`"NONE"`	No
`dimensions`	Integer	The desired dimensionality of the output embeddings. Must be supported by the model.	—	The model’s default dimension.	No

Response#

Output	Shape	Data Type	Description
`token_count`	[batch_size]	INT32	The number of tokens in each input text.
`embeddings`	[batch_size, embedding_dimension]	FP32	The resulting embedding vectors.

Batching#

The model supports batching up to the max_batch_size specified in the model configuration. To process a large number of requests, split the requests into batches and aggregate the results as shown in Make Inference Calls.

You can get the max_batch_size by using the following code.

config = grpc_client.get_model_config(model_name=model_name).config
config.max_batch_size