gRPC Reference for NeMo Retriever Text Embedding NIM#

This documentation contains the gRPC reference for NeMo Retriever Text Embedding NIM.

The Text Embedding NIM supports the Open Inference Protocol (KServe V2). You can make gRPC inference requests by using the Triton Client Libraries.

Launch the Text Embedding NIM#

Launch Text Embedding NIM by following the Get Started guide. In the code to Launch the NIM, include the additional argument -p 8001:8001 as shown following.

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  -p 8001:8001 \ # additional argument
  $IMG_NAME

Make Inference Calls#

After you launch the Text Embedding NIM, you can make inference calls by using the following code.

Install Python dependencies.

python3 -m pip install tritonclient[all]==2.53.0

Make inference calls.

import numpy as np
from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput


def infer_batch(
    text_batch: list[str], client: InferenceServerClient, model_name: str, parameters: dict
):
    text_np = np.array([[text.encode("utf-8")] for text in text_batch], dtype=np.object_)
    text_input = InferInput("text", text_np.shape, "BYTES")
    text_input.set_data_from_numpy(text_np)
    infer_input = [text_input]
    infer_output = [InferRequestedOutput(nn) for nn in ["token_count", "embeddings"]]
    result = client.infer(
        model_name=model_name,
        parameters=parameters,
        inputs=infer_input,
        outputs=infer_output,
    )
    token_count, embeddings = result.as_numpy("token_count"), result.as_numpy("embeddings")
    return token_count, embeddings


def infer_with_grpc(
    text_ls: list[str],
    model_name: str,
    grpc_host: str = "localhost:8001",
):
    parameters = {"input_type": "query", "truncate": "END"}

    grpc_client = InferenceServerClient(url=grpc_host, verbose=False)
    config = grpc_client.get_model_config(model_name=model_name).config
    max_batch_size = config.max_batch_size

    total_token_count = 0
    embeddings_ls = []
    for offset in range(0, len(text_ls), max_batch_size):
        text_batch = text_ls[offset : offset + max_batch_size]
        token_count, embeddings = infer_batch(
            text_batch=text_batch,
            client=grpc_client,
            model_name=model_name,
            parameters=parameters,
        )
        if token_count is not None:
            total_token_count += token_count.prod()
        embeddings_ls.append(embeddings)
    embeddings = np.concatenate(embeddings_ls)
    return total_token_count, embeddings


infer_with_grpc(text_ls=["hello world"] * 100, model_name="nvidia_llama_3_2_nv_embedqa_1b_v2")

The result would look like the following.

(500,
 array([[ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        ...,
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.014316  ,  0.00778983,  0.03520809, ...,  0.02619596,
          0.02236139, -0.00068361],
        [ 0.01421121,  0.00781021,  0.03530755, ...,  0.02615815,
          0.02227391, -0.00074966]], dtype=float32))

API Reference#

gRPC Models#

The gRPC model names differ from the NIM model IDs shown in the Support Matrix. The following table contains the mapping of the names.

Model ID

gRPC Model Name

nvidia/llama-3.2-nv-embedqa-1b-v2

nvidia_llama_3_2_nv_embedqa_1b_v2

nvidia/nv-embedqa-e5-v5

nvidia_nv_embedqa_e5_v5

Request Inputs#

Input

Shape

Data Type

Description

text

[batch_size, 1]

BYTES

The text to embed, encoded as UTF-8.

Request Parameters#

Parameter

Type

Description

Valid Values

Default

Required

input_type

String

The context of the embedding.

"query", "document"

"None"

No

truncate

String

How to handle text that exceeds the maximum token length.

"END", "START"

"NONE"

No

dimensions

Integer

The desired dimensionality of the output embeddings. Must be supported by the model.

The model’s default dimension.

No

Response#

Output

Shape

Data Type

Description

token_count

[batch_size]

INT32

The number of tokens in each input text.

embeddings

[batch_size, embedding_dimension]

FP32

The resulting embedding vectors.

Batching#

The model supports batching up to the max_batch_size specified in the model configuration. To process a large number of requests, split the requests into batches and aggregate the results as shown in Make Inference Calls.

You can get the max_batch_size by using the following code.

config = grpc_client.get_model_config(model_name=model_name).config
config.max_batch_size