gRPC Reference for NeMo Retriever Text Embedding NIM#
This documentation contains the gRPC reference for NeMo Retriever Text Embedding NIM.
The Text Embedding NIM supports the Open Inference Protocol (KServe V2). You can make gRPC inference requests by using the Triton Client Libraries.
Launch the Text Embedding NIM#
Launch Text Embedding NIM by following the Get Started guide.
In the code to Launch the NIM,
include the additional argument -p 8001:8001
as shown following.
# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
-p 8001:8001 \ # additional argument
$IMG_NAME
Make Inference Calls#
After you launch the Text Embedding NIM, you can make inference calls by using the following code.
Install Python dependencies.
python3 -m pip install tritonclient[all]==2.53.0
Make inference calls.
import numpy as np
from tritonclient.grpc import InferenceServerClient, InferInput, InferRequestedOutput
def infer_batch(
text_batch: list[str], client: InferenceServerClient, model_name: str, parameters: dict
):
text_np = np.array([[text.encode("utf-8")] for text in text_batch], dtype=np.object_)
text_input = InferInput("text", text_np.shape, "BYTES")
text_input.set_data_from_numpy(text_np)
infer_input = [text_input]
infer_output = [InferRequestedOutput(nn) for nn in ["token_count", "embeddings"]]
result = client.infer(
model_name=model_name,
parameters=parameters,
inputs=infer_input,
outputs=infer_output,
)
token_count, embeddings = result.as_numpy("token_count"), result.as_numpy("embeddings")
return token_count, embeddings
def infer_with_grpc(
text_ls: list[str],
model_name: str,
grpc_host: str = "localhost:8001",
):
parameters = {"input_type": "query", "truncate": "END"}
grpc_client = InferenceServerClient(url=grpc_host, verbose=False)
config = grpc_client.get_model_config(model_name=model_name).config
max_batch_size = config.max_batch_size
total_token_count = 0
embeddings_ls = []
for offset in range(0, len(text_ls), max_batch_size):
text_batch = text_ls[offset : offset + max_batch_size]
token_count, embeddings = infer_batch(
text_batch=text_batch,
client=grpc_client,
model_name=model_name,
parameters=parameters,
)
if token_count is not None:
total_token_count += token_count.prod()
embeddings_ls.append(embeddings)
embeddings = np.concatenate(embeddings_ls)
return total_token_count, embeddings
infer_with_grpc(text_ls=["hello world"] * 100, model_name="nvidia_llama_3_2_nv_embedqa_1b_v2")
The result would look like the following.
(500,
array([[ 0.014316 , 0.00778983, 0.03520809, ..., 0.02619596,
0.02236139, -0.00068361],
[ 0.014316 , 0.00778983, 0.03520809, ..., 0.02619596,
0.02236139, -0.00068361],
[ 0.014316 , 0.00778983, 0.03520809, ..., 0.02619596,
0.02236139, -0.00068361],
...,
[ 0.014316 , 0.00778983, 0.03520809, ..., 0.02619596,
0.02236139, -0.00068361],
[ 0.014316 , 0.00778983, 0.03520809, ..., 0.02619596,
0.02236139, -0.00068361],
[ 0.01421121, 0.00781021, 0.03530755, ..., 0.02615815,
0.02227391, -0.00074966]], dtype=float32))
API Reference#
gRPC Models#
The gRPC model names differ from the NIM model IDs shown in the Support Matrix. The following table contains the mapping of the names.
Model ID |
gRPC Model Name |
---|---|
nvidia/llama-3.2-nv-embedqa-1b-v2 |
nvidia_llama_3_2_nv_embedqa_1b_v2 |
nvidia/nv-embedqa-e5-v5 |
nvidia_nv_embedqa_e5_v5 |
Request Inputs#
Input |
Shape |
Data Type |
Description |
---|---|---|---|
|
[batch_size, 1] |
BYTES |
The text to embed, encoded as UTF-8. |
Request Parameters#
Parameter |
Type |
Description |
Valid Values |
Default |
Required |
---|---|---|---|---|---|
|
String |
The context of the embedding. |
|
|
No |
|
String |
How to handle text that exceeds the maximum token length. |
|
|
No |
|
Integer |
The desired dimensionality of the output embeddings. Must be supported by the model. |
— |
The model’s default dimension. |
No |
Response#
Output |
Shape |
Data Type |
Description |
---|---|---|---|
|
[batch_size] |
INT32 |
The number of tokens in each input text. |
|
[batch_size, embedding_dimension] |
FP32 |
The resulting embedding vectors. |
Batching#
The model supports batching up to the max_batch_size
specified in the model configuration.
To process a large number of requests, split the requests into batches and aggregate the results as shown in Make Inference Calls.
You can get the max_batch_size
by using the following code.
config = grpc_client.get_model_config(model_name=model_name).config
config.max_batch_size