API Reference for NVIDIA NeMo Retriever Embedding NIM#
This documentation contains the HTTP API reference for NVIDIA NeMo Retriever Embedding NIM.
You can download the complete API spec.
Warning
Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. Use the truncate field to control how the runtime handles input that is longer than the served profile supports.
Endpoints#
The runtime exposes the following endpoints.
Method |
Path |
Description |
|---|---|---|
|
|
Generate embeddings. |
|
|
List served models. |
|
|
Check readiness. |
|
|
Check liveness. |
|
|
Return Prometheus metrics. |
|
|
Return runtime and model metadata. |
|
|
Return the model manifest. |
|
|
Return license information. |
|
|
Return runtime API version information. |
The /v1 path prefix is the API version, not the model version.
Embeddings#
Use POST /v1/embeddings to generate embeddings for text, image, or text+image inputs.
Request Body#
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Model ID. Use |
|
string or array of strings |
Yes |
Text, image data URL, or text with an image data URL. |
|
string |
No |
Use |
|
string or array of strings |
No |
One of |
|
string |
No |
One of |
|
string |
No |
One of |
|
integer |
No |
Dynamic embedding size. Supported values are listed in Specify Dynamic Embedding Sizes. |
|
string |
No |
One of |
|
string |
No |
Optional caller-provided user identifier. |
Response Body#
The response contains an object with:
Field |
Type |
Description |
|---|---|---|
|
string |
Always |
|
array |
One embedding object per input item. |
|
string |
Served model ID. |
|
object |
Token usage information. |
Each item in data contains:
Field |
Type |
Description |
|---|---|---|
|
string |
Always |
|
integer |
Input index. |
|
array |
Embedding vector or packed compressed embedding values. |
Specify Dynamic Embedding Sizes#
To reduce the storage cost of returned embeddings, use the optional dimensions API parameter. The VLM Embed runtime accepts the following dimensions:
128256384512768102415362048
For the full list of supported models, refer to the support matrix.
Important
When you use the dimensions API parameter, use the same value for dimensions in concurrent requests to preserve dynamic batching efficiency.
Limitations#
The following are limitations when you specify dimensions:
The
dimensionsparameter cannot be used in combination with theembedding_typeparameter. These are alternative methods for reducing the memory footprint of embeddings and have different performance trade-offs.If
dimensionsis omitted, the VLM Embed model returns 2048-dimensionalfloatembeddings.
Specify Embedding Type#
The /v1/embeddings endpoint contains an optional field named embedding_type that supports the following values. For the full list of models and their embedding types, refer to the support matrix.
Embedding Type |
Returned JSON Values |
Length for VLM Embed |
|---|---|---|
|
floating-point numbers |
2048 |
|
signed integers |
2048 |
|
unsigned integers |
2048 |
|
packed signed integers |
256 |
|
packed unsigned integers |
256 |
Use
floatwhen you need maximum accuracy and storage or memory constraints are not a concern.Use
int8oruint8when you need a balance of compression and accuracy.Use
binaryorubinaryfor larger-scale systems where maximum compression is more important than preserving allfloatprecision.
Warning
Binary embeddings require Hamming distance for similarity calculations, not cosine similarity or dot product. Verify that your vector database supports the compressed type that you select.
Tune Dynamic Batching#
Dynamic batching is a feature that allows the NIM to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_MAX_WAIT_MS environment variable. The default value is 10ms (milliseconds).
Specify Modality#
The /v1/embeddings endpoint contains a modality field to support text, image, and mixed text+image input types.
The following are valid values for modality:
textimagetext_image
Image inputs are encoded as image data URLs, such as data:image/png;base64,... or data:image/jpeg;base64,.... You can also wrap a data URL in an image tag, such as <img src="data:image/png;base64,..."/>.
Image Input Limitations#
The following limitations apply to image inputs for nvidia/llama-nemotron-embed-vl-1b-v2:
Images are document inputs. Use
input_type: "passage"or omitinput_type.Images with
input_type: "query"are rejected.The decoded image payload is limited to 25 MiB.
Image dimensions are limited to
8192 x 16384or16384 x 8192.
Specify Modality Explicitly#
If you specify modality, each input is processed as the modality you provide.
You can specify a single modality for the whole request:
{
"modality": "text"
}
You can also specify one modality per input:
{
"modality": ["text", "image", "text_image"]
}
Let the Server Infer Modality#
If you omit modality, the server infers the modality for each input:
If the input is a valid data URL starting with
data:image/, the modality is inferred asimage.If the input contains both text and an embedded image data URL, the modality is inferred as
text_image.Otherwise, the modality is inferred as
text.
API Examples#
Use the examples in this section to help you get started with the API.
Set the host and service port:
export HOSTNAME=localhost
export SERVICE_PORT=8000
List Models#
Use the following command to list the available models:
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
-H 'Accept: application/json'
The response contains the served model ID:
{
"object": "list",
"data": [
{
"id": "nvidia/llama-nemotron-embed-vl-1b-v2",
"object": "model",
"created": 1779146207,
"owned_by": "nvidia"
}
]
}
Generate a Text Query Embedding#
Use input_type: "query" for text queries.
curl -X "POST" \
"http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"input": ["What is NVIDIA?"],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"input_type": "query",
"modality": "text",
"embedding_type": "float",
"encoding_format": "float"
}'
The following response is shortened for readability.
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.012519836, -0.0126571655, 0.0101623535]
}
],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"usage": {
"prompt_tokens": 7,
"total_tokens": 7
}
}
Generate an Image Document Embedding#
Use input_type: "passage" and modality: "image" for image documents. The following example creates a valid PNG data URL, writes the request body to payload-image.json, and sends the request.
python3 - <<'PY'
import base64
import json
import struct
import zlib
width = height = 224
rows = []
for y in range(height):
row = bytearray([0])
for x in range(width):
row.extend((x % 256, y % 256, 128))
rows.append(bytes(row))
def chunk(kind, data):
return (
struct.pack(">I", len(data))
+ kind
+ data
+ struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
)
png = (
b"\x89PNG\r\n\x1a\n"
+ chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
+ chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
+ chunk(b"IEND", b"")
)
payload = {
"input": ["data:image/png;base64," + base64.b64encode(png).decode()],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"input_type": "passage",
"modality": "image",
"embedding_type": "float",
"encoding_format": "float",
}
with open("payload-image.json", "w", encoding="utf-8") as f:
json.dump(payload, f)
PY
curl -X "POST" \
"http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d @payload-image.json
The following response is shortened for readability.
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [-0.0047950745, 0.017486572, 0.023162842]
}
],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"usage": {
"prompt_tokens": 266,
"total_tokens": 266
}
}
Generate a Compressed Embedding#
The following example generates an int8 embedding:
curl -X "POST" \
"http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"input": ["What is NVIDIA?"],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"input_type": "query",
"modality": "text",
"embedding_type": "int8",
"encoding_format": "float"
}'
The following response is shortened for readability.
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [1, -1, 1, 0, 5]
}
],
"model": "nvidia/llama-nemotron-embed-vl-1b-v2",
"usage": {
"prompt_tokens": 7,
"total_tokens": 7
}
}
Health Checks#
Query the readiness endpoint:
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
-H 'Accept: application/json'
Response:
{
"object": "health.response",
"message": "ready",
"ready": true
}
Query the liveness endpoint:
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
-H 'Accept: application/json'
Response:
{
"object": "health.response",
"message": "live",
"live": true
}
Error Responses#
Errors are returned as JSON objects with object: "error", a message, and a type.
The following examples show common validation failures:
Case |
HTTP Status |
Message |
|---|---|---|
Empty input array |
|
|
Blank input string |
|
|
Unknown model |
|
|
Invalid modality |
|
|
Image with |
|
|
Invalid |
|
|
Invalid |
|
|
OpenAPI Reference for NeMo Retriever Embedding NIM#
The following is the OpenAPI reference for NVIDIA NeMo Retriever Embedding NIM.