API Reference for NVIDIA NeMo Retriever Embedding NIM#

This documentation contains the HTTP API reference for NVIDIA NeMo Retriever Embedding NIM.

You can download the complete API spec.

Warning

Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. Use the truncate field to control how the runtime handles input that is longer than the served profile supports.

Endpoints#

The runtime exposes the following endpoints.

Method	Path	Description
`POST`	`/v1/embeddings`	Generate embeddings.
`GET`	`/v1/models`	List served models.
`GET`	`/v1/health/ready`	Check readiness.
`GET`	`/v1/health/live`	Check liveness.
`GET`	`/v1/metrics`	Return Prometheus metrics.
`GET`	`/v1/metadata`	Return runtime and model metadata.
`GET`	`/v1/manifest`	Return the model manifest.
`GET`	`/v1/license`	Return license information.
`GET`	`/v1/version`	Return runtime API version information.

The /v1 path prefix is the API version, not the model version.

Embeddings#

Use POST /v1/embeddings to generate embeddings for text, image, or text+image inputs.

Request Body#

Field	Type	Required	Description
`model`	string	Yes	Model ID. Use `nvidia/llama-nemotron-embed-vl-1b-v2` for VLM Embed.
`input`	string or array of strings	Yes	Text, image data URL, or text with an image data URL.
`input_type`	string	No	Use `query` for text queries and `passage` for documents. Image inputs are supported only with `passage` or when `input_type` is omitted.
`modality`	string or array of strings	No	One of `text`, `image`, or `text_image`. If an array is provided, its length must match `input`, or it must contain one batch-wide value.
`embedding_type`	string	No	One of `float`, `int8`, `uint8`, `binary`, or `ubinary`. Default is `float`.
`encoding_format`	string	No	One of `float` or `base64`. Default is `float`.
`dimensions`	integer	No	Dynamic embedding size. Supported values are listed in Specify Dynamic Embedding Sizes.
`truncate`	string	No	One of `START`, `END`, or `NONE`.
`user`	string	No	Optional caller-provided user identifier.

Response Body#

The response contains an object with:

Field	Type	Description
`object`	string	Always `list`.
`data`	array	One embedding object per input item.
`model`	string	Served model ID.
`usage`	object	Token usage information.

Each item in data contains:

Field	Type	Description
`object`	string	Always `embedding`.
`index`	integer	Input index.
`embedding`	array	Embedding vector or packed compressed embedding values.

Specify Dynamic Embedding Sizes#

To reduce the storage cost of returned embeddings, use the optional dimensions API parameter. The VLM Embed runtime accepts the following dimensions:

128
256
384
512
768
1024
1536
2048

For the full list of supported models, refer to the support matrix.

Important

When you use the dimensions API parameter, use the same value for dimensions in concurrent requests to preserve dynamic batching efficiency.

Limitations#

The following are limitations when you specify dimensions:

The dimensions parameter cannot be used in combination with the embedding_type parameter. These are alternative methods for reducing the memory footprint of embeddings and have different performance trade-offs.
If dimensions is omitted, the VLM Embed model returns 2048-dimensional float embeddings.

Specify Embedding Type#

The /v1/embeddings endpoint contains an optional field named embedding_type that supports the following values. For the full list of models and their embedding types, refer to the support matrix.

Embedding Type	Returned JSON Values	Length for VLM Embed
`float`	floating-point numbers	2048
`int8`	signed integers	2048
`uint8`	unsigned integers	2048
`binary`	packed signed integers	256
`ubinary`	packed unsigned integers	256

Use float when you need maximum accuracy and storage or memory constraints are not a concern.
Use int8 or uint8 when you need a balance of compression and accuracy.
Use binary or ubinary for larger-scale systems where maximum compression is more important than preserving all float precision.

Warning

Binary embeddings require Hamming distance for similarity calculations, not cosine similarity or dot product. Verify that your vector database supports the compressed type that you select.

Tune Dynamic Batching#

Dynamic batching is a feature that allows the NIM to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_MAX_WAIT_MS environment variable. The default value is 10ms (milliseconds).

Specify Modality#

The /v1/embeddings endpoint contains a modality field to support text, image, and mixed text+image input types.

The following are valid values for modality:

text
image
text_image

Image inputs are encoded as image data URLs, such as data:image/png;base64,... or data:image/jpeg;base64,.... You can also wrap a data URL in an image tag, such as <img src="data:image/png;base64,..."/>.

Image Input Limitations#

The following limitations apply to image inputs for nvidia/llama-nemotron-embed-vl-1b-v2:

Images are document inputs. Use input_type: "passage" or omit input_type.
Images with input_type: "query" are rejected.
The decoded image payload is limited to 25 MiB.
Image dimensions are limited to 8192 x 16384 or 16384 x 8192.

Specify Modality Explicitly#

If you specify modality, each input is processed as the modality you provide.

You can specify a single modality for the whole request:

{
  "modality": "text"
}

You can also specify one modality per input:

{
  "modality": ["text", "image", "text_image"]
}

Let the Server Infer Modality#

If you omit modality, the server infers the modality for each input:

If the input is a valid data URL starting with data:image/, the modality is inferred as image.
If the input contains both text and an embedded image data URL, the modality is inferred as text_image.
Otherwise, the modality is inferred as text.

API Examples#

Use the examples in this section to help you get started with the API.

Set the host and service port:

export HOSTNAME=localhost
export SERVICE_PORT=8000

List Models#

Use the following command to list the available models:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
  -H 'Accept: application/json'

The response contains the served model ID:

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/llama-nemotron-embed-vl-1b-v2",
      "object": "model",
      "created": 1779146207,
      "owned_by": "nvidia"
    }
  ]
}

Generate a Text Query Embedding#

Use input_type: "query" for text queries.

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "float",
    "encoding_format": "float"
}'

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.012519836, -0.0126571655, 0.0101623535]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

Generate an Image Document Embedding#

Use input_type: "passage" and modality: "image" for image documents. The following example creates a valid PNG data URL, writes the request body to payload-image.json, and sends the request.

python3 - <<'PY'
import base64
import json
import struct
import zlib

width = height = 224
rows = []
for y in range(height):
    row = bytearray([0])
    for x in range(width):
        row.extend((x % 256, y % 256, 128))
    rows.append(bytes(row))

def chunk(kind, data):
    return (
        struct.pack(">I", len(data))
        + kind
        + data
        + struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
    )

png = (
    b"\x89PNG\r\n\x1a\n"
    + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
    + chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
    + chunk(b"IEND", b"")
)

payload = {
    "input": ["data:image/png;base64," + base64.b64encode(png).decode()],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "passage",
    "modality": "image",
    "embedding_type": "float",
    "encoding_format": "float",
}

with open("payload-image.json", "w", encoding="utf-8") as f:
    json.dump(payload, f)
PY

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @payload-image.json

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [-0.0047950745, 0.017486572, 0.023162842]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 266,
    "total_tokens": 266
  }
}

Generate a Compressed Embedding#

The following example generates an int8 embedding:

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "int8",
    "encoding_format": "float"
}'

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [1, -1, 1, 0, 5]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

Health Checks#

Query the readiness endpoint:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
  -H 'Accept: application/json'

Response:

{
  "object": "health.response",
  "message": "ready",
  "ready": true
}

Query the liveness endpoint:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
  -H 'Accept: application/json'

Response:

{
  "object": "health.response",
  "message": "live",
  "live": true
}

Error Responses#

Errors are returned as JSON objects with object: "error", a message, and a type.

The following examples show common validation failures:

Case	HTTP Status	Message
Empty input array	`400`	`input must not be empty`
Blank input string	`400`	`input[0] must not be blank or empty`
Unknown model	`404`	`model 'wrong-model' not found; available: 'nvidia/llama-nemotron-embed-vl-1b-v2'`
Invalid modality	`400`	`modality must be one of 'text', 'image', 'text_image'; got 'bogus'`
Image with `input_type: "query"`	`400`	`input_type="query" is not supported with images; the VLM model processes images as passages only. Use input_type="passage" or omit input_type.`
Invalid `encoding_format`	`400`	`encoding_format must be one of 'float', 'base64'; got 'bad'`
Invalid `embedding_type`	`400`	`Input should be 'float', 'binary', 'ubinary', 'int8' or 'uint8'`

OpenAPI Reference for NeMo Retriever Embedding NIM#

The following is the OpenAPI reference for NVIDIA NeMo Retriever Embedding NIM.