API Reference for NVIDIA NeMo Retriever Embedding NIM#

This documentation contains the HTTP API reference for NVIDIA NeMo Retriever Embedding NIM.

You can download the complete API spec.

Warning

Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. Use the truncate field to control how the runtime handles input that is longer than the served profile supports.

Endpoints#

The runtime exposes the following endpoints.

Method

Path

Description

POST

/v1/embeddings

Generate embeddings.

GET

/v1/models

List served models.

GET

/v1/health/ready

Check readiness.

GET

/v1/health/live

Check liveness.

GET

/v1/metrics

Return Prometheus metrics.

GET

/v1/metadata

Return runtime and model metadata.

GET

/v1/manifest

Return the model manifest.

GET

/v1/license

Return license information.

GET

/v1/version

Return runtime API version information.

The /v1 path prefix is the API version, not the model version.

Embeddings#

Use POST /v1/embeddings to generate embeddings for text, image, or text+image inputs.

Request Body#

Field

Type

Required

Description

model

string

Yes

Model ID. Use nvidia/llama-nemotron-embed-vl-1b-v2 for VLM Embed.

input

string or array of strings

Yes

Text, image data URL, or text with an image data URL.

input_type

string

No

Use query for text queries and passage for documents. Image inputs are supported only with passage or when input_type is omitted.

modality

string or array of strings

No

One of text, image, or text_image. If an array is provided, its length must match input, or it must contain one batch-wide value.

embedding_type

string

No

One of float, int8, uint8, binary, or ubinary. Default is float.

encoding_format

string

No

One of float or base64. Default is float.

dimensions

integer

No

Dynamic embedding size. Supported values are listed in Specify Dynamic Embedding Sizes.

truncate

string

No

One of START, END, or NONE.

user

string

No

Optional caller-provided user identifier.

Response Body#

The response contains an object with:

Field

Type

Description

object

string

Always list.

data

array

One embedding object per input item.

model

string

Served model ID.

usage

object

Token usage information.

Each item in data contains:

Field

Type

Description

object

string

Always embedding.

index

integer

Input index.

embedding

array

Embedding vector or packed compressed embedding values.

Specify Dynamic Embedding Sizes#

To reduce the storage cost of returned embeddings, use the optional dimensions API parameter. The VLM Embed runtime accepts the following dimensions:

  • 128

  • 256

  • 384

  • 512

  • 768

  • 1024

  • 1536

  • 2048

For the full list of supported models, refer to the support matrix.

Important

When you use the dimensions API parameter, use the same value for dimensions in concurrent requests to preserve dynamic batching efficiency.

Limitations#

The following are limitations when you specify dimensions:

  • The dimensions parameter cannot be used in combination with the embedding_type parameter. These are alternative methods for reducing the memory footprint of embeddings and have different performance trade-offs.

  • If dimensions is omitted, the VLM Embed model returns 2048-dimensional float embeddings.

Specify Embedding Type#

The /v1/embeddings endpoint contains an optional field named embedding_type that supports the following values. For the full list of models and their embedding types, refer to the support matrix.

Embedding Type

Returned JSON Values

Length for VLM Embed

float

floating-point numbers

2048

int8

signed integers

2048

uint8

unsigned integers

2048

binary

packed signed integers

256

ubinary

packed unsigned integers

256

  • Use float when you need maximum accuracy and storage or memory constraints are not a concern.

  • Use int8 or uint8 when you need a balance of compression and accuracy.

  • Use binary or ubinary for larger-scale systems where maximum compression is more important than preserving all float precision.

Warning

Binary embeddings require Hamming distance for similarity calculations, not cosine similarity or dot product. Verify that your vector database supports the compressed type that you select.

Tune Dynamic Batching#

Dynamic batching is a feature that allows the NIM to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_MAX_WAIT_MS environment variable. The default value is 10ms (milliseconds).

Specify Modality#

The /v1/embeddings endpoint contains a modality field to support text, image, and mixed text+image input types.

The following are valid values for modality:

  • text

  • image

  • text_image

Image inputs are encoded as image data URLs, such as data:image/png;base64,... or data:image/jpeg;base64,.... You can also wrap a data URL in an image tag, such as <img src="data:image/png;base64,..."/>.

Image Input Limitations#

The following limitations apply to image inputs for nvidia/llama-nemotron-embed-vl-1b-v2:

  • Images are document inputs. Use input_type: "passage" or omit input_type.

  • Images with input_type: "query" are rejected.

  • The decoded image payload is limited to 25 MiB.

  • Image dimensions are limited to 8192 x 16384 or 16384 x 8192.

Specify Modality Explicitly#

If you specify modality, each input is processed as the modality you provide.

You can specify a single modality for the whole request:

{
  "modality": "text"
}

You can also specify one modality per input:

{
  "modality": ["text", "image", "text_image"]
}

Let the Server Infer Modality#

If you omit modality, the server infers the modality for each input:

  • If the input is a valid data URL starting with data:image/, the modality is inferred as image.

  • If the input contains both text and an embedded image data URL, the modality is inferred as text_image.

  • Otherwise, the modality is inferred as text.

API Examples#

Use the examples in this section to help you get started with the API.

Set the host and service port:

export HOSTNAME=localhost
export SERVICE_PORT=8000

List Models#

Use the following command to list the available models:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
  -H 'Accept: application/json'

The response contains the served model ID:

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/llama-nemotron-embed-vl-1b-v2",
      "object": "model",
      "created": 1779146207,
      "owned_by": "nvidia"
    }
  ]
}

Generate a Text Query Embedding#

Use input_type: "query" for text queries.

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "float",
    "encoding_format": "float"
}'

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.012519836, -0.0126571655, 0.0101623535]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

Generate an Image Document Embedding#

Use input_type: "passage" and modality: "image" for image documents. The following example creates a valid PNG data URL, writes the request body to payload-image.json, and sends the request.

python3 - <<'PY'
import base64
import json
import struct
import zlib

width = height = 224
rows = []
for y in range(height):
    row = bytearray([0])
    for x in range(width):
        row.extend((x % 256, y % 256, 128))
    rows.append(bytes(row))

def chunk(kind, data):
    return (
        struct.pack(">I", len(data))
        + kind
        + data
        + struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
    )

png = (
    b"\x89PNG\r\n\x1a\n"
    + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
    + chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
    + chunk(b"IEND", b"")
)

payload = {
    "input": ["data:image/png;base64," + base64.b64encode(png).decode()],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "passage",
    "modality": "image",
    "embedding_type": "float",
    "encoding_format": "float",
}

with open("payload-image.json", "w", encoding="utf-8") as f:
    json.dump(payload, f)
PY

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @payload-image.json

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [-0.0047950745, 0.017486572, 0.023162842]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 266,
    "total_tokens": 266
  }
}

Generate a Compressed Embedding#

The following example generates an int8 embedding:

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "int8",
    "encoding_format": "float"
}'

The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [1, -1, 1, 0, 5]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

Health Checks#

Query the readiness endpoint:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
  -H 'Accept: application/json'

Response:

{
  "object": "health.response",
  "message": "ready",
  "ready": true
}

Query the liveness endpoint:

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
  -H 'Accept: application/json'

Response:

{
  "object": "health.response",
  "message": "live",
  "live": true
}

Error Responses#

Errors are returned as JSON objects with object: "error", a message, and a type.

The following examples show common validation failures:

Case

HTTP Status

Message

Empty input array

400

input must not be empty

Blank input string

400

input[0] must not be blank or empty

Unknown model

404

model 'wrong-model' not found; available: 'nvidia/llama-nemotron-embed-vl-1b-v2'

Invalid modality

400

modality must be one of 'text', 'image', 'text_image'; got 'bogus'

Image with input_type: "query"

400

input_type="query" is not supported with images; the VLM model processes images as passages only. Use input_type="passage" or omit input_type.

Invalid encoding_format

400

encoding_format must be one of 'float', 'base64'; got 'bad'

Invalid embedding_type

400

Input should be 'float', 'binary', 'ubinary', 'int8' or 'uint8'

OpenAPI Reference for NeMo Retriever Embedding NIM#

The following is the OpenAPI reference for NVIDIA NeMo Retriever Embedding NIM.