API Reference for NeMo Retriever Text Embedding NIM#

This documentation contains the API reference for NeMo Retriever Text Embedding NIM.

Overview#

You can download the complete API spec

Warning

Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. See the truncate field in the Reference on ways to handle sequences longer than the maximum token length.

Warning

NV-Embed-QA and E5 models operate in passage or query mode, and thus require the input_type parameter. passage is used when generating embeddings during indexing. query is used when generating embeddings during querying. It is very important to use the correct input_type. Failure to do so will result in large drops in retrieval accuracy.

Since the OpenAI API does not accept input_type as a parameter, it is possible to add the -query or -passage suffix to the model parameter like NV-Embed-QA-query and not use the input_type field at all for OpenAI API compliance.

For example, the following two requests are identical.

With the input_type parameter:

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is the population of Pittsburgh?"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "input_type": "query",
    "modality": "text"
}'

Without the input_type parameter with the -query (or -passage) in the model name:

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is the population of Pittsburgh?"],
    "model": "nvidia/nv-embedqa-e5-v5-query",
    "modality": "text"
}'

Note that the GTE and GTR models do not accept the input_type parameter, since both the -query and -passage input types are processed in the same way.

Specify Dynamic Embedding Sizes#

To reduce the storage cost of the returned embeddings, some models support dynamic embedding sizes by using Matryoshka Representation Learning. To produce a lower-dimensional embedding representation of your text, you can use the optional dimensions API parameter. For the full list of supported models, refer to support matrix.

Important

When you use the dimensions API parameter, use the same value for dimensions in concurrent requests to ensure that dynamic batching works properly.

Limitations#

The following are limitations when you specify dimensions:

The dimensions parameter cannot be used in combination with the embedding_type parameter. These are alternative methods for reducing the memory footprint of embeddings and have different performance trade-offs.

Specify Embedding Type#

The /v1/embeddings endpoint contains an optional field named embedding_type that supports the following types of embeddings. For the full list of models and their embedding types, refer to support matrix.

float — Use when you need maximum accuracy, and storage and memory constraints are not a concern. float is the default embedding type.
int8, uint8 — Recommended for most production deployments. Provides a balance of compression and accuracy.
binary, ubinary — Use for very large-scale systems, where maximum compression is critical, and you can accept some reduced accuracy.

You can specify embedding_type to potentially reduce memory and storage costs, and make it easier to scale vector databases to large datasets. Models that support compressed embedding types are optimized to minimize accuracy loss when you use these representations.

Embedding Type	Potential Memory Savings	Size per Dimension	Data Type Returned
`float`	1x	4 bytes	`float32`
`int8`	4x	1 byte	`int8`
`uint8`	4x	1 byte	`uint8`
`binary`	32x	1 bit	`int8`
`ubinary`	32x	1 bit	`uint8`

int8 Embeddings#

int8 embeddings reduce memory usage by a factor of four compared to float embeddings, while maintaining high retrieval accuracy. The conversion process maps each floating-point value in the original vector to an 8-bit integer, signed (int8) or unsigned (uint8).

binary Embeddings#

binary embeddings reduce storage requirements by a factor of 32 and can accelerate search speeds. This makes them ideal for systems that handle large datasets or require very low latency.

The conversion process maps each floating-point value in the original vector into a single bit (0 or 1), and then packs the bits into 8-bit bytes. For example, a 1024-dimension float embedding is transformed into 1024 bits, which are then represented as 128 int8 values (binary) or uint8 values (ubinary).

Warning

Binary embeddings require hamming distance for similarity calculations, not cosine similarity or dot product. If you specify binary for your embedding type, you might need to change your search implementation.

Limitations#

The following are limitations when you specify embedding_type:

The embedding_type parameter cannot be used in combination with the dimensions parameter. These are alternative methods for reducing the memory footprint of embeddings and have different performance trade-offs.
Compressed embedding types (int8, uint8, binary, ubinary) can result in a loss of accuracy compared to float embeddings. The impact on retrieval accuracy varies depending on the embedding type and model.
Not all vector databases support compressed embedding types. Verify that your vector database supports int8 or binary embeddings before using these types in production.

Tune Dynamic Batching#

Dynamic batching is a feature that allows the underlying Triton process in the NIM container to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDS environment variable. The default value is 100us (microseconds).

For more information on dynamic batching, refer to the Triton User Guide.

Specify Modality#

The /v1/embeddings endpoint contains a modality field to support text, image, and mixed (text+image) input types. The following are the valid values for modality:

"text"
"image"
"text_image"

For image input, image data URLs are accepted, in image/png or image/jpeg format, with a maximum size of 5MB. You can also specify an image tag that wraps a data URL, such as <img src="data:image/png;base64,..."/>. Each input is validated against model limits.

Limitations#

The following are limitations for image input types:

For the model nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1, input_type=”query” is text-only.
Images as queries are not supported.
The size of decoded image bytes is limited to 25 MiB.
The max dimension of images is limited to 8192 x 16384 or 16384 x 8192.

Specify Modality Explicitly#

If you specify modality explicitly, you ensure that each input is processed exactly as you intend. The following are the two ways to specify modality:

Specify a single string for single input. For example: "modality": "text".
Specify an array that matches the length of input for batched requests. For example: "modality": ["text", "image", "text_image"].

Let the Server Infer Modality#

If you omit modality, the server infers the modality for each input by using the following process:

If the input is a valid data URL starting with data:image/, the modality is inferred as ‘image’.
If the input contains both text and an embedded image data URL, such as caption text data:image/png;base64,..., the modality is inferred as text_image.
Otherwise the modality is inferred as text.

In some cases, such as malformed data URLs, or text that unintentionally mimics a data URL pattern, there is a risk that the server infers modality incorrectly.

API Examples#

Use the examples in this section to help you get started with using the API.

The complete API spec can be found at Open AI Spec.

List Models#

cURL Request

Use the following command to list the available models.

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
-H 'Accept: application/json'

Response

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/nv-embedqa-e5-v5",
      "created": 0,
      "object": "model",
      "owned_by": "organization-owner"
    }
  ]
}

Generate Embeddings#

cURL Request

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["Hello world"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "input_type": "query",
    "modality": "text"
}'

Response

{
  "object": "list",
  "data": [
    {
      "index": 0,
      "embedding": [
        0.0010356903076171875, -0.017669677734375,
        // ...
        -0.0178985595703125
      ],
      "object": "embedding"
    }
  ],
  "model": "nvidia/nv-embedqa-e5-v5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

For models that do not require the input_type parameter, such as GTE or GTR, use the following sample API calls.

cURL Request

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["Hello world"],
    "model": "nvidia/nv-embedqa-e5-v5",
    "modality": "text"
}'

Response

{
  "object": "list",
  "data": [
    {
      "index": 0,
      "embedding": [
        0.0010356903076171875, -0.017669677734375,
        // ...
        -0.0178985595703125
      ],
      "object": "embedding"
    }
  ],
  "model": "nvidia/nv-embedqa-e5-v5",
  "usage": {
    "prompt_tokens": 0,
    "total_tokens": 0
  }
}

Generate Embeddings of a Specific Type#

The following example generates int8 embeddings by specifying embedding_type.

cURL Request

curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["Hello world"],
    "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
    "input_type": "query",
    "embedding_type": "int8"
}'

Response

{
  "object": "list",
  "data": [
    {
      "index": 0,
      "embedding": [
        10,
        -89,
        // ...
        -37
      ],
      "object": "embedding"
    }
  ],
  "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}

Health Check#

cURL Request

Use the following command to query the health endpoints.

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
-H 'Accept: application/json'

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
-H 'Accept: application/json'

Response

{
  "object": "health-response",
  "message": "Service is ready."
}

{
  "object": "health-response",
  "message": "Service is live."
}

OpenAPI Reference for Text Embedding NIM#

The following is the OpenAPI reference for NeMo Retriever Text Embedding NIM.