Use the API (OpenAI) for NVIDIA NeMo Retriever Reranking NIM#

Use the examples in this documentation to help you get started using the API for NVIDIA NeMo Retriever Reranking NIM.

For the full API reference, refer to API Reference (OpenAI).

Overview#

Maximum Token Length#

Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. For how to handle sequences longer than the maximum token length, refer to the ranking/truncate field and the RankRequest/truncate schema in the reference.

VLM Reranking Requests#

Use nvidia/llama-nemotron-rerank-vl-1b-v2 for multimodal reranking. The query is always text. Candidate passages can include text, image, or both.

query.text must be a non-empty string.
Each passage must include at least one of text or image.
For image-only passages, omit text. Do not set text to an empty string.
Images must be base64 data URLs, such as data:image/jpeg;base64,... or data:image/png;base64,.... PNG, JPEG, WebP, GIF, BMP, and TIFF image bytes are recognized.
The NIM rejects asset ID references and non-base64 image encodings.
A request can include up to 512 passages.

How to Tune Dynamic Batching#

Dynamic batching is a feature that allows the NIM to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_MAX_WAIT_MS environment variable. The default value is 10ms (milliseconds).

Example Health Checks#

The following example queries the health/live endpoint to see if the service is up and running.

cURL Request

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
-H 'Accept: application/json'

Response

{
  "object": "health.response",
  "message": "Service is live.",
  "status":"live"
}

The following example queries the health/ready endpoint to see if the service is ready to receive requests.

cURL Request

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
-H 'Accept: application/json'

Response

{
  "object": "health.response",
  "message": "Service is ready.",
  "status":"ready"
}

Example to List Models#

To list the available models, use the following code.

cURL Request

HOSTNAME="localhost"
SERVICE_PORT="8000"
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
-H 'Accept: application/json'

Response

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/llama-nemotron-rerank-vl-1b-v2"
    }
  ]
}

Example to Generate Rankings#

To generate rankings, use the following code.

cURL Request

HOSTNAME="localhost"
SERVICE_PORT="8000"
curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
  "query": {"text": "which way should i go?"},
  "passages": [
    {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
    {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
    {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
    {"text": "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
  ],
  "truncate": "END"
}'

Response

You should get a response similar to the following.

{
  "rankings": [
    {
      "index": 0,
      "logit": 0.7646484375
    },
    {
      "index": 3,
      "logit": -1.1044921875
    },
    {
      "index": 2,
      "logit": -2.71875
    },
    {
      "index": 1,
      "logit": -5.09765625
    }
  ],
  "usage": 
    {
        "prompt_tokens": 123,
        "total_tokens": 123
    }
}

Example to Generate Rankings for Images and Text#

To generate multimodal rankings, use the following code.

cURL Request

set -euo pipefail

URL="${NIM_URL:-http://localhost:8000}"

img1="https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg"
img2="https://developer-blogs.nvidia.com/wp-content/uploads/2025/02/hc-press-evo2-nim-25-featured-b.jpg"

echo "Fetching images..."
b64_1=$(curl -s "${img1}" | base64 -w 0)
b64_2=$(curl -s "${img2}" | base64 -w 0)

payload=$(mktemp)
trap "rm -f ${payload}" EXIT

cat > "${payload}" <<EOF
{
  "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
  "query": {
    "text": "How is AI improving the intelligence and capabilities of robots?"
  },
  "passages": [
    {
      "text": "AI enables robots to perceive, plan, and act autonomously.",
      "image": "data:image/jpeg;base64,${b64_1}"
    },
    {
      "text": "A biological foundation model designed to analyze and generate DNA, RNA, and protein sequences.",
      "image": "data:image/jpeg;base64,${b64_2}"
    }
  ]
}
EOF

echo "Sending multimodal rerank request to ${URL}..."
curl -s -X POST "${URL}/v1/ranking" \
  -H "Content-Type: application/json" \
  -d "@${payload}" | python3 -m json.tool

Response

You should get a response similar to the following.

{
    "rankings": [
        {
            "index": 0,
            "logit": -2.56640625
        },
        {
            "index": 1,
            "logit": -6.23046875
        }
    ],
    "usage": {
        "prompt_tokens": 3651,
        "total_tokens": 3651
    }
}

Use the API (OpenAI) for NVIDIA NeMo Retriever Reranking NIM#

Overview#

Maximum Token Length#

VLM Reranking Requests#

How to Tune Dynamic Batching#

Example Health Checks#

Example to List Models#

Example to Generate Rankings#

Example to Generate Rankings for Images and Text#

Related Topics#