Use the API (OpenAI) for NVIDIA NeMo Retriever Reranking NIM#

Use the examples in this documentation to help you get started using the API for NVIDIA NeMo Retriever Reranking NIM.

For the full API reference, refer to API Reference (OpenAI).

Overview#

Maximum Token Length#

Every model has a maximum token length. The models section lists the maximum token lengths of the supported models. For how to handle sequences longer than the maximum token length, refer to the ranking/truncate field and the RankRequest/truncate schema in the reference.

How to Tune Dynamic Batching#

Dynamic batching is a feature that allows the underlying Triton process in the NIM container to group one or more requests into a single batch, which can improve throughput under certain conditions, for example when serving many requests with small payloads. This feature is enabled by default and can be tuned by setting the NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDS environment variable. The default value is 100us (microseconds).

For more information on dynamic batching, refer to the Triton User Guide.

Example to List Models#

To list the available models, use the following code.

cURL Request

HOSTNAME="localhost"
SERVICE_PORT="8000"
curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/models" \
-H 'Accept: application/json'

Response

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/llama-nemotron-rerank-1b-v2"
    }
  ]
}

Example to Generate Rankings#

To generate rankings, use the following code.

cURL Request

HOSTNAME="localhost"
SERVICE_PORT="8000"
curl -X "POST" \
  "http://${HOSTNAME}:${SERVICE_PORT}/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/llama-nemotron-rerank-1b-v2",
  "query": {"text": "which way should i go?"},
  "passages": [
    {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
    {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
    {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
    {"text": "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
  ],
  "truncate": "END"
}'

Response

{
  "rankings": [
    {
      "index": 0,
      "logit": 0.7646484375
    },
    {
      "index": 3,
      "logit": -1.1044921875
    },
    {
      "index": 2,
      "logit": -2.71875
    },
    {
      "index": 1,
      "logit": -5.09765625
    }
  ]
}

Example Health Checks#

The following example queries the health/live endpoint to see if the service is up and running.

cURL Request

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/live" \
-H 'Accept: application/json'

Response

{
  "object": "health-response",
  "message": "Service is live."
}

The following example queries the health/ready endpoint to see if the service is ready to receive requests.

cURL Request

curl "http://${HOSTNAME}:${SERVICE_PORT}/v1/health/ready" \
-H 'Accept: application/json'

Response

{
  "object": "health-response",
  "message": "Service is ready."
}