Use NVIDIA NeMo Retriever Reranking NIM#

This section provides some examples of reranking, some best practices, and describes some security issues you need to consider when you work with NVIDIA NeMo Retriever Reranking NIM.

Examples#

Shell (cURL)#

Ranking#

To generate rankings, use the following code.

cURL Request

curl -X 'POST' \
  'http://localhost:8000/v1/ranking' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
  "query": {"text": "which way should i go?"},
  "passages": [
    {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
    {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
    {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
    {"text": "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
  ],
  "truncate": "END"
}'

Response

You should get a response similar to the following.

{
  "rankings": [
    { "index": 0, "logit": -1.2421875 },
    { "index": 3, "logit": -3.029296875 },
    { "index": 2, "logit": -5.41015625 },
    { "index": 1, "logit": -8.2421875 }
  ],
  "usage": 
    {
        "prompt_tokens": 123,
        "total_tokens": 123
    }
}

Best Practices#

A request to the NeMo Retriever Reranking API includes a query, a list of passages, and an optional truncate parameter (either NONE or END, defaulting to NONE). It then reranks the passages based on relevance. While many datastores return scores for passages, those scores are not used by the Reranking API. Only the text of the query and candidate passages are used, and are ranked according to the model’s understanding of the content.

If truncate is NONE, the container returns an error for inputs whose tokenized representation exceeds the token limit for the underlying model. If truncate is END, all tokens beyond the token limit are ignored (see below).

Token Limits & Truncation#

The token limit is measured after tokenization and applies to each query and passage pair, not to the request character count. The applicable limit depends on the model and runtime configuration. For the current supported models and configurations, refer to Support Matrix. If a deployment overrides NIM_MAX_SEQ_LEN, use the effective configured value for that deployment.

When truncate is END and a query and passage pair exceeds the token limit, tokens are truncated from the end of the passage until the pair fits. If the query itself reaches the token limit, the entire passage can be truncated, which makes the reranking result uninformative.

Max passages#

You can pass up to 512 passages in a single reranking call.

Understanding results#

The results from a reranking request will include a list of objects with index and logit keys. They will be sorted descending by logit value. logit is the raw, unnormalized predictions that a model generates for each query / passage pair.

The index references the index of the passage being referred to in the request. So if the request list included passages ["bears", "house", "grass"] and the indexes in the response are 1,2,0 then the response is saying that the sorted passages order is ["house", "grass", "bears"].

Security & Authentication#

For security and authentication information, see Security and Authentication for NVIDIA NeMo Retriever Reranking NIM.