Is this page helpful?

Get Started With NVIDIA NeMo Retriever Reranking NIM#

This documentation helps you get started with NVIDIA NeMo Retriever Reranking NIM.

Prerequisites#

Before you can get started, you need the following:

Verify that you have supported hardware and software. For details, refer to the support matrix.
If you are running on an RTX AI PC or Workstation, install WSL2. For instructions, refer to NIM on WSL2 documentation.
Create an account on NVIDIA NGC and generate an API key to access the NIM container images and model assets on NGC. For instructions, refer to the NGC Authentication section that follows.

Note

Deploying on Kubernetes is not supported for WSL.

NGC Authentication#

Generate your API key#

To access the NIM container images and model assets on NGC, you must generate a personal API key. To create your key, go to https://org.ngc.nvidia.com/setup/api-keys.

When you create your key, for Services Included, select the following:

NGC Catalog
Private Registry (if you are an Early Access participant)

You can include more services if you are going to use this key for other purposes. For more information, refer to the NGC User Guide.

Export the API key#

To conveniently use your API key in the commands in the following sections, you can export your key as an environment variable named NGC_API_KEY. For example, run the following code in your terminal.

export NGC_API_KEY=<your API key value>

Run one of the following commands to make your key available when you start a new terminal session.

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving your API key value in a file (retrieve it by using cat $NGC_API_KEY_FILE), or saving your key in a password manager.

Docker Login to NGC#

Before you can pull the NIM container image from NGC, first authenticate to NGC by using the following command.

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the License Terms#

Some NIM models require that you accept the license terms on NGC before you can pull the container image and model assets. To accept the license terms, browse to the model or container page on the NGC Catalog, read and then click Accept Terms.

Launch the NIM#

The following commands launch a Docker container for the llama-nemotron-rerank-vl-1b-v2 model. For Docker versions >= 19.03, the --runtime=nvidia option has the same effect as the --gpus all option.

Starting in version 2.0.0, the NIM supports two model download providers, selected via NIM_MODEL_DOWNLOAD_PROVIDER. The default is hf (Hugging Face); set to ngc to download from NGC instead. Pick the flow that matches your environment.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-rerank-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

The runtime uses separate directories for cached artifacts and model weights. Mount the artifact cache at /opt/cache and the model weights cache at /model. The NIM_MODEL_PATH environment variable controls where model weights are downloaded inside the container. For standard deployment, leave NIM_MODEL_PATH unset; the embedding NIM downloads weights to /model/embed, and the reranking NIM downloads weights to /model/rerank. For fine-tuned model deployment, mount the local model weights into /model and set NIM_MODEL_PATH to the mounted path.

Hugging Face provider (default)#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

NGC provider#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_DOWNLOAD_PROVIDER=ngc \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Flags	Description
`-it`	`--interactive` + `--tty` (see Docker docs).
`--rm`	Delete the container after it stops.
`--name=llama-nemotron-rerank-vl-1b-v2`	Give a name to the NIM container for bookkeeping.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus all`	Expose all NVIDIA GPUs inside the container.
`--shm-size=16GB`	Allocate host memory for multi-GPU communication. Not required for single-GPU models or GPUs with NVLink.
`-e HF_TOKEN`	Hugging Face access token. Required when using the default `hf` download provider.
`-e NGC_API_KEY`	NGC API key. Required when using the `ngc` download provider.
`-e NIM_MODEL_DOWNLOAD_PROVIDER=ngc`	Force the NIM to download model artifacts from NGC instead of Hugging Face.
`-v "$LOCAL_NIM_CACHE/cache:/opt/cache"`	Mount a cache directory so downloaded artifacts can be reused by follow-up runs.
`-v "$LOCAL_NIM_CACHE/weights:/model"`	Mount a model weights cache directory. Leave `NIM_MODEL_PATH` unset to use the default model path for the NIM.
`-u $(id -u)`	Use the same user as your system user inside the container to avoid permission mismatches on the cache directory.
`-p 8000:8000`	Publish the NIM HTTP port.
`$IMG_NAME`	NIM container image from NGC.

GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Verify that the Service is Ready#

After you launch the NIM, it might take a few seconds for the service to be ready to accept requests. To verify that the service is ready, run the following code.

curl -X 'GET' 'http://localhost:8000/v1/health/ready'

If the service is ready, you should see a response similar to the following.

{"object":"health.response","message":"Service is ready.","status":"ready"}

Run Inference#

After the service is ready, use the following code to run inference. For more information, refer to Use the API (OpenAI) for NVIDIA NeMo Retriever Reranking NIM.

curl -X "POST" \
  "http://localhost:8000/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
  "query": {"text": "which way did the traveler go?"},
  "passages": [
    {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
    {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
    {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
    {"text": "i shall be telling this with a sigh somewhere ages and ages hence: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
  ],
  "truncate": "END"
}'

Run Image Passage Reranking#

The VLM reranking model accepts a text query and candidate passages that include images. For details, refer to VLM reranking requests.

python3 - <<'PY'
import base64
import json
import struct
import zlib

def png_data_url(rgb):
    width = height = 224
    rows = []
    for _ in range(height):
        row = bytearray([0])
        for _ in range(width):
            row.extend(rgb)
        rows.append(bytes(row))

    def chunk(kind, data):
        return (
            struct.pack(">I", len(data))
            + kind
            + data
            + struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
        )

    png = (
        b"\x89PNG\r\n\x1a\n"
        + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
        + chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
        + chunk(b"IEND", b"")
    )
    return "data:image/png;base64," + base64.b64encode(png).decode()

payload = {
    "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
    "query": {"text": "Which image is mostly red?"},
    "passages": [
        {"image": png_data_url((220, 30, 30))},
        {"image": png_data_url((30, 30, 220))},
    ],
    "truncate": "END",
}

with open("payload-rerank-image.json", "w", encoding="utf-8") as f:
    json.dump(payload, f)
PY

curl -s -X "POST" \
  "http://localhost:8000/v1/ranking" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @payload-rerank-image.json | python3 -m json.tool

Deploy on Multiple GPUs#

Support for multi-GPU deployment is not included in the 2.0.0 release of the embedding and reranking NIMs.

Download NIM Models to Cache#

Starting in version 2.0.0, the NIM selects kernels automatically at startup based on the GPU’s compute capability, and model weights download on the first run. There is no list-model-profiles step.

To pre-fetch model weights, for example to stage them for an air-gapped deployment, start the container one time with separate directories mounted for model weights and runtime cache. The NIM downloads model weights to the default path under /model and writes runtime artifacts to /opt/cache. You can stop the container after the weights are present or start the container with NIM_PRECOMPILE_ONLY=1 to exit after artifacts are compiled and before the server starts.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-rerank-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

# Start the NIM with the artifact and model weights caches mounted.
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia --gpus all --shm-size=16GB \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# After the weights are present in $LOCAL_NIM_CACHE/weights/embed and artifacts are present in $LOCAL_NIM_CACHE/cache, stop the container.

For a fully air-gapped deployment, download the model weights and runtime cache on a connected computer, stage them at paths on the target host, and mount those directories at runtime. Mount the staged model weights at the path specified by NIM_MODEL_PATH, and mount the runtime cache separately at /opt/cache:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia --gpus all --shm-size=16GB \
  --network=none \
  -e NIM_MODEL_PATH=/model/embed \
  -v /path/to/staged/weights:/model/embed:ro \
  -v /path/to/runtime/cache:/opt/cache \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Stop the Container#

To stop the Docker container, run the following code.

docker stop $CONTAINER_NAME

To remove the Docker container, run the following code. If you included the --rm flag when you started the container, you don’t need this step.

docker rm $CONTAINER_NAME