Is this page helpful?

Custom Model Artifact Support in NVIDIA NeMo Retriever Reranking NIM#

Model artifacts typically consist of a framework, architecture, and weights. As long as the framework and architecture match a supported model, you can use your own custom weights. Use this documentation to learn how to deploy custom model artifacts with NVIDIA NeMo Retriever Reranking NIM.

Prerequisites#

Before you proceed, complete the steps in Get Started With NVIDIA NeMo Retriever Reranking NIM through the Docker login. Stop before the step to launch the NIM.

To use the examples in this documentation, install the Hugging Face CLI (hf, provided by huggingface_hub 0.34.0 or later). For details, refer to Hugging Face CLI.

Supported Model Artifacts#

NVIDIA NeMo Retriever Reranking NIM supports the models listed in Support Matrix. You can use your own pre-downloaded model artifacts by staging a Hugging Face-style safetensors model directory on the host. The staged directory must contain the model configuration, safetensors weights, tokenizer files, and any processor configuration required by the selected model.

Set NIM_MODEL_PATH to the in-container path for the staged artifact directory. A fine-tuned checkpoint is supported only when it preserves the architecture, tokenizer, model-specific processing, and API output contract of a supported model.

The deprecated NIM_CUSTOM_MODEL workflow from earlier Retriever NIMs does not apply to version 2.0.0 and later NeMo Retriever Reranking NIM containers.

Note

To learn more about fine-tuning workflows, refer to NeMo AutoModel.

Download Artifacts Outside the NIM#

First, download the artifacts to a host directory. The following example uses the default model ID for this NIM image. Do not switch to another model family unless that exact artifact is documented as compatible with the container.

# Choose a supported model ID and a container name
export NIM_MODEL_NAME=nvidia/llama-nemotron-rerank-vl-1b-v2
export MODEL_SLUG=$(basename "$NIM_MODEL_NAME")
export CONTAINER_NAME=llama-nemotron-rerank-vl-1b-v2

# Choose host paths for downloaded artifacts and the NIM runtime cache
export LOCAL_MODEL_PATH=/tmp/nim-model-path-$MODEL_SLUG
export LOCAL_NIM_CACHE=/tmp/nim-runtime-cache-$MODEL_SLUG
mkdir -p "$LOCAL_MODEL_PATH" "$LOCAL_NIM_CACHE"

# Download the artifacts outside the NIM
hf download "$NIM_MODEL_NAME" --local-dir "$LOCAL_MODEL_PATH"

# Make the staged artifacts readable by the container user
chmod -R a+rX "$LOCAL_MODEL_PATH"

If the model is gated or private, authenticate to the model provider before you run the external download command. After the local directory is populated, the NIM start command does not need the model-provider credential. The NIM image must still be available locally or accessible through your Docker login.

Launch the NIM With Staged Model Artifacts#

Use a volume mount to map the staged artifact directory to a path in the container, and set NIM_MODEL_PATH to that container path. Use a separate volume mount to map a writable runtime cache to /opt/cache in the container.

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Start the NIM with staged artifacts
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_NAME \
  -e NIM_MODEL_PATH=/model \
  -v "$LOCAL_NIM_CACHE:/opt/cache" \
  -v "$LOCAL_MODEL_PATH:/model:ro" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

The following table describes the flags that are specific to launching a NIM with custom model artifacts. For the full list of standard flags, refer to Get Started With NVIDIA NeMo Retriever Reranking NIM.

Flag	Description
`-e NIM_MODEL_PATH=/model`	The in-container path for the staged model artifact directory.
`-v "$LOCAL_NIM_CACHE:/opt/cache"`	Volume mount a writable runtime cache to `/opt/cache` in the container.
`-v "$LOCAL_MODEL_PATH:/model:ro"`	Volume mount the host directory that contains the staged model artifacts to `/model` as read-only.