Release Notes for NVIDIA NeMo Retriever Reranking NIM#

This documentation contains the release notes for NVIDIA NeMo Retriever Reranking NIM.

Note

Some releases are labelled “Production Branch” or “(PB)”. Production Branches provide reliable, stable versions of the NIM. Non-production branch releases (sometimes called Feature Branch (FB) releases) contain the latest features, improvements, and optimizations.

Release 1.12 - Production Branch Only#

This release is a production branch.

Highlights#

1.12.0: Production branch release of llama-3.2-nv-rerankqa-1b-v2 with STIG/FIPS base image.
1.12.0: Upgraded to use Triton Inference server version 26.03 to address CVEs.

Known Issues#

There are no known issues in this release.

Release 1.11.0#

Highlights#

Add support for multimodal reranking of text and images with the new NIM llama-nemotron-rerank-vl-1b-v2. For more information, refer to Use the API (OpenAI) for NVIDIA NeMo Retriever Reranking NIM.

Release 1.10.0#

Highlights#

Rename existing models to the new Nemotron brand. The impacted models are the following:
- The llama-3.2-nemoretriever-500m-rerank-v2 model is now named llama-nemotron-rerank-500m-v2.
- The llama-3.2-nv-rerankqa-1b-v2 model is now named llama-nemotron-rerank-1b-v2.
Add fixes for high and critical vulnerabilities.

Fixed Known Issues#

The following are the known issues that are fixed in this version:

Fixed an issue with the persistence.enabled helm chart value. Persistent storage options (persistence.storageClass, persistence.existingClaim, hostPath.enabled) are now fully functional.

Release 1.9 - Production Branch Only#

This release is a production branch.

Highlights#

1.9.0: Production branch release of llama-3.2-nv-rerankqa-1b-v2 with STIG/FIPS base image.
1.9.0: Upgraded to use Triton Inference server version 25.08.03 to address CVEs.
CUDA version changed from 12.9 to 13. For details, refer to What’s New and Important in CUDA Toolkit 13.0.
1.9.1 - 1.9.x: CVE fixes for high & critical vulnerabilities.
1.9.1: Updated the API to return HTTP 422 (Unprocessable Content) instead of HTTP 400 (Bad Request) when the input text exceeds the maximum token length.

Known Issues#

There are no known issues in this release.

Release 1.8.0#

Summary#

Upgraded to use Triton Inference Server 25.08 to address CVEs.
Added TRT optimized engines for CUDA GPU Compute Capability. Support includes 12.0, 10.0, 9.0, 8.9, 8.6, and 8.0.

Known Issues#

The persistence.enabled value and all related dependent configuration flags are currently non-functional in the NIM helm chart.

Release 1.7.0#

Summary#

Added support for gRPC. For details, see API Reference (gRPC) for NVIDIA NeMo Retriever Reranking NIM.
Added the NIM_REPOSITORY_OVERRIDE environment variable.
Added mixed precision support for the Llama Nemotron Rerank 500m v2 NIM. For details, see llama-3.2-nemoretriever-500m-rerank-v2.

Known Issues#

The persistence.enabled value and all related dependent configuration flags are currently non-functional in the NIM helm chart.

Release 1.6.0#

Summary#

Added support for Llama-3.2-nemoretriever-500m-rerank-v2 reranking model.

Release 1.5.0#

Summary#

Added support for B200 GPU.

Known Issues#

The list-model-profiles command incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.
Slight performance degradation observed since 1.3.1 release.

Release 1.4.0#

Summary#

Added support for configurable memory footprint by allowing users to set batch size and sequence length.
Added the NIM_TRITON_MAX_BATCH_SIZE environment variable.
Reduced container image sizes.
Removed model profiles for A100 PCIe 40GB & H100 PCIe 80GB configurations.
Fixed bug where list-model-profiles command fails to run on hosts that don’t have an NVIDIA GPUs, even when NIM_CPU_ONLY is set.

Known Issues#

The list-model-profiles command incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.
Slight performance degradation observed since 1.3.1 release.

Release 1.3.1#

Added the NIM_SERVED_MODEL_NAME environment variable.
Updated the LangChain Playbook to use the Llama-3.2-NV-RerankQA-1B-v2 NIM.

Release 1.3.0#

Added support for Llama-3.2-NV-RerankQA-1B-v2 reranking model.
Added NIM_NUM_MODEL_INSTANCES and NIM_NUM_TOKENIZERS environment variables.
Added support for dynamic batching in the underlying Triton Inference Server process.

Known Issues#

The current version of langchain-nvidia-ai-endpoints used in the LangChain playbook is not compatible with the Llama-3.2-NV-RerankQA-1B-v2 NIM.

Release 1.0.2#

Improved accuracy for model running on A100 and A10G GPUs.

Release 1.0.1#

Added support for NGC Personal/Service API keys in addition to the NGC API Key (Original).
NGC_API_KEY is no longer required when running a container with a pre-populated cache (NIM_CACHE_PATH).
list-model-profiles command updated to check the correct location for model artifacts.

Release Notes for NVIDIA NeMo Retriever Reranking NIM#

Release 1.12 - Production Branch Only#

Highlights#

Known Issues#

Release 1.11.0#

Highlights#

Release 1.10.0#

Highlights#

Fixed Known Issues#

Release 1.9 - Production Branch Only#

Highlights#

Known Issues#

Release 1.8.0#

Summary#

Known Issues#

Release 1.7.0#

Summary#

Known Issues#

Release 1.6.0#

Summary#

Release 1.5.0#

Summary#

Known Issues#

Release 1.4.0#

Summary#

Known Issues#

Release 1.3.1#

Release 1.3.0#

Known Issues#

Release 1.0.2#

Release 1.0.1#

Release 1.0.0#

Summary#

Reranking Models#