Release Notes for NVIDIA NeMo Retriever Reranking NIM#
This documentation contains the release notes for NVIDIA NeMo Retriever Reranking NIM.
Note
Some releases are labelled “Production Branch” or “(PB)”. Production Branches provide reliable, stable versions of the NIM. Non-production branch releases (sometimes called Feature Branch (FB) releases) contain the latest features, improvements, and optimizations.
Release 2.0.0#
This release represents a major runtime upgrade for the nvidia/llama-nemotron-rerank-vl-1b-v2 NIM that includes a new purpose-built reranking inference stack.
Compared to earlier versions, the new runtime delivers
higher throughput and lower latency across all supported GPU SKUs, smaller VRAM footprint, faster startup time, and smaller container size.
Note
The model, supported modalities (text | image | text_image), and API are unchanged.
Highlights#
This release contains the following key changes:
The runtime selects optimized CUDA kernels automatically at startup based on the GPU’s compute capability. No manual profile selection steps are required.
Added support for loading model artifacts from Hugging Face or NGC.
To use Hugging Face (default), set
HF_TOKEN.To use NGC, set
NIM_MODEL_DOWNLOAD_PROVIDER=ngcandNGC_API_KEY.
The
NIM_ENGINE_COUNTenv var defaults to1. For details, refer to Engine Count.Added the
SHOW_CONFIGenvironment variable. SettingSHOW_CONFIG=1at runtime lists the environment variables configured for the NIM.You can opt-in to gRPC by setting
NIM_GRPC_BIND_ADDR.New environment variables. For details, refer to Environment Variables for NVIDIA NeMo Retriever Reranking NIM.
NIM_MODEL_NAMEis supported for served API model aliasing.NIM_SERVED_MODEL_NAMEremains supported for served API aliasing and currently takes precedence when both variables are configured.The following environment variables are renamed in this version:
NIM_HTTP_API_PORTis nowNIM_BIND_ADDR.NIM_LOG_LEVELis nowRUST_LOG.NIM_LOGGING_JSONLis nowLOG_FORMAT=json.NIM_TRITON_GRPC_PORTis nowNIM_GRPC_BIND_ADDR.
The following environment variables are deprecated aliases in this release. The aliases still work, but will be removed in a future release.
NIM_NUM_MODEL_INSTANCESandNIM_TRITON_MODEL_INSTANCE_COUNTare nowNIM_ENGINE_COUNT.
The following environment variables are removed in this release with no replacement.
NIM_CACHE_PATH,NIM_HTTP_MAX_WORKERS,NIM_HTTP_TRITON_PORT,NIM_IGNORE_MODEL_DOWNLOAD_FAIL,NIM_MANIFEST_ALLOW_UNSAFE,NIM_MANIFEST_PATH,NIM_MODEL_PROFILE,NIM_NUM_TOKENIZERS,NIM_REPOSITORY_OVERRIDE,NIM_TELEMETRY_MODE,NIM_TELEMETRY_ENABLE_ON_RTX,NIM_TELEMETRY_INTERVAL_MINUTES,NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDS,NIM_TRITON_LOG_VERBOSE,NIM_TRITON_PERFORMANCE_MODE.
Support Matrix and Compatibility Updates#
The supported optimized SKUs are the following. For details, refer to Support Matrix for NVIDIA NeMo Retriever Reranking NIM.
FP16 on: B200, RTX PRO 6000, H100, H200, L40S, A100, A10G, L4
FP8 on: H100, H200, RTX PRO 6000
The default precision is now FP16 when supported. Set
NIM_PRECISION=fp8to opt into FP8.
All Known Issues#
The known issues for NeMo Retriever Reranking NIM are the following:
For GPUs with less VRAM, such as A10G and L4, set
NIM_MAX_BATCH_SIZEto 26 or lower. For details, refer to Environment Variables for NVIDIA NeMo Retriever Reranking NIM.