Release Notes for NVIDIA NeMo Retriever Embedding NIM#
This documentation contains the release notes for NVIDIA NeMo Retriever Embedding NIM.
Note
Some releases are labelled “Production Branch” or “(PB)”. Production Branches provide reliable, stable versions of the NIM. Non-production branch releases (sometimes called Feature Branch (FB) releases) contain the latest features, improvements, and optimizations.
Release 2.0.0#
Summary#
Major runtime upgrade for the
nvidia/llama-nemotron-embed-vl-1b-v2NIM that includes a new purpose-built embedding inference stack. Compared to earlier versions, the new runtime delivers higher throughput and lower latency across all supported GPU SKUs, smaller VRAM footprint, faster startup time, and smaller container size.The model and the multimodal
/v1/embeddingssurface (modality: text | image | text_image) are unchanged.The runtime selects optimized CUDA kernels automatically at startup based on the GPU’s compute capability. No manual profile selection steps are required.
The supported optimized SKUs are the following. For details, refer to Support Matrix for NVIDIA NeMo Retriever Embedding NIM.
FP16 on: B200, GB200, RTX PRO 6000, H100, H200, L40S, A100, A10G, L4
FP8 on B200, GB200, RTX PRO 6000, H100, H200, L40S
The default precision is now determined automatically based on the GPU architecture. Set
NIM_PRECISION=fp16to opt into FP16.Added support for loading model artifacts from Hugging Face or NGC.
To use Hugging Face (default), set
HF_TOKEN.To use NGC, set
NIM_MODEL_DOWNLOAD_PROVIDER=ngcandNGC_API_KEY.
The
NIM_ENGINE_COUNTenv var defaults to1.Added the
SHOW_CONFIGenvironment variable. SettingSHOW_CONFIG=1at runtime lists the environment variables configured for the NIM.You can opt-in to gRPC by setting
NIM_GRPC_BIND_ADDR.New environment variables. For details, refer to Environment Variables for NVIDIA NeMo Retriever Embedding NIM.
NIM_MODEL_NAMEis supported for served API model aliasing.NIM_SERVED_MODEL_NAMEremains supported for served API aliasing and currently takes precedence when both variables are configured.The following environment variables are renamed in this version:
NIM_HTTP_API_PORTis nowNIM_BIND_ADDR.NIM_LOG_LEVELis nowRUST_LOG.NIM_LOGGING_JSONLis nowLOG_FORMAT=json.NIM_TRITON_GRPC_PORTis nowNIM_GRPC_BIND_ADDR.
The following environment variables are deprecated aliases in this release. The aliases still work, but will be removed in a future release.
NIM_NUM_MODEL_INSTANCESandNIM_TRITON_MODEL_INSTANCE_COUNTare nowNIM_ENGINE_COUNT.NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDSis nowNIM_MAX_WAIT_MS.
The following environment variables are removed in this release with no replacement.
NIM_CACHE_PATH,NIM_CUSTOM_MODEL,NIM_HTTP_MAX_WORKERS,NIM_HTTP_TRITON_PORT,NIM_IGNORE_MODEL_DOWNLOAD_FAIL,NIM_MANIFEST_ALLOW_UNSAFE,NIM_MANIFEST_PATH,NIM_MODEL_PROFILE,NIM_NUM_TOKENIZERS,NIM_REPOSITORY_OVERRIDE,NIM_TELEMETRY_MODE,NIM_TELEMETRY_ENABLE_ON_RTX,NIM_TELEMETRY_INTERVAL_MINUTES,NIM_TRITON_LOG_VERBOSE,NIM_TRITON_PERFORMANCE_MODE.
Known Issues#
The
nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2:2.0.0container image tag can fail to pull on Docker Engine 29.5.x when the Docker containerd image store is enabled. For details, refer to Troubleshoot NVIDIA NeMo Retriever Embedding NIM.
Release 1.13.0#
Highlights#
Rename existing models to the new Nemotron brand. The impacted models are the following:
The llama-3.2-nemoretriever-300m-embed-v2 model is now named llama-nemotron-embed-300m-v2.
The llama-3.2-nv-embedqa-1b-v2 model is now named llama-nemotron-embed-1b-v2.
Add fixes for high and critical vulnerabilities.
Fixed Known Issues#
The following are the known issues that are fixed in this version:
Fixed an issue with the
persistence.enabledhelm chart value. Persistent storage options (persistence.storageClass,persistence.existingClaim,hostPath.enabled) are now fully functional.
Release 1.12.0#
Highlights#
Renamed the llama-3.2-nemoretriever-1b-vlm-embed-v1 model to llama-nemotron-embed-vl-1b-v2.
Added TRT engine support for llama-nemotron-embed-vl-1b-v2 model. For details, see Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2).
The
NIM_TRITON_PERFORMANCE_MODEenvironment variable has no effect on the llama-3.2-nemoretriever-1b-vlm-embed-v1 NIM as the NIM has been optimized for both throughput and latency.
Known Issues#
The
persistence.enabledvalue and all related dependent configuration flags are currently non-functional in the NIM helm chart.
Release 1.11 - Production Branch Only#
This release is a production branch.
Highlights#
1.11.0: Production branch release of llama-3.2-nv-embedqa-1b-v2 with STIG/FIPS base image.
1.11.0: Upgraded to use Triton Inference server version 25.08.03 to address CVEs.
1.11.0: CUDA version changed from 12.9 to 13. For details, refer to What’s New and Important in CUDA Toolkit 13.0.
1.11.1: Added the support of the nv-embedqa-e5-v5 with STIG/FIPS base image.
1.11.1 - 1.11.x: CVE fixes for high & critical vulnerabilities.
Known Issues#
There are no known issues in this release.
Release 1.10.1#
This release is a patch release.
Highlights#
Added dynamic embedding support for llama-3.2-nemoretriever-300m-embed-v2 NIM.
Batch sizes larger than 1024 are now supported again after fixing a bug.
Known Issues#
The
persistence.enabledvalue and all related dependent configuration flags are currently non-functional in the NIM helm chart.
Release 1.10.0#
Summary#
Added support for llama-3.2-nemoretriever-300m-embed-v2 NIM. For details, refer to Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2).
Upgraded to use Triton Inference Server 25.08 to address CVEs.
Optimized TensorRT engine profiles for reduced GPU VRAM usage and improved cache utilization.
Added TRT optimized engines for CUDA GPU Compute Capability. Support includes 12.0, 10.0, 9.0, 8.9, 8.6, and 8.0.
Known Issues#
The
persistence.enabledvalue and all related dependent configuration flags are currently non-functional in the NIM helm chart.
Release 1.9.0#
Summary#
Added support for llama-3.2-nemoretriever-300m-embed-v1 NIM.
Added quantization support for uint8 and ubinary. For details, refer to Specify Embedding Type.
Added the
NIM_REPOSITORY_OVERRIDEenvironment variable.
Known Issues#
The
persistence.enabledvalue and all related dependent configuration flags are currently non-functional in the NIM helm chart.
Release 1.8 - Production Branch Only#
Summary#
1.8.0: Added support for H200 NVL GPU for the NV-EmbedQA-E5-v5 NIM. For details, see NV-EmbedQA-E5-v5.
1.8.0: Added FP8 support for H100 and L40s for the NV-EmbedQA-E5-v5 NIM. For details, see NV-EmbedQA-E5-v5.
1.8.1 - 1.8.x: CVE fixes for high & critical vulnerabilities.
Release 1.7.0 - Early Access Only#
Summary#
Added support for llama-3.2-nemoretriever-1b-vlm-embed-v1 model. For details, see Support Matrix for NVIDIA NeMo Retriever Embedding NIM.
Added new
modalityfield to the/v1/embeddingsendpoint to support text, image, and mixed (text+image) input types. For details, see Specify Modality.
Known Issues#
Currently, only unoptimized generic model profiles are supported.
Release 1.6.0#
Summary#
Added support for B200 GPU. For details, see Support Matrix for NVIDIA NeMo Retriever Embedding NIM.
Known Issues#
The
list-model-profilescommand incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.Slight performance degradation observed since 1.3.1 release.
For the B200 GPU, Llama-3.2-NV-EmbedQA-1B-v2 requires
NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1to properly start the NIM.
Release 1.5.1#
Summary#
Fixed bug where
list-model-profilescommand fails to run on hosts that don’t have an NVIDIA GPUs, even whenNIM_CPU_ONLYis set.Fixed bug where
list-model-profilescommand returnscustommodels that should not be used.
Known Issues#
The
list-model-profilescommand incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.Slight performance degradation observed since 1.3.1 release.
Release 1.5.0#
Summary#
Added support for bge-m3 embedding model. For details, refer to Support Matrix.
Added support for bge-large-zh-v1.5 embedding model.
Added the
NIM_TRITON_PERFORMANCE_MODEenvironment variable to allow you to select performance modes that are optimized for low latency or high throughput.Added the
NIM_TRITON_MAX_BATCH_SIZEenvironment variable.Added support for configurable memory footprint by allowing users to set batch size and sequence length.
Added support for gRPC.
Reduced container image sizes.
Removed model profiles for A100 PCIe 40GB & H100 PCIe 80GB configurations.
Known Issues#
The
list-model-profilescommand incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.The
list-model-profilescommand fails to run on hosts that don’t have an NVIDIA GPUs, even whenNIM_CPU_ONLYis set.The
list-model-profilescommand returnscustommodels that should not be used.
Release 1.4.0-rtx (Beta)#
Summary#
This is a public beta release of the NVIDIA NeMo Retriever Embedding NIM. This release contains the following changes:
Added support for GeForce RTX 4090, NVIDIA RTX 6000 Ada Generation, GeForce RTX 5080, and GeForce RTX 5090 for the Llama-3.2-NV-EmbedQA-1B-v2 NIM.
Known Issues#
The
list-model-profilescommand incorrectly lists compatible model profiles as incompatible. Select the profile that matches your hardware configuration. This bug does not impact automatic profile selection.
Release 1.3.1#
Added the
NIM_SERVED_MODEL_NAMEenvironment variable.Updated the LangChain Playbook to use the Llama-3.2-NV-EmbedQA-1B-v2 NIM.
Release 1.3.0#
Added support for Llama-3.2-NV-EmbedQA-1B-v2 embedding model.
Added support for dynamic embedding sizes via Matryoshka Representation Learning (for supported models).
Added
NIM_NUM_MODEL_INSTANCESandNIM_NUM_TOKENIZERSenvironment variables.Added support for dynamic batching in the underlying Triton Inference Server process.
Known Issues#
The current version of
langchain-nvidia-ai-endpointsused in the LangChain playbook is not compatible with the Llama-3.2-NV-EmbedQA-1B-v2 NIM.
Release 1.2.0#
Updated NV-EmbedQA-E5-v5 NIM to use Triton Inference Server 24.08.
Added the NIM_TRITON_GRPC_PORT env var to set gRPC port for Triton Inference Server.
Release 1.1.0#
Updated NV-EmbedQA-E5-v5 NIM using standard NIM library and tools.
Release 1.0.1#
Added support for NGC Personal/Service API keys in addition to the NGC API Key (Original).
NGC_API_KEYis no longer required when running a container with a pre-populated cache (NIM_CACHE_PATH).list-model-profilescommand updated to check the correct location for model artifacts.
Release 1.0.0#
Summary#
This is the first general release of the NVIDIA NeMo Retriever Embedding NIM.
Embedding Models#
NV-EmbedQA-E5-v5
NV-EmbedQA-Mistral7B-v2
Snowflake’s Arctic-embed-l