Release Notes for NVIDIA NeMo Retriever Embedding NIM#

This documentation contains the release notes for NVIDIA NeMo Retriever Embedding NIM.

Note

Some releases are labelled “Production Branch” or “(PB)”. Production Branches provide reliable, stable versions of the NIM. Non-production branch releases (sometimes called Feature Branch (FB) releases) contain the latest features, improvements, and optimizations.

Release 2.0.0#

This release represents a major runtime upgrade for the nvidia/llama-nemotron-embed-vl-1b-v2 NIM that includes a new purpose-built embedding inference stack. Compared to earlier versions, the new runtime delivers higher throughput and lower latency across all supported GPU SKUs, smaller VRAM footprint, faster startup time, and smaller container size.

Note

The model, supported modalities (text | image | text_image), and API are unchanged.

Highlights#

This release contains the following key changes:

The runtime selects optimized CUDA kernels automatically at startup based on the GPU’s compute capability. No manual profile selection steps are required.
Added support for loading model artifacts from Hugging Face or NGC.
- To use Hugging Face (default), set HF_TOKEN.
- To use NGC, set NIM_MODEL_DOWNLOAD_PROVIDER=ngc and NGC_API_KEY.
The NIM_ENGINE_COUNT env var defaults to 1.
Added the SHOW_CONFIG environment variable. Setting SHOW_CONFIG=1 at runtime lists the environment variables configured for the NIM.
You can opt-in to gRPC by setting NIM_GRPC_BIND_ADDR.
New environment variables. For details, refer to Environment Variables for NVIDIA NeMo Retriever Embedding NIM.
NIM_MODEL_NAME is supported for served API model aliasing. NIM_SERVED_MODEL_NAME remains supported for served API aliasing and currently takes precedence when both variables are configured.
The following environment variables are renamed in this version:
- NIM_HTTP_API_PORT is now NIM_BIND_ADDR.
- NIM_LOG_LEVEL is now RUST_LOG.
- NIM_LOGGING_JSONL is now LOG_FORMAT=json.
- NIM_TRITON_GRPC_PORT is now NIM_GRPC_BIND_ADDR.
The following environment variables are deprecated aliases in this release. The aliases still work, but will be removed in a future release.
- NIM_NUM_MODEL_INSTANCES and NIM_TRITON_MODEL_INSTANCE_COUNT are now NIM_ENGINE_COUNT.
- NIM_TRITON_DYNAMIC_BATCHING_MAX_QUEUE_DELAY_MICROSECONDS is now NIM_MAX_WAIT_MS.
The following environment variables are removed in this release with no replacement. NIM_CACHE_PATH, NIM_CUSTOM_MODEL, NIM_HTTP_MAX_WORKERS, NIM_HTTP_TRITON_PORT, NIM_IGNORE_MODEL_DOWNLOAD_FAIL, NIM_MANIFEST_ALLOW_UNSAFE, NIM_MANIFEST_PATH, NIM_MODEL_PROFILE, NIM_NUM_TOKENIZERS, NIM_REPOSITORY_OVERRIDE, NIM_TELEMETRY_MODE, NIM_TELEMETRY_ENABLE_ON_RTX, NIM_TELEMETRY_INTERVAL_MINUTES, NIM_TRITON_LOG_VERBOSE, NIM_TRITON_PERFORMANCE_MODE.

Support Matrix and Compatibility Updates#

The supported optimized SKUs are the following. For details, refer to Support Matrix for NVIDIA NeMo Retriever Embedding NIM.
- FP16 on: B200, GB200, RTX PRO 6000, H100, H200, L40S, A100, A10G, L4
- FP8 on B200, GB200, RTX PRO 6000, H100, H200, L40S
The default precision is now determined automatically based on the GPU architecture. Set NIM_PRECISION=fp16 to opt into FP16.

All Known Issues#

Currently, there are no known issues for NeMo Retriever Embedding NIM.