Support Matrix for NVIDIA NeMo Retriever Embedding NIM#

This documentation describes the software and hardware that NVIDIA NeMo Retriever Embedding NIM supports.

CPU# NeMo Retriever Embedding NIM requires the following: x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.

Models# NVIDIA NeMo Retriever Embedding NIM supports the following models. Model Name Model ID Max Tokens Publisher Parameters

(millions, excl. embeddings) Total Parameters

(millions) Embedding

Dimension Dynamic Embeddings

Supported Model Card Llama Nemotron Embed 300m v2 nvidia/llama-nemotron-embed-300m-v2 8192 NVIDIA 307 569 2048 yes Link Llama Nemotron Embed Vision Language 1B nvidia/llama-nemotron-embed-vl-1b-v2 8192 NVIDIA 1414 1678 2048 yes - bge-large-zh-v1.5 baai/bge-large-zh-v1.5 512 BAAI 303 325 1024 no Link bge-m3 baai/bge-m3 8192 BAAI 303 568 1024 no Link Llama Nemotron Embed 1B v2 nvidia/llama-nemotron-embed-1b-v2 8192 NVIDIA 973 1236 2048 yes Link NV-EmbedQA-E5-v5 nvidia/nv-embedqa-e5-v5 512 NVIDIA 303 335 1024 no Link NV-EmbedQA-Mistral7B-v2 nvidia/nv-embedqa-mistral-7b-v2 512 NVIDIA 6980 7110 4096 no Link Snowflake’s Arctic-embed-l snowflake/arctic-embed-l 512 Snowflake 303 335 1024 no Link Note The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.

Embedding Type Support# The following table contains the embedding types that each model supports. For details, refer to How to Specify Embedding Type. Model ID Supported Embedding Types nvidia/llama-nemotron-embed-300m-v2 float , int8 , uint8 , binary , ubinary nvidia/llama-nemotron-embed-vl-1b-v2 float , int8 , uint8 , binary , ubinary nvidia/llama-nemotron-embed-1b-v2 float , int8 , uint8 , binary , ubinary baai/bge-large-zh-v1.5 float baai/bge-m3 float nvidia/nv-embedqa-e5-v5 float nvidia/nv-embedqa-mistral-7b-v2 float snowflake/arctic-embed-l float

Optimized vs Non Optimized Models# The following models are optimized using TRT and are available as pre-built, optimized engines on NGC. These optimized models are GPU specific and require a minimum GPU memory value as specified in the Optimized configuration sections of each model. NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. These generic profiles are known as non-optimized configuration. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps in the Overriding Profile Selection section.

Compute Capability and Automatic Profile Selection# NVIDIA NeMo Retriever Embedding NIM supports TensorRT engines that are compiled with the option kSAME_COMPUTE_CAPABILITY . This option builds engines that are compatible with GPUs having the same compute capability as the one on which the engine was built. For more information, refer to Same Compute Capability Compatibility Level. To see the mapping of CUDA GPU compute capability versions to supported GPU SKUs, refer to CUDA GPU Compute Capability. If you run a NIM on a GPU that has the same compute capability as one of the engines, then that engine should appear as compatible when you run list-model-profiles . Automatic profile selection uses the following order to choose a profile: A GPU-specific engine (for example, gpu:NVIDIA B200) A compute capability engine (for example, compute_capability:10.0) ONNX or Pytorch(for example, model_type:onnx) Note: Certain NIMs may include both GPU-specific engines and compute capability engines, while others may include only a single engine type.

Supported Hardware# Note Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported. Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2)# Optimized configuration# Compute Capability Precision 12.0 FP16 & FP8 10.0 FP16 & FP8 9.0 FP16 & FP8 8.9 FP16 & FP8 8.6 FP16 8.0 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Max Tokens Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total. Min: 2.4 GiB, Max: 25.2 GiB FP16 7.49 GiB 4096 Warning The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192). Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2)# Optimized configuration# GPU GPU Memory (GB) Precision A100 SXM4 40 & 80 FP16 H100 HBM3 80 FP16 & FP8 H100 NVL 80 FP16 & FP8 L40s 48 FP16 & FP8 A10G 24 FP16 L4 24 FP16 B200 180 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. Warning The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192). GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory Min: 4.4 GiB, Max: 21 GiB FP16 3.2GiB bge-large-zh-v1.5# Optimized configuration# GPU GPU Memory (GB) Precision H20 96 FP16 L20 48 FP16 Non-optimized configuration# The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory 10 FP16 8.1 bge-m3# Optimized configuration# GPU GPU Memory (GB) Precision A100 SXM4 80 FP16 H100 HBM3 80 FP16 L40s 48 FP16 A10G 24 FP16 L20 48 FP16 H20 96 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory 33 FP16 8.8 Llama Nemotron Embed 1B v2# Optimized configuration# Compute Capability Precision 12.0 FP16 & FP8 10.0 FP16 & FP8 9.0 FP16 & FP8 8.9 FP16 & FP8 8.6 FP16 8.0 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Max Tokens Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total. 3.6 FP16 9 4096 If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM. Warning The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192). NV-EmbedQA-E5-v5# Optimized configuration# Compute Capability Precision 12.0 FP16 10.0 FP16 9.0 FP16 8.9 FP16 8.6 FP16 8.0 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory 2 FP16 8.5 NV-EmbedQA-Mistral7B-v2# Optimized configuration# GPU GPU Memory (GB) Precision A100 SXM4 80 FP16 H100 HBM3 80 FP8 H100 HBM3 80 FP16 L40s 48 FP8 L40s 48 FP16 A10G 24 FP16 L4 24 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory 16 FP16 30 Snowflake’s Arctic-embed-l# Optimized configuration# GPU GPU Memory (GB) Precision A100 SXM4 80 FP16 H100 HBM3 80 FP16 L40s 48 FP16 A10G 24 FP16 L4 24 FP16 Non-optimized configuration# The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model. GPUs GPU Memory Precision Disk Space Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory 2 FP16 17