Support Matrix for NVIDIA NeMo Retriever Embedding NIM#

This documentation describes the software and hardware that NVIDIA NeMo Retriever Embedding NIM supports.

CPU#

NeMo Retriever Embedding NIM requires the following:

x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.

Models#

NVIDIA NeMo Retriever Embedding NIM supports the following models.

Publisher	Model ID	Supported Embedding Types	Max Tokens	Parameters (millions, excl. embeddings)	Total Parameters (millions)	Embedding Dimension	Dynamic Embeddings Supported	Model Card
NVIDIA	nvidia/llama-nemotron-embed-vl-1b-v2	`float`, `int8`, `uint8`, `binary`, `ubinary`	2048	1414	1678	2048	yes	Llama Nemotron Embed VL 1B v2 model card
NVIDIA	nvidia/llama-nemotron-embed-1b-v2	`float`, `int8`, `uint8`, `binary`, `ubinary`	8192	973	1236	2048	yes	Llama Nemotron Embed 1B v2 model card
NVIDIA	nvidia/llama-nemotron-embed-300m-v2	`float`, `int8`, `uint8`, `binary`, `ubinary`	8192	307	569	2048	yes	Llama Nemotron Embed 300M v2 model card
NVIDIA	nvidia/nv-embedqa-e5-v5	`float`	512	303	335	1024	no	NV-EmbedQA-E5 v5 model card
BAAI	baai/bge-m3	`float`	8192	303	568	1024	no	BAAI bge-m3 model card
BAAI	baai/bge-large-zh-v1.5	`float`	512	303	325	1024	no	BAAI bge-large-zh-v1.5 model card

Note

The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.

Optimized vs Non Optimized Models#

Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-embed-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. At startup, the NIM selects a kernel feature set for the detected GPU architecture. Depending on the selected feature set, the NIM might compile kernels, load precompiled kernels optimized for that architecture, or use both.

The optimized configuration tables list the compute capability families or GPU SKUs that have optimized kernel support for the listed precision. These configurations are tuned and validated for the release.

Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.

Compute Capability and Automatic Kernel Selection#

Starting in version 2.0.0, automatic profile selection is replaced by automatic kernel selection. The NIM detects the GPU compute capability at startup and selects the supported kernel feature set for that compute capability family.

The selected feature set determines which CUDA kernels, attention implementation, precompiled kernel artifacts, and default precision are used. If an optimized feature set is not available for the detected GPU, the NIM uses the compatible fallback feature set.

To request a precision explicitly, set NIM_PRECISION to fp16 or fp8. FP8 is available only on the compute capability families or GPU SKUs listed with FP8 support in the optimized configuration tables.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2)#

Optimized configuration#

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.	Min: 2.4 GiB, Max: 25.2 GiB	FP16	7.49 GiB	4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2)#

Supported GPU SKUs#

SKU	GPU	Precision
`NVIDIA-RTX-PRO-6000-Blackwell-Workstation-Edition`	NVIDIA RTX PRO 6000 Blackwell Workstation Edition	FP8 & FP16
`NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition`	NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8 & FP16
`NVIDIA-B200`	NVIDIA B200	FP8 & FP16
`NVIDIA-GB200`	NVIDIA GB200	FP8 & FP16
`NVIDIA-H200`	NVIDIA H200	FP8 & FP16
`NVIDIA-A100-SXM4-80GB`	NVIDIA A100 SXM4 80GB	FP16
`NVIDIA-H100-NVL`	NVIDIA H100 NVL	FP8 & FP16
`NVIDIA-H100-80GB-HBM3`	NVIDIA H100 80GB HBM3	FP8 & FP16
`NVIDIA-L4`	NVIDIA L4	FP16
`NVIDIA-L40S`	NVIDIA L40S	FP8 & FP16
`NVIDIA-A10G`	NVIDIA A10G	FP16

Non-optimized configuration#

Fallback behavior on GPUs outside the listed SKU set has not been verified for this model.

Note

The default VLM profile uses a maximum sequence length of 2048 tokens. Image inputs are supported only as document or passage inputs.

bge-large-zh-v1.5#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
H20	96	FP16
L20	48	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	10	FP16	8.1

bge-m3#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	80	FP16
H100 HBM3	80	FP16
L40S	48	FP16
A10G	24	FP16
L20	48	FP16
H20	96	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	33	FP16	8.8

Llama Nemotron Embed 1B v2#

Optimized configuration#

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.	3.6	FP16	9	4096

If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM.

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-EmbedQA-E5-v5#

Optimized configuration#

Compute Capability	Precision
12.0	FP16
10.0	FP16
9.0	FP16
8.9	FP16
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	2	FP16	8.5

Memory Footprint#

The following table provides the set of valid configurations and the associated approximate memory footprints for the model.

nvidia/llama-nemotron-embed-300m-v2

12.0

fp8

Approximate GPU Memory Size (GiB)
2.04

fp16

Approximate GPU Memory Size (GiB)
3.53

10.0

fp8

Approximate GPU Memory Size (GiB)
2.04

fp16

Approximate GPU Memory Size (GiB)
3.19

9.0

fp8

Approximate GPU Memory Size (GiB)
2.04

fp16

Approximate GPU Memory Size (GiB)
3.17

8.9

fp8

Approximate GPU Memory Size (GiB)
2.04

fp16

Approximate GPU Memory Size (GiB)
2.67

8.6

fp16

Approximate GPU Memory Size (GiB)
2.86

8.0

fp16

Approximate GPU Memory Size (GiB)
2.86

nvidia/llama-nemotron-embed-1b-v2

12.0

fp8

Approximate GPU Memory Size (GiB)
2.98

fp16

Approximate GPU Memory Size (GiB)
6.53

10.0

fp8

Approximate GPU Memory Size (GiB)
2.98

fp16

Approximate GPU Memory Size (GiB)
6.09

9.0

fp8

Approximate GPU Memory Size (GiB)
2.98

fp16

Approximate GPU Memory Size (GiB)
4.91

8.9

fp8

Approximate GPU Memory Size (GiB)
2.98

fp16

Approximate GPU Memory Size (GiB)
4.91

8.6

fp16

Approximate GPU Memory Size (GiB)
5.22

8.0

fp16

Approximate GPU Memory Size (GiB)
5.09

nvidia/llama-nemotron-embed-vl-1b-v2

12.0

fp8

Approximate GPU Memory Size (GiB)
8.26

fp16

Approximate GPU Memory Size (GiB)
6.13

10.0

fp8

Approximate GPU Memory Size (GiB)
9.27

fp16

Approximate GPU Memory Size (GiB)
6.07

9.0

fp8

Approximate GPU Memory Size (GiB)
9.17

fp16

Approximate GPU Memory Size (GiB)
6.79

8.9

fp8

Approximate GPU Memory Size (GiB)
8.5

fp16

Approximate GPU Memory Size (GiB)
5.8

8.6

fp16

Approximate GPU Memory Size (GiB)
5.61

8.0

fp16

Approximate GPU Memory Size (GiB)
5.79

nvidia/nv-embedqa-e5-v5

12.0

fp16

Approximate GPU Memory Size (GiB)
0.87

10.0

fp16

Approximate GPU Memory Size (GiB)
0.87

9.0

fp16

Approximate GPU Memory Size (GiB)
0.87

8.9

fp16

Approximate GPU Memory Size (GiB)
0.87

8.6

fp16

Approximate GPU Memory Size (GiB)
0.88

8.0

fp16

Approximate GPU Memory Size (GiB)
0.87

Software#

NVIDIA Driver#

Release 1.7.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.

Support Matrix for NVIDIA NeMo Retriever Embedding NIM#

CPU#

Models#

Optimized vs Non Optimized Models#

Compute Capability and Automatic Kernel Selection#

Supported Hardware#

Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2)#

Optimized configuration#

Non-optimized configuration#

Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2)#

Supported GPU SKUs#

Non-optimized configuration#

bge-large-zh-v1.5#

Optimized configuration#

Non-optimized configuration#

bge-m3#

Optimized configuration#

Non-optimized configuration#

Llama Nemotron Embed 1B v2#

Optimized configuration#

Non-optimized configuration#

NV-EmbedQA-E5-v5#

Optimized configuration#

Non-optimized configuration#

Memory Footprint#

Software#

NVIDIA Driver#

NVIDIA Container Toolkit#

Related Topics#