Support Matrix for NVIDIA NeMo Retriever Embedding NIM#

This documentation describes the software and hardware that NVIDIA NeMo Retriever Embedding NIM supports.

CPU#

NeMo Retriever Embedding NIM requires the following:

Models#

NVIDIA NeMo Retriever Embedding NIM supports the following models.

Publisher

Model ID

Supported Embedding Types

Max Tokens

Parameters
(millions, excl. embeddings)

Total Parameters
(millions)

Embedding
Dimension

Dynamic Embeddings
Supported

Model Card

NVIDIA

nvidia/llama-nemotron-embed-vl-1b-v2

float, int8, uint8, binary, ubinary

2048

1414

1678

2048

yes

Llama Nemotron Embed VL 1B v2 model card

NVIDIA

nvidia/llama-nemotron-embed-1b-v2

float, int8, uint8, binary, ubinary

8192

973

1236

2048

yes

Llama Nemotron Embed 1B v2 model card

NVIDIA

nvidia/llama-nemotron-embed-300m-v2

float, int8, uint8, binary, ubinary

8192

307

569

2048

yes

Llama Nemotron Embed 300M v2 model card

NVIDIA

nvidia/nv-embedqa-e5-v5

float

512

303

335

1024

no

NV-EmbedQA-E5 v5 model card

BAAI

baai/bge-m3

float

8192

303

568

1024

no

BAAI bge-m3 model card

BAAI

baai/bge-large-zh-v1.5

float

512

303

325

1024

no

BAAI bge-large-zh-v1.5 model card

Note

The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.

Optimized vs Non Optimized Models#

Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-embed-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. At startup, the NIM selects a kernel feature set for the detected GPU architecture. Depending on the selected feature set, the NIM might compile kernels, load precompiled kernels optimized for that architecture, or use both.

The optimized configuration tables list the compute capability families or GPU SKUs that have optimized kernel support for the listed precision. These configurations are tuned and validated for the release.

Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.

Compute Capability and Automatic Kernel Selection#

Starting in version 2.0.0, automatic profile selection is replaced by automatic kernel selection. The NIM detects the GPU compute capability at startup and selects the supported kernel feature set for that compute capability family.

The selected feature set determines which CUDA kernels, attention implementation, precompiled kernel artifacts, and default precision are used. If an optimized feature set is not available for the detected GPU, the NIM uses the compatible fallback feature set.

To request a precision explicitly, set NIM_PRECISION to fp16 or fp8. FP8 is available only on the compute capability families or GPU SKUs listed with FP8 support in the optimized configuration tables.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2)#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

Min: 2.4 GiB, Max: 25.2 GiB

FP16

7.49 GiB

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2)#

Supported GPU SKUs#

SKU

GPU

Precision

NVIDIA-RTX-PRO-6000-Blackwell-Workstation-Edition

NVIDIA RTX PRO 6000 Blackwell Workstation Edition

FP8 & FP16

NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition

NVIDIA RTX PRO 6000 Blackwell Server Edition

FP8 & FP16

NVIDIA-B200

NVIDIA B200

FP8 & FP16

NVIDIA-GB200

NVIDIA GB200

FP8 & FP16

NVIDIA-H200

NVIDIA H200

FP8 & FP16

NVIDIA-A100-SXM4-80GB

NVIDIA A100 SXM4 80GB

FP16

NVIDIA-H100-NVL

NVIDIA H100 NVL

FP8 & FP16

NVIDIA-H100-80GB-HBM3

NVIDIA H100 80GB HBM3

FP8 & FP16

NVIDIA-L4

NVIDIA L4

FP16

NVIDIA-L40S

NVIDIA L40S

FP8 & FP16

NVIDIA-A10G

NVIDIA A10G

FP16

Non-optimized configuration#

Fallback behavior on GPUs outside the listed SKU set has not been verified for this model.

Note

The default VLM profile uses a maximum sequence length of 2048 tokens. Image inputs are supported only as document or passage inputs.

bge-large-zh-v1.5#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

H20

96

FP16

L20

48

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

10

FP16

8.1

bge-m3#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP16

L40s

48

FP16

A10G

24

FP16

L20

48

FP16

H20

96

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

33

FP16

8.8

Llama Nemotron Embed 1B v2#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

3.6

FP16

9

4096

If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM.

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-EmbedQA-E5-v5#

Optimized configuration#

Compute Capability

Precision

12.0

FP16

10.0

FP16

9.0

FP16

8.9

FP16

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

2

FP16

8.5

Memory Footprint#

The following table provides the set of valid configurations and the associated approximate memory footprints for the model.

Approximate GPU Memory Size (GiB)

2.04

Approximate GPU Memory Size (GiB)

3.53

Approximate GPU Memory Size (GiB)

2.04

Approximate GPU Memory Size (GiB)

3.19

Approximate GPU Memory Size (GiB)

2.04

Approximate GPU Memory Size (GiB)

3.17

Approximate GPU Memory Size (GiB)

2.04

Approximate GPU Memory Size (GiB)

2.67

Approximate GPU Memory Size (GiB)

2.86

Approximate GPU Memory Size (GiB)

2.86

Approximate GPU Memory Size (GiB)

2.98

Approximate GPU Memory Size (GiB)

6.53

Approximate GPU Memory Size (GiB)

2.98

Approximate GPU Memory Size (GiB)

6.09

Approximate GPU Memory Size (GiB)

2.98

Approximate GPU Memory Size (GiB)

4.91

Approximate GPU Memory Size (GiB)

2.98

Approximate GPU Memory Size (GiB)

4.91

Approximate GPU Memory Size (GiB)

5.22

Approximate GPU Memory Size (GiB)

5.09

Approximate GPU Memory Size (GiB)

8.26

Approximate GPU Memory Size (GiB)

6.13

Approximate GPU Memory Size (GiB)

9.27

Approximate GPU Memory Size (GiB)

6.07

Approximate GPU Memory Size (GiB)

9.17

Approximate GPU Memory Size (GiB)

6.79

Approximate GPU Memory Size (GiB)

8.5

Approximate GPU Memory Size (GiB)

5.8

Approximate GPU Memory Size (GiB)

5.61

Approximate GPU Memory Size (GiB)

5.79

Approximate GPU Memory Size (GiB)

0.87

Approximate GPU Memory Size (GiB)

0.87

Approximate GPU Memory Size (GiB)

0.87

Approximate GPU Memory Size (GiB)

0.87

Approximate GPU Memory Size (GiB)

0.88

Approximate GPU Memory Size (GiB)

0.87

Software#

NVIDIA Driver#

Release 1.7.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.