Support Matrix for NeMo Retriever Text Embedding NIM#

This documentation describes the software and hardware that NeMo Retriever Text Embedding NIM supports.

CPU#

Text Embedding NIM requires the following:

Models#

NeMo Retriever Text Embedding NIM supports the following models.

Model Name

Model ID

Max Tokens

Publisher

Parameters
(millions, excl. embeddings)

Total Parameters
(millions)

Embedding
Dimension

Dynamic Embeddings
Supported

Model Card

Llama 3.2 NeMo Retriever Embedding 300m v2

nvidia/llama-3.2-nemoretriever-300m-embed-v2

8192

NVIDIA

307

569

2048

no

Llama 3.2 NeMo Retriever Embedding 300m v1

nvidia/llama-3.2-nemoretriever-300m-embed-v1

8192

NVIDIA

307

569

2048

no

Link

NeMo Retriever Llama Vision Embed

nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1

4096

NVIDIA

1414

1678

2048

yes

Link

bge-large-zh-v1.5

baai/bge-large-zh-v1.5

512

BAAI

303

325

1024

no

Link

bge-m3

baai/bge-m3

8192

BAAI

303

568

1024

no

Link

Llama-3.2-NV-EmbedQA-1B-v2

nvidia/llama-3.2-nv-embedqa-1b-v2

8192

NVIDIA

973

1236

2048

yes

Link

NV-EmbedQA-E5-v5

nvidia/nv-embedqa-e5-v5

512

NVIDIA

303

335

1024

no

Link

NV-EmbedQA-Mistral7B-v2

nvidia/nv-embedqa-mistral-7b-v2

512

NVIDIA

6980

7110

4096

no

Link

Snowflake’s Arctic-embed-l

snowflake/arctic-embed-l

512

Snowflake

303

335

1024

no

Link

Note

The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.

Embedding Type Support#

The following table contains the embedding types that each model supports. For details, refer to Specify Embedding Type.

Model ID

Supported Embedding Types

nvidia/llama-3.2-nemoretriever-300m-embed-v2

float, int8, uint8, binary, ubinary

nvidia/llama-3.2-nemoretriever-300m-embed-v1

float, int8, uint8, binary, ubinary

nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1

float, int8, uint8, binary, ubinary

nvidia/llama-3.2-nv-embedqa-1b-v2

float, int8, uint8, binary, ubinary

baai/bge-large-zh-v1.5

float

baai/bge-m3

float

nvidia/nv-embedqa-e5-v5

float

nvidia/nv-embedqa-mistral-7b-v2

float

snowflake/arctic-embed-l

float

Optimized vs Non Optimized Models#

The following models are optimized using TRT and are available as pre-built, optimized engines on NGC. These optimized models are GPU specific and require a minimum GPU memory value as specified in the Optimized configuration sections of each model.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. These generic profiles are known as non-optimized configuration. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps in the Overriding Profile Selection section.

Compute Capability and Automatic Profile Selection#

NeMo Retriever Text Embedding NIM supports TensorRT engines that are compiled with the option kSAME_COMPUTE_CAPABILITY. This option builds engines that are compatible with GPUs having the same compute capability as the one on which the engine was built. For more information, refer to Same Compute Capability Compatibility Level.

To see the mapping of CUDA GPU compute capability versions to supported GPU SKUs, refer to CUDA GPU Compute Capability. If you run a NIM on a GPU that has the same compute capability as one of the engines, then that engine should appear as compatible when you run list-model-profiles.

Automatic profile selection uses the following order to choose a profile:

  1. A GPU-specific engine (for example, gpu:NVIDIA B200)

  2. A compute capability engine (for example, compute_capability:10.0)

  3. ONNX (for example, model_type:onnx)

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Llama 3.2 NeMo Retriever Embedding 300m v2 (llama-3.2-nemoretriever-300m-embed-v2)#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

Min: 2.4 GiB, Max: 25.2 GiB

FP16

7.49 GiB

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Llama 3.2 NeMo Retriever Embedding 300m v1 (llama-3.2-nemoretriever-300m-embed-v1)#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

A100 SXM4

40 & 80

FP16

H100 HBM3

80

FP16 & FP8

H100 NVL

80

FP16 & FP8

L40s

48

FP16 & FP8

A10G

24

FP16

L4

24

FP16

B200

180

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

Min: 2.4 GiB, Max: 25.2 GiB

FP16

7.49 GiB

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NeMo Retriever Llama Vision Embed (llama-3.2-nemoretriever-1b-vlm-embed-v1)#

Optimized configuration#

Currently, there is no support for optimized configurations.

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

Min: 4.4 GiB, Max: 21 GiB

FP16

3.2GiB

bge-large-zh-v1.5#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

H20

96

FP16

L20

48

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

10

FP16

8.1

bge-m3#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP16

L40s

48

FP16

A10G

24

FP16

L20

48

FP16

H20

96

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

33

FP16

8.8

Llama-3.2-NV-EmbedQA-1B-v2#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

3.6

FP16

9

4096

If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM.

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-EmbedQA-E5-v5#

Optimized configuration#

Compute Capability

Precision

12.0

FP16

10.0

FP16

9.0

FP16

8.9

FP16

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

2

FP16

8.5

NV-EmbedQA-Mistral7B-v2#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP8

H100 HBM3

80

FP16

L40s

48

FP8

L40s

48

FP16

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

16

FP16

30

Snowflake’s Arctic-embed-l#

Optimized configuration#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP16

L40s

48

FP16

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

2

FP16

17

Memory Footprint#

The following table provides the set of valid configurations and the associated approximate memory footprints for the model. These values were measured using version 1.5.0 and are expected to remain similar in future releases.

Approximate GPU Memory Size (GB)

2.04

Approximate GPU Memory Size (GB)

3.53

Approximate GPU Memory Size (GB)

2.04

Approximate GPU Memory Size (GB)

3.19

Approximate GPU Memory Size (GB)

2.04

Approximate GPU Memory Size (GB)

3.17

Approximate GPU Memory Size (GB)

2.04

Approximate GPU Memory Size (GB)

2.67

Approximate GPU Memory Size (GB)

2.86

Approximate GPU Memory Size (GB)

2.86

Approximate GPU Memory Size (GB)

2.98

Approximate GPU Memory Size (GB)

6.53

Approximate GPU Memory Size (GB)

2.98

Approximate GPU Memory Size (GB)

6.09

Approximate GPU Memory Size (GB)

2.98

Approximate GPU Memory Size (GB)

4.91

Approximate GPU Memory Size (GB)

2.98

Approximate GPU Memory Size (GB)

4.91

Approximate GPU Memory Size (GB)

5.22

Approximate GPU Memory Size (GB)

5.09

Approximate GPU Memory Size (GB)

0.87

Approximate GPU Memory Size (GB)

0.87

Approximate GPU Memory Size (GB)

0.87

Approximate GPU Memory Size (GB)

0.87

Approximate GPU Memory Size (GB)

0.88

Approximate GPU Memory Size (GB)

0.87

Software#

NVIDIA Driver#

Release 1.7.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

If issues arise when you start the NIM containers, run the following code to ensure that the latest NVIDIA drivers are installed.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
 && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
   sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.