Support Matrix for NeMo Retriever Text Embedding NIM#

This documentation describes the software and hardware that NeMo Retriever Text Embedding NIM supports.

CPU#

Text Embedding NIM requires the following:

x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.

Models#

NeMo Retriever Text Embedding NIM supports the following models.

| Model Name | Model ID | Max Tokens | Publisher | Parameters
(millions, excl. embeddings) | Total Parameters
(millions) | Embedding
Dimension | Dynamic Embeddings
Supported | Model Card | | ———- | ——– | ———— | ———– | ———– | ———– | ———- | ———- | | Llama 3.2 NeMo Retriever Embedding 300m | nvidia/llama-3.2-nemoretriever-300m-embed-v1 | 8192 | NVIDIA | 307 | 569 | 2048 | yes | - | | NeMo Retriever Llama Vision Embed | nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1 | 4096 | NVIDIA | 1414 | 1678 | 2048 | yes | — | | bge-large-zh-v1.5 | baai/bge-large-zh-v1.5 | 512 | BAAI | 303 | 325 | 1024 | no | Link | | bge-m3 | baai/bge-m3 | 8192 | BAAI | 303 | 568 | 1024 | no | Link | | Llama-3.2-NV-EmbedQA-1B-v2 | nvidia/llama-3.2-nv-embedqa-1b-v2 | 8192 | NVIDIA | 973 | 1236 | 2048 | yes | Link | | NV-EmbedQA-E5-v5 | nvidia/nv-embedqa-e5-v5 | 512 | NVIDIA | 303 | 335 | 1024 | no | Link | | NV-EmbedQA-Mistral7B-v2 | nvidia/nv-embedqa-mistral-7b-v2 | 512 | NVIDIA | 6980 | 7110 | 4096 | no | Link | | Snowflake’s Arctic-embed-l | snowflake/arctic-embed-l | 512 | Snowflake | 303 | 335 | 1024 | no | Link |

Note

The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.

Embedding Type Support#

The following table contains the embedding types that each model supports. For details, refer to Specify Embedding Type.

Model ID	Supported Embedding Types
nvidia/llama-3.2-nemoretriever-300m-embed-v1	`float`, `int8`, `uint8`, `binary`, `ubinary`
nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1	`float`, `int8`, `uint8`, `binary`, `ubinary`
nvidia/llama-3.2-nv-embedqa-1b-v2	`float`, `int8`, `uint8`, `binary`, `ubinary`
baai/bge-large-zh-v1.5	`float`
baai/bge-m3	`float`
nvidia/nv-embedqa-e5-v5	`float`
nvidia/nv-embedqa-mistral-7b-v2	`float`
snowflake/arctic-embed-l	`float`

Optimized vs Non Optimized Models#

The following models are optimized using TRT and are available as pre-built, optimized engines on NGC. These optimized models are GPU specific and require a minimum GPU memory value as specified in the Optimized configuration sections of each model.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. These generic profiles are known as non-optimized configuration. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps in the Overriding Profile Selection section.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Llama 3.2 NeMo Retriever Embedding 300m v1 (llama-3.2-nemoretriever-300m-embed-v1)#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	40 & 80	FP16
H100 HBM3	80	FP16 & FP8
H100 NVL	80	FP16 & FP8
L40s	48	FP16 & FP8
A10G	24	FP16
L4	24	FP16
B200	180	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.	TBD	FP16	TBD	8192

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NeMo Retriever Llama Vision Embed (llama-3.2-nemoretriever-1b-vlm-embed-v1)#

Optimized configuration#

Currently, there is no support for optimized configurations.

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	Min: 4.4 GiB, Max: 21 GiB	FP16	3.2GiB

bge-large-zh-v1.5#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
H20	96	FP16
L20	48	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	10	FP16	8.1

bge-m3#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	80	FP16
H100 HBM3	80	FP16
L40s	48	FP16
A10G	24	FP16
L20	48	FP16
H20	96	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	33	FP16	8.8

Llama-3.2-NV-EmbedQA-1B-v2#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	40 & 80	FP16
H100 HBM3	80	FP16 & FP8
H100 NVL	80	FP16 & FP8
L40s	48	FP16 & FP8
A10G	24	FP16
L4	24	FP16 & FP8
B200	180	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.	3.6	FP16	9	4096

If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM.

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-EmbedQA-E5-v5#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	40 & 80	FP16
H100 HBM3	80	FP8* & FP16
H100 NVL	80	FP8* & FP16
L40s	48	FP8* & FP16
A10G	24	FP16
L4	24	FP16
B200	180	FP16
H200 NVL*	141	FP8 & FP16

Note: SKUs with an asterisk (*) only available for the 1.8.x Production Branch release.

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	2	FP16	8.5

NV-EmbedQA-Mistral7B-v2#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	80	FP16
H100 HBM3	80	FP8
H100 HBM3	80	FP16
L40s	48	FP8
L40s	48	FP16
A10G	24	FP16
L4	24	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	16	FP16	30

Snowflake’s Arctic-embed-l#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	80	FP16
H100 HBM3	80	FP16
L40s	48	FP16
A10G	24	FP16
L4	24	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	2	FP16	17

Memory Footprint#

You can control the NIM’s memory footprint by controlling the maximum allowed batch size and sequence length. For more information, refer to Memory Footprint.

The following table provides the set of valid configurations and the associated approximate memory footprints for the model. These values were measured using version 1.5.0 and are expected to remain similar in future releases.

nvidia/llama-3.2-nemoretriever-300m-embed-v1

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.51
8	8192	4.63
16	8192	8.19
30	1024	2.26
30	2048	4.17
30	4096	6.81
30	8192	14.42

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.51
8	8192	4.63
16	8192	8.19
30	1024	2.26
30	2048	4.17
30	4096	6.81
30	8192	14.42

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.51
8	8192	4.63
16	1024	1.7
16	2048	2.72
16	4096	4.13
16	8192	8.19

b200

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.11
8	8192	5.22
16	8192	8.79
30	1024	2.86
30	2048	4.76
30	4096	7.4
30	8192	15.02

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.89
8	8192	5.0
16	8192	8.57
30	1024	2.64
30	2048	4.55
30	4096	7.18
30	8192	14.8

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.15
8	8192	3.69
16	8192	6.6
30	1024	1.68
30	2048	2.81
30	4096	5.3
30	8192	11.68

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.89
8	8192	5.0
16	8192	8.57
30	1024	2.64
30	2048	4.55
30	4096	7.18
30	8192	14.8

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.15
8	8192	3.69
16	8192	6.6
30	1024	1.68
30	2048	2.81
30	4096	5.3
30	8192	11.68

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.51
8	8192	4.63
16	1024	1.7
16	2048	2.72
16	4096	4.13
16	8192	8.19

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.27
8	8192	5.19
16	1024	1.44
16	2048	2.61
16	4096	4.69
16	8192	9.6

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.51
8	8192	4.63
16	8192	8.19
30	1024	2.26
30	2048	4.17
30	4096	6.81
30	8192	14.42

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.15
8	8192	3.69
16	8192	6.6
30	1024	1.68
30	2048	2.8
30	4096	5.29
30	8192	11.68

nvidia/llama-3.2-nv-embedqa-1b-v2

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.14
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.14
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	1024	3.47
16	2048	4.88
16	4096	7.44
16	8192	13.56

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.84
8	8192	4.94
16	8192	8.47
30	1024	2.6
30	2048	4.02
30	4096	7.09
30	8192	14.65

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.84
8	8192	4.94
16	8192	8.47
30	1024	2.6
30	2048	4.02
30	4096	7.09
30	8192	14.65

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	1024	3.47
16	2048	4.88
16	4096	7.44
16	8192	13.56

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.08
8	8192	6.94
16	1024	2.54
16	2048	3.8
16	4096	6.44
16	8192	12.47

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.96
8	8192	5.94
16	8192	10.47
30	1024	3.06
30	2048	4.95
30	4096	8.97
30	8192	18.4

nvidia/nv-embedqa-e5-v5

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.63
64	512	1.03
128	512	1.44
192	512	1.84

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.63
64	512	1.03
128	512	1.44
384	512	3.06

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.79
16	512	0.88
32	512	0.98
64	512	1.19
80	512	1.29

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.82
64	512	1.22
128	512	1.63
384	512	3.25

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.82
64	512	1.22
128	512	1.63
384	512	3.25

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.69
16	512	0.79
32	512	0.89
64	512	1.09
80	512	1.2

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	512	0.82
64	512	1.28
128	512	1.75
256	512	2.69

Software#

NVIDIA Driver#

Release 1.7.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

If issues arise when you start the NIM containers, run the following code to ensure that the latest NVIDIA drivers are installed.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
 && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
   sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.