Support Matrix for NeMo Retriever Text Embedding NIM#
This documentation describes the software and hardware that NeMo Retriever Text Embedding NIM supports.
CPU#
Text Embedding NIM requires the following:
x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.
Models#
NeMo Retriever Text Embedding NIM supports the following models:
Model Name |
Model ID |
Max Tokens |
Publisher |
Parameters |
Embedding |
Dynamic Embeddings |
Model Card |
---|---|---|---|---|---|---|---|
bge-large-zh-v1.5 |
baai/bge-large-zh-v1.5 |
512 |
BAAI |
325 |
1024 |
no |
|
Llama-3.2-NV-EmbedQA-1B-v2 |
nvidia/llama-3.2-nv-embedqa-1b-v2 |
8192 |
NVIDIA |
1236 |
2048 |
yes |
|
NV-EmbedQA-E5-v5 |
nvidia/nv-embedqa-e5-v5 |
512 |
NVIDIA |
335 |
1024 |
no |
|
NV-EmbedQA-Mistral7B-v2 |
nvidia/nv-embedqa-mistral-7b-v2 |
512 |
NVIDIA |
7110 |
4096 |
no |
|
Snowflake’s Arctic-embed-l |
snowflake/arctic-embed-l |
512 |
Snowflake |
335 |
1024 |
no |
|
bge-m3 |
baai/bge-m3 |
8192 |
BAAI |
560 |
1024 |
no |
Supported Hardware#
bge-large-zh-v1.5#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
H20 |
96 |
FP16 |
L20 |
48 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
10 |
FP16 |
8.1 |
bge-m3#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 SXM4 |
80 |
FP16 |
H100 HBM3 |
80 |
FP16 |
L40s |
48 |
FP16 |
A10G |
24 |
FP16 |
L20 |
48 |
FP16 |
H20 |
96 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
33 |
FP16 |
8.8 |
Llama-3.2-NV-EmbedQA-1B-v2#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 SXM4 |
40 & 80 |
FP16 |
H100 HBM3 |
80 |
FP16 & FP8 |
H100 NVL |
80 |
FP16 & FP8 |
L40s |
48 |
FP16 & FP8 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 & FP8 |
GeForce RTX 4090 (Beta) |
24 |
FP16 |
NVIDIA RTX 6000 Ada Generation (Beta) |
48 |
FP16 |
GeForce RTX 5080 (Beta) |
16 |
FP16 |
GeForce RTX 5090 (Beta) |
32 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
---|---|---|---|---|
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total. |
3.6 |
FP16 |
9 |
4096 |
Warning
The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).
NV-EmbedQA-E5-v5#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 SXM4 |
40 & 80 |
FP16 |
H100 HBM3 |
80 |
FP16 |
H100 NVL |
80 |
FP16 |
L40s |
48 |
FP16 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
2 |
FP16 |
8.5 |
NV-EmbedQA-Mistral7B-v2#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 SXM4 |
80 |
FP16 |
H100 HBM3 |
80 |
FP8 |
H100 HBM3 |
80 |
FP16 |
L40s |
48 |
FP8 |
L40s |
48 |
FP16 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
16 |
FP16 |
30 |
Snowflake’s Arctic-embed-l#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 SXM4 |
80 |
FP16 |
H100 HBM3 |
80 |
FP16 |
L40s |
48 |
FP16 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
2 |
FP16 |
17 |
Memory Footprint#
You can control the NIM’s memory footprint by controlling the maximum allowed batch size and sequence length. For more information, refer to Memory Footprint.
The following table provides the set of valid configurations and the associated approximate memory footprints for the model. Please note that there is an additional small fixed memory overhead when running the container.
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
8192 |
11.25 |
30 |
1024 |
2.17 |
30 |
2048 |
4.81 |
30 |
4096 |
9.61 |
30 |
8192 |
21.09 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
8192 |
11.25 |
30 |
1024 |
2.17 |
30 |
2048 |
4.81 |
30 |
4096 |
9.61 |
30 |
8192 |
21.09 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
1024 |
1.16 |
16 |
2048 |
2.56 |
16 |
4096 |
5.12 |
16 |
8192 |
11.25 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
8192 |
11.25 |
30 |
1024 |
2.17 |
30 |
2048 |
4.81 |
30 |
4096 |
9.61 |
30 |
8192 |
21.09 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.43 |
8 |
8192 |
3.53 |
16 |
8192 |
7.06 |
30 |
1024 |
1.19 |
30 |
2048 |
2.61 |
30 |
4096 |
5.68 |
30 |
8192 |
13.24 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
8192 |
11.25 |
30 |
1024 |
2.17 |
30 |
2048 |
4.81 |
30 |
4096 |
9.61 |
30 |
8192 |
21.09 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.43 |
8 |
8192 |
3.53 |
16 |
8192 |
7.06 |
30 |
1024 |
1.19 |
30 |
2048 |
2.61 |
30 |
4096 |
5.68 |
30 |
8192 |
13.24 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
1024 |
1.16 |
16 |
2048 |
2.56 |
16 |
4096 |
5.12 |
16 |
8192 |
11.25 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.68 |
8 |
8192 |
5.53 |
16 |
1024 |
1.13 |
16 |
2048 |
2.39 |
16 |
4096 |
5.03 |
16 |
8192 |
11.06 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.7 |
8 |
8192 |
5.62 |
16 |
8192 |
11.25 |
30 |
1024 |
2.17 |
30 |
2048 |
4.81 |
30 |
4096 |
9.61 |
30 |
8192 |
21.09 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
8192 |
0.43 |
8 |
8192 |
3.53 |
16 |
8192 |
7.06 |
30 |
1024 |
1.19 |
30 |
2048 |
3.55 |
30 |
4096 |
7.56 |
30 |
8192 |
16.99 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
64 |
512 |
0.41 |
128 |
512 |
0.81 |
192 |
512 |
1.22 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
64 |
512 |
0.41 |
128 |
512 |
0.81 |
384 |
512 |
2.44 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
16 |
512 |
0.1 |
32 |
512 |
0.2 |
64 |
512 |
0.41 |
80 |
512 |
0.51 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
64 |
512 |
0.41 |
128 |
512 |
0.81 |
384 |
512 |
2.44 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
64 |
512 |
0.41 |
128 |
512 |
0.81 |
384 |
512 |
2.44 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
16 |
512 |
0.1 |
32 |
512 |
0.2 |
64 |
512 |
0.41 |
80 |
512 |
0.51 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GB) |
---|---|---|
1 |
512 |
0.01 |
64 |
512 |
0.41 |
128 |
512 |
0.81 |
256 |
512 |
1.63 |
Software#
NVIDIA Driver#
Releases prior to 1.4.0-rtx use Triton Inference Server 24.08. Please refer to the Release Notes for Triton on NVIDIA driver support.
Release 1.4.0-rtx uses Triton Inference Server 25.01. Please refer to the Release Notes for Triton on NVIDIA driver support.
If issues arise when you start the NIM containers, run the following code to ensure that the latest NVIDIA drivers are installed.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
NVIDIA Container Toolkit#
Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.