Support Matrix for NVIDIA NeMo Retriever Embedding NIM#
This documentation describes the software and hardware that NVIDIA NeMo Retriever Embedding NIM supports.
CPU#
NeMo Retriever Embedding NIM requires the following:
x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.
Models#
NVIDIA NeMo Retriever Embedding NIM supports the following models.
Publisher |
Model ID |
Supported Embedding Types |
Max Tokens |
Parameters |
Total Parameters |
Embedding |
Dynamic Embeddings |
Model Card |
|---|---|---|---|---|---|---|---|---|
NVIDIA |
nvidia/llama-nemotron-embed-vl-1b-v2 |
|
2048 |
1414 |
1678 |
2048 |
yes |
|
NVIDIA |
nvidia/llama-nemotron-embed-1b-v2 |
|
8192 |
973 |
1236 |
2048 |
yes |
|
NVIDIA |
nvidia/llama-nemotron-embed-300m-v2 |
|
8192 |
307 |
569 |
2048 |
yes |
|
NVIDIA |
nvidia/nv-embedqa-e5-v5 |
|
512 |
303 |
335 |
1024 |
no |
|
BAAI |
baai/bge-m3 |
|
8192 |
303 |
568 |
1024 |
no |
|
BAAI |
baai/bge-large-zh-v1.5 |
|
512 |
303 |
325 |
1024 |
no |
Note
The “Parameters (excl. embeddings)” column shows the count of parameters that directly impact inference performance and computational cost. Embedding layer parameters are excluded because they primarily affect model size rather than inference speed. For example, models with different vocabulary sizes may have different total parameter counts but the same inference-relevant parameter count.
Optimized vs Non Optimized Models#
Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-embed-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. At startup, the NIM selects a kernel feature set for the detected GPU architecture. Depending on the selected feature set, the NIM might compile kernels, load precompiled kernels optimized for that architecture, or use both.
The optimized configuration tables list the compute capability families or GPU SKUs that have optimized kernel support for the listed precision. These configurations are tuned and validated for the release.
Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.
Compute Capability and Automatic Kernel Selection#
Starting in version 2.0.0, automatic profile selection is replaced by automatic kernel selection. The NIM detects the GPU compute capability at startup and selects the supported kernel feature set for that compute capability family.
The selected feature set determines which CUDA kernels, attention implementation, precompiled kernel artifacts, and default precision are used. If an optimized feature set is not available for the detected GPU, the NIM uses the compatible fallback feature set.
To request a precision explicitly, set NIM_PRECISION to fp16 or fp8. FP8 is available only on the compute capability families or GPU SKUs listed with FP8 support in the optimized configuration tables.
Supported Hardware#
Note
Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.
Llama Nemotron Embed 300m v2 (llama-nemotron-embed-300m-v2)#
Optimized configuration#
Precision |
|
|---|---|
12.0 |
FP16 & FP8 |
10.0 |
FP16 & FP8 |
9.0 |
FP16 & FP8 |
8.9 |
FP16 & FP8 |
8.6 |
FP16 |
8.0 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
|---|---|---|---|---|
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total. |
Min: 2.4 GiB, Max: 25.2 GiB |
FP16 |
7.49 GiB |
4096 |
Warning
The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).
Llama Nemotron Embed Vision Language 1B (llama-nemotron-embed-vl-1b-v2)#
Supported GPU SKUs#
SKU |
GPU |
Precision |
|---|---|---|
|
NVIDIA RTX PRO 6000 Blackwell Workstation Edition |
FP8 & FP16 |
|
NVIDIA RTX PRO 6000 Blackwell Server Edition |
FP8 & FP16 |
|
NVIDIA B200 |
FP8 & FP16 |
|
NVIDIA GB200 |
FP8 & FP16 |
|
NVIDIA H200 |
FP8 & FP16 |
|
NVIDIA A100 SXM4 80GB |
FP16 |
|
NVIDIA H100 NVL |
FP8 & FP16 |
|
NVIDIA H100 80GB HBM3 |
FP8 & FP16 |
|
NVIDIA L4 |
FP16 |
|
NVIDIA L40S |
FP8 & FP16 |
|
NVIDIA A10G |
FP16 |
Non-optimized configuration#
Fallback behavior on GPUs outside the listed SKU set has not been verified for this model.
Note
The default VLM profile uses a maximum sequence length of 2048 tokens. Image inputs are supported only as document or passage inputs.
bge-large-zh-v1.5#
Optimized configuration#
GPU |
GPU Memory (GB) |
Precision |
|---|---|---|
H20 |
96 |
FP16 |
L20 |
48 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
10 |
FP16 |
8.1 |
bge-m3#
Optimized configuration#
GPU |
GPU Memory (GB) |
Precision |
|---|---|---|
A100 SXM4 |
80 |
FP16 |
H100 HBM3 |
80 |
FP16 |
L40s |
48 |
FP16 |
A10G |
24 |
FP16 |
L20 |
48 |
FP16 |
H20 |
96 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
33 |
FP16 |
8.8 |
Llama Nemotron Embed 1B v2#
Optimized configuration#
Precision |
|
|---|---|
12.0 |
FP16 & FP8 |
10.0 |
FP16 & FP8 |
9.0 |
FP16 & FP8 |
8.9 |
FP16 & FP8 |
8.6 |
FP16 |
8.0 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
|---|---|---|---|---|
Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total. |
3.6 |
FP16 |
9 |
4096 |
If you run this model on RTX 40xx or later, you need a minimum of 8GB of VRAM.
Warning
The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).
NV-EmbedQA-E5-v5#
Optimized configuration#
Precision |
|
|---|---|
12.0 |
FP16 |
10.0 |
FP16 |
9.0 |
FP16 |
8.9 |
FP16 |
8.6 |
FP16 |
8.0 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
2 |
FP16 |
8.5 |
Memory Footprint#
The following table provides the set of valid configurations and the associated approximate memory footprints for the model.
Approximate GPU Memory Size (GiB) |
|---|
2.04 |
Approximate GPU Memory Size (GiB) |
|---|
3.53 |
Approximate GPU Memory Size (GiB) |
|---|
2.04 |
Approximate GPU Memory Size (GiB) |
|---|
3.19 |
Approximate GPU Memory Size (GiB) |
|---|
2.04 |
Approximate GPU Memory Size (GiB) |
|---|
3.17 |
Approximate GPU Memory Size (GiB) |
|---|
2.04 |
Approximate GPU Memory Size (GiB) |
|---|
2.67 |
Approximate GPU Memory Size (GiB) |
|---|
2.86 |
Approximate GPU Memory Size (GiB) |
|---|
2.86 |
Approximate GPU Memory Size (GiB) |
|---|
2.98 |
Approximate GPU Memory Size (GiB) |
|---|
6.53 |
Approximate GPU Memory Size (GiB) |
|---|
2.98 |
Approximate GPU Memory Size (GiB) |
|---|
6.09 |
Approximate GPU Memory Size (GiB) |
|---|
2.98 |
Approximate GPU Memory Size (GiB) |
|---|
4.91 |
Approximate GPU Memory Size (GiB) |
|---|
2.98 |
Approximate GPU Memory Size (GiB) |
|---|
4.91 |
Approximate GPU Memory Size (GiB) |
|---|
5.22 |
Approximate GPU Memory Size (GiB) |
|---|
5.09 |
Approximate GPU Memory Size (GiB) |
|---|
8.26 |
Approximate GPU Memory Size (GiB) |
|---|
6.13 |
Approximate GPU Memory Size (GiB) |
|---|
9.27 |
Approximate GPU Memory Size (GiB) |
|---|
6.07 |
Approximate GPU Memory Size (GiB) |
|---|
9.17 |
Approximate GPU Memory Size (GiB) |
|---|
6.79 |
Approximate GPU Memory Size (GiB) |
|---|
8.5 |
Approximate GPU Memory Size (GiB) |
|---|
5.8 |
Approximate GPU Memory Size (GiB) |
|---|
5.61 |
Approximate GPU Memory Size (GiB) |
|---|
5.79 |
Approximate GPU Memory Size (GiB) |
|---|
0.87 |
Approximate GPU Memory Size (GiB) |
|---|
0.87 |
Approximate GPU Memory Size (GiB) |
|---|
0.87 |
Approximate GPU Memory Size (GiB) |
|---|
0.87 |
Approximate GPU Memory Size (GiB) |
|---|
0.88 |
Approximate GPU Memory Size (GiB) |
|---|
0.87 |
Software#
NVIDIA Driver#
Release 1.7.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.
Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.
NVIDIA Container Toolkit#
Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.