Support Matrix for NVIDIA NeMo Retriever Reranking NIM#
This documentation describes the software and hardware that NVIDIA NeMo Retriever Reranking NIM supports.
CPU#
NeMo Retriever Reranking NIM requires the following:
x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.
Models#
NeMo Retriever Reranking NIM supports the following models.
Publisher |
Model ID |
Max Tokens |
Model Card |
|---|---|---|---|
NVIDIA |
nvidia/llama-nemotron-rerank-vl-1b-v2 |
8192 |
|
NVIDIA |
nvidia/llama-nemotron-rerank-1b-v2 |
8192 |
|
NVIDIA |
nvidia/llama-nemotron-rerank-500m-v2 |
8192 |
Note that when truncate is set to END, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.
Optimized vs Non Optimized Models#
Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-rerank-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. The NIM uses FP16 kernels by default. FP8 kernels are available only for the GPU SKUs listed with FP8 support and must be requested explicitly with NIM_PRECISION=fp8.
The optimized configuration tables list the GPU SKUs and precisions that are tuned and validated for the release. Optimized attention kernels are also an explicit runtime configuration; enable them only on supported hardware.
Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.
Compute Capability and Kernel Configuration#
Starting in version 2.0.0, automatic profile selection does not apply to nvidia/llama-nemotron-rerank-vl-1b-v2, and the NIM does not perform automatic kernel selection. The runtime uses the default FP16 kernel path unless you explicitly request another supported configuration.
Use the optimized configuration table for llama-nemotron-rerank-vl-1b-v2 to determine whether the target GPU SKU supports FP8. To opt into FP8 kernels, set NIM_PRECISION=fp8. To use optimized attention kernels, enable the corresponding runtime configuration for the deployment.
To see the mapping of CUDA GPU compute capability versions to GPU SKUs, refer to CUDA GPU Compute Capability.
Supported Hardware#
Note
Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.
llama-nemotron-rerank-vl-1b-v2#
Optimized configuration#
GPU SKU |
Precision |
Max Tokens |
|---|---|---|
NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition |
FP16 & FP8 |
8192 |
NVIDIA-B200 |
FP16 & FP8 |
8192 |
NVIDIA-H100-NVL |
FP16 & FP8 |
8192 |
NVIDIA-H100-80GB-HBM3 |
FP16, FP8 |
8192 |
NVIDIA-A100-SXM4-80GB |
FP16 |
8192 |
FP8 availability for llama-nemotron-rerank-vl-1b-v2 is SKU-specific. Use the precision listed for the target SKU.
By default, the runtime uses NIM_ENGINE_COUNT=1. For the maximum compatibility profile, keep or set NIM_ENGINE_COUNT=1 explicitly. For maximum performance on GPU SKUs with at least 80 GB of VRAM, set NIM_ENGINE_COUNT=2.
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
|---|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
7.30 |
FP16 |
3.10 |
8192 |
llama-nemotron-rerank-1b-v2#
Optimized configuration#
Precision |
|
|---|---|
12.0 |
FP16 & FP8 |
10.0 |
FP16 & FP8 |
9.0 |
FP16 & FP8 |
8.9 |
FP16 & FP8 |
8.6 |
FP16 |
8.0 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
|---|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
3.6 |
FP16 |
9.5 |
4096 |
Warning
The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).
llama-nemotron-rerank-500m-v2#
Optimized configuration#
Precision |
|
|---|---|
12.0 |
FP16 & FP8 |
10.0 |
FP16 & FP8 |
9.0 |
FP16 & FP8 |
8.9 |
FP16 & FP8 |
8.6 |
FP16 |
8.0 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
Max Tokens |
|---|---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
3.6 |
FP16 |
9.5 |
4096 |
Warning
The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).
Memory Footprint#
The following table provides the set of valid configurations and the associated approximate memory footprints for the model.
For llama-nemotron-rerank-vl-1b-v2, use the optimized SKU table above to determine which precision is supported for a release GPU SKU.
Approximate GPU Memory Size (GiB) |
|---|
37.04 |
Approximate GPU Memory Size (GiB) |
|---|
25.22 |
Approximate GPU Memory Size (GiB) |
|---|
37.16 |
Approximate GPU Memory Size (GiB) |
|---|
25.35 |
Approximate GPU Memory Size (GiB) |
|---|
37.06 |
Approximate GPU Memory Size (GiB) |
|---|
25.25 |
Approximate GPU Memory Size (GiB) |
|---|
36.88 |
Approximate GPU Memory Size (GiB) |
|---|
25.06 |
Approximate GPU Memory Size (GiB) |
|---|
24.06 |
Approximate GPU Memory Size (GiB) |
|---|
25.06 |
Approximate GPU Memory Size (GiB) |
|---|
3.68 |
Approximate GPU Memory Size (GiB) |
|---|
7.59 |
Approximate GPU Memory Size (GiB) |
|---|
3.91 |
Approximate GPU Memory Size (GiB) |
|---|
6.69 |
Approximate GPU Memory Size (GiB) |
|---|
3.65 |
Approximate GPU Memory Size (GiB) |
|---|
6.51 |
Approximate GPU Memory Size (GiB) |
|---|
3.56 |
Approximate GPU Memory Size (GiB) |
|---|
5.84 |
Approximate GPU Memory Size (GiB) |
|---|
6.06 |
Approximate GPU Memory Size (GiB) |
|---|
6.53 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
2.53 |
8 |
8192 |
9.63 |
16 |
8192 |
18.25 |
30 |
1024 |
5.19 |
30 |
2048 |
8.65 |
30 |
4096 |
16.27 |
30 |
8192 |
32.91 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
2.53 |
8 |
8192 |
9.63 |
16 |
8192 |
18.25 |
30 |
1024 |
5.19 |
30 |
2048 |
8.65 |
30 |
4096 |
16.27 |
30 |
8192 |
32.91 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
2.78 |
8 |
8192 |
9.88 |
16 |
1024 |
3.66 |
16 |
2048 |
5.5 |
16 |
4096 |
9.38 |
16 |
8192 |
18.0 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
3.04 |
8 |
8192 |
10.38 |
16 |
8192 |
18.75 |
30 |
1024 |
5.69 |
30 |
2048 |
9.15 |
30 |
4096 |
16.77 |
30 |
8192 |
33.41 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
1.4 |
8 |
8192 |
4.5 |
16 |
8192 |
8.04 |
30 |
1024 |
2.16 |
30 |
2048 |
3.58 |
30 |
4096 |
6.66 |
30 |
8192 |
14.21 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
3.04 |
8 |
8192 |
10.38 |
16 |
8192 |
18.75 |
30 |
1024 |
5.69 |
30 |
2048 |
9.15 |
30 |
4096 |
16.77 |
30 |
8192 |
33.41 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
1.4 |
8 |
8192 |
4.5 |
16 |
8192 |
8.04 |
30 |
1024 |
2.16 |
30 |
2048 |
3.58 |
30 |
4096 |
6.66 |
30 |
8192 |
14.21 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
2.27 |
8 |
8192 |
9.81 |
16 |
1024 |
3.5 |
16 |
2048 |
5.38 |
16 |
4096 |
9.31 |
16 |
8192 |
18.06 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
2.6 |
8 |
8192 |
9.69 |
16 |
8192 |
17.81 |
30 |
1024 |
5.14 |
30 |
2048 |
8.59 |
30 |
4096 |
15.86 |
30 |
8192 |
32.03 |
Max Batch Size |
Max Sequence Length |
Approximate GPU Memory Size (GiB) |
|---|---|---|
1 |
8192 |
1.52 |
8 |
8192 |
4.5 |
16 |
8192 |
8.03 |
30 |
1024 |
2.16 |
30 |
2048 |
3.58 |
30 |
4096 |
6.65 |
30 |
8192 |
14.21 |
Software#
NVIDIA Driver#
Release 1.6.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.
Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.
NVIDIA Container Toolkit#
Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.