Support Matrix for NVIDIA NeMo Retriever Reranking NIM#

This documentation describes the software and hardware that NVIDIA NeMo Retriever Reranking NIM supports.

CPU#

NeMo Retriever Reranking NIM requires the following:

x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.

Models#

NeMo Retriever Reranking NIM supports the following models.

Publisher	Model ID	Max Tokens (Optimized Models)	Model Card
NVIDIA	nvidia/llama-nemotron-rerank-vl-1b-v2	8192	Model card
NVIDIA	nvidia/llama-nemotron-rerank-1b-v2	8192	Model card
NVIDIA	nvidia/llama-nemotron-rerank-500m-v2	8192	Model card

Note that when truncate is set to END, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.

Optimized vs Non Optimized Models#

Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-rerank-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. The NIM uses FP16 kernels by default. FP8 kernels are available only for the GPU SKUs listed with FP8 support and must be requested explicitly with NIM_PRECISION=fp8.

The optimized configuration tables list the GPU SKUs and precisions that are tuned and validated for the release. Optimized attention kernels are also an explicit runtime configuration; enable them only on supported hardware.

Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.

Compute Capability and Kernel Configuration#

Starting in version 2.0.0, automatic profile selection does not apply to nvidia/llama-nemotron-rerank-vl-1b-v2, and the NIM does not perform automatic kernel selection. The runtime uses the default FP16 kernel path unless you explicitly request another supported configuration.

Use the optimized configuration table for llama-nemotron-rerank-vl-1b-v2 to determine whether the target GPU SKU supports FP8. To opt into FP8 kernels, set NIM_PRECISION=fp8. To use optimized attention kernels, enable the corresponding runtime configuration for the deployment.

To see the mapping of CUDA GPU compute capability versions to GPU SKUs, refer to CUDA GPU Compute Capability.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

llama-nemotron-rerank-vl-1b-v2#

Optimized configuration#

GPU SKU	Precision	Max Tokens
NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition	FP16 & FP8	8192
NVIDIA-B200	FP16 & FP8	8192
NVIDIA-H100-NVL	FP16 & FP8	8192
NVIDIA-H100-80GB-HBM3	FP16, FP8	8192
NVIDIA-A100-SXM4-80GB	FP16	8192

FP8 availability for llama-nemotron-rerank-vl-1b-v2 is SKU-specific. Use the precision listed for the target SKU.

By default, the runtime uses NIM_ENGINE_COUNT=1. For the maximum compatibility profile, keep or set NIM_ENGINE_COUNT=1 explicitly. For maximum performance on GPU SKUs with at least 80 GB of VRAM, set NIM_ENGINE_COUNT=2.

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	7.30	FP16	3.10	8192

llama-nemotron-rerank-1b-v2#

Optimized configuration#

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	3.6	FP16	9.5	4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

llama-nemotron-rerank-500m-v2#

Optimized configuration#

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	3.6	FP16	9.5	4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Memory Footprint#

The following table provides the set of valid configurations and the associated approximate memory footprints for the model. For llama-nemotron-rerank-vl-1b-v2, use the optimized SKU table above to determine which precision is supported for a release GPU SKU.

nvidia/llama-nemotron-rerank-vl-1b-v2

12.0

fp8

Approximate GPU Memory Size (GiB)
37.04

fp16

Approximate GPU Memory Size (GiB)
25.22

10.0

fp8

Approximate GPU Memory Size (GiB)
37.16

fp16

Approximate GPU Memory Size (GiB)
25.35

9.0

fp8

Approximate GPU Memory Size (GiB)
37.06

fp16

Approximate GPU Memory Size (GiB)
25.25

8.9

fp8

Approximate GPU Memory Size (GiB)
36.88

fp16

Approximate GPU Memory Size (GiB)
25.06

8.6

fp16

Approximate GPU Memory Size (GiB)
24.06

8.0

fp16

Approximate GPU Memory Size (GiB)
25.06

nvidia/llama-nemotron-rerank-1b-v2

12.0

fp8

Approximate GPU Memory Size (GiB)
3.68

fp16

Approximate GPU Memory Size (GiB)
7.59

10.0

fp8

Approximate GPU Memory Size (GiB)
3.91

fp16

Approximate GPU Memory Size (GiB)
6.69

9.0

fp8

Approximate GPU Memory Size (GiB)
3.65

fp16

Approximate GPU Memory Size (GiB)
6.51

8.9

fp8

Approximate GPU Memory Size (GiB)
3.56

fp16

Approximate GPU Memory Size (GiB)
5.84

8.6

fp16

Approximate GPU Memory Size (GiB)
6.06

8.0

fp16

Approximate GPU Memory Size (GiB)
6.53

nvidia/llama-nemotron-rerank-500m-v2

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	2.53
8	8192	9.63
16	8192	18.25
30	1024	5.19
30	2048	8.65
30	4096	16.27
30	8192	32.91

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	2.53
8	8192	9.63
16	8192	18.25
30	1024	5.19
30	2048	8.65
30	4096	16.27
30	8192	32.91

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	2.78
8	8192	9.88
16	1024	3.66
16	2048	5.5
16	4096	9.38
16	8192	18.0

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	3.04
8	8192	10.38
16	8192	18.75
30	1024	5.69
30	2048	9.15
30	4096	16.77
30	8192	33.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	1.4
8	8192	4.5
16	8192	8.04
30	1024	2.16
30	2048	3.58
30	4096	6.66
30	8192	14.21

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	3.04
8	8192	10.38
16	8192	18.75
30	1024	5.69
30	2048	9.15
30	4096	16.77
30	8192	33.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	1.4
8	8192	4.5
16	8192	8.04
30	1024	2.16
30	2048	3.58
30	4096	6.66
30	8192	14.21

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	2.27
8	8192	9.81
16	1024	3.5
16	2048	5.38
16	4096	9.31
16	8192	18.06

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	2.6
8	8192	9.69
16	8192	17.81
30	1024	5.14
30	2048	8.59
30	4096	15.86
30	8192	32.03

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GiB)
1	8192	1.52
8	8192	4.5
16	8192	8.03
30	1024	2.16
30	2048	3.58
30	4096	6.65
30	8192	14.21

Software#

NVIDIA Driver#

Release 1.6.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.

Support Matrix for NVIDIA NeMo Retriever Reranking NIM#

CPU#

Models#

Optimized vs Non Optimized Models#

Compute Capability and Kernel Configuration#

Supported Hardware#

llama-nemotron-rerank-vl-1b-v2#

Optimized configuration#

Non-optimized configuration#

llama-nemotron-rerank-1b-v2#

Optimized configuration#

Non-optimized configuration#

llama-nemotron-rerank-500m-v2#

Optimized configuration#

Non-optimized configuration#

Memory Footprint#

Software#

NVIDIA Driver#

NVIDIA Container Toolkit#

Related Topics#