Support Matrix for NeMo Retriever Text Reranking NIM#

This documentation describes the software and hardware that NeMo Retriever Text Reranking NIM supports.

CPU#

Text Reranking NIM requires the following:

x86 processor with at least 8 cores. For a list of supported systems, refer to NVIDIA Certified Systems Catalog.

Models#

Model Name	Model ID	Max Tokens	Publisher	Model Card
Llama-3.2-NV-RerankQA-1B-v2	nvidia/llama-3-2-nv-rerankqa-1b-v2	8192 (optimized models)	NVIDIA	Link
llama-3.2-nemoretriever-500m-rerank-v2	nvidia/llama-3.2-nemoretriever-500m-rerank-v2	8192 (optimized models)	NVIDIA	Link
NV-RerankQA-Mistral4B-v3	nvidia/nv-rerankqa-mistral-4b-v3	512	NVIDIA	Link

Note that when truncate is set to END, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.

Optimized vs Non Optimized Models#

The following models are optimized using TRT and are available as pre-built, optimized engines on NGC. These optimized models are GPU specific and require a minimum GPU memory value as specified in the Optimized configuration sections of each model.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. These generic profiles are known as non-optimized configuration. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps in the Overriding Profile Selection section.

Compute Capability and Automatic Profile Selection#

NeMo Retriever Text Reranking NIM supports TensorRT engines that are compiled with the option kSAME_COMPUTE_CAPABILITY. This option builds engines that are compatible with GPUs having the same compute capability as the one on which the engine was built. For more information, refer to Same Compute Capability Compatibility Level.

To see the mapping of CUDA GPU compute capability versions to supported GPU SKUs, refer to CUDA GPU Compute Capability. If you run a NIM on a GPU that has the same compute capability as one of the engines, then that engine should appear as compatible when you run list-model-profiles.

Automatic profile selection uses the following order to choose a profile:

A GPU-specific engine (for example, gpu:NVIDIA B200)
A compute capability engine (for example, compute_capability:10.0)
ONNX or Pytorch(for example, model_type:onnx)

Note: Certain NIMs may include both GPU-specific engines and compute capability engines, while others may include only a single engine type.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Llama-3.2-NV-RerankQA-1B-v2#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	40 & 80	FP16
H100 HBM3	80	FP16 & FP8
H100 NVL	80	FP16 & FP8
L40s	48	FP16 & FP8
A10G	24	FP16
L4	24	FP16 & FP8
B200	180	FP16

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	3.6	FP16	9.5	4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Llama-3.2-nemoretriever-500m-rerank-v2#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	40 & 80	FP16 & INT8
H100 HBM3	80	FP16 & FP8
H100 NVL	80	FP16 & FP8
L40s	48	FP16 & FP8
A10G	24	FP16 & INT8
L4	24	FP16 & INT8

Compute Capability	Precision
12.0	FP16 & FP8
10.0	FP16 & FP8
9.0	FP16 & FP8
8.9	FP16 & FP8
8.6	FP16
8.0	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space	Max Tokens
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	3.6	FP16	9.5	4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-RerankQA-Mistral4B-v3#

Optimized configuration#

GPU	GPU Memory (GB)	Precision
A100 SXM4	80	FP16
H100 HBM3	80	FP16 & FP8
L40s	48	FP 16 & FP8
A10G	24	FP16
L4	24	FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory	9	FP16	23

Memory Footprint#

You can control the NIM’s memory footprint by controlling the maximum allowed batch size and sequence length. For more information, refer to Memory Footprint.

The following table provides the set of valid configurations and the associated approximate memory footprints for the model.

nvidia/llama-3.2-nv-rerankqa-1b-v2

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.7
8	8192	8.63
16	1024	4.16
16	2048	5.56
16	4096	8.13
16	8192	14.25

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	4.52
8	8192	9.44
16	8192	15.06
30	1024	5.98
30	2048	8.62
30	4096	13.42
30	8192	24.91

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.84
8	8192	4.94
16	8192	8.47
30	1024	2.6
30	2048	4.02
30	4096	7.09
30	8192	14.65

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	4.52
8	8192	9.44
16	8192	15.06
30	1024	5.98
30	2048	8.62
30	4096	13.42
30	8192	24.91

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.84
8	8192	4.94
16	8192	8.47
30	1024	2.6
30	2048	4.02
30	4096	7.09
30	8192	14.65

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.2
8	8192	8.13
16	1024	3.66
16	2048	5.06
16	4096	7.63
16	8192	13.75

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.08
8	8192	5.94
16	1024	2.04
16	2048	2.8
16	4096	4.44
16	8192	8.47

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.02
8	8192	7.94
16	8192	13.56
30	1024	4.48
30	2048	7.12
30	4096	11.92
30	8192	23.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.83
8	8192	4.94
16	8192	8.47
30	1024	2.6
30	2048	4.02
30	4096	7.09
30	8192	14.65

nvidia/llama-3.2-nemoretriever-500m-rerank-v2

a100-sxm4-40gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.53
8	8192	9.63
16	8192	18.25
30	1024	5.19
30	2048	8.65
30	4096	16.27
30	8192	32.91

a100-sxm4-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.53
8	8192	9.63
16	8192	18.25
30	1024	5.19
30	2048	8.65
30	4096	16.27
30	8192	32.91

a10g

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.78
8	8192	9.88
16	1024	3.66
16	2048	5.5
16	4096	9.38
16	8192	18.0

h100-hbm3-80gb

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.04
8	8192	10.38
16	8192	18.75
30	1024	5.69
30	2048	9.15
30	4096	16.77
30	8192	33.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.4
8	8192	4.5
16	8192	8.04
30	1024	2.16
30	2048	3.58
30	4096	6.66
30	8192	14.21

h100-nvl

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	3.04
8	8192	10.38
16	8192	18.75
30	1024	5.69
30	2048	9.15
30	4096	16.77
30	8192	33.41

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.4
8	8192	4.5
16	8192	8.04
30	1024	2.16
30	2048	3.58
30	4096	6.66
30	8192	14.21

l4

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.27
8	8192	9.81
16	1024	3.5
16	2048	5.38
16	4096	9.31
16	8192	18.06

l40s

fp16

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	2.6
8	8192	9.69
16	8192	17.81
30	1024	5.14
30	2048	8.59
30	4096	15.86
30	8192	32.03

fp8

Max Batch Size	Max Sequence Length	Approximate GPU Memory Size (GB)
1	8192	1.52
8	8192	4.5
16	8192	8.03
30	1024	2.16
30	2048	3.58
30	4096	6.65
30	8192	14.21

Software#

NVIDIA Driver#

Release 1.6.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.