Support Matrix for NVIDIA NeMo Retriever Reranking NIM#

This documentation describes the software and hardware that NVIDIA NeMo Retriever Reranking NIM supports.

CPU#

NeMo Retriever Reranking NIM requires the following:

Models#

NeMo Retriever Reranking NIM supports the following models.

Publisher

Model ID

Max Tokens
(Optimized Models)

Model Card

NVIDIA

nvidia/llama-nemotron-rerank-vl-1b-v2

8192

Model card

NVIDIA

nvidia/llama-nemotron-rerank-1b-v2

8192

Model card

NVIDIA

nvidia/llama-nemotron-rerank-500m-v2

8192

Model card

Note that when truncate is set to END, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.

Optimized vs Non Optimized Models#

Starting in version 2.0.0, optimized configurations for nvidia/llama-nemotron-rerank-vl-1b-v2 use runtime CUDA kernels and just-in-time compilation. The NIM uses FP16 kernels by default. FP8 kernels are available only for the GPU SKUs listed with FP8 support and must be requested explicitly with NIM_PRECISION=fp8.

The optimized configuration tables list the GPU SKUs and precisions that are tuned and validated for the release. Optimized attention kernels are also an explicit runtime configuration; enable them only on supported hardware.

Non-optimized configurations use a fallback kernel feature set intended for broad compatibility, such as FP16 architecture-agnostic kernels. Fallback configurations can run on GPUs with sufficient memory, but they might not support every optimized feature or deliver the same performance as the optimized configurations.

Compute Capability and Kernel Configuration#

Starting in version 2.0.0, automatic profile selection does not apply to nvidia/llama-nemotron-rerank-vl-1b-v2, and the NIM does not perform automatic kernel selection. The runtime uses the default FP16 kernel path unless you explicitly request another supported configuration.

Use the optimized configuration table for llama-nemotron-rerank-vl-1b-v2 to determine whether the target GPU SKU supports FP8. To opt into FP8 kernels, set NIM_PRECISION=fp8. To use optimized attention kernels, enable the corresponding runtime configuration for the deployment.

To see the mapping of CUDA GPU compute capability versions to GPU SKUs, refer to CUDA GPU Compute Capability.

Supported Hardware#

Note

Currently, GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

llama-nemotron-rerank-vl-1b-v2#

Optimized configuration#

GPU SKU

Precision

Max Tokens

NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition

FP16 & FP8

8192

NVIDIA-B200

FP16 & FP8

8192

NVIDIA-H100-NVL

FP16 & FP8

8192

NVIDIA-H100-80GB-HBM3

FP16, FP8

8192

NVIDIA-A100-SXM4-80GB

FP16

8192

FP8 availability for llama-nemotron-rerank-vl-1b-v2 is SKU-specific. Use the precision listed for the target SKU.

By default, the runtime uses NIM_ENGINE_COUNT=1. For the maximum compatibility profile, keep or set NIM_ENGINE_COUNT=1 explicitly. For maximum performance on GPU SKUs with at least 80 GB of VRAM, set NIM_ENGINE_COUNT=2.

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

7.30

FP16

3.10

8192

llama-nemotron-rerank-1b-v2#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

3.6

FP16

9.5

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

llama-nemotron-rerank-500m-v2#

Optimized configuration#

Compute Capability

Precision

12.0

FP16 & FP8

10.0

FP16 & FP8

9.0

FP16 & FP8

8.9

FP16 & FP8

8.6

FP16

8.0

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

3.6

FP16

9.5

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

Memory Footprint#

The following table provides the set of valid configurations and the associated approximate memory footprints for the model. For llama-nemotron-rerank-vl-1b-v2, use the optimized SKU table above to determine which precision is supported for a release GPU SKU.

Approximate GPU Memory Size (GiB)

37.04

Approximate GPU Memory Size (GiB)

25.22

Approximate GPU Memory Size (GiB)

37.16

Approximate GPU Memory Size (GiB)

25.35

Approximate GPU Memory Size (GiB)

37.06

Approximate GPU Memory Size (GiB)

25.25

Approximate GPU Memory Size (GiB)

36.88

Approximate GPU Memory Size (GiB)

25.06

Approximate GPU Memory Size (GiB)

24.06

Approximate GPU Memory Size (GiB)

25.06

Approximate GPU Memory Size (GiB)

3.68

Approximate GPU Memory Size (GiB)

7.59

Approximate GPU Memory Size (GiB)

3.91

Approximate GPU Memory Size (GiB)

6.69

Approximate GPU Memory Size (GiB)

3.65

Approximate GPU Memory Size (GiB)

6.51

Approximate GPU Memory Size (GiB)

3.56

Approximate GPU Memory Size (GiB)

5.84

Approximate GPU Memory Size (GiB)

6.06

Approximate GPU Memory Size (GiB)

6.53

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

2.53

8

8192

9.63

16

8192

18.25

30

1024

5.19

30

2048

8.65

30

4096

16.27

30

8192

32.91

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

2.53

8

8192

9.63

16

8192

18.25

30

1024

5.19

30

2048

8.65

30

4096

16.27

30

8192

32.91

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

2.78

8

8192

9.88

16

1024

3.66

16

2048

5.5

16

4096

9.38

16

8192

18.0

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

3.04

8

8192

10.38

16

8192

18.75

30

1024

5.69

30

2048

9.15

30

4096

16.77

30

8192

33.41

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

1.4

8

8192

4.5

16

8192

8.04

30

1024

2.16

30

2048

3.58

30

4096

6.66

30

8192

14.21

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

3.04

8

8192

10.38

16

8192

18.75

30

1024

5.69

30

2048

9.15

30

4096

16.77

30

8192

33.41

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

1.4

8

8192

4.5

16

8192

8.04

30

1024

2.16

30

2048

3.58

30

4096

6.66

30

8192

14.21

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

2.27

8

8192

9.81

16

1024

3.5

16

2048

5.38

16

4096

9.31

16

8192

18.06

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

2.6

8

8192

9.69

16

8192

17.81

30

1024

5.14

30

2048

8.59

30

4096

15.86

30

8192

32.03

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GiB)

1

8192

1.52

8

8192

4.5

16

8192

8.03

30

1024

2.16

30

2048

3.58

30

4096

6.65

30

8192

14.21

Software#

NVIDIA Driver#

Release 1.6.0+ uses NVIDIA Optimized Frameworks 25.01. For NVIDIA driver support, refer to the Frameworks Support Matrix.

Ensure that the latest compatible NVIDIA driver is installed on your system before launching NIM containers. If you experience issues starting the containers, verify that your driver is up-to-date.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.