Support Matrix#

Models#

Model Name

Model ID

Max Tokens

Publisher

Llama-3.2-NV-RerankQA-1B-v2

nvidia/llama-3-2-nv-rerankqa-1b-v2

8192 (optimized models)

NVIDIA

NV-RerankQA-Mistral4B-v3

nvidia/nv-rerankqa-mistral-4b-v3

512

NVIDIA

Note that when truncate is set to END, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.

Supported Hardware#

Llama-3.2-NV-RerankQA-1B-v2#

GPU

GPU Memory (GB)

Precision

A100 PCIe

40 & 80

FP16

A100 SXM4

40 & 80

FP16

H100 PCIe

80

FP16 & FP8

H100 HBM3

80

FP16 & FP8

H100 NVL

80

FP16 & FP8

L40s

48

FP16 & FP8

A10G

24

FP16

L4

24

FP16 & FP8

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

3.6

FP16

19.6

NV-RerankQA-Mistral4B-v3#

GPU

GPU Memory (GB)

Precision

A100 PCIe

80

FP16

A100 SXM4

80

FP16

H100 HBM3

80

FP16 & FP8

L40s

48

FP 16 & FP8

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

9

FP16

23

Software#

NVIDIA Driver#

Release 1.0.0 uses Triton Inference Server 24.05. Please refer to the Release Notes for Triton on NVIDIA driver support.

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. Please refer to the NVIDIA Container Toolkit for more information.