Support Matrix#
Models#
Model Name |
Model ID |
Max Tokens |
Publisher |
---|---|---|---|
Llama-3.2-NV-RerankQA-1B-v2 |
nvidia/llama-3-2-nv-rerankqa-1b-v2 |
8192 (optimized models) |
NVIDIA |
NV-RerankQA-Mistral4B-v3 |
nvidia/nv-rerankqa-mistral-4b-v3 |
512 |
NVIDIA |
Note that when truncate
is set to END
, any Query / Passage pair that is longer than the maximum token length will be truncated from the right, starting with the passage.
Supported Hardware#
Llama-3.2-NV-RerankQA-1B-v2#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 PCIe |
40 & 80 |
FP16 |
A100 SXM4 |
40 & 80 |
FP16 |
H100 PCIe |
80 |
FP16 & FP8 |
H100 HBM3 |
80 |
FP16 & FP8 |
H100 NVL |
80 |
FP16 & FP8 |
L40s |
48 |
FP16 & FP8 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 & FP8 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
3.6 |
FP16 |
19.6 |
NV-RerankQA-Mistral4B-v3#
GPU |
GPU Memory (GB) |
Precision |
---|---|---|
A100 PCIe |
80 |
FP16 |
A100 SXM4 |
80 |
FP16 |
H100 HBM3 |
80 |
FP16 & FP8 |
L40s |
48 |
FP 16 & FP8 |
A10G |
24 |
FP16 |
L4 |
24 |
FP16 |
Non-optimized configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory |
9 |
FP16 |
23 |
Software#
NVIDIA Driver#
Release 1.0.0 uses Triton Inference Server 24.05. Please refer to the Release Notes for Triton on NVIDIA driver support.
NVIDIA Container Toolkit#
Your Docker environment must support NVIDIA GPUs. Please refer to the NVIDIA Container Toolkit for more information.