Optimization for NVIDIA NeMo Retriever Reranking NIM#
Use this documentation to learn about optimization for NVIDIA NeMo Retriever Reranking NIM.
NVIDIA NeMo Retriever Reranking NIM (NeMo Retriever Reranking NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of the models.
The NIM uses the TensorRT backend for Triton Inference Server for optimized inference of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.
The NIM includes multiple optimization profiles, catered to floating point precision types supported by each SKU.
Automatic Profile Selection#
NeMo Retriever Reranking NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters. The parameters influence the selection process. The selection process is based on the parameters involved is outlined below:
Compatibility check: NeMo Retriever Reranking NIM excludes the profiles that are not runnable with the detected configuration based on the number of GPUs and GPU model.
Backend: This can be either TensorRT, ONNX, or PyTorch. The optimized TensorRT profiles are preferred.
Precision: Lower precision profiles are preferred. For example, profiles with FP8 are selected before FP16.
This selection is logged at startup. For example:
MODEL PROFILES
- Compatible with system and runnable:
- onnx (type=ONNX,precision=FP16)
- NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8 (type=TensorRT,precision=FP8,gpu=NVIDIA-H100-80GB-HBM3)
- NVIDIA-H100-80GB-HBM3_10.0.1_12 (type=TensorRT,precision=FP16,gpu=NVIDIA-H100-80GB-HBM3)
Selected profile: NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8
Profile metadata: type: TensorRT
Profile metadata: precision: FP8
Profile metadata: gpu: NVIDIA-H100-80GB-HBM3
Profile metadata: trt_version: 10.0.1
Profile metadata: cuda_major_version: 12
Overriding Profile Selection#
Attention
To override this behavior, set a specific profile ID with
-e NIM_MODEL_PROFILE=<value>. The following
list-model-profiles command lists the available profiles for the
IMG_NAME NeMo Retriever Reranking NIM:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
MODEL PROFILES
- Compatible with system and runnable:
- f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa (h100-nvl-fp8-triton-tensorrt)
- 58610f044b526218e323a350f3bc22d6fdaa775b3b634fc3b671ca29ba8848d6 (h100-pcie-fp8-triton-tensorrt)
- 761a1460449bba9831b61bcc1e4e2ce93bfc66f5badecf6157009a0e0d18d934 (h100-hbm3-80gb-fp8-triton-tensorrt)
...
In the previous example, set
-e NIM_MODEL_PROFILE="f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa" to run the H100 NVL FP8 profile.
To run
list-model-profiles on a host with no available GPUs, pass
-e NIM_CPU_ONLY=1 to the
docker run command to set the
NIM_CPU_ONLY environment variable.
Warning
In the current version of NeMo Retriever Reranking NIM the
list-model-profiles command fails to run on hosts that don’t have an NVIDIA GPU, even when
NIM_CPU_ONLY is set.