Optimization for NeMo Retriever Text Reranking NIM#

Use this documentation to learn about optimization for NeMo Retriever Text Reranking NIM.

NeMo Retriever Text Reranking NIM (Text Reranking NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of the models.

The NIM uses the TensorRT backend for Triton Inference Server for optimized inference of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.

The NIM includes multiple optimization profiles, catered to floating point precision types supported by each SKU.

Automatic Profile Selection#

Text Reranking NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters. The parameters influence the selection process. The selection process is based on the parameters involved is outlined below:

Compatibility check: Text Reranking NIM excludes the profiles that are not runnable with the detected configuration based on the number of GPUs and GPU model.
Backend: This can be either TensorRT or ONNX. The optimized TensorRT profiles are preferred.
Precision: Lower precision profiles are preferred. For example, profiles with FP8 are selected before FP16.

This selection is logged at startup. For example:

MODEL PROFILES
- Compatible with system and runnable:
  - onnx                                (type=ONNX,precision=FP16)
  - NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8 (type=TensorRT,precision=FP8,gpu=NVIDIA-H100-80GB-HBM3)
  - NVIDIA-H100-80GB-HBM3_10.0.1_12     (type=TensorRT,precision=FP16,gpu=NVIDIA-H100-80GB-HBM3)
Selected profile: NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8
Profile metadata: type: TensorRT
Profile metadata: precision: FP8
Profile metadata: gpu: NVIDIA-H100-80GB-HBM3
Profile metadata: trt_version: 10.0.1
Profile metadata: cuda_major_version: 12

Overriding Profile Selection#

Attention

To override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>. The following list-model-profiles command lists the available profiles for the IMG_NAME Text Reranking NIM:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

MODEL PROFILES
- Compatible with system and runnable:
  - f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa (h100-nvl-fp8-triton-tensorrt)
  - 58610f044b526218e323a350f3bc22d6fdaa775b3b634fc3b671ca29ba8848d6 (h100-pcie-fp8-triton-tensorrt)
  - 761a1460449bba9831b61bcc1e4e2ce93bfc66f5badecf6157009a0e0d18d934 (h100-hbm3-80gb-fp8-triton-tensorrt)
  ...

In the previous example, set -e NIM_MODEL_PROFILE="f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa" to run the H100 NVL FP8 profile.

To run list-model-profiles on a host with no available GPUs, pass -e NIM_CPU_ONLY=1 to the docker run command to set the NIM_CPU_ONLY environment variable.

Warning

In the current version of Text Reranking NIM the list-model-profiles command fails to run on hosts that don’t have an NVIDIA GPU, even when NIM_CPU_ONLY is set.