Optimization for NVIDIA NeMo Retriever Reranking NIM#

Use this documentation to learn about optimization for NVIDIA NeMo Retriever Reranking NIM.

NVIDIA NeMo Retriever Reranking NIM (NeMo Retriever Reranking NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of the models.

The NIM uses the TensorRT backend for Triton Inference Server for optimized inference of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.

The NIM includes multiple optimization profiles, catered to floating point precision types supported by each SKU.

Automatic Profile Selection#

NeMo Retriever Reranking NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters. The parameters influence the selection process. The selection process is based on the parameters involved is outlined below:

  • Compatibility check: NeMo Retriever Reranking NIM excludes the profiles that are not runnable with the detected configuration based on the number of GPUs and GPU model.

  • Backend: This can be either TensorRT, ONNX, or PyTorch. The optimized TensorRT profiles are preferred.

  • Precision: Lower precision profiles are preferred. For example, profiles with FP8 are selected before FP16.

The model profile is logged at startup. For example, in the log you should see something similar to the following.

MODEL PROFILES
- Compatible with system and runnable:
  - <profile name 1>  (type=<type>,precision=<precision>)
  - <profile name 2>  (type=<type>,precision=<precision>,gpu=<gpu>)
  - <profile name 3>  (type=<type>,precision=<precision>,gpu=<gpu>)
Selected profile: <profile name>
Profile metadata: type: <type>
Profile metadata: precision: <precision>
...

Override Profile Selection#

To override automatic profile selection, set a specific profile ID in the docker run command by including -e NIM_MODEL_PROFILE=<value>. The following command lists the available profiles:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

Note

To run list-model-profiles on a host with no available GPUs, include -e NIM_CPU_ONLY=1 in the docker run command.

You should output similar to the following.

MODEL PROFILES
- Compatible with system and runnable:
  - f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa (h100-nvl-fp8-triton-tensorrt)
  - 58610f044b526218e323a350f3bc22d6fdaa775b3b634fc3b671ca29ba8848d6 (h100-pcie-fp8-triton-tensorrt)
  - 761a1460449bba9831b61bcc1e4e2ce93bfc66f5badecf6157009a0e0d18d934 (h100-hbm3-80gb-fp8-triton-tensorrt)
  ...

In the previous example, set -e NIM_MODEL_PROFILE="f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa" to run the H100 NVL FP8 profile.