Optimization for NVIDIA NeMo Retriever Reranking NIM#

Use this documentation to learn about runtime optimization for NVIDIA NeMo Retriever Reranking NIM.

Automatic Pipeline Selection#

Starting in version 2.0.0, at startup NeMo Retriever Reranking NIM detects the compute capability of the GPU and automatically selects an optimized inference pipeline for that GPU. The selected pipeline determines the CUDA kernels, precompiled cuDNN plans, attention implementation, and default precision. No user action is required to select profiles, and the list-model-profiles command is not available.

The selected pipeline and its precision are logged at startup. For example, the log includes a line similar to the following.

INFO CUDA device ready gpu="NVIDIA H100 80GB HBM3" cc=90
INFO loading rerank engines engine_count=2 ...

Precision Override#

The runtime selects a default precision as part of pipeline selection. The default weight precision is FP16 across all supported SKUs. To opt into FP8 on H100, H200, and RTX PRO 6000, set NIM_PRECISION=fp8 in the Docker run command as shown following. For details about which SKUs support FP8, refer to Support Matrix for NVIDIA NeMo Retriever Reranking NIM. If there is no pipeline available for the precision that you request, the runtime selects the default pipeline and precision for the GPU and logs a warning.

docker run ... \
  -e NIM_PRECISION=fp8 \
  $IMG_NAME

Override Profile Selection#

Starting in version 2.0.0, profile selection is automatic and you can’t override it by using NIM_MODEL_PROFILE. To request a specific precision, see Precision Override.