Optimization for NVIDIA NeMo Retriever Embedding NIM#

Use this documentation to learn about optimization for NVIDIA NeMo Retriever Embedding NIM.

NVIDIA NeMo Retriever Embedding NIM (NeMo Retriever Embedding NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of the models.

The NIM uses the TensorRT backend for Triton Inference Server for optimized inference of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.

The NIM includes multiple optimization profiles, catered to floating point precision types supported by each SKU.

Automatic Profile Selection#

NeMo Retriever Embedding NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters that influence the selection process. The automatic selection process considers the following factors:

  • Compatibility check: Automatic selection excludes the profiles that are not runnable with the detected configuration based on the number of GPUs and GPU model.

  • Backend: The backend is TensorRT, ONNX, or PyTorch. Automatic selection prefers the optimized TensorRT profiles.

  • Precision: Automatic selection prefers lower precision profiles. For example, profiles with FP8 are selected before FP16.

The model profile is logged at startup. For example, in the log you should see something similar to the following.

MODEL PROFILES
- Compatible with system and runnable:
  - <profile name 1>  (type=<type>,precision=<precision>)
  - <profile name 2>  (type=<type>,precision=<precision>,gpu=<gpu>)
  - <profile name 3>  (type=<type>,precision=<precision>,gpu=<gpu>)
Selected profile: <profile name>
Profile metadata: type: <type>
Profile metadata: precision: <precision>
...

Override Profile Selection#

To override automatic profile selection, set a specific profile ID in the docker run command by including -e NIM_MODEL_PROFILE=<value>. The following command lists the available profiles:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

Note

To run list-model-profiles on a host with no available GPUs, include -e NIM_CPU_ONLY=1 in the docker run command.

You should output similar to the following.

MODEL PROFILES
- Compatible with system and runnable:
  - f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa (h100-nvl-fp8-triton-tensorrt)
  - 58610f044b526218e323a350f3bc22d6fdaa775b3b634fc3b671ca29ba8848d6 (h100-pcie-fp8-triton-tensorrt)
  - 761a1460449bba9831b61bcc1e4e2ce93bfc66f5badecf6157009a0e0d18d934 (h100-hbm3-80gb-fp8-triton-tensorrt)
  ...

In the previous example, set -e NIM_MODEL_PROFILE="f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa" to run the H100 NVL FP8 profile.