Optimization

NeMo Retriever Text Embedding NIM (Text Embedding NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of embedding models.

The NVIDIA TensorRT-accelerated NIM backend provides support for optimized versions of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.

The TensorRT NIM backend includes multiple optimization profiles, catered to floating point precision types supported by each SKU.

Automatic Profile Selection

Text Embedding NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:

Compatibility check: Text Embedding NIM filters out the profiles that are not runnable with the detected configuration based on the number and type of GPUs available.
Backend: This can be either TensorRT or ONNX. The optimized TensorRT profiles are preferred over ONNX when available.
Precision: Lower precision profiles are preferred when available. For example, Text Embedding NIM will automatically select FP8 profiles over FP16.

This selection is logged at startup. For example:

MODEL PROFILES
- Compatible with system and runnable:
  - ea95501d4e68ce0adf88f962526d1fbac0fa64032495b6f6a421b2c37b62433f (type=ONNX,precision=FP16)
  - ef90a4689f79683f170e365ac4d6988f8c9eb9c4f81f2c58827c3f27ed7cd5a8 (type=TensorRT,precision=fp16,gpu=NVIDIA-A100)
Selected profile: ef90a4689f79683f170e365ac4d6988f8c9eb9c4f81f2c58827c3f27ed7cd5a8
Profile metadata: type: TensorRT
Profile metadata: precision: FP8
Profile metadata: gpu: NVIDIA-A100
Profile metadata: trt_version: 10.0.1
Profile metadata: cuda_major_version: 12

Overriding Profile Selection

Attention

To override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>. The following list-model-profiles command lists the available profiles for the IMG_NAME Text Embedding NIM:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

MODEL PROFILES
 - Compatible with system and runnable:
  - ea95501d4e68ce0adf88f962526d1fbac0fa64032495b6f6a421b2c37b62433f (type=ONNX,precision=fp16)
  - ef90a4689f79683f170e365ac4d6988f8c9eb9c4f81f2c58827c3f27ed7cd5a8 (type=TensorRT,precision=fp16,gpu=NVIDIA-A100)
 - Incompatible with system:
   - 2ad8fe56fc5c2cd108b4d177286fbd6c6ea5dcd3de3995cb9aeb83f80ddd5c9e (type=TensorRT,precision=fp16,gpu=NVIDIA-L4)
   - 72972bc1f4dc3f79aff74aa6a9a26d24d27d6f9c74c455dc904b2a2a20f156d2 (type=TensorRT,precision=fp16,gpu=NVIDIA-A10G)
   - c963a7abf0fada9bb5071b700bf7aad8d9d4c993e0995a086ddddd106f1aa4be (type=TensorRT,precision=fp16,gpu=NVIDIA-L40S)
   - f5cfb1a2c2f00bff7f504e78bcce237903a4d257cbb0086ea7856c2df3458a5f (type=TensorRT,precision=fp16,gpu=NVIDIA-H100)

In the previous example, you can set -e NIM_MODEL_PROFILE="f5cfb1a2c2f00bff7f504e78bcce237903a4d257cbb0086ea7856c2df3458a5f" to run the H100 FP16 profile.