Optimization
NeMo Retriever Text Embedding NIM (Text Embedding NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of embedding models.
The NVIDIA TensorRT-accelerated NIM backend provides support for optimized versions of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.
The TensorRT NIM backend includes multiple optimization profiles, catered to floating point precision types supported by each SKU.
Text Embedding NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters, which influence the selection process. The sorting logic based on the parameters involved is outlined below:
Compatibility check: Text Embedding NIM filters out the profiles that are not runnable with the detected configuration based on the number and type of GPUs available.
Backend: This can be either TensorRT or ONNX. The optimized TensorRT profiles are preferred over ONNX when available.
Precision: Lower precision profiles are preferred when available. For example, Text Embedding NIM will automatically select FP8 profiles over FP16.
This selection is logged at startup. For example:
Detected 3 compatible profile(s).
Valid profile: onnx (type=ONNX,precision=FP16)
Valid profile: NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8 (type=TensorRT,precision=FP8,gpu=NVIDIA-H100-80GB-HBM3)
Valid profile: NVIDIA-H100-80GB-HBM3_10.0.1_12 (type=TensorRT,precision=FP16,gpu=NVIDIA-H100-80GB-HBM3)
Selected profile: NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8
Profile metadata: type: TensorRT
Profile metadata: precision: FP8
Profile metadata: gpu: NVIDIA-H100-80GB-HBM3
Profile metadata: trt_version: 10.0.1
Profile metadata: cuda_major_version: 12
Overriding Profile Selection
To override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>
. The following list-model-profiles
command lists the available profiles for the IMG_NAME
Text Embedding NIM:
docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles
MODEL PROFILES
- Compatible with system and runnable:
- onnx (type=ONNX,precision=FP16)
- NVIDIA-H100-80GB-HBM3_10.0.1_12_FP8 (type=TensorRT,precision=FP8,gpu=NVIDIA-H100-80GB-HBM3)
- NVIDIA-H100-80GB-HBM3_10.0.1_12 (type=TensorRT,precision=FP16,gpu=NVIDIA-H100-80GB-HBM3)
In the previous example, you can set -e NIM_MODEL_PROFILE="NVIDIA-H100-80GB-HBM3_10.0.1_12"
to run the H100 FP16 profile.