Optimization for NeMo Retriever Text Embedding NIM#

Use this documentation to learn about optimization for NeMo Retriever Text Embedding NIM.

NeMo Retriever Text Embedding NIM (Text Embedding NIM) automatically leverages model- and hardware-specific optimizations intended to improve the performance of the models.

The NIM uses the TensorRT backend for Triton Inference Server for optimized inference of common models across a number of NVIDIA GPUs. If an optimized engine does not exist for a SKU being used, a GPU-agnostic ONNX backend (using the CUDA Execution Provider) is used instead.

The NIM includes multiple optimization profiles, catered to floating point precision types supported by each SKU.

Automatic Profile Selection#

Text Embedding NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters. The parameters influence the selection process. The selection process is based on the parameters involved is outlined below:

Compatibility check: Text Embedding NIM excludes the profiles that are not runnable with the detected configuration based on the number of GPUs and GPU model.
Backend: This can be either TensorRT or ONNX. The optimized TensorRT profiles are preferred.
Precision: Lower precision profiles are preferred. For example, profiles with FP8 are selected before FP16.

This selection is logged at startup. For example:

MODEL PROFILES
- Compatible with system and runnable:
  - ea95501d4e68ce0adf88f962526d1fbac0fa64032495b6f6a421b2c37b62433f (type=ONNX,precision=FP16)
  - ef90a4689f79683f170e365ac4d6988f8c9eb9c4f81f2c58827c3f27ed7cd5a8 (type=TensorRT,precision=fp16,gpu=NVIDIA-A100)
Selected profile: ef90a4689f79683f170e365ac4d6988f8c9eb9c4f81f2c58827c3f27ed7cd5a8
Profile metadata: type: TensorRT
Profile metadata: precision: FP8
Profile metadata: gpu: NVIDIA-A100
Profile metadata: trt_version: 10.0.1
Profile metadata: cuda_major_version: 12

Overriding Profile Selection#

Attention

To override this behavior, set a specific profile ID with -e NIM_MODEL_PROFILE=<value>. The following list-model-profiles command lists the available profiles for the IMG_NAME Text Embedding NIM:

docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles

MODEL PROFILES
- Compatible with system and runnable:
  - f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa (h100-nvl-fp8-triton-tensorrt)
  - 58610f044b526218e323a350f3bc22d6fdaa775b3b634fc3b671ca29ba8848d6 (h100-pcie-fp8-triton-tensorrt)
  - 761a1460449bba9831b61bcc1e4e2ce93bfc66f5badecf6157009a0e0d18d934 (h100-hbm3-80gb-fp8-triton-tensorrt)
  ...

In the previous example, set -e NIM_MODEL_PROFILE="f6ecf987d3acacf363a4125195c7f84cd4110520157f9091702a3061e2fd69fa" to run the H100 NVL FP8 profile.

To run list-model-profiles on a host with no available GPUs, pass -e NIM_CPU_ONLY=1 to the docker run command to set the NIM_CPU_ONLY environment variable.

Warning

In the current version of Text Embedding NIM the list-model-profiles command fails to run on hosts that don’t have an NVIDIA GPU, even when NIM_CPU_ONLY is set.