Optimization for NVIDIA NeMo Retriever Embedding NIM#
Use this documentation to learn about runtime optimization for NVIDIA NeMo Retriever Embedding NIM.
Automatic Pipeline Selection#
Starting in version 2.0.0, at startup NeMo Retriever Embedding NIM detects the compute capability of the GPU
and automatically selects an optimized inference pipeline for that GPU.
The selected pipeline determines the CUDA kernels, precompiled cuDNN plans, attention implementation, and default precision.
No user action is required to select profiles, and the list-model-profiles command is not available.
The selected pipeline and its precision are logged at startup. For example, the log includes a line similar to the following.
INFO initializing CudaEngine from Pipeline pipeline_id=<pipeline-id> requested_precision=fp8 requested_attention=cudnn ...
Precision Override#
The runtime selects a default precision as part of pipeline selection. On some SKUs, such as B200 and L40S, the default is FP8. On other SKUs, the default is FP16. For details about which SKUs support FP8, refer to Support Matrix for NVIDIA NeMo Retriever Embedding NIM.
To request a specific precision, set NIM_PRECISION to fp16 or fp8 in the Docker run command as shown following.
If there is no pipeline available for the precision that you request, the runtime selects the default pipeline and precision for the GPU and logs a warning.
docker run ... \
-e NIM_PRECISION=fp8 \
$IMG_NAME
Override Profile Selection#
Starting in version 2.0.0, profile selection is automatic and you can’t override it by using NIM_MODEL_PROFILE.
To request a specific precision, see Precision Override.