Exporting Llama Embedding To ONNX and TensorRT#
The Llama Embedding Model can be exported to ONNX and TensorRT formats for optimized inference deployment.
ONNX Export#
To export a trained Llama Embedding model to ONNX format, use the following steps:
First, convert the Llama model into an embedding model by adding bidirectional attention and pooling capabilities:
from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model model, tokenizer = get_llama_bidirectional_hf_model( model_name_or_path='path/to/converted/hf/ckpt', normalize=False, pooling_mode="avg", trust_remote_code=True, )
Define input and output configurations for ONNX format:
input_names = ["input_ids", "attention_mask"] dynamic_axes_input = { "input_ids": {0: "batch_size", 1: "seq_length"}, "attention_mask": {0: "batch_size", 1: "seq_length"} } output_names = ["embeddings"] dynamic_axes_output = {"embeddings": {0: "batch_size", 1: "embedding_dim"}}
Initialize the ONNX exporter:
from nemo.export.onnx_llm_exporter import OnnxLLMExporter onnx_exporter = OnnxLLMExporter( onnx_model_dir='/tmp/onnx_output/', model=model, tokenizer=tokenizer, )
Export the model to ONNX format:
onnx_exporter.export( input_names=input_names, output_names=output_names, opset=17, # Using ONNX opset version 17 dynamic_axes_input=dynamic_axes_input, dynamic_axes_output=dynamic_axes_output, export_dtype="fp32", # Exporting in 32-bit floating-point precision ) print(f"ONNX model exported successfully to: {onnx_exporter.onnx_model_dir}")
TensorRT Export#
After exporting to ONNX, you can convert the model to TensorRT for more optimized inference performance:
Define input profiles for different batch sizes and sequence lengths:
input_profiles = [ { "input_ids": [[1, 3], [16, 128], [64, 256]], "attention_mask": [[1, 3], [16, 128], [64, 256]], "dimensions": [[1], [16], [64]] } ]
Whether generate TensorRT engine with compatible TensorRT version:
import tensorrt as trt trt_version_compatible = True trt_builder_flags = [trt.BuilderFlag.VERSION_COMPATIBLE] if trt_version_compatible else None
Convert ONNX to TensorRT:
onnx_exporter.export_onnx_to_trt( trt_model_dir='/tmp/trt_output/', profiles=input_profiles, override_layernorm_precision_to_fp32=True, override_layers_to_fp32=["/model/norm/", "/pooling_module", "/ReduceL2", "/Div"], profiling_verbosity="layer_names_only", trt_builder_flags=trt_builder_flags, ) print("TensorRT engine exported successfully to /tmp/trt_output")
The exported TensorRT engine can be used for efficient inference with significantly improved performance compared to the base PyTorch model.
- For a detailed walkthrough of the ONNX and TensorRT export process, you can follow the
Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT tutorial, which provides step-by-step instructions and examples.