Exporting Llama Embedding To ONNX and TensorRT#

The Llama Embedding Model can be exported to ONNX and TensorRT formats for optimized inference deployment.

ONNX Export#

To export a trained Llama Embedding model to ONNX format, use the following steps:

  1. First, convert the Llama model into an embedding model by adding bidirectional attention and pooling capabilities:

from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model

model, tokenizer = get_llama_bidirectional_hf_model(
  model_name_or_path='path/to/converted/hf/ckpt',
  normalize=False,
  pooling_mode="avg",
  trust_remote_code=True,
)
  1. Define input and output configurations for ONNX format:

input_names = ["input_ids", "attention_mask"]
dynamic_axes_input = {
    "input_ids": {0: "batch_size", 1: "seq_length"},
    "attention_mask": {0: "batch_size", 1: "seq_length"}
}
output_names = ["embeddings"]
dynamic_axes_output = {"embeddings": {0: "batch_size", 1: "embedding_dim"}}
  1. Initialize the ONNX exporter:

from nemo.export.onnx_llm_exporter import OnnxLLMExporter

onnx_exporter = OnnxLLMExporter(
    onnx_model_dir='/tmp/onnx_output/',
    model=model,
    tokenizer=tokenizer,
)
  1. Export the model to ONNX format:

onnx_exporter.export(
    input_names=input_names,
    output_names=output_names,
    opset=17,  # Using ONNX opset version 17
    dynamic_axes_input=dynamic_axes_input,
    dynamic_axes_output=dynamic_axes_output,
    export_dtype="fp32",  # Exporting in 32-bit floating-point precision
)

print(f"ONNX model exported successfully to: {onnx_exporter.onnx_model_dir}")

TensorRT Export#

After exporting to ONNX, you can convert the model to TensorRT for more optimized inference performance:

  1. Define input profiles for different batch sizes and sequence lengths:

input_profiles = [
    {
        "input_ids": [[1, 3], [16, 128], [64, 256]],
        "attention_mask": [[1, 3], [16, 128], [64, 256]],
        "dimensions": [[1], [16], [64]]
    }
]
  1. Whether generate TensorRT engine with compatible TensorRT version:

import tensorrt as trt
trt_version_compatible = True
trt_builder_flags = [trt.BuilderFlag.VERSION_COMPATIBLE] if trt_version_compatible else None
  1. Convert ONNX to TensorRT:

onnx_exporter.export_onnx_to_trt(
    trt_model_dir='/tmp/trt_output/',
    profiles=input_profiles,
    override_layernorm_precision_to_fp32=True,
    override_layers_to_fp32=["/model/norm/", "/pooling_module", "/ReduceL2", "/Div"],
    profiling_verbosity="layer_names_only",
    trt_builder_flags=trt_builder_flags,
)

print("TensorRT engine exported successfully to /tmp/trt_output")

The exported TensorRT engine can be used for efficient inference with significantly improved performance compared to the base PyTorch model.

For a detailed walkthrough of the ONNX and TensorRT export process, you can follow the

Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT tutorial, which provides step-by-step instructions and examples.