nemo_export.onnx_llm_exporter
#
Module Contents#
Classes#
Exports models to ONNX and run fast inference. |
Functions#
No op decorator. |
Data#
API#
- nemo_export.onnx_llm_exporter.use_pytriton = True#
- nemo_export.onnx_llm_exporter.use_onnxruntime = True#
- class nemo_export.onnx_llm_exporter.OnnxLLMExporter(
- onnx_model_dir: str,
- model: Optional[torch.nn.Module] = None,
- tokenizer=None,
- model_name_or_path: str = None,
- load_runtime: bool = True,
Bases:
nemo_deploy.ITritonDeployable
Exports models to ONNX and run fast inference.
.. rubric:: Example
from nemo_export.onnx_llm_exporter import OnnxLLMExporter
onnx_llm_exporter = OnnxLLMExporter( onnx_model_dir=”/path/for/onnx_model/files”, model_name_or_path=”/path/for/model/files”, )
onnx_llm_exporter.export( input_names=[“input_ids”, “attention_mask”, “dimensions”], output_names=[“embeddings”], )
output = onnx_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)
Initialization
Initializes the ONNX Exporter.
- Parameters:
onnx_model_dir (str) – path for storing the ONNX model files.
model (Optional[torch.nn.Module]) – torch model.
tokenizer (HF or NeMo tokenizer) – tokenizer class.
model_name_or_path (str) – a path for ckpt or HF model ID
load_runtime (bool) – load ONNX runtime if there is any exported model available in the onnx_model_dir folder.
- export(
- input_names: list,
- output_names: list,
- example_inputs: dict = None,
- opset: int = 20,
- dynamic_axes_input: Optional[dict] = None,
- dynamic_axes_output: Optional[dict] = None,
- export_dtype: str = 'fp32',
- verbose: bool = False,
Performs ONNX conversion from a PyTorch model.
- Parameters:
input_names (list) – input parameter names of the model that ONNX will export will use.
output_names (list) – output parameter names of the model that ONNX will export will use.
example_inputs (dict) – example input for the model to build the engine.
opset (int) – ONNX opset version. Default is 20.
dynamic_axes_input (dict) – Variable length axes for the input.
dynamic_axes_output (dict) – Variable length axes for the output.
export_dtype (str) – Export dtype, fp16 or fp32.
verbose (bool) – Enable verbose or not.
- _export_to_onnx(
- input_names: list,
- output_names: list,
- example_inputs: dict = None,
- opset: int = 20,
- dynamic_axes_input: Optional[dict] = None,
- dynamic_axes_output: Optional[dict] = None,
- export_dtype: Union[torch.dtype, str] = 'fp16',
- verbose: bool = False,
- export_onnx_to_trt(
- trt_model_dir: str,
- profiles=None,
- override_layernorm_precision_to_fp32: bool = False,
- override_layers_to_fp32: List = None,
- trt_dtype: str = 'fp16',
- profiling_verbosity: str = 'layer_names_only',
- trt_builder_flags: List[tensorrt.BuilderFlag] = None,
Performs TensorRT conversion from an ONNX model.
- Parameters:
trt_model_dir – path to store the TensorRT model.
profiles – TensorRT profiles.
override_layernorm_precision_to_fp32 (bool) – whether to convert layers to fp32 or not.
override_layers_to_fp32 (List) – Layer names to be converted to fp32.
trt_dtype (str) – “fp16” or “fp32”.
profiling_verbosity (str) – Profiling verbosity. Default is “layer_names_only”.
trt_builder_flags (List[trt.BuilderFlag]) – TRT specific flags.
- _override_layers_to_fp32(
- network: tensorrt.INetworkDefinition,
- fp32_layer_patterns: list[str],
- _override_layernorm_precision_to_fp32(
- network: tensorrt.INetworkDefinition,
Set the precision of LayerNorm subgraphs to FP32 to preserve accuracy.
https://nvbugs/4478448 (Mistral)
https://nvbugs/3802112 (T5)
- Parameters:
network – tensorrt.INetworkDefinition
- forward(
- inputs: Union[List, Dict],
- dimensions: Optional[List] = None,
Run inference for a given input.
- Parameters:
inputs (Union[List, Dict]) – Input for the model. If list, it should be a list of strings. If dict, it should be a dictionary with keys as the model input names.
dimensions (Optional[List]) – The dimensions parameter of the model. Required if the model was exported to accept dimensions parameter and inputs is given as a list of strings.
- Returns:
Model output.
- Return type:
np.ndarray
- quantize(
- quant_cfg: Union[str, Dict[str, Any]],
- forward_loop: Optional[Callable],
Quantize the model by calibrating it using a given forward loop.
- Parameters:
quant_cfg (str or dict) – The quantization configuration to use.
forward_loop (callable) – A function that accepts the model as a single parameter and runs sample data through it. This is used for calibration during quantization.
- property get_model#
Returns the model.
- property get_tokenizer#
Returns the tokenizer.
- property get_model_input_names#
Returns the model input names.
- abstract property get_triton_input#
Get triton input.
- abstract property get_triton_output#
Get triton output.