`nemo_export.onnx_llm_exporter`#

Module Contents#

Classes#

OnnxLLMExporter

Exports models to ONNX and run fast inference.

Functions#

noop_decorator

No op decorator.

Data#

`use_pytriton`
`batch`
`use_onnxruntime`

API#

nemo_export.onnx_llm_exporter.noop_decorator(func)#: No op decorator.

nemo_export.onnx_llm_exporter.use_pytriton = True#

nemo_export.onnx_llm_exporter.batch = None#

nemo_export.onnx_llm_exporter.use_onnxruntime = True#

class nemo_export.onnx_llm_exporter.OnnxLLMExporter( onnx_model_dir: str, model: Optional[torch.nn.Module] = None, tokenizer=None, model_name_or_path: str = None, load_runtime: bool = True, )#

Bases: nemo_deploy.ITritonDeployable

Exports models to ONNX and run fast inference.

.. rubric:: Example

from nemo_export.onnx_llm_exporter import OnnxLLMExporter

onnx_llm_exporter = OnnxLLMExporter( onnx_model_dir=”/path/for/onnx_model/files”, model_name_or_path=”/path/for/model/files”, )

onnx_llm_exporter.export( input_names=[“input_ids”, “attention_mask”, “dimensions”], output_names=[“embeddings”], )

output = onnx_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)

Initialization

Initializes the ONNX Exporter.

Parameters:

onnx_model_dir (str) – path for storing the ONNX model files.
model (Optional[torch.nn.Module]) – torch model.
tokenizer (HF or NeMo tokenizer) – tokenizer class.
model_name_or_path (str) – a path for ckpt or HF model ID
load_runtime (bool) – load ONNX runtime if there is any exported model available in the onnx_model_dir folder.

_load_runtime()#

_load_hf_model()#

export( input_names: list, output_names: list, example_inputs: dict = None, opset: int = 20, dynamic_axes_input: Optional[dict] = None, dynamic_axes_output: Optional[dict] = None, export_dtype: str = 'fp32', verbose: bool = False, )#

Performs ONNX conversion from a PyTorch model.

Parameters:

input_names (list) – input parameter names of the model that ONNX will export will use.
output_names (list) – output parameter names of the model that ONNX will export will use.
example_inputs (dict) – example input for the model to build the engine.
opset (int) – ONNX opset version. Default is 20.
dynamic_axes_input (dict) – Variable length axes for the input.
dynamic_axes_output (dict) – Variable length axes for the output.
export_dtype (str) – Export dtype, fp16 or fp32.
verbose (bool) – Enable verbose or not.

_export_to_onnx( input_names: list, output_names: list, example_inputs: dict = None, opset: int = 20, dynamic_axes_input: Optional[dict] = None, dynamic_axes_output: Optional[dict] = None, export_dtype: Union[torch.dtype, str] = 'fp16', verbose: bool = False, )#

export_onnx_to_trt( trt_model_dir: str, profiles=None, override_layernorm_precision_to_fp32: bool = False, override_layers_to_fp32: List = None, trt_dtype: str = 'fp16', profiling_verbosity: str = 'layer_names_only', trt_builder_flags: List[tensorrt.BuilderFlag] = None, ) → None#

Performs TensorRT conversion from an ONNX model.

Parameters:

trt_model_dir – path to store the TensorRT model.
profiles – TensorRT profiles.
override_layernorm_precision_to_fp32 (bool) – whether to convert layers to fp32 or not.
override_layers_to_fp32 (List) – Layer names to be converted to fp32.
trt_dtype (str) – “fp16” or “fp32”.
profiling_verbosity (str) – Profiling verbosity. Default is “layer_names_only”.
trt_builder_flags (List[trt.BuilderFlag]) – TRT specific flags.

_override_layer_precision_to_fp32(layer: tensorrt.ILayer) → None#

_override_layers_to_fp32( network: tensorrt.INetworkDefinition, fp32_layer_patterns: list[str], ) → None#

_override_layernorm_precision_to_fp32( network: tensorrt.INetworkDefinition, ) → None#

Set the precision of LayerNorm subgraphs to FP32 to preserve accuracy.

https://nvbugs/4478448 (Mistral)
https://nvbugs/3802112 (T5)

Parameters:: network – tensorrt.INetworkDefinition

forward( inputs: Union[List, Dict], dimensions: Optional[List] = None, )#

Run inference for a given input.

Parameters:

inputs (Union[List, Dict]) – Input for the model. If list, it should be a list of strings. If dict, it should be a dictionary with keys as the model input names.
dimensions (Optional[List]) – The dimensions parameter of the model. Required if the model was exported to accept dimensions parameter and inputs is given as a list of strings.

Returns:

Model output.

Return type:

np.ndarray

quantize( quant_cfg: Union[str, Dict[str, Any]], forward_loop: Optional[Callable], ) → None#

Quantize the model by calibrating it using a given forward loop.

Parameters:

quant_cfg (str or dict) – The quantization configuration to use.
forward_loop (callable) – A function that accepts the model as a single parameter and runs sample data through it. This is used for calibration during quantization.

property get_model#: Returns the model.

property get_tokenizer#: Returns the tokenizer.

property get_model_input_names#: Returns the model input names.

abstract property get_triton_input#: Get triton input.

abstract property get_triton_output#: Get triton output.

abstractmethod triton_infer_fn(**inputs: numpy.ndarray)#: PyTriton inference function.

nemo_export.onnx_llm_exporter#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_export.onnx_llm_exporter`#