nemo_export.onnx_llm_exporter#

Module Contents#

Classes#

OnnxLLMExporter

Exports models to ONNX and run fast inference.

Functions#

noop_decorator

No op decorator.

Data#

API#

nemo_export.onnx_llm_exporter.noop_decorator(func)[source]#

No op decorator.

nemo_export.onnx_llm_exporter.use_pytriton = True#
nemo_export.onnx_llm_exporter.batch = None[source]#
nemo_export.onnx_llm_exporter.use_onnxruntime = True#
class nemo_export.onnx_llm_exporter.OnnxLLMExporter(
onnx_model_dir: str,
model: Optional[torch.nn.Module] = None,
tokenizer=None,
model_name_or_path: str = None,
load_runtime: bool = True,
)[source]#

Bases: nemo_deploy.ITritonDeployable

Exports models to ONNX and run fast inference.

.. rubric:: Example

from nemo_export.onnx_llm_exporter import OnnxLLMExporter

onnx_llm_exporter = OnnxLLMExporter( onnx_model_dir=”/path/for/onnx_model/files”, model_name_or_path=”/path/for/model/files”, )

onnx_llm_exporter.export( input_names=[“input_ids”, “attention_mask”, “dimensions”], output_names=[“embeddings”], )

output = onnx_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)

Initialization

Initializes the ONNX Exporter.

Parameters:
  • onnx_model_dir (str) – path for storing the ONNX model files.

  • model (Optional[torch.nn.Module]) – torch model.

  • tokenizer (HF or NeMo tokenizer) – tokenizer class.

  • model_name_or_path (str) – a path for ckpt or HF model ID

  • load_runtime (bool) – load ONNX runtime if there is any exported model available in the onnx_model_dir folder.

_load_runtime()[source]#
_load_hf_model()[source]#
export(
input_names: list,
output_names: list,
example_inputs: dict = None,
opset: int = 20,
dynamic_axes_input: Optional[dict] = None,
dynamic_axes_output: Optional[dict] = None,
export_dtype: str = 'fp32',
verbose: bool = False,
)[source]#

Performs ONNX conversion from a PyTorch model.

Parameters:
  • input_names (list) – input parameter names of the model that ONNX will export will use.

  • output_names (list) – output parameter names of the model that ONNX will export will use.

  • example_inputs (dict) – example input for the model to build the engine.

  • opset (int) – ONNX opset version. Default is 20.

  • dynamic_axes_input (dict) – Variable length axes for the input.

  • dynamic_axes_output (dict) – Variable length axes for the output.

  • export_dtype (str) – Export dtype, fp16 or fp32.

  • verbose (bool) – Enable verbose or not.

_export_to_onnx(
input_names: list,
output_names: list,
example_inputs: dict = None,
opset: int = 20,
dynamic_axes_input: Optional[dict] = None,
dynamic_axes_output: Optional[dict] = None,
export_dtype: Union[torch.dtype, str] = 'fp16',
verbose: bool = False,
)[source]#
export_onnx_to_trt(
trt_model_dir: str,
profiles=None,
override_layernorm_precision_to_fp32: bool = False,
override_layers_to_fp32: List = None,
trt_dtype: str = 'fp16',
profiling_verbosity: str = 'layer_names_only',
trt_builder_flags: List[tensorrt.BuilderFlag] = None,
) None[source]#

Performs TensorRT conversion from an ONNX model.

Parameters:
  • trt_model_dir – path to store the TensorRT model.

  • profiles – TensorRT profiles.

  • override_layernorm_precision_to_fp32 (bool) – whether to convert layers to fp32 or not.

  • override_layers_to_fp32 (List) – Layer names to be converted to fp32.

  • trt_dtype (str) – “fp16” or “fp32”.

  • profiling_verbosity (str) – Profiling verbosity. Default is “layer_names_only”.

  • trt_builder_flags (List[trt.BuilderFlag]) – TRT specific flags.

_override_layer_precision_to_fp32(layer: tensorrt.ILayer) None[source]#
_override_layers_to_fp32(
network: tensorrt.INetworkDefinition,
fp32_layer_patterns: list[str],
) None[source]#
_override_layernorm_precision_to_fp32(
network: tensorrt.INetworkDefinition,
) None[source]#

Set the precision of LayerNorm subgraphs to FP32 to preserve accuracy.

  • https://nvbugs/4478448 (Mistral)

  • https://nvbugs/3802112 (T5)

Parameters:

network – tensorrt.INetworkDefinition

forward(
inputs: Union[List, Dict],
dimensions: Optional[List] = None,
)[source]#

Run inference for a given input.

Parameters:
  • inputs (Union[List, Dict]) – Input for the model. If list, it should be a list of strings. If dict, it should be a dictionary with keys as the model input names.

  • dimensions (Optional[List]) – The dimensions parameter of the model. Required if the model was exported to accept dimensions parameter and inputs is given as a list of strings.

Returns:

Model output.

Return type:

np.ndarray

quantize(
quant_cfg: Union[str, Dict[str, Any]],
forward_loop: Optional[Callable],
) None[source]#

Quantize the model by calibrating it using a given forward loop.

Parameters:
  • quant_cfg (str or dict) – The quantization configuration to use.

  • forward_loop (callable) – A function that accepts the model as a single parameter and runs sample data through it. This is used for calibration during quantization.

property get_model#

Returns the model.

property get_tokenizer#

Returns the tokenizer.

property get_model_input_names#

Returns the model input names.

abstract property get_triton_input#

Get triton input.

abstract property get_triton_output#

Get triton output.

abstractmethod triton_infer_fn(**inputs: numpy.ndarray)[source]#

PyTriton inference function.