core.export.trtllm.engine_builder.trtllm_engine_builder#

Module Contents#

Classes#

TRTLLMEngineBuilder

A utility class to build TRTLLM engine

API#

class core.export.trtllm.engine_builder.trtllm_engine_builder.TRTLLMEngineBuilder#

A utility class to build TRTLLM engine

static build_and_save_engine(
engine_dir: str,
trtllm_model_weights: dict,
trtllm_model_config,
max_input_len: int = 1024,
max_output_len: int = 1024,
max_batch_size: int = 4,
lora_ckpt_list=None,
use_lora_plugin=None,
max_lora_rank: int = 64,
lora_target_modules=None,
max_prompt_embedding_table_size: int = 0,
paged_kv_cache: bool = True,
remove_input_padding: bool = True,
paged_context_fmha: bool = False,
use_refit: bool = False,
max_num_tokens: int = None,
max_seq_len: int = None,
opt_num_tokens: int = None,
max_beam_width: int = 1,
tokens_per_block: int = 128,
multiple_profiles: bool = False,
gpt_attention_plugin: str = 'auto',
gemm_plugin: str = 'auto',
reduce_fusion: bool = False,
)#

Method to build the TRTLLM Engine

This method uses the TRTLLMEngineBuilder to build and save the engine to engine dir

Parameters:
  • engine_dir (str) – The file path to save the engine

  • trtllm_model_weights (dict) – The TRTLLM converted model weights dict

  • trtllm_model_config – The TRTLLM Config

  • max_input_len (int, optional) – Max input length. Defaults to 1024.

  • max_output_len (int, optional) – Max output length. Defaults to 1024.

  • max_batch_size (int, optional) – Max batch size. Defaults to 4.

  • model_type (ModelType, optional) – ModelType enum. Defaults to ModelType.gpt.

  • lora_ckpt_list (type, optional) – Lora checkpoint list. Defaults to None.

  • use_lora_plugin (type, optional) – Use lora plugin. Defaults to None.

  • max_lora_rank (int, optional) – Max lora rank. Defaults to 64.

  • lora_target_modules (type, optional) – Lora target modules. Defaults to None.

  • max_prompt_embedding_table_size (int, optional) – Defaults to 0.

  • paged_kv_cache (bool, optional) – Use Paged KV cache. Defaults to True.

  • remove_input_padding (bool, optional) – Remove input padding. Defaults to True.

  • paged_context_fmha (bool, optional) – Paged context fmha. Defaults to False.

  • use_refit (bool, optional) – Use refit. Defaults to False.

  • max_num_tokens (int, optional) – Max num of tokens. Defaults to None.

  • max_seq_len (int, optional) – Max seq length. Defaults to None.

  • opt_num_tokens (int, optional) – Opt number of tokens. Defaults to None.

  • max_beam_width (int, optional) – Max beam width. Defaults to 1.

  • tokens_per_block (int, optional) – Nmber of tokens per block. Defaults to 128.

  • multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.

  • gpt_attention_plugin (str, optional) – Gpt attention plugin to use. Defaults to “auto”.

  • gemm_plugin (str, optional) – Gemma plugin to use. Defaults to “auto”.