core.export.trtllm.engine_builder.trtllm_engine_builder#
Module Contents#
Classes#
A utility class to build TRTLLM engine |
API#
- class core.export.trtllm.engine_builder.trtllm_engine_builder.TRTLLMEngineBuilder#
A utility class to build TRTLLM engine
- static build_and_save_engine(
- engine_dir: str,
- trtllm_model_weights: dict,
- trtllm_model_config,
- max_input_len: int = 1024,
- max_output_len: int = 1024,
- max_batch_size: int = 4,
- lora_ckpt_list=None,
- use_lora_plugin=None,
- max_lora_rank: int = 64,
- lora_target_modules=None,
- max_prompt_embedding_table_size: int = 0,
- paged_kv_cache: bool = True,
- remove_input_padding: bool = True,
- paged_context_fmha: bool = False,
- use_refit: bool = False,
- max_num_tokens: int = None,
- max_seq_len: int = None,
- opt_num_tokens: int = None,
- max_beam_width: int = 1,
- tokens_per_block: int = 128,
- multiple_profiles: bool = False,
- gpt_attention_plugin: str = 'auto',
- gemm_plugin: str = 'auto',
- reduce_fusion: bool = False,
Method to build the TRTLLM Engine
This method uses the TRTLLMEngineBuilder to build and save the engine to engine dir
- Parameters:
engine_dir (str) – The file path to save the engine
trtllm_model_weights (dict) – The TRTLLM converted model weights dict
trtllm_model_config – The TRTLLM Config
max_input_len (int, optional) – Max input length. Defaults to 1024.
max_output_len (int, optional) – Max output length. Defaults to 1024.
max_batch_size (int, optional) – Max batch size. Defaults to 4.
model_type (ModelType, optional) – ModelType enum. Defaults to ModelType.gpt.
lora_ckpt_list (type, optional) – Lora checkpoint list. Defaults to None.
use_lora_plugin (type, optional) – Use lora plugin. Defaults to None.
max_lora_rank (int, optional) – Max lora rank. Defaults to 64.
lora_target_modules (type, optional) – Lora target modules. Defaults to None.
max_prompt_embedding_table_size (int, optional) – Defaults to 0.
paged_kv_cache (bool, optional) – Use Paged KV cache. Defaults to True.
remove_input_padding (bool, optional) – Remove input padding. Defaults to True.
paged_context_fmha (bool, optional) – Paged context fmha. Defaults to False.
use_refit (bool, optional) – Use refit. Defaults to False.
max_num_tokens (int, optional) – Max num of tokens. Defaults to None.
max_seq_len (int, optional) – Max seq length. Defaults to None.
opt_num_tokens (int, optional) – Opt number of tokens. Defaults to None.
max_beam_width (int, optional) – Max beam width. Defaults to 1.
tokens_per_block (int, optional) – Nmber of tokens per block. Defaults to 128.
multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.
gpt_attention_plugin (str, optional) – Gpt attention plugin to use. Defaults to “auto”.
gemm_plugin (str, optional) – Gemma plugin to use. Defaults to “auto”.