core.export.trtllm.trtllm_helper#

Module Contents#

Classes#

TRTLLMHelper

TRTLLM Helper class to convert export and build TRTLLM model.

API#

class core.export.trtllm.trtllm_helper.TRTLLMHelper(
*,
transformer_config: megatron.core.transformer.transformer_config.TransformerConfig,
model_type: megatron.core.export.model_type.ModelType,
trtllm_conversion_dict: dict = {},
position_embedding_type: str = 'learned_absolute',
max_position_embeddings: int = None,
rotary_percentage: int = 1.0,
rotary_base: int = 10000,
rope_scaling_factor: float = 8.0,
moe_tp_mode: int = 2,
multi_query_mode: bool = False,
activation: str = 'gelu',
seq_len_interpolation_factor: float = None,
moe_renorm_mode=None,
share_embeddings_and_output_weights=False,
)#

TRTLLM Helper class to convert export and build TRTLLM model.

Initialization

Constructor for the TRTLLMHelper

There are two public API’s supported by this helper. a) get_trtllm_pretrained_config_and_model_weights b) build_and_save_engine

Parameters:
  • transformer_config (TransformerConfig) – The transformer config

  • model_type (ModelType) – The type of the input model. Enum (megatron.core.export.model_type.ModelType)

  • trtllm_conversion_dict (dict, optional) – A conversion dictionary that will map your model layer names to trtllm equivalent layer names. Default dictionary is given megatron/core/export/model_to_trtllm_mapping. This dict is merged into the default dict. NOTE: Ignore layer numbers in the model layer names. (e.g) decoder.layers.0.attention_qkv.weight will be decoder.layers.attention_qkv.weight in the mapping dictionary. Defaults to {}.

  • position_embedding_type (str, optional) – The position embedding type. Defaults to None.

  • max_position_embeddings (int, optional) – Max posistion embeddings value. Defaults to None.

  • rotary_percentage (int, optional) – The rotary percentage if using rope embedding. Defaults to 1.0.

  • rotary_base (int, optional) – The rotary base (theta value) if using rope embeddings. Defaults to 10000.

  • moe_tp_mode (int, optional) – TRTLLM Config. Defaults to 2.

  • multi_query_mode (bool, optional) – Defaults to False.

  • activation (str, optional) – Defaults to “gelu”.

  • seq_len_interpolation_factor (float, optional) – The sequence length interpolation factor if using rope embeddings. Defaults to None.

  • moe_renorm_mode (optional) – Renormalization mode if using mixture of experts. Defaults to None.

  • share_embeddings_and_output_weights (bool, optional) – True if input and output layers share weights. Defaults to False.

_get_trtllm_config(
export_config: megatron.core.export.export_config.ExportConfig,
world_size: int,
gpus_per_node: int,
vocab_size_padded: int,
dtype: megatron.core.export.data_type.DataType,
fp8_quantized: bool = False,
fp8_kvcache: bool = False,
)#

Get TRTLLM Config

Returns appropriate TRTLLM PretrainedConfig used by TRTLLM for building engine

Parameters:
  • export_config (ExportConfig) – The export config that defines inference tp , pp size etc.

  • world_size (int) – The number of gpus (Mostly TP * PP)

  • gpus_per_node (int) – Num gpus per node

  • vocab_size_padded (int) – Padded vocab size

  • dtype (DataType) – The datatype or model precision

Returns:

GPTConfig or the LLamaConfig or the PretrainedConfig constructed from your model config

_load_scaling_factors(model_state_dict: dict) dict#

Loads scaling factors from model state dictionary.

Parameters:

model_state_dict (dict) – Model state dictionary

Returns:

Maps scaling factor key, to its value and the inverse. The inverse is used for casting the quantized weights.

Return type:

dict

get_trtllm_pretrained_config_and_model_weights(
model_state_dict,
dtype: megatron.core.export.data_type.DataType,
export_config: megatron.core.export.export_config.ExportConfig = None,
on_device_distributed_conversion: bool = False,
vocab_size: int = None,
gpus_per_node: int = None,
state_dict_split_by_layer_numbers: bool = True,
fp8_quantized: bool = False,
fp8_kvcache: bool = False,
)#

Get TRTLLM Config and Converted Model Weights

This function returns the trtllm model weights as a list. There are two modes for conversion. The default is to use a single device cpu/gpu for conversion. NOTE: For faster performance, if your entire model will fit in memory, pre transfer the model state dict to cuda device and then call this function. For on device conversion it returns weights which will be used on the device itself. Same thing happens with the pretrained config

Parameters:
  • model_state_dict (dict) – The input model state dictionary (Entire model state loaded on CPU) or the model state dict of each GPU in the case of on_device conversion)

  • export_config (ExportConfig) – The export config used to define inference tp size, pp size etc. Used only for on device conversion.

  • dtype (DataType) – The data type of model precision

  • on_device_distributed_conversion (bool, optional) – Convert on gpus in distributed setting. This assumes that the model state dict is sharded according to required inference model parallelism and that each gpu gets its part of the model state dict . Defaults to False.

  • vocab_size (int, optional) – The vocabulary size. Defaults to None.

  • gpus_per_node (int, optional) – The number of gpus per node. Used for on device conversion.

  • state_dict_split_by_layer_numbers (bool, optional) – Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight … for all layers. If you use represenation 2 set this to True. Defaults to True

Returns:

Two lists . First list of trtllm converted model weights(Either on device, or a list of weights for each gpu) and the trtllm_model_configs.

_add_scales_to_converter(
converter: Union[megatron.core.export.trtllm.trtllm_weights_converter.single_device_trtllm_model_weights_converter.SingleDeviceTRTLLMModelWeightsConverter, megatron.core.export.trtllm.trtllm_weights_converter.distributed_trtllm_model_weights_converter.DistributedTRTLLMModelWeightsConverter],
scales: dict,
fp8_kvcache: bool,
)#

Adds scaling factors to the distributed and single device converters.

Parameters:
  • converter (ModelWeightConverter) – Converter, holding the TRT-LLM model weights.

  • scales (dict) – Dictionary holding TRT-LLM scaling factors

  • fp8_kvcache (bool) – If true, creates scaling factors (equal to 1.0) for kv_cache quantization

_get_trtllm_pretrained_config_and_model_weights_in_distributed_setting(
model_state_dict: dict,
dtype: megatron.core.export.data_type.DataType,
vocab_size: int,
gpus_per_node: int,
scales: dict,
fp8_quantized: bool,
fp8_kvcache: bool,
)#

Get the TRTLLM Pretrained config and model weights list in a distributed setting

This function assumes the model state dict is distributed according to model parallelism . Each device gets its own model state dict

Parameters:
  • export_config (ExportConfig) – The export config to set inference tp, pp size etc.

  • model_state_dict (dict) – The model state dictionary (All collected on cpu)

  • dtype (DataType) – The data type or model precision

  • vocab_size (int) – Tokenizer vocab size

  • gpus_per_node (int) – The number of gpus per node

  • scales (dict) – Dictionary with fp8 scaling factors

  • fp8_quantized (bool) – True for fp8 checkpoint export

  • fp8_kvcache (bool) – True for fp8 KV-cache quantization

Returns:

Two lists . List of trtllm converted model weights and trtllm model configs (One for each gpu).

_get_trtllm_pretrained_config_and_model_weights_list_on_single_device(
export_config: megatron.core.export.export_config.ExportConfig,
model_state_dict: dict,
dtype: megatron.core.export.data_type.DataType,
gpus_per_node,
state_dict_split_by_layer_numbers,
scales: dict,
fp8_quantized: bool,
fp8_kvcache: bool,
)#

Get the TRTLLM Pretrained config and model weights list (one per gpu rank) on single device (CPU/GPU)

This function assumes the entire model state dict is present in CPU or on one GPU

Parameters:
  • export_config (ExportConfig) – The export config to set inference tp, pp size etc.

  • model_state_dict (dict) – The model state dictionary (All collected on cpu)

  • dtype (DataType) – The data type or model precision

  • gpus_per_node (int, optional) – Number of gpus per node

  • state_dict_split_by_layer_numbers (bool, optional) – Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight … for all layers. If you use represenation 2 set this to True. Defaults to True

  • scales (dict) – Dictionary with fp8 scaling factors

  • fp8_quantized (bool) – True for fp8 checkpoint export

  • fp8_kvcache (bool) – True for fp8 KV-cache quantization

Returns:

Two lists . List of trtllm converted model weights and trtllm model configs (One for each gpu).

build_and_save_engine(
engine_dir: str,
trtllm_model_weights: dict,
trtllm_model_config,
max_input_len: int = 1024,
max_output_len: int = 1024,
max_batch_size: int = 4,
lora_ckpt_list=None,
use_lora_plugin=None,
max_lora_rank: int = 64,
lora_target_modules=None,
max_prompt_embedding_table_size: int = 0,
paged_kv_cache: bool = True,
remove_input_padding: bool = True,
paged_context_fmha: bool = False,
use_refit: bool = False,
max_num_tokens: int = None,
max_seq_len: int = None,
opt_num_tokens: int = None,
max_beam_width: int = 1,
tokens_per_block: int = 128,
multiple_profiles: bool = False,
gpt_attention_plugin: str = 'auto',
gemm_plugin: str = 'auto',
)#

Method to build the TRTLLM Engine

This method uses the TRTLLMEngineBuilder to build and save the engine to engine dir

Parameters:
  • engine_dir (str) – The file path to save the engine

  • trtllm_model_weights (dict) – The TRTLLM converted model weights dict

  • trtllm_model_config – The TRTLLM Config

  • max_input_len (int, optional) – Max input length. Defaults to 1024.

  • max_output_len (int, optional) – Max output length. Defaults to 1024.

  • max_batch_size (int, optional) – Max batch size. Defaults to 4.

  • lora_ckpt_list (type, optional) – Lora checkpoint list. Defaults to None.

  • use_lora_plugin (type, optional) – Use lora plugin. Defaults to None.

  • max_lora_rank (int, optional) – Max lora rank. Defaults to 64.

  • lora_target_modules (type, optional) – Lora target modules. Defaults to None.

  • max_prompt_embedding_table_size (int, optional) – Max size of prompt embedding table. Defaults to 0.

  • paged_kv_cache (bool, optional) – Use Paged KV cache. Defaults to True.

  • remove_input_padding (bool, optional) – Remove input padding. Defaults to True.

  • paged_context_fmha (bool, optional) – Paged context fmha. Defaults to False.

  • use_refit (bool, optional) – Use refit. Defaults to False.

  • max_num_tokens (int, optional) – Max num of tokens. Defaults to None.

  • max_seq_len (int, optional) – Max seq length. Defaults to None.

  • opt_num_tokens (int, optional) – Opt number of tokens. Defaults to None.

  • max_beam_width (int, optional) – Max beam width. Defaults to 1.

  • tokens_per_block (int, optional) – Nmber of tokens per block. Defaults to 128.

  • multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.

  • gpt_attention_plugin (str, optional) – Gpt attention plugin to use. Defaults to “auto”.

  • gemm_plugin (str, optional) – Gemma plugin to use. Defaults to “auto”.