core.export.trtllm.trtllm_helper#
Module Contents#
Classes#
TRTLLM Helper class to convert export and build TRTLLM model. |
API#
- class core.export.trtllm.trtllm_helper.TRTLLMHelper(
- *,
- transformer_config: megatron.core.transformer.transformer_config.TransformerConfig,
- model_type: megatron.core.export.model_type.ModelType,
- trtllm_conversion_dict: dict = {},
- position_embedding_type: str = 'learned_absolute',
- max_position_embeddings: int = None,
- rotary_percentage: int = 1.0,
- rotary_base: int = 10000,
- rope_scaling_factor: float = 8.0,
- moe_tp_mode: int = 2,
- multi_query_mode: bool = False,
- activation: str = 'gelu',
- seq_len_interpolation_factor: float = None,
- moe_renorm_mode=None,
- share_embeddings_and_output_weights=False,
TRTLLM Helper class to convert export and build TRTLLM model.
Initialization
Constructor for the TRTLLMHelper
There are two public API’s supported by this helper. a) get_trtllm_pretrained_config_and_model_weights b) build_and_save_engine
- Parameters:
transformer_config (TransformerConfig) – The transformer config
model_type (ModelType) – The type of the input model. Enum (megatron.core.export.model_type.ModelType)
trtllm_conversion_dict (dict, optional) – A conversion dictionary that will map your model layer names to trtllm equivalent layer names. Default dictionary is given megatron/core/export/model_to_trtllm_mapping. This dict is merged into the default dict. NOTE: Ignore layer numbers in the model layer names. (e.g) decoder.layers.0.attention_qkv.weight will be decoder.layers.attention_qkv.weight in the mapping dictionary. Defaults to {}.
position_embedding_type (str, optional) – The position embedding type. Defaults to None.
max_position_embeddings (int, optional) – Max posistion embeddings value. Defaults to None.
rotary_percentage (int, optional) – The rotary percentage if using rope embedding. Defaults to 1.0.
rotary_base (int, optional) – The rotary base (theta value) if using rope embeddings. Defaults to 10000.
moe_tp_mode (int, optional) – TRTLLM Config. Defaults to 2.
multi_query_mode (bool, optional) – Defaults to False.
activation (str, optional) – Defaults to “gelu”.
seq_len_interpolation_factor (float, optional) – The sequence length interpolation factor if using rope embeddings. Defaults to None.
moe_renorm_mode (optional) – Renormalization mode if using mixture of experts. Defaults to None.
share_embeddings_and_output_weights (bool, optional) – True if input and output layers share weights. Defaults to False.
- _get_trtllm_config(
- export_config: megatron.core.export.export_config.ExportConfig,
- world_size: int,
- gpus_per_node: int,
- vocab_size_padded: int,
- dtype: megatron.core.export.data_type.DataType,
- fp8_quantized: bool = False,
- fp8_kvcache: bool = False,
Get TRTLLM Config
Returns appropriate TRTLLM PretrainedConfig used by TRTLLM for building engine
- Parameters:
export_config (ExportConfig) – The export config that defines inference tp , pp size etc.
world_size (int) – The number of gpus (Mostly TP * PP)
gpus_per_node (int) – Num gpus per node
vocab_size_padded (int) – Padded vocab size
dtype (DataType) – The datatype or model precision
- Returns:
GPTConfig or the LLamaConfig or the PretrainedConfig constructed from your model config
- _load_scaling_factors(model_state_dict: dict) dict#
Loads scaling factors from model state dictionary.
- Parameters:
model_state_dict (dict) – Model state dictionary
- Returns:
Maps scaling factor key, to its value and the inverse. The inverse is used for casting the quantized weights.
- Return type:
dict
- get_trtllm_pretrained_config_and_model_weights(
- model_state_dict,
- dtype: megatron.core.export.data_type.DataType,
- export_config: megatron.core.export.export_config.ExportConfig = None,
- on_device_distributed_conversion: bool = False,
- vocab_size: int = None,
- gpus_per_node: int = None,
- state_dict_split_by_layer_numbers: bool = True,
- fp8_quantized: bool = False,
- fp8_kvcache: bool = False,
Get TRTLLM Config and Converted Model Weights
This function returns the trtllm model weights as a list. There are two modes for conversion. The default is to use a single device cpu/gpu for conversion. NOTE: For faster performance, if your entire model will fit in memory, pre transfer the model state dict to cuda device and then call this function. For on device conversion it returns weights which will be used on the device itself. Same thing happens with the pretrained config
- Parameters:
model_state_dict (dict) – The input model state dictionary (Entire model state loaded on CPU) or the model state dict of each GPU in the case of on_device conversion)
export_config (ExportConfig) – The export config used to define inference tp size, pp size etc. Used only for on device conversion.
dtype (DataType) – The data type of model precision
on_device_distributed_conversion (bool, optional) – Convert on gpus in distributed setting. This assumes that the model state dict is sharded according to required inference model parallelism and that each gpu gets its part of the model state dict . Defaults to False.
vocab_size (int, optional) – The vocabulary size. Defaults to None.
gpus_per_node (int, optional) – The number of gpus per node. Used for on device conversion.
state_dict_split_by_layer_numbers (bool, optional) – Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight … for all layers. If you use represenation 2 set this to True. Defaults to True
- Returns:
Two lists . First list of trtllm converted model weights(Either on device, or a list of weights for each gpu) and the trtllm_model_configs.
- _add_scales_to_converter(
- converter: Union[megatron.core.export.trtllm.trtllm_weights_converter.single_device_trtllm_model_weights_converter.SingleDeviceTRTLLMModelWeightsConverter, megatron.core.export.trtllm.trtllm_weights_converter.distributed_trtllm_model_weights_converter.DistributedTRTLLMModelWeightsConverter],
- scales: dict,
- fp8_kvcache: bool,
Adds scaling factors to the distributed and single device converters.
- Parameters:
converter (ModelWeightConverter) – Converter, holding the TRT-LLM model weights.
scales (dict) – Dictionary holding TRT-LLM scaling factors
fp8_kvcache (bool) – If true, creates scaling factors (equal to 1.0) for kv_cache quantization
- _get_trtllm_pretrained_config_and_model_weights_in_distributed_setting(
- model_state_dict: dict,
- dtype: megatron.core.export.data_type.DataType,
- vocab_size: int,
- gpus_per_node: int,
- scales: dict,
- fp8_quantized: bool,
- fp8_kvcache: bool,
Get the TRTLLM Pretrained config and model weights list in a distributed setting
This function assumes the model state dict is distributed according to model parallelism . Each device gets its own model state dict
- Parameters:
export_config (ExportConfig) – The export config to set inference tp, pp size etc.
model_state_dict (dict) – The model state dictionary (All collected on cpu)
dtype (DataType) – The data type or model precision
vocab_size (int) – Tokenizer vocab size
gpus_per_node (int) – The number of gpus per node
scales (dict) – Dictionary with fp8 scaling factors
fp8_quantized (bool) – True for fp8 checkpoint export
fp8_kvcache (bool) – True for fp8 KV-cache quantization
- Returns:
Two lists . List of trtllm converted model weights and trtllm model configs (One for each gpu).
- _get_trtllm_pretrained_config_and_model_weights_list_on_single_device(
- export_config: megatron.core.export.export_config.ExportConfig,
- model_state_dict: dict,
- dtype: megatron.core.export.data_type.DataType,
- gpus_per_node,
- state_dict_split_by_layer_numbers,
- scales: dict,
- fp8_quantized: bool,
- fp8_kvcache: bool,
Get the TRTLLM Pretrained config and model weights list (one per gpu rank) on single device (CPU/GPU)
This function assumes the entire model state dict is present in CPU or on one GPU
- Parameters:
export_config (ExportConfig) – The export config to set inference tp, pp size etc.
model_state_dict (dict) – The model state dictionary (All collected on cpu)
dtype (DataType) – The data type or model precision
gpus_per_node (int, optional) – Number of gpus per node
state_dict_split_by_layer_numbers (bool, optional) – Are the model layers split by layer numbers in state dict. For example : mlp.fc1.weight can be represented like mlp.fc1.weight of shape [num_layers, hidden_dim, ffn_hidden_dim]} or it can be like mlp.fc1.layers.0.weight of shape [hidden_dim, ffn_hidden_dim], then mlp.fc1.layers.1.weight … for all layers. If you use represenation 2 set this to True. Defaults to True
scales (dict) – Dictionary with fp8 scaling factors
fp8_quantized (bool) – True for fp8 checkpoint export
fp8_kvcache (bool) – True for fp8 KV-cache quantization
- Returns:
Two lists . List of trtllm converted model weights and trtllm model configs (One for each gpu).
- build_and_save_engine(
- engine_dir: str,
- trtllm_model_weights: dict,
- trtllm_model_config,
- max_input_len: int = 1024,
- max_output_len: int = 1024,
- max_batch_size: int = 4,
- lora_ckpt_list=None,
- use_lora_plugin=None,
- max_lora_rank: int = 64,
- lora_target_modules=None,
- max_prompt_embedding_table_size: int = 0,
- paged_kv_cache: bool = True,
- remove_input_padding: bool = True,
- paged_context_fmha: bool = False,
- use_refit: bool = False,
- max_num_tokens: int = None,
- max_seq_len: int = None,
- opt_num_tokens: int = None,
- max_beam_width: int = 1,
- tokens_per_block: int = 128,
- multiple_profiles: bool = False,
- gpt_attention_plugin: str = 'auto',
- gemm_plugin: str = 'auto',
Method to build the TRTLLM Engine
This method uses the TRTLLMEngineBuilder to build and save the engine to engine dir
- Parameters:
engine_dir (str) – The file path to save the engine
trtllm_model_weights (dict) – The TRTLLM converted model weights dict
trtllm_model_config – The TRTLLM Config
max_input_len (int, optional) – Max input length. Defaults to 1024.
max_output_len (int, optional) – Max output length. Defaults to 1024.
max_batch_size (int, optional) – Max batch size. Defaults to 4.
lora_ckpt_list (type, optional) – Lora checkpoint list. Defaults to None.
use_lora_plugin (type, optional) – Use lora plugin. Defaults to None.
max_lora_rank (int, optional) – Max lora rank. Defaults to 64.
lora_target_modules (type, optional) – Lora target modules. Defaults to None.
max_prompt_embedding_table_size (int, optional) – Max size of prompt embedding table. Defaults to 0.
paged_kv_cache (bool, optional) – Use Paged KV cache. Defaults to True.
remove_input_padding (bool, optional) – Remove input padding. Defaults to True.
paged_context_fmha (bool, optional) – Paged context fmha. Defaults to False.
use_refit (bool, optional) – Use refit. Defaults to False.
max_num_tokens (int, optional) – Max num of tokens. Defaults to None.
max_seq_len (int, optional) – Max seq length. Defaults to None.
opt_num_tokens (int, optional) – Opt number of tokens. Defaults to None.
max_beam_width (int, optional) – Max beam width. Defaults to 1.
tokens_per_block (int, optional) – Nmber of tokens per block. Defaults to 128.
multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.
gpt_attention_plugin (str, optional) – Gpt attention plugin to use. Defaults to “auto”.
gemm_plugin (str, optional) – Gemma plugin to use. Defaults to “auto”.