nemo_export.tensorrt_llm_hf#
Module Contents#
Classes#
Exports HuggingFace checkpoints to TensorRT-LLM and run fast inference. |
Data#
API#
- nemo_export.tensorrt_llm_hf.LOGGER = 'getLogger(...)'#
- class nemo_export.tensorrt_llm_hf.TensorRTLLMHF(
- model_dir: str,
- lora_ckpt_list: List[str] = None,
- load_model: bool = True,
- use_python_runtime: bool = True,
- enable_chunked_context: bool = None,
- max_tokens_in_paged_kv_cache: int = None,
- multi_block_mode: bool = False,
Bases:
nemo_export.tensorrt_llm.TensorRTLLMExports HuggingFace checkpoints to TensorRT-LLM and run fast inference.
This class provides functionality to export HuggingFace models to TensorRT-LLM format and run inference using the exported models. It inherits from TensorRTLLM and adds HuggingFace-specific export capabilities.
.. rubric:: Example
from nemo_export.tensorrt_llm_hf import TensorRTLLMHF
trt_llm_exporter = TensorRTLLMHF(model_dir=”/path/for/model/files”) trt_llm_exporter.export_hf_model( hf_model_path=”/path/to/huggingface/model”, max_batch_size=8, tensor_parallelism_size=1, )
output = trt_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)
Initialization
Initialize TensorRTLLMHF exporter.
- Parameters:
model_dir (str) – Path for storing the TensorRT-LLM model files.
lora_ckpt_list (List[str], optional) – List of LoRA checkpoint paths. Defaults to None.
load_model (bool, optional) – Load TensorRT-LLM model if engine files exist. Defaults to True.
use_python_runtime (bool, optional) – Whether to use python or c++ runtime. Defaults to True.
enable_chunked_context (bool, optional) – Enable chunked context processing. Defaults to None.
max_tokens_in_paged_kv_cache (int, optional) – Max tokens in paged KV cache. Defaults to None.
multi_block_mode (bool, optional) – Enable faster decoding in multihead attention. Defaults to False.
- export_hf_model(
- hf_model_path: str,
- max_batch_size: int = 8,
- tensor_parallelism_size: int = 1,
- max_input_len: int = 256,
- max_output_len: int = 256,
- max_num_tokens: Optional[int] = None,
- opt_num_tokens: Optional[int] = None,
- dtype: Optional[str] = None,
- max_seq_len: Optional[int] = 512,
- gemm_plugin: str = 'auto',
- remove_input_padding: bool = True,
- use_paged_context_fmha: bool = True,
- paged_kv_cache: bool = True,
- tokens_per_block: int = 128,
- multiple_profiles: bool = False,
- reduce_fusion: bool = False,
- max_beam_width: int = 1,
- use_refit: bool = False,
- model_type: Optional[str] = None,
- delete_existing_files: bool = True,
Export a Hugging Face model to TensorRT-LLM format.
This method exports a Hugging Face model to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.
- Parameters:
hf_model_path (str) – Path to the Hugging Face model directory.
max_batch_size (int, optional) – Maximum batch size. Defaults to 8.
tensor_parallelism_size (int, optional) – Size of tensor parallelism. Defaults to 1.
max_input_len (int, optional) – Maximum input sequence length. Defaults to 256.
max_output_len (int, optional) – Maximum output sequence length. Defaults to 256.
max_num_tokens (Optional[int], optional) – Maximum number of tokens. Defaults to None.
opt_num_tokens (Optional[int], optional) – Optimal number of tokens. Defaults to None.
dtype (Optional[str], optional) – Data type for model weights. Defaults to None.
max_seq_len (Optional[int], optional) – Maximum sequence length. Defaults to 512.
gemm_plugin (str, optional) – GEMM plugin type. Defaults to “auto”.
remove_input_padding (bool, optional) – Remove input padding. Defaults to True.
use_paged_context_fmha (bool, optional) – Use paged context FMHA. Defaults to True.
paged_kv_cache (bool, optional) – Use paged KV cache. Defaults to True.
tokens_per_block (int, optional) – Tokens per block. Defaults to 128.
multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.
reduce_fusion (bool, optional) – Enable reduce fusion. Defaults to False.
max_beam_width (int, optional) – Maximum beam width. Defaults to 1.
use_refit (bool, optional) – Use refit. Defaults to False.
model_type (Optional[str], optional) – Type of the model. Defaults to None.
delete_existing_files (bool, optional) – Delete existing files. Defaults to True.
- Raises:
ValueError – If model_type is not supported or dtype cannot be determined.
FileNotFoundError – If config file is not found.
RuntimeError – If there are errors reading the config file.
- get_hf_model_type(model_dir: str) str#
Get the model type from a Hugging Face model directory.
This method infers the model type from the ‘architectures’ field in the model’s config.json file.
- Parameters:
model_dir (str) – Path to the Hugging Face model directory or model ID at Hugging Face Hub.
- Returns:
The inferred model type (e.g., “LlamaForCausalLM”).
- Return type:
str
- Raises:
ValueError – If the architecture choice is ambiguous.
- get_hf_model_dtype(model_dir: str) Optional[str]#
Get the data type from a Hugging Face model directory.
This method reads the config file from a Hugging Face model directory and identifies the model’s data type from various possible locations in the config.
- Parameters:
model_dir (str) – Path to the Hugging Face model directory.
- Returns:
The model’s data type if found in config, None otherwise.
- Return type:
Optional[str]
- Raises:
FileNotFoundError – If the config file is not found.
ValueError – If the config file contains invalid JSON.
RuntimeError – If there are errors reading the config file.
- property get_supported_hf_model_mapping#
Supported HF Model Mapping.