`nemo_export.tensorrt_llm_hf`#

Module Contents#

Classes#

TensorRTLLMHF

Exports HuggingFace checkpoints to TensorRT-LLM and run fast inference.

Data#

LOGGER

API#

nemo_export.tensorrt_llm_hf.LOGGER = 'getLogger(...)'#

class nemo_export.tensorrt_llm_hf.TensorRTLLMHF( model_dir: str, lora_ckpt_list: List[str] = None, load_model: bool = True, use_python_runtime: bool = True, enable_chunked_context: bool = None, max_tokens_in_paged_kv_cache: int = None, multi_block_mode: bool = False, )#

Bases: nemo_export.tensorrt_llm.TensorRTLLM

Exports HuggingFace checkpoints to TensorRT-LLM and run fast inference.

This class provides functionality to export HuggingFace models to TensorRT-LLM format and run inference using the exported models. It inherits from TensorRTLLM and adds HuggingFace-specific export capabilities.

.. rubric:: Example

from nemo_export.tensorrt_llm_hf import TensorRTLLMHF

trt_llm_exporter = TensorRTLLMHF(model_dir=”/path/for/model/files”) trt_llm_exporter.export_hf_model( hf_model_path=”/path/to/huggingface/model”, max_batch_size=8, tensor_parallelism_size=1, )

output = trt_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)

Initialization

Initialize TensorRTLLMHF exporter.

Parameters:

model_dir (str) – Path for storing the TensorRT-LLM model files.
lora_ckpt_list (List[str], optional) – List of LoRA checkpoint paths. Defaults to None.
load_model (bool, optional) – Load TensorRT-LLM model if engine files exist. Defaults to True.
use_python_runtime (bool, optional) – Whether to use python or c++ runtime. Defaults to True.
enable_chunked_context (bool, optional) – Enable chunked context processing. Defaults to None.
max_tokens_in_paged_kv_cache (int, optional) – Max tokens in paged KV cache. Defaults to None.
multi_block_mode (bool, optional) – Enable faster decoding in multihead attention. Defaults to False.

export_hf_model( hf_model_path: str, max_batch_size: int = 8, tensor_parallelism_size: int = 1, max_input_len: int = 256, max_output_len: int = 256, max_num_tokens: Optional[int] = None, opt_num_tokens: Optional[int] = None, dtype: Optional[str] = None, max_seq_len: Optional[int] = 512, gemm_plugin: str = 'auto', remove_input_padding: bool = True, use_paged_context_fmha: bool = True, paged_kv_cache: bool = True, tokens_per_block: int = 128, multiple_profiles: bool = False, reduce_fusion: bool = False, max_beam_width: int = 1, use_refit: bool = False, model_type: Optional[str] = None, delete_existing_files: bool = True, )#

Export a Hugging Face model to TensorRT-LLM format.

This method exports a Hugging Face model to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.

Parameters:

hf_model_path (str) – Path to the Hugging Face model directory.
max_batch_size (int, optional) – Maximum batch size. Defaults to 8.
tensor_parallelism_size (int, optional) – Size of tensor parallelism. Defaults to 1.
max_input_len (int, optional) – Maximum input sequence length. Defaults to 256.
max_output_len (int, optional) – Maximum output sequence length. Defaults to 256.
max_num_tokens (Optional[int], optional) – Maximum number of tokens. Defaults to None.
opt_num_tokens (Optional[int], optional) – Optimal number of tokens. Defaults to None.
dtype (Optional[str], optional) – Data type for model weights. Defaults to None.
max_seq_len (Optional[int], optional) – Maximum sequence length. Defaults to 512.
gemm_plugin (str, optional) – GEMM plugin type. Defaults to “auto”.
remove_input_padding (bool, optional) – Remove input padding. Defaults to True.
use_paged_context_fmha (bool, optional) – Use paged context FMHA. Defaults to True.
paged_kv_cache (bool, optional) – Use paged KV cache. Defaults to True.
tokens_per_block (int, optional) – Tokens per block. Defaults to 128.
multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.
reduce_fusion (bool, optional) – Enable reduce fusion. Defaults to False.
max_beam_width (int, optional) – Maximum beam width. Defaults to 1.
use_refit (bool, optional) – Use refit. Defaults to False.
model_type (Optional[str], optional) – Type of the model. Defaults to None.
delete_existing_files (bool, optional) – Delete existing files. Defaults to True.

Raises:

ValueError – If model_type is not supported or dtype cannot be determined.
FileNotFoundError – If config file is not found.
RuntimeError – If there are errors reading the config file.

get_hf_model_type(model_dir: str) → str#

Get the model type from a Hugging Face model directory.

This method infers the model type from the ‘architectures’ field in the model’s config.json file.

Parameters:: model_dir (str) – Path to the Hugging Face model directory or model ID at Hugging Face Hub.
Returns:: The inferred model type (e.g., “LlamaForCausalLM”).
Return type:: str
Raises:: ValueError – If the architecture choice is ambiguous.

get_hf_model_dtype(model_dir: str) → Optional[str]#

Get the data type from a Hugging Face model directory.

This method reads the config file from a Hugging Face model directory and identifies the model’s data type from various possible locations in the config.

Parameters:

model_dir (str) – Path to the Hugging Face model directory.

Returns:

The model’s data type if found in config, None otherwise.

Return type:

Optional[str]

Raises:

FileNotFoundError – If the config file is not found.
ValueError – If the config file contains invalid JSON.
RuntimeError – If there are errors reading the config file.

property get_supported_hf_model_mapping#: Supported HF Model Mapping.

nemo_export.tensorrt_llm_hf#

Module Contents#

Classes#

Data#

API#

`nemo_export.tensorrt_llm_hf`#