nemo_export.tensorrt_llm_hf#

Module Contents#

Classes#

TensorRTLLMHF

Exports HuggingFace checkpoints to TensorRT-LLM and run fast inference.

Data#

API#

nemo_export.tensorrt_llm_hf.LOGGER = 'getLogger(...)'#
class nemo_export.tensorrt_llm_hf.TensorRTLLMHF(
model_dir: str,
lora_ckpt_list: List[str] = None,
load_model: bool = True,
use_python_runtime: bool = True,
enable_chunked_context: bool = None,
max_tokens_in_paged_kv_cache: int = None,
multi_block_mode: bool = False,
)#

Bases: nemo_export.tensorrt_llm.TensorRTLLM

Exports HuggingFace checkpoints to TensorRT-LLM and run fast inference.

This class provides functionality to export HuggingFace models to TensorRT-LLM format and run inference using the exported models. It inherits from TensorRTLLM and adds HuggingFace-specific export capabilities.

.. rubric:: Example

from nemo_export.tensorrt_llm_hf import TensorRTLLMHF

trt_llm_exporter = TensorRTLLMHF(model_dir=”/path/for/model/files”) trt_llm_exporter.export_hf_model( hf_model_path=”/path/to/huggingface/model”, max_batch_size=8, tensor_parallelism_size=1, )

output = trt_llm_exporter.forward([“Hi, how are you?”, “I am good, thanks, how about you?”]) print(“output: “, output)

Initialization

Initialize TensorRTLLMHF exporter.

Parameters:
  • model_dir (str) – Path for storing the TensorRT-LLM model files.

  • lora_ckpt_list (List[str], optional) – List of LoRA checkpoint paths. Defaults to None.

  • load_model (bool, optional) – Load TensorRT-LLM model if engine files exist. Defaults to True.

  • use_python_runtime (bool, optional) – Whether to use python or c++ runtime. Defaults to True.

  • enable_chunked_context (bool, optional) – Enable chunked context processing. Defaults to None.

  • max_tokens_in_paged_kv_cache (int, optional) – Max tokens in paged KV cache. Defaults to None.

  • multi_block_mode (bool, optional) – Enable faster decoding in multihead attention. Defaults to False.

export_hf_model(
hf_model_path: str,
max_batch_size: int = 8,
tensor_parallelism_size: int = 1,
max_input_len: int = 256,
max_output_len: int = 256,
max_num_tokens: Optional[int] = None,
opt_num_tokens: Optional[int] = None,
dtype: Optional[str] = None,
max_seq_len: Optional[int] = 512,
gemm_plugin: str = 'auto',
remove_input_padding: bool = True,
use_paged_context_fmha: bool = True,
paged_kv_cache: bool = True,
tokens_per_block: int = 128,
multiple_profiles: bool = False,
reduce_fusion: bool = False,
max_beam_width: int = 1,
use_refit: bool = False,
model_type: Optional[str] = None,
delete_existing_files: bool = True,
)#

Export a Hugging Face model to TensorRT-LLM format.

This method exports a Hugging Face model to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.

Parameters:
  • hf_model_path (str) – Path to the Hugging Face model directory.

  • max_batch_size (int, optional) – Maximum batch size. Defaults to 8.

  • tensor_parallelism_size (int, optional) – Size of tensor parallelism. Defaults to 1.

  • max_input_len (int, optional) – Maximum input sequence length. Defaults to 256.

  • max_output_len (int, optional) – Maximum output sequence length. Defaults to 256.

  • max_num_tokens (Optional[int], optional) – Maximum number of tokens. Defaults to None.

  • opt_num_tokens (Optional[int], optional) – Optimal number of tokens. Defaults to None.

  • dtype (Optional[str], optional) – Data type for model weights. Defaults to None.

  • max_seq_len (Optional[int], optional) – Maximum sequence length. Defaults to 512.

  • gemm_plugin (str, optional) – GEMM plugin type. Defaults to “auto”.

  • remove_input_padding (bool, optional) – Remove input padding. Defaults to True.

  • use_paged_context_fmha (bool, optional) – Use paged context FMHA. Defaults to True.

  • paged_kv_cache (bool, optional) – Use paged KV cache. Defaults to True.

  • tokens_per_block (int, optional) – Tokens per block. Defaults to 128.

  • multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.

  • reduce_fusion (bool, optional) – Enable reduce fusion. Defaults to False.

  • max_beam_width (int, optional) – Maximum beam width. Defaults to 1.

  • use_refit (bool, optional) – Use refit. Defaults to False.

  • model_type (Optional[str], optional) – Type of the model. Defaults to None.

  • delete_existing_files (bool, optional) – Delete existing files. Defaults to True.

Raises:
  • ValueError – If model_type is not supported or dtype cannot be determined.

  • FileNotFoundError – If config file is not found.

  • RuntimeError – If there are errors reading the config file.

get_hf_model_type(model_dir: str) str#

Get the model type from a Hugging Face model directory.

This method infers the model type from the ‘architectures’ field in the model’s config.json file.

Parameters:

model_dir (str) – Path to the Hugging Face model directory or model ID at Hugging Face Hub.

Returns:

The inferred model type (e.g., “LlamaForCausalLM”).

Return type:

str

Raises:

ValueError – If the architecture choice is ambiguous.

get_hf_model_dtype(model_dir: str) Optional[str]#

Get the data type from a Hugging Face model directory.

This method reads the config file from a Hugging Face model directory and identifies the model’s data type from various possible locations in the config.

Parameters:

model_dir (str) – Path to the Hugging Face model directory.

Returns:

The model’s data type if found in config, None otherwise.

Return type:

Optional[str]

Raises:
  • FileNotFoundError – If the config file is not found.

  • ValueError – If the config file contains invalid JSON.

  • RuntimeError – If there are errors reading the config file.

property get_supported_hf_model_mapping#

Supported HF Model Mapping.