nemo_export.tensorrt_llm#

Module Contents#

Classes#

TensorRTLLM

Exports nemo and huggingface checkpoints to TensorRT-LLM and run fast inference.

Data#

API#

nemo_export.tensorrt_llm.LOGGER = 'getLogger(...)'#
class nemo_export.tensorrt_llm.TensorRTLLM(
model_dir: str,
lora_ckpt_list: List[str] = None,
load_model: bool = True,
use_python_runtime: bool = True,
enable_chunked_context: bool = None,
max_tokens_in_paged_kv_cache: int = None,
multi_block_mode: bool = False,
)#

Bases: nemo_deploy.ITritonDeployable

Exports nemo and huggingface checkpoints to TensorRT-LLM and run fast inference.

This class provides functionality to export NeMo and HuggingFace models to TensorRT-LLM format and run inference using the exported models. It supports various model architectures and provides options for model parallelism, quantization, and inference parameters.

.. rubric:: Example

from nemo_export.tensorrt_llm import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir=”/path/for/model/files”) trt_llm_exporter.export( nemo_checkpoint_path=”/path/for/nemo/checkpoint”, model_type=”llama”, tensor_parallelism_size=1, )

output = trt_llm_exporter.forward([β€œHi, how are you?”, β€œI am good, thanks, how about you?”]) print(β€œoutput: β€œ, output)

Initialization

Initialize TensorRTLLM exporter.

Parameters:
  • model_dir (str) – Path for storing the TensorRT-LLM model files.

  • lora_ckpt_list (List[str], optional) – List of LoRA checkpoint paths. Defaults to None.

  • load_model (bool, optional) – Load TensorRT-LLM model if engine files exist. Defaults to True.

  • use_python_runtime (bool, optional) – Whether to use python or c++ runtime. Defaults to True.

  • enable_chunked_context (bool, optional) – Enable chunked context processing. Defaults to None.

  • max_tokens_in_paged_kv_cache (int, optional) – Max tokens in paged KV cache. Defaults to None.

  • multi_block_mode (bool, optional) – Enable faster decoding in multihead attention. Defaults to False.

export(
nemo_checkpoint_path: str,
model_type: Optional[str] = None,
delete_existing_files: bool = True,
tensor_parallelism_size: int = 1,
pipeline_parallelism_size: int = 1,
max_input_len: int = 256,
max_output_len: Optional[int] = None,
max_batch_size: int = 8,
use_parallel_embedding: bool = False,
paged_kv_cache: bool = True,
remove_input_padding: bool = True,
use_paged_context_fmha: bool = True,
dtype: Optional[str] = None,
load_model: bool = True,
use_lora_plugin: str = None,
lora_target_modules: List[str] = None,
max_lora_rank: int = 64,
max_num_tokens: Optional[int] = None,
opt_num_tokens: Optional[int] = None,
max_seq_len: Optional[int] = 512,
multiple_profiles: bool = False,
gpt_attention_plugin: str = 'auto',
gemm_plugin: str = 'auto',
reduce_fusion: bool = True,
fp8_quantized: Optional[bool] = None,
fp8_kvcache: Optional[bool] = None,
build_rank: Optional[int] = 0,
)#

Export nemo checkpoints to TensorRT-LLM format.

This method exports a NeMo checkpoint to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.

Parameters:
  • nemo_checkpoint_path (str) – Path to the NeMo checkpoint.

  • model_type (Optional[str], optional) – Type of the model. Defaults to None.

  • delete_existing_files (bool, optional) – Delete existing files in model_dir. Defaults to True.

  • tensor_parallelism_size (int, optional) – Size of tensor parallelism. Defaults to 1.

  • pipeline_parallelism_size (int, optional) – Size of pipeline parallelism. Defaults to 1.

  • max_input_len (int, optional) – Maximum input sequence length. Defaults to 256.

  • max_output_len (Optional[int], optional) – Maximum output sequence length. Defaults to None.

  • max_batch_size (int, optional) – Maximum batch size. Defaults to 8.

  • use_parallel_embedding (bool, optional) – Use parallel embedding. Defaults to False.

  • paged_kv_cache (bool, optional) – Use paged KV cache. Defaults to True.

  • remove_input_padding (bool, optional) – Remove input padding. Defaults to True.

  • use_paged_context_fmha (bool, optional) – Use paged context FMHA. Defaults to True.

  • dtype (Optional[str], optional) – Data type for model weights. Defaults to None.

  • load_model (bool, optional) – Load model after export. Defaults to True.

  • use_lora_plugin (str, optional) – Use LoRA plugin. Defaults to None.

  • lora_target_modules (List[str], optional) – Target modules for LoRA. Defaults to None.

  • max_lora_rank (int, optional) – Maximum LoRA rank. Defaults to 64.

  • max_num_tokens (Optional[int], optional) – Maximum number of tokens. Defaults to None.

  • opt_num_tokens (Optional[int], optional) – Optimal number of tokens. Defaults to None.

  • max_seq_len (Optional[int], optional) – Maximum sequence length. Defaults to 512.

  • multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.

  • gpt_attention_plugin (str, optional) – GPT attention plugin type. Defaults to β€œauto”.

  • gemm_plugin (str, optional) – GEMM plugin type. Defaults to β€œauto”.

  • reduce_fusion (bool, optional) – Enable reduce fusion. Defaults to True.

  • fp8_quantized (Optional[bool], optional) – Enable FP8 quantization. Defaults to None.

  • fp8_kvcache (Optional[bool], optional) – Enable FP8 KV cache. Defaults to None.

  • build_rank (Optional[int], optional) – Rank to build on. Defaults to 0.

Raises:
  • ValueError – If model_type is not supported or dtype cannot be determined.

  • Exception – If files cannot be deleted or other export errors occur.

export_hf_model(
hf_model_path: str,
max_batch_size: int = 8,
tensor_parallelism_size: int = 1,
max_input_len: int = 256,
max_output_len: int = 256,
max_num_tokens: Optional[int] = None,
opt_num_tokens: Optional[int] = None,
dtype: Optional[str] = None,
max_seq_len: Optional[int] = 512,
gemm_plugin: str = 'auto',
remove_input_padding: bool = True,
use_paged_context_fmha: bool = True,
paged_kv_cache: bool = True,
tokens_per_block: int = 128,
multiple_profiles: bool = False,
reduce_fusion: bool = False,
max_beam_width: int = 1,
use_refit: bool = False,
model_type: Optional[str] = None,
delete_existing_files: bool = True,
)#

Export a Hugging Face model to TensorRT-LLM format.

This method exports a Hugging Face model to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.

Parameters:
  • hf_model_path (str) – Path to the Hugging Face model directory.

  • max_batch_size (int, optional) – Maximum batch size. Defaults to 8.

  • tensor_parallelism_size (int, optional) – Size of tensor parallelism. Defaults to 1.

  • max_input_len (int, optional) – Maximum input sequence length. Defaults to 256.

  • max_output_len (int, optional) – Maximum output sequence length. Defaults to 256.

  • max_num_tokens (Optional[int], optional) – Maximum number of tokens. Defaults to None.

  • opt_num_tokens (Optional[int], optional) – Optimal number of tokens. Defaults to None.

  • dtype (Optional[str], optional) – Data type for model weights. Defaults to None.

  • max_seq_len (Optional[int], optional) – Maximum sequence length. Defaults to 512.

  • gemm_plugin (str, optional) – GEMM plugin type. Defaults to β€œauto”.

  • remove_input_padding (bool, optional) – Remove input padding. Defaults to True.

  • use_paged_context_fmha (bool, optional) – Use paged context FMHA. Defaults to True.

  • paged_kv_cache (bool, optional) – Use paged KV cache. Defaults to True.

  • tokens_per_block (int, optional) – Tokens per block. Defaults to 128.

  • multiple_profiles (bool, optional) – Use multiple profiles. Defaults to False.

  • reduce_fusion (bool, optional) – Enable reduce fusion. Defaults to False.

  • max_beam_width (int, optional) – Maximum beam width. Defaults to 1.

  • use_refit (bool, optional) – Use refit. Defaults to False.

  • model_type (Optional[str], optional) – Type of the model. Defaults to None.

  • delete_existing_files (bool, optional) – Delete existing files. Defaults to True.

Raises:
  • ValueError – If model_type is not supported or dtype cannot be determined.

  • FileNotFoundError – If config file is not found.

  • RuntimeError – If there are errors reading the config file.

get_hf_model_type(model_dir: str) str#

Get the model type from a Hugging Face model directory.

This method infers the model type from the β€˜architectures’ field in the model’s config.json file.

Parameters:

model_dir (str) – Path to the Hugging Face model directory or model ID at Hugging Face Hub.

Returns:

The inferred model type (e.g., β€œLlamaForCausalLM”).

Return type:

str

Raises:

ValueError – If the architecture choice is ambiguous.

get_hf_model_dtype(model_dir: str) Optional[str]#

Get the data type from a Hugging Face model directory.

This method reads the config file from a Hugging Face model directory and identifies the model’s data type from various possible locations in the config.

Parameters:

model_dir (str) – Path to the Hugging Face model directory.

Returns:

The model’s data type if found in config, None otherwise.

Return type:

Optional[str]

Raises:
  • FileNotFoundError – If the config file is not found.

  • ValueError – If the config file contains invalid JSON.

  • RuntimeError – If there are errors reading the config file.

_export_to_nim_format(
model_config: Dict[str, Any],
model_type: str,
)#

Exports the model configuration to a specific format required by NIM.

This method performs the following steps:

  1. Copies the generation_config.json (if present) from the nemo_context directory to the root model directory.

  2. Creates a dummy Hugging Face configuration file based on the provided model configuration and type.

Parameters:
  • model_config (dict) – A dictionary containing the model configuration parameters.

  • model_type (str) – The type of the model (e.g., β€œllama”).

get_transformer_config(nemo_model_config)#

Given nemo model config get transformer config.

forward(
input_texts: List[str],
max_output_len: int = 64,
top_k: int = 1,
top_p: float = 0.0,
temperature: float = 1.0,
stop_words_list: List[str] = None,
bad_words_list: List[str] = None,
no_repeat_ngram_size: int = None,
lora_uids: List[str] = None,
output_log_probs: bool = False,
output_context_logits: bool = False,
output_generation_logits: bool = False,
**sampling_kwargs,
)#

Exports nemo checkpoints to TensorRT-LLM.

Parameters:
  • input_texts (List(str)) – list of sentences.

  • max_output_len (int) – max generated tokens.

  • top_k (int) – limits us to a certain number (K) of the top tokens to consider.

  • top_p (float) – limits us to the top tokens within a certain probability mass (p).

  • temperature (float) – A parameter of the softmax function, which is the last layer in the network.

  • stop_words_list (List(str)) – list of stop words.

  • bad_words_list (List(str)) – list of bad words.

  • no_repeat_ngram_size (int) – no repeat ngram size.

  • output_generation_logits (bool) – if True returns generation_logits in the outout of generate method.

  • sampling_kwargs – Additional kwargs to set in the SamplingConfig.

_pad_logits(logits_tensor)#

Pads the logits tensor with 0’s on the right.

property get_supported_models_list#

Supported model list.

property get_supported_hf_model_mapping#

Supported HF Model Mapping.

property get_hidden_size#

Get hidden size.

property get_triton_input#

Get triton input.

property get_triton_output#
_infer_fn(prompts, inputs)#

Shared helper function to prepare inference inputs and execute forward pass.

Parameters:
  • prompts – List of input prompts

  • inputs – Dictionary of input parameters

Returns:

List of generated text outputs

Return type:

output_texts

triton_infer_fn(**inputs: numpy.ndarray)#

Triton infer function for inference.

ray_infer_fn(
inputs: Dict[str, Any],
) Dict[str, Any]#

Ray inference function that processes input dictionary and returns output without byte casting.

Parameters:

inputs (Dict[str, Any]) –

Input dictionary containing:

  • prompts: List of input prompts

  • max_output_len: Maximum output length (optional)

  • top_k: Top-k sampling parameter (optional)

  • top_p: Top-p sampling parameter (optional)

  • temperature: Sampling temperature (optional)

  • random_seed: Random seed (optional)

  • stop_words_list: List of stop words (optional)

  • bad_words_list: List of bad words (optional)

  • no_repeat_ngram_size: No repeat ngram size (optional)

  • lora_uids: LoRA UIDs (optional)

  • apply_chat_template: Whether to apply chat template (optional)

  • compute_logprob: Whether to compute log probabilities (optional)

Returns:

Output dictionary containing: - sentences: List of generated text outputs - log_probs: Log probabilities (if requested)

Return type:

Dict[str, Any]

_load_config_file()#
_load()#
unload_engine()#

Unload engine.