nemo_export.tensorrt_llm
#
Module Contents#
Classes#
Exports nemo and huggingface checkpoints to TensorRT-LLM and run fast inference. |
Data#
API#
- nemo_export.tensorrt_llm.LOGGER = 'getLogger(...)'#
- class nemo_export.tensorrt_llm.TensorRTLLM(
- model_dir: str,
- lora_ckpt_list: List[str] = None,
- load_model: bool = True,
- use_python_runtime: bool = True,
- enable_chunked_context: bool = None,
- max_tokens_in_paged_kv_cache: int = None,
- multi_block_mode: bool = False,
Bases:
nemo_deploy.ITritonDeployable
Exports nemo and huggingface checkpoints to TensorRT-LLM and run fast inference.
This class provides functionality to export NeMo and HuggingFace models to TensorRT-LLM format and run inference using the exported models. It supports various model architectures and provides options for model parallelism, quantization, and inference parameters.
.. rubric:: Example
from nemo_export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir=β/path/for/model/filesβ) trt_llm_exporter.export( nemo_checkpoint_path=β/path/for/nemo/checkpointβ, model_type=βllamaβ, tensor_parallelism_size=1, )
output = trt_llm_exporter.forward([βHi, how are you?β, βI am good, thanks, how about you?β]) print(βoutput: β, output)
Initialization
Initialize TensorRTLLM exporter.
- Parameters:
model_dir (str) β Path for storing the TensorRT-LLM model files.
lora_ckpt_list (List[str], optional) β List of LoRA checkpoint paths. Defaults to None.
load_model (bool, optional) β Load TensorRT-LLM model if engine files exist. Defaults to True.
use_python_runtime (bool, optional) β Whether to use python or c++ runtime. Defaults to True.
enable_chunked_context (bool, optional) β Enable chunked context processing. Defaults to None.
max_tokens_in_paged_kv_cache (int, optional) β Max tokens in paged KV cache. Defaults to None.
multi_block_mode (bool, optional) β Enable faster decoding in multihead attention. Defaults to False.
- export(
- nemo_checkpoint_path: str,
- model_type: Optional[str] = None,
- delete_existing_files: bool = True,
- tensor_parallelism_size: int = 1,
- pipeline_parallelism_size: int = 1,
- max_input_len: int = 256,
- max_output_len: Optional[int] = None,
- max_batch_size: int = 8,
- use_parallel_embedding: bool = False,
- paged_kv_cache: bool = True,
- remove_input_padding: bool = True,
- use_paged_context_fmha: bool = True,
- dtype: Optional[str] = None,
- load_model: bool = True,
- use_lora_plugin: str = None,
- lora_target_modules: List[str] = None,
- max_lora_rank: int = 64,
- max_num_tokens: Optional[int] = None,
- opt_num_tokens: Optional[int] = None,
- max_seq_len: Optional[int] = 512,
- multiple_profiles: bool = False,
- gpt_attention_plugin: str = 'auto',
- gemm_plugin: str = 'auto',
- reduce_fusion: bool = True,
- fp8_quantized: Optional[bool] = None,
- fp8_kvcache: Optional[bool] = None,
- build_rank: Optional[int] = 0,
Export nemo checkpoints to TensorRT-LLM format.
This method exports a NeMo checkpoint to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.
- Parameters:
nemo_checkpoint_path (str) β Path to the NeMo checkpoint.
model_type (Optional[str], optional) β Type of the model. Defaults to None.
delete_existing_files (bool, optional) β Delete existing files in model_dir. Defaults to True.
tensor_parallelism_size (int, optional) β Size of tensor parallelism. Defaults to 1.
pipeline_parallelism_size (int, optional) β Size of pipeline parallelism. Defaults to 1.
max_input_len (int, optional) β Maximum input sequence length. Defaults to 256.
max_output_len (Optional[int], optional) β Maximum output sequence length. Defaults to None.
max_batch_size (int, optional) β Maximum batch size. Defaults to 8.
use_parallel_embedding (bool, optional) β Use parallel embedding. Defaults to False.
paged_kv_cache (bool, optional) β Use paged KV cache. Defaults to True.
remove_input_padding (bool, optional) β Remove input padding. Defaults to True.
use_paged_context_fmha (bool, optional) β Use paged context FMHA. Defaults to True.
dtype (Optional[str], optional) β Data type for model weights. Defaults to None.
load_model (bool, optional) β Load model after export. Defaults to True.
use_lora_plugin (str, optional) β Use LoRA plugin. Defaults to None.
lora_target_modules (List[str], optional) β Target modules for LoRA. Defaults to None.
max_lora_rank (int, optional) β Maximum LoRA rank. Defaults to 64.
max_num_tokens (Optional[int], optional) β Maximum number of tokens. Defaults to None.
opt_num_tokens (Optional[int], optional) β Optimal number of tokens. Defaults to None.
max_seq_len (Optional[int], optional) β Maximum sequence length. Defaults to 512.
multiple_profiles (bool, optional) β Use multiple profiles. Defaults to False.
gpt_attention_plugin (str, optional) β GPT attention plugin type. Defaults to βautoβ.
gemm_plugin (str, optional) β GEMM plugin type. Defaults to βautoβ.
reduce_fusion (bool, optional) β Enable reduce fusion. Defaults to True.
fp8_quantized (Optional[bool], optional) β Enable FP8 quantization. Defaults to None.
fp8_kvcache (Optional[bool], optional) β Enable FP8 KV cache. Defaults to None.
build_rank (Optional[int], optional) β Rank to build on. Defaults to 0.
- Raises:
ValueError β If model_type is not supported or dtype cannot be determined.
Exception β If files cannot be deleted or other export errors occur.
- export_hf_model(
- hf_model_path: str,
- max_batch_size: int = 8,
- tensor_parallelism_size: int = 1,
- max_input_len: int = 256,
- max_output_len: int = 256,
- max_num_tokens: Optional[int] = None,
- opt_num_tokens: Optional[int] = None,
- dtype: Optional[str] = None,
- max_seq_len: Optional[int] = 512,
- gemm_plugin: str = 'auto',
- remove_input_padding: bool = True,
- use_paged_context_fmha: bool = True,
- paged_kv_cache: bool = True,
- tokens_per_block: int = 128,
- multiple_profiles: bool = False,
- reduce_fusion: bool = False,
- max_beam_width: int = 1,
- use_refit: bool = False,
- model_type: Optional[str] = None,
- delete_existing_files: bool = True,
Export a Hugging Face model to TensorRT-LLM format.
This method exports a Hugging Face model to TensorRT-LLM format with various configuration options for model parallelism, quantization, and inference parameters.
- Parameters:
hf_model_path (str) β Path to the Hugging Face model directory.
max_batch_size (int, optional) β Maximum batch size. Defaults to 8.
tensor_parallelism_size (int, optional) β Size of tensor parallelism. Defaults to 1.
max_input_len (int, optional) β Maximum input sequence length. Defaults to 256.
max_output_len (int, optional) β Maximum output sequence length. Defaults to 256.
max_num_tokens (Optional[int], optional) β Maximum number of tokens. Defaults to None.
opt_num_tokens (Optional[int], optional) β Optimal number of tokens. Defaults to None.
dtype (Optional[str], optional) β Data type for model weights. Defaults to None.
max_seq_len (Optional[int], optional) β Maximum sequence length. Defaults to 512.
gemm_plugin (str, optional) β GEMM plugin type. Defaults to βautoβ.
remove_input_padding (bool, optional) β Remove input padding. Defaults to True.
use_paged_context_fmha (bool, optional) β Use paged context FMHA. Defaults to True.
paged_kv_cache (bool, optional) β Use paged KV cache. Defaults to True.
tokens_per_block (int, optional) β Tokens per block. Defaults to 128.
multiple_profiles (bool, optional) β Use multiple profiles. Defaults to False.
reduce_fusion (bool, optional) β Enable reduce fusion. Defaults to False.
max_beam_width (int, optional) β Maximum beam width. Defaults to 1.
use_refit (bool, optional) β Use refit. Defaults to False.
model_type (Optional[str], optional) β Type of the model. Defaults to None.
delete_existing_files (bool, optional) β Delete existing files. Defaults to True.
- Raises:
ValueError β If model_type is not supported or dtype cannot be determined.
FileNotFoundError β If config file is not found.
RuntimeError β If there are errors reading the config file.
- get_hf_model_type(model_dir: str) str #
Get the model type from a Hugging Face model directory.
This method infers the model type from the βarchitecturesβ field in the modelβs config.json file.
- Parameters:
model_dir (str) β Path to the Hugging Face model directory or model ID at Hugging Face Hub.
- Returns:
The inferred model type (e.g., βLlamaForCausalLMβ).
- Return type:
str
- Raises:
ValueError β If the architecture choice is ambiguous.
- get_hf_model_dtype(model_dir: str) Optional[str] #
Get the data type from a Hugging Face model directory.
This method reads the config file from a Hugging Face model directory and identifies the modelβs data type from various possible locations in the config.
- Parameters:
model_dir (str) β Path to the Hugging Face model directory.
- Returns:
The modelβs data type if found in config, None otherwise.
- Return type:
Optional[str]
- Raises:
FileNotFoundError β If the config file is not found.
ValueError β If the config file contains invalid JSON.
RuntimeError β If there are errors reading the config file.
- _export_to_nim_format(
- model_config: Dict[str, Any],
- model_type: str,
Exports the model configuration to a specific format required by NIM.
This method performs the following steps:
Copies the generation_config.json (if present) from the nemo_context directory to the root model directory.
Creates a dummy Hugging Face configuration file based on the provided model configuration and type.
- Parameters:
model_config (dict) β A dictionary containing the model configuration parameters.
model_type (str) β The type of the model (e.g., βllamaβ).
- get_transformer_config(nemo_model_config)#
Given nemo model config get transformer config.
- forward(
- input_texts: List[str],
- max_output_len: int = 64,
- top_k: int = 1,
- top_p: float = 0.0,
- temperature: float = 1.0,
- stop_words_list: List[str] = None,
- bad_words_list: List[str] = None,
- no_repeat_ngram_size: int = None,
- lora_uids: List[str] = None,
- output_log_probs: bool = False,
- output_context_logits: bool = False,
- output_generation_logits: bool = False,
- **sampling_kwargs,
Exports nemo checkpoints to TensorRT-LLM.
- Parameters:
input_texts (List(str)) β list of sentences.
max_output_len (int) β max generated tokens.
top_k (int) β limits us to a certain number (K) of the top tokens to consider.
top_p (float) β limits us to the top tokens within a certain probability mass (p).
temperature (float) β A parameter of the softmax function, which is the last layer in the network.
stop_words_list (List(str)) β list of stop words.
bad_words_list (List(str)) β list of bad words.
no_repeat_ngram_size (int) β no repeat ngram size.
output_generation_logits (bool) β if True returns generation_logits in the outout of generate method.
sampling_kwargs β Additional kwargs to set in the SamplingConfig.
- _pad_logits(logits_tensor)#
Pads the logits tensor with 0βs on the right.
- property get_supported_models_list#
Supported model list.
- property get_supported_hf_model_mapping#
Supported HF Model Mapping.
Get hidden size.
- property get_triton_input#
Get triton input.
- property get_triton_output#
- _infer_fn(prompts, inputs)#
Shared helper function to prepare inference inputs and execute forward pass.
- Parameters:
prompts β List of input prompts
inputs β Dictionary of input parameters
- Returns:
List of generated text outputs
- Return type:
output_texts
- triton_infer_fn(**inputs: numpy.ndarray)#
Triton infer function for inference.
- ray_infer_fn(
- inputs: Dict[str, Any],
Ray inference function that processes input dictionary and returns output without byte casting.
- Parameters:
inputs (Dict[str, Any]) β
Input dictionary containing:
prompts: List of input prompts
max_output_len: Maximum output length (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
temperature: Sampling temperature (optional)
random_seed: Random seed (optional)
stop_words_list: List of stop words (optional)
bad_words_list: List of bad words (optional)
no_repeat_ngram_size: No repeat ngram size (optional)
lora_uids: LoRA UIDs (optional)
apply_chat_template: Whether to apply chat template (optional)
compute_logprob: Whether to compute log probabilities (optional)
- Returns:
Output dictionary containing: - sentences: List of generated text outputs - log_probs: Log probabilities (if requested)
- Return type:
Dict[str, Any]
- _load_config_file()#
- _load()#
- unload_engine()#
Unload engine.