nemo_export.trt_llm.tensorrt_llm_run#

Module Contents#

Classes#

TensorrtLLMHostContext

The host side context for TRT LLM inference.

TensorrtLLMWorkerContext

The MPI worker side context for TRT LLM inference.

Functions#

_load

The impl of load API for on a single GPU worker.

_forward

The impl of forward API for on a single GPU worker with tensor as IO.

load

Loaded the compiled LLM model and run it.

forward

Run the loaded model with the host_context provided from the load API.

unload_engine

Deletes the ModelRunner which should free up device memory.

prepare_input_tensors

Prepare input tensors from text input.

generate

Generate the output sequence from the input sequence.

unload

Frees the GPU resource from the TensorrtLLMHostContext and reset the host_context.

to_word_list_format

Format of word_dict.

Data#

API#

nemo_export.trt_llm.tensorrt_llm_run.LOGGER = 'getLogger(...)'#
class nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext#

The host side context for TRT LLM inference.

executor: mpi4py.futures.MPIPoolExecutor = None#
world_size: int = 1#
tokenizer: transformers.PreTrainedTokenizer = None#
max_batch_size: int = 0#
max_input_len: int = 0#
add_bos: bool = False#
class nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMWorkerContext#

The MPI worker side context for TRT LLM inference.

decoder: tensorrt_llm.runtime.ModelRunner | tensorrt_llm.runtime.ModelRunnerCpp = None#
sampling_config: tensorrt_llm.runtime.SamplingConfig = None#
lora_manager: tensorrt_llm.lora_manager.LoraManager = None#
max_batch_size: int = 0#
max_input_len: int = 0#
nemo_export.trt_llm.tensorrt_llm_run.tensorrt_llm_worker_context = 'TensorrtLLMWorkerContext(...)'#
nemo_export.trt_llm.tensorrt_llm_run._load(
tokenizer: transformers.PreTrainedTokenizer,
engine_dir,
lora_ckpt_list=None,
num_beams=1,
use_python_runtime: bool = True,
enable_chunked_context: bool = False,
max_tokens_in_paged_kv_cache: int = None,
multi_block_mode: bool = False,
)#

The impl of load API for on a single GPU worker.

nemo_export.trt_llm.tensorrt_llm_run._forward(
input_tensors: List[torch.IntTensor],
max_output_len: int,
top_k: int = 1,
top_p: float = 0.0,
temperature: float = 1.0,
lora_uids: List[str] = None,
stop_words_list=None,
bad_words_list=None,
multiprocessed_env=False,
**sampling_kwargs,
) Optional[torch.IntTensor]#

The impl of forward API for on a single GPU worker with tensor as IO.

Returns:

the output tokens tensor with shape [batch_size, num_beams, output_len].

nemo_export.trt_llm.tensorrt_llm_run.load(
tokenizer: transformers.PreTrainedTokenizer,
engine_dir: str,
lora_ckpt_list: List[str] = None,
num_beams: int = 1,
use_python_runtime: bool = True,
enable_chunked_context: bool = False,
max_tokens_in_paged_kv_cache: int = None,
multi_block_mode: bool = False,
) nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext#

Loaded the compiled LLM model and run it.

It also supports running the TRT LLM model on multi-GPU.

nemo_export.trt_llm.tensorrt_llm_run.forward(
input_tensors: List[torch.IntTensor],
max_output_len: int,
host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
top_k: int = 1,
top_p: float = 0.0,
temperature: float = 1.0,
lora_uids: List[str] = None,
stop_words_list=None,
bad_words_list=None,
multiprocessed_env=False,
**sampling_kwargs,
) Optional[torch.IntTensor]#

Run the loaded model with the host_context provided from the load API.

nemo_export.trt_llm.tensorrt_llm_run.unload_engine()#

Deletes the ModelRunner which should free up device memory.

nemo_export.trt_llm.tensorrt_llm_run.prepare_input_tensors(
input_texts: List[str],
host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
)#

Prepare input tensors from text input.

Parameters:
  • input_texts – List of input text strings

  • host_context – Context containing tokenizer and configuration

Returns:

Prepared input tensors for model

Return type:

dict

nemo_export.trt_llm.tensorrt_llm_run.generate(
input_texts: List[str],
max_output_len: int,
host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
top_k: int = 1,
top_p: float = 0.0,
temperature: float = 1.0,
lora_uids: List[str] = None,
stop_words_list=None,
bad_words_list=None,
output_log_probs=False,
multiprocessed_env=False,
output_context_logits=False,
output_generation_logits=False,
**sampling_kwargs,
) Optional[List[List[str]]]#

Generate the output sequence from the input sequence.

Returns a 2D string list with shape [batch_size, num_beams].

nemo_export.trt_llm.tensorrt_llm_run.unload(
host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
)#

Frees the GPU resource from the TensorrtLLMHostContext and reset the host_context.

nemo_export.trt_llm.tensorrt_llm_run.to_word_list_format(
word_dict: List[List[str]],
tokenizer=None,
ref_str='<extra_id_1>',
)#

Format of word_dict.

len(word_dict) should be same to batch_size word_dict[i] means the words for batch i len(word_dict[i]) must be 1, which means it only contains 1 string This string can contains several sentences and split by “,”. For example, if word_dict[2] = “ I am happy, I am sad”, then this function will return the ids for two short sentences “ I am happy” and “ I am sad”.