nemo_export.trt_llm.tensorrt_llm_run#
Module Contents#
Classes#
The host side context for TRT LLM inference. |
|
The MPI worker side context for TRT LLM inference. |
Functions#
The impl of |
|
The impl of |
|
Loaded the compiled LLM model and run it. |
|
Run the loaded model with the host_context provided from the |
|
Deletes the ModelRunner which should free up device memory. |
|
Prepare input tensors from text input. |
|
Generate the output sequence from the input sequence. |
|
Frees the GPU resource from the TensorrtLLMHostContext and reset the host_context. |
|
Format of word_dict. |
Data#
API#
- nemo_export.trt_llm.tensorrt_llm_run.LOGGER = 'getLogger(...)'#
- class nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext#
The host side context for TRT LLM inference.
- executor: mpi4py.futures.MPIPoolExecutor = None#
- world_size: int = 1#
- tokenizer: transformers.PreTrainedTokenizer = None#
- max_batch_size: int = 0#
- max_input_len: int = 0#
- add_bos: bool = False#
- class nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMWorkerContext#
The MPI worker side context for TRT LLM inference.
- decoder: tensorrt_llm.runtime.ModelRunner | tensorrt_llm.runtime.ModelRunnerCpp = None#
- sampling_config: tensorrt_llm.runtime.SamplingConfig = None#
- lora_manager: tensorrt_llm.lora_manager.LoraManager = None#
- max_batch_size: int = 0#
- max_input_len: int = 0#
- nemo_export.trt_llm.tensorrt_llm_run.tensorrt_llm_worker_context = 'TensorrtLLMWorkerContext(...)'#
- nemo_export.trt_llm.tensorrt_llm_run._load(
- tokenizer: transformers.PreTrainedTokenizer,
- engine_dir,
- lora_ckpt_list=None,
- num_beams=1,
- use_python_runtime: bool = True,
- enable_chunked_context: bool = False,
- max_tokens_in_paged_kv_cache: int = None,
- multi_block_mode: bool = False,
The impl of
loadAPI for on a single GPU worker.
- nemo_export.trt_llm.tensorrt_llm_run._forward(
- input_tensors: List[torch.IntTensor],
- max_output_len: int,
- top_k: int = 1,
- top_p: float = 0.0,
- temperature: float = 1.0,
- lora_uids: List[str] = None,
- stop_words_list=None,
- bad_words_list=None,
- multiprocessed_env=False,
- **sampling_kwargs,
The impl of
forwardAPI for on a single GPU worker with tensor as IO.- Returns:
the output tokens tensor with shape [batch_size, num_beams, output_len].
- nemo_export.trt_llm.tensorrt_llm_run.load(
- tokenizer: transformers.PreTrainedTokenizer,
- engine_dir: str,
- lora_ckpt_list: List[str] = None,
- num_beams: int = 1,
- use_python_runtime: bool = True,
- enable_chunked_context: bool = False,
- max_tokens_in_paged_kv_cache: int = None,
- multi_block_mode: bool = False,
Loaded the compiled LLM model and run it.
It also supports running the TRT LLM model on multi-GPU.
- nemo_export.trt_llm.tensorrt_llm_run.forward(
- input_tensors: List[torch.IntTensor],
- max_output_len: int,
- host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
- top_k: int = 1,
- top_p: float = 0.0,
- temperature: float = 1.0,
- lora_uids: List[str] = None,
- stop_words_list=None,
- bad_words_list=None,
- multiprocessed_env=False,
- **sampling_kwargs,
Run the loaded model with the host_context provided from the
loadAPI.
- nemo_export.trt_llm.tensorrt_llm_run.unload_engine()#
Deletes the ModelRunner which should free up device memory.
- nemo_export.trt_llm.tensorrt_llm_run.prepare_input_tensors(
- input_texts: List[str],
- host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
Prepare input tensors from text input.
- Parameters:
input_texts – List of input text strings
host_context – Context containing tokenizer and configuration
- Returns:
Prepared input tensors for model
- Return type:
dict
- nemo_export.trt_llm.tensorrt_llm_run.generate(
- input_texts: List[str],
- max_output_len: int,
- host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
- top_k: int = 1,
- top_p: float = 0.0,
- temperature: float = 1.0,
- lora_uids: List[str] = None,
- stop_words_list=None,
- bad_words_list=None,
- output_log_probs=False,
- multiprocessed_env=False,
- output_context_logits=False,
- output_generation_logits=False,
- **sampling_kwargs,
Generate the output sequence from the input sequence.
Returns a 2D string list with shape [batch_size, num_beams].
- nemo_export.trt_llm.tensorrt_llm_run.unload(
- host_context: nemo_export.trt_llm.tensorrt_llm_run.TensorrtLLMHostContext,
Frees the GPU resource from the TensorrtLLMHostContext and reset the host_context.
- nemo_export.trt_llm.tensorrt_llm_run.to_word_list_format(
- word_dict: List[List[str]],
- tokenizer=None,
- ref_str='<extra_id_1>',
Format of word_dict.
len(word_dict) should be same to batch_size word_dict[i] means the words for batch i len(word_dict[i]) must be 1, which means it only contains 1 string This string can contains several sentences and split by “,”. For example, if word_dict[2] = “ I am happy, I am sad”, then this function will return the ids for two short sentences “ I am happy” and “ I am sad”.