nemo_deploy.nlp.trtllm_api_deployable#

Module Contents#

Classes#

TensorRTLLMAPIDeployable

A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.

Data#

API#

nemo_deploy.nlp.trtllm_api_deployable.LOGGER = 'getLogger(...)'#
class nemo_deploy.nlp.trtllm_api_deployable.TensorRTLLMAPIDeployable(
hf_model_id_path: str,
tokenizer: Optional[Union[str, pathlib.Path, tensorrt_llm.llmapi.llm.TokenizerBase, transformers.PreTrainedTokenizerBase]] = None,
tensor_parallel_size: int = 1,
pipeline_parallel_size: int = 1,
moe_expert_parallel_size: int = -1,
moe_tensor_parallel_size: int = -1,
max_batch_size: int = 8,
max_num_tokens: int = 8192,
backend: str = 'pytorch',
dtype: str = 'auto',
**kwargs,
)#

Bases: nemo_deploy.ITritonDeployable

A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.

This class provides a standardized interface for deploying TensorRT-LLM LLM-API in Triton inference server. It handles model loading, inference, and deployment configurations.

Parameters:
  • hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.

  • tokenizer (Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]) – Path to the tokenizer or tokenizer instance.

  • tensor_parallel_size (int) – Tensor parallelism size. Defaults to 1.

  • pipeline_parallel_size (int) – Pipeline parallelism size. Defaults to 1.

  • moe_expert_parallel_size (int) – MOE expert parallelism size. Defaults to -1.

  • moe_tensor_parallel_size (int) – MOE tensor parallelism size. Defaults to -1.

  • max_batch_size (int) – Maximum batch size. Defaults to 8.

  • max_num_tokens (int) – Maximum total tokens across all sequences in a batch. Defaults to 8192.

  • backend (str) – Backend to use for TRTLLM. Defaults to β€œpytorch”.

  • dtype (str) – Model data type. Defaults to β€œauto”.

  • **kwargs – Additional keyword arguments to pass to model loading.

Initialization

generate(
prompts: List[str],
max_length: int = 256,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
**kwargs,
) List[str]#

Generate text based on the provided input prompts.

This method processes input prompts through the loaded model and generates text according to the specified parameters.

Parameters:
  • prompts – List of input prompts

  • max_length – Maximum number of tokens to generate. Defaults to 256.

  • temperature – Sampling temperature. Defaults to None.

  • top_k – Number of highest probability tokens to consider. Defaults to None.

  • top_p – Cumulative probability threshold for token sampling. Defaults to None.

  • **kwargs – Additional keyword arguments to sampling params.

Returns:

A list of generated texts, one for each input prompt.

Return type:

List[str]

Raises:

RuntimeError – If the model is not initialized.

property get_triton_input#
property get_triton_output#
triton_infer_fn(**inputs: numpy.ndarray)#