`nemo_deploy.llm.trtllm_api_deployable`#

Module Contents#

Classes#

TensorRTLLMAPIDeployable

A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.

Data#

LOGGER

API#

nemo_deploy.llm.trtllm_api_deployable.LOGGER = 'getLogger(...)'#

class nemo_deploy.llm.trtllm_api_deployable.TensorRTLLMAPIDeployable(

hf_model_id_path: str,

tokenizer: Optional[Union[str, pathlib.Path, tensorrt_llm.llmapi.llm.TokenizerBase, transformers.PreTrainedTokenizerBase]] = None,

tensor_parallel_size: int = 1,

pipeline_parallel_size: int = 1,

moe_expert_parallel_size: int = -1,

moe_tensor_parallel_size: int = -1,

max_batch_size: int = 8,

max_num_tokens: int = 8192,

backend: str = 'pytorch',

dtype: str = 'auto',

**kwargs,

)#

Bases: nemo_deploy.ITritonDeployable

A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.

This class provides a standardized interface for deploying TensorRT-LLM LLM-API in Triton inference server. It handles model loading, inference, and deployment configurations.

Parameters:

hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
tokenizer (Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]) – Path to the tokenizer or tokenizer instance.
tensor_parallel_size (int) – Tensor parallelism size. Defaults to 1.
pipeline_parallel_size (int) – Pipeline parallelism size. Defaults to 1.
moe_expert_parallel_size (int) – MOE expert parallelism size. Defaults to -1.
moe_tensor_parallel_size (int) – MOE tensor parallelism size. Defaults to -1.
max_batch_size (int) – Maximum batch size. Defaults to 8.
max_num_tokens (int) – Maximum total tokens across all sequences in a batch. Defaults to 8192.
backend (str) – Backend to use for TRTLLM. Defaults to “pytorch”.
dtype (str) – Model data type. Defaults to “auto”.
**kwargs – Additional keyword arguments to pass to model loading.

Initialization

generate(

prompts: List[str],

max_length: int = 256,

temperature: Optional[float] = None,

top_k: Optional[int] = None,

top_p: Optional[float] = None,

**kwargs,

) → List[str]#

Generate text based on the provided input prompts.

This method processes input prompts through the loaded model and generates text according to the specified parameters.

Parameters:

prompts – List of input prompts
max_length – Maximum number of tokens to generate. Defaults to 256.
temperature – Sampling temperature. Defaults to None.
top_k – Number of highest probability tokens to consider. Defaults to None.
top_p – Cumulative probability threshold for token sampling. Defaults to None.
**kwargs – Additional keyword arguments to sampling params.

Returns:

A list of generated texts, one for each input prompt.

Return type:

List[str]

Raises:

RuntimeError – If the model is not initialized.

property get_triton_input#

property get_triton_output#

triton_infer_fn(**inputs: numpy.ndarray)#

nemo_deploy.llm.trtllm_api_deployable#

Module Contents#

Classes#

Data#

API#

`nemo_deploy.llm.trtllm_api_deployable`#