nemo_deploy.llm.trtllm_api_deployable
#
Module Contents#
Classes#
A Triton inference server compatible wrapper for TensorRT-LLM LLM-API. |
Data#
API#
- nemo_deploy.llm.trtllm_api_deployable.LOGGER = 'getLogger(...)'#
- class nemo_deploy.llm.trtllm_api_deployable.TensorRTLLMAPIDeployable(
- hf_model_id_path: str,
- tokenizer: Optional[Union[str, pathlib.Path, tensorrt_llm.llmapi.llm.TokenizerBase, transformers.PreTrainedTokenizerBase]] = None,
- tensor_parallel_size: int = 1,
- pipeline_parallel_size: int = 1,
- moe_expert_parallel_size: int = -1,
- moe_tensor_parallel_size: int = -1,
- max_batch_size: int = 8,
- max_num_tokens: int = 8192,
- backend: str = 'pytorch',
- dtype: str = 'auto',
- **kwargs,
Bases:
nemo_deploy.ITritonDeployable
A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.
This class provides a standardized interface for deploying TensorRT-LLM LLM-API in Triton inference server. It handles model loading, inference, and deployment configurations.
- Parameters:
hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
tokenizer (Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]) – Path to the tokenizer or tokenizer instance.
tensor_parallel_size (int) – Tensor parallelism size. Defaults to 1.
pipeline_parallel_size (int) – Pipeline parallelism size. Defaults to 1.
moe_expert_parallel_size (int) – MOE expert parallelism size. Defaults to -1.
moe_tensor_parallel_size (int) – MOE tensor parallelism size. Defaults to -1.
max_batch_size (int) – Maximum batch size. Defaults to 8.
max_num_tokens (int) – Maximum total tokens across all sequences in a batch. Defaults to 8192.
backend (str) – Backend to use for TRTLLM. Defaults to “pytorch”.
dtype (str) – Model data type. Defaults to “auto”.
**kwargs – Additional keyword arguments to pass to model loading.
Initialization
- generate(
- prompts: List[str],
- max_length: int = 256,
- temperature: Optional[float] = None,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- **kwargs,
Generate text based on the provided input prompts.
This method processes input prompts through the loaded model and generates text according to the specified parameters.
- Parameters:
prompts – List of input prompts
max_length – Maximum number of tokens to generate. Defaults to 256.
temperature – Sampling temperature. Defaults to None.
top_k – Number of highest probability tokens to consider. Defaults to None.
top_p – Cumulative probability threshold for token sampling. Defaults to None.
**kwargs – Additional keyword arguments to sampling params.
- Returns:
A list of generated texts, one for each input prompt.
- Return type:
List[str]
- Raises:
RuntimeError – If the model is not initialized.
- property get_triton_input#
- property get_triton_output#
- triton_infer_fn(**inputs: numpy.ndarray)#