nemo_deploy.nlp.trtllm_api_deployable
#
Module Contents#
Classes#
A Triton inference server compatible wrapper for TensorRT-LLM LLM-API. |
Data#
API#
- nemo_deploy.nlp.trtllm_api_deployable.LOGGER = 'getLogger(...)'#
- class nemo_deploy.nlp.trtllm_api_deployable.TensorRTLLMAPIDeployable(
- hf_model_id_path: str,
- tokenizer: Optional[Union[str, pathlib.Path, tensorrt_llm.llmapi.llm.TokenizerBase, transformers.PreTrainedTokenizerBase]] = None,
- tensor_parallel_size: int = 1,
- pipeline_parallel_size: int = 1,
- moe_expert_parallel_size: int = -1,
- moe_tensor_parallel_size: int = -1,
- max_batch_size: int = 8,
- max_num_tokens: int = 8192,
- backend: str = 'pytorch',
- dtype: str = 'auto',
- **kwargs,
Bases:
nemo_deploy.ITritonDeployable
A Triton inference server compatible wrapper for TensorRT-LLM LLM-API.
This class provides a standardized interface for deploying TensorRT-LLM LLM-API in Triton inference server. It handles model loading, inference, and deployment configurations.
- Parameters:
hf_model_id_path (str) β Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
tokenizer (Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]) β Path to the tokenizer or tokenizer instance.
tensor_parallel_size (int) β Tensor parallelism size. Defaults to 1.
pipeline_parallel_size (int) β Pipeline parallelism size. Defaults to 1.
moe_expert_parallel_size (int) β MOE expert parallelism size. Defaults to -1.
moe_tensor_parallel_size (int) β MOE tensor parallelism size. Defaults to -1.
max_batch_size (int) β Maximum batch size. Defaults to 8.
max_num_tokens (int) β Maximum total tokens across all sequences in a batch. Defaults to 8192.
backend (str) β Backend to use for TRTLLM. Defaults to βpytorchβ.
dtype (str) β Model data type. Defaults to βautoβ.
**kwargs β Additional keyword arguments to pass to model loading.
Initialization
- generate(
- prompts: List[str],
- max_length: int = 256,
- temperature: Optional[float] = None,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- **kwargs,
Generate text based on the provided input prompts.
This method processes input prompts through the loaded model and generates text according to the specified parameters.
- Parameters:
prompts β List of input prompts
max_length β Maximum number of tokens to generate. Defaults to 256.
temperature β Sampling temperature. Defaults to None.
top_k β Number of highest probability tokens to consider. Defaults to None.
top_p β Cumulative probability threshold for token sampling. Defaults to None.
**kwargs β Additional keyword arguments to sampling params.
- Returns:
A list of generated texts, one for each input prompt.
- Return type:
List[str]
- Raises:
RuntimeError β If the model is not initialized.
- property get_triton_input#
- property get_triton_output#
- triton_infer_fn(**inputs: numpy.ndarray)#