nemo_export.tensorrt_llm_deployable_ray#
Module Contents#
Classes#
A Ray Serve compatible wrapper for deploying TensorRT-LLM models. |
Data#
API#
- nemo_export.tensorrt_llm_deployable_ray.LOGGER = 'getLogger(...)'#
- nemo_export.tensorrt_llm_deployable_ray.app = 'FastAPI(...)'#
- class nemo_export.tensorrt_llm_deployable_ray.TensorRTLLMRayDeployable(
- trt_llm_path: str,
- model_id: str = 'tensorrt-llm-model',
- use_python_runtime: bool = True,
- enable_chunked_context: bool = None,
- max_tokens_in_paged_kv_cache: int = None,
- multi_block_mode: bool = False,
- lora_ckpt_list: List[str] = None,
A Ray Serve compatible wrapper for deploying TensorRT-LLM models.
This class provides a standardized interface for deploying TensorRT-LLM models in Ray Serve. It supports various NLP tasks and handles model loading, inference, and deployment configurations.
- Parameters:
model_dir (str) – Path to the TensorRT-LLM model directory.
model_id (str) – Identifier for the model in the API responses. Defaults to “tensorrt-llm-model”.
max_batch_size (int) – Maximum number of requests to batch together. Defaults to 8.
batch_wait_timeout_s (float) – Maximum time to wait for batching requests. Defaults to 0.3.
load_model (bool) – Whether to load the model during initialization. Defaults to True.
use_python_runtime (bool) – Whether to use Python runtime. Defaults to True.
enable_chunked_context (bool) – Whether to enable chunked context. Defaults to None.
max_tokens_in_paged_kv_cache (int) – Maximum tokens in paged KV cache. Defaults to None.
multi_block_mode (bool) – Whether to enable multi-block mode. Defaults to False.
Initialization
Initialize the TensorRT-LLM model deployment.
- Parameters:
model_dir (str) – Path to the TensorRT-LLM model directory.
model_id (str) – Model identifier. Defaults to “tensorrt-llm-model”.
max_batch_size (int) – Maximum number of requests to batch together. Defaults to 8.
pipeline_parallelism_size (int) – Number of pipeline parallelism. Defaults to 1.
tensor_parallelism_size (int) – Number of tensor parallelism. Defaults to 1.
use_python_runtime (bool) – Whether to use Python runtime. Defaults to True.
enable_chunked_context (bool) – Whether to enable chunked context. Defaults to None.
max_tokens_in_paged_kv_cache (int) – Maximum tokens in paged KV cache. Defaults to None.
multi_block_mode (bool) – Whether to enable multi-block mode. Defaults to False.
lora_ckpt_list (List[str]) – List of LoRA checkpoint paths. Defaults to None.
- Raises:
ImportError – If Ray is not installed.
Exception – If model initialization fails.
- async completions(request: Dict[Any, Any])#
Handle text completion requests.
- async chat_completions(request: Dict[Any, Any])#
Handle chat completion requests.
- async list_models()#
List available models.
This endpoint returns information about the deployed model in OpenAI API format.
- Returns:
object: Response type (“list”)
data: List of model information
- Return type:
Dict containing
- async health_check()#
Check the health status of the service.
This endpoint is used to verify that the service is running and healthy.
- Returns:
status: Health status (“healthy”)
- Return type:
Dict containing