nemo_export.tensorrt_llm_deployable_ray#

Module Contents#

Classes#

TensorRTLLMRayDeployable

A Ray Serve compatible wrapper for deploying TensorRT-LLM models.

Data#

API#

nemo_export.tensorrt_llm_deployable_ray.LOGGER = 'getLogger(...)'#
nemo_export.tensorrt_llm_deployable_ray.app = 'FastAPI(...)'#
class nemo_export.tensorrt_llm_deployable_ray.TensorRTLLMRayDeployable(
trt_llm_path: str,
model_id: str = 'tensorrt-llm-model',
use_python_runtime: bool = True,
enable_chunked_context: bool = None,
max_tokens_in_paged_kv_cache: int = None,
multi_block_mode: bool = False,
lora_ckpt_list: List[str] = None,
)#

A Ray Serve compatible wrapper for deploying TensorRT-LLM models.

This class provides a standardized interface for deploying TensorRT-LLM models in Ray Serve. It supports various NLP tasks and handles model loading, inference, and deployment configurations.

Parameters:
  • model_dir (str) – Path to the TensorRT-LLM model directory.

  • model_id (str) – Identifier for the model in the API responses. Defaults to “tensorrt-llm-model”.

  • max_batch_size (int) – Maximum number of requests to batch together. Defaults to 8.

  • batch_wait_timeout_s (float) – Maximum time to wait for batching requests. Defaults to 0.3.

  • load_model (bool) – Whether to load the model during initialization. Defaults to True.

  • use_python_runtime (bool) – Whether to use Python runtime. Defaults to True.

  • enable_chunked_context (bool) – Whether to enable chunked context. Defaults to None.

  • max_tokens_in_paged_kv_cache (int) – Maximum tokens in paged KV cache. Defaults to None.

  • multi_block_mode (bool) – Whether to enable multi-block mode. Defaults to False.

Initialization

Initialize the TensorRT-LLM model deployment.

Parameters:
  • model_dir (str) – Path to the TensorRT-LLM model directory.

  • model_id (str) – Model identifier. Defaults to “tensorrt-llm-model”.

  • max_batch_size (int) – Maximum number of requests to batch together. Defaults to 8.

  • pipeline_parallelism_size (int) – Number of pipeline parallelism. Defaults to 1.

  • tensor_parallelism_size (int) – Number of tensor parallelism. Defaults to 1.

  • use_python_runtime (bool) – Whether to use Python runtime. Defaults to True.

  • enable_chunked_context (bool) – Whether to enable chunked context. Defaults to None.

  • max_tokens_in_paged_kv_cache (int) – Maximum tokens in paged KV cache. Defaults to None.

  • multi_block_mode (bool) – Whether to enable multi-block mode. Defaults to False.

  • lora_ckpt_list (List[str]) – List of LoRA checkpoint paths. Defaults to None.

Raises:
  • ImportError – If Ray is not installed.

  • Exception – If model initialization fails.

async completions(request: Dict[Any, Any])#

Handle text completion requests.

async chat_completions(request: Dict[Any, Any])#

Handle chat completion requests.

async list_models()#

List available models.

This endpoint returns information about the deployed model in OpenAI API format.

Returns:

  • object: Response type (“list”)

  • data: List of model information

Return type:

Dict containing

async health_check()#

Check the health status of the service.

This endpoint is used to verify that the service is running and healthy.

Returns:

  • status: Health status (“healthy”)

Return type:

Dict containing