For general TensorRT-LLM features and configuration, see the Reference Guide.
When running TensorRT-LLM through Dynamo, TensorRT-LLM’s Prometheus metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with trtllm_) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.
Additional performance metrics are available via non-Prometheus APIs (see Non-Prometheus Performance Metrics below).
As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes 5 basic Prometheus metrics. Note that the trtllm_ prefix is added by Dynamo.
For Dynamo runtime metrics, see the Dynamo Metrics Guide.
For visualization setup instructions, see the Prometheus and Grafana Setup Guide.
This is a single machine example.
For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.
Launch a frontend and TensorRT-LLM backend to test metrics:
Note: The backend must be set to "pytorch" for metrics collection (enforced in components/src/dynamo/trtllm/main.py). TensorRT-LLM’s MetricsCollector integration has only been tested/validated with the PyTorch backend.
Wait for the TensorRT-LLM worker to start, then send requests and check metrics:
TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All TensorRT-LLM engine metrics use the trtllm_ prefix and include labels (e.g., model_name, engine_type, finished_reason) to identify the source.
Note: TensorRT-LLM uses model_name instead of Dynamo’s standard model label convention.
Example Prometheus Exposition Format text:
Note: The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual /metrics endpoint for the current list.
TensorRT-LLM provides metrics in the following categories (all prefixed with trtllm_):
Note: Metrics may change between TensorRT-LLM versions. Always inspect the /metrics endpoint for your version.
The following metrics are exposed via Dynamo’s /metrics endpoint (with the trtllm_ prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:
trtllm_request_success_total (Counter) — Count of successfully processed requests by finish reason
model_name, engine_type, finished_reasontrtllm_e2e_request_latency_seconds (Histogram) — End-to-end request latency (seconds)
model_name, engine_typetrtllm_time_to_first_token_seconds (Histogram) — Time to first token, TTFT (seconds)
model_name, engine_typetrtllm_time_per_output_token_seconds (Histogram) — Time per output token, TPOT (seconds)
model_name, engine_typetrtllm_request_queue_time_seconds (Histogram) — Time a request spends waiting in the queue (seconds)
model_name, engine_typeThese metric names and availability are subject to change with TensorRT-LLM version updates.
TensorRT-LLM provides Prometheus metrics through the MetricsCollector class (see tensorrt_llm/metrics/collector.py).
Dynamo adds the following operational metrics for TensorRT-LLM workers. These complement the engine’s native metrics above with request-level observability that the engine does not provide. All metrics use the trtllm_ prefix and are automatically enabled when --publish-events-and-metrics is set.
Metric name constants are defined in lib/runtime/src/metrics/prometheus_names.rs (trtllm_additional module).
trtllm_request_type_image_total (Counter) — Total number of requests containing image/multimodal content
model_name, disaggregation_mode, engine_typetrtllm_request_type_structured_output_total (Counter) — Total number of requests using guided/structured decoding (JSON, regex, grammar, etc.)
model_name, disaggregation_mode, engine_typetrtllm_num_aborted_requests_total (Counter) — Total number of aborted/cancelled requests
model_name, disaggregation_mode, engine_typeThese metrics are only recorded in disaggregated (prefill + decode) deployments when a KV cache transfer actually occurs. They are sourced from TensorRT-LLM’s RequestPerfMetrics.timing_metrics.
trtllm_kv_transfer_success_total (Counter) — Total number of successful KV cache transfers (recorded on prefill side)
model_name, disaggregation_mode, engine_typetrtllm_kv_transfer_latency_seconds (Histogram) — KV cache transfer latency per request in seconds
model_name, disaggregation_mode, engine_typetrtllm_kv_transfer_bytes (Histogram) — KV cache transfer size per request in bytes
model_name, disaggregation_mode, engine_typetrtllm_kv_transfer_speed_gb_s (Histogram) — KV cache transfer speed per request in GB/s
model_name, disaggregation_mode, engine_typeTensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.
engine.llm.get_stats_async() - System-wide aggregate statisticsengine.llm.get_kv_cache_events_async() - Real-time cache operationsNote: These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.
MetricsCollector class from tensorrt_llm.metrics (see collector.py)register_engine_metrics_callback() function with metric_prefix_filter=["trtllm_"]return_perf_metrics set to True when --publish-events-and-metrics is enabledMetricsCollector initialized with model metadata (model name, engine type)dynamo_*) are available at the same /metrics endpoint alongside TensorRT-LLM metrics
lib/runtime/src/metrics.rs (Rust runtime metrics)lib/runtime/src/metrics/prometheus_names.rs (metric name constants)components/src/dynamo/common/utils/prometheus.py - Prometheus utilities and callback registration