For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
      • Reference Guide
      • Examples
      • Prometheus Metrics
      • Video Diffusion (Experimental)
      • Known Issues and Mitigations
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • Environment Variables
  • Getting Started Quickly
  • Start Observability Stack
  • Launch Dynamo Components
  • Exposed Metrics
  • Metric Categories
  • Available Metrics
  • Additional Operational Metrics
  • Request Type Tracking
  • Abort Tracking
  • KV Cache Transfer Metrics (Disaggregated Deployments)
  • Non-Prometheus Performance Metrics
  • Available via Code References
  • Example RequestPerfMetrics JSON Structure
  • Implementation Details
  • Related Documentation
  • TensorRT-LLM Metrics
  • Dynamo Metrics
BackendsTensorRT-LLM

Prometheus

||View as Markdown|
Edit this page
Previous

Examples

Next

Video Diffusion Support (Experimental)

For general TensorRT-LLM features and configuration, see the Reference Guide.


Overview

When running TensorRT-LLM through Dynamo, TensorRT-LLM’s Prometheus metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with trtllm_) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

Additional performance metrics are available via non-Prometheus APIs (see Non-Prometheus Performance Metrics below).

As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes 5 basic Prometheus metrics. Note that the trtllm_ prefix is added by Dynamo.

For Dynamo runtime metrics, see the Dynamo Metrics Guide.

For visualization setup instructions, see the Prometheus and Grafana Setup Guide.

Environment Variables

VariableDescriptionDefaultExample
DYN_SYSTEM_PORTSystem metrics/health port-1 (disabled)8081

Getting Started Quickly

This is a single machine example.

Start Observability Stack

For visualizing metrics with Prometheus and Grafana, start the observability stack. See Observability Getting Started for instructions.

Launch Dynamo Components

Launch a frontend and TensorRT-LLM backend to test metrics:

$# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$$ python -m dynamo.frontend
$
$# Enable system metrics server on port 8081 and enable metrics collection
$$ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model <model_name> --publish-events-and-metrics

Note: The backend must be set to "pytorch" for metrics collection (enforced in components/src/dynamo/trtllm/main.py). TensorRT-LLM’s MetricsCollector integration has only been tested/validated with the PyTorch backend.

Wait for the TensorRT-LLM worker to start, then send requests and check metrics:

$# Send a request
$curl -H 'Content-Type: application/json' \
>-d '{
> "model": "<model_name>",
> "max_completion_tokens": 100,
> "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]
>}' \
>http://localhost:8000/v1/chat/completions
$
$# Check metrics from the worker
$curl -s localhost:8081/metrics | grep "^trtllm_"

Exposed Metrics

TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the /metrics HTTP endpoint. All TensorRT-LLM engine metrics use the trtllm_ prefix and include labels (e.g., model_name, engine_type, finished_reason) to identify the source.

Note: TensorRT-LLM uses model_name instead of Dynamo’s standard model label convention.

Example Prometheus Exposition Format text:

# HELP trtllm_request_success_total Count of successfully processed requests.
# TYPE trtllm_request_success_total counter
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="stop"} 150.0
trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="length"} 5.0
# HELP trtllm_time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE trtllm_time_to_first_token_seconds histogram
trtllm_time_to_first_token_seconds_bucket{le="0.01",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 0.0
trtllm_time_to_first_token_seconds_bucket{le="0.05",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.0
trtllm_time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 8.75
# HELP trtllm_e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE trtllm_e2e_request_latency_seconds histogram
trtllm_e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0
trtllm_e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2
# HELP trtllm_time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE trtllm_time_per_output_token_seconds histogram
trtllm_time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0
trtllm_time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5
# HELP trtllm_request_queue_time_seconds Histogram of time spent in WAITING phase for request.
# TYPE trtllm_request_queue_time_seconds histogram
trtllm_request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0
trtllm_request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0
trtllm_request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1

Note: The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual /metrics endpoint for the current list.

Metric Categories

TensorRT-LLM provides metrics in the following categories (all prefixed with trtllm_):

  • Request metrics - Request success tracking and latency measurements
  • Performance metrics - Time to first token (TTFT), time per output token (TPOT), and queue time

Note: Metrics may change between TensorRT-LLM versions. Always inspect the /metrics endpoint for your version.

Available Metrics

The following metrics are exposed via Dynamo’s /metrics endpoint (with the trtllm_ prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5:

  • trtllm_request_success_total (Counter) — Count of successfully processed requests by finish reason
    • Labels: model_name, engine_type, finished_reason
  • trtllm_e2e_request_latency_seconds (Histogram) — End-to-end request latency (seconds)
    • Labels: model_name, engine_type
  • trtllm_time_to_first_token_seconds (Histogram) — Time to first token, TTFT (seconds)
    • Labels: model_name, engine_type
  • trtllm_time_per_output_token_seconds (Histogram) — Time per output token, TPOT (seconds)
    • Labels: model_name, engine_type
  • trtllm_request_queue_time_seconds (Histogram) — Time a request spends waiting in the queue (seconds)
    • Labels: model_name, engine_type

These metric names and availability are subject to change with TensorRT-LLM version updates.

TensorRT-LLM provides Prometheus metrics through the MetricsCollector class (see tensorrt_llm/metrics/collector.py).

Additional Operational Metrics

Dynamo adds the following operational metrics for TensorRT-LLM workers. These complement the engine’s native metrics above with request-level observability that the engine does not provide. All metrics use the trtllm_ prefix and are automatically enabled when --publish-events-and-metrics is set.

Metric name constants are defined in lib/runtime/src/metrics/prometheus_names.rs (trtllm_additional module).

Request Type Tracking

  • trtllm_request_type_image_total (Counter) — Total number of requests containing image/multimodal content
    • Labels: model_name, disaggregation_mode, engine_type
  • trtllm_request_type_structured_output_total (Counter) — Total number of requests using guided/structured decoding (JSON, regex, grammar, etc.)
    • Labels: model_name, disaggregation_mode, engine_type

Abort Tracking

  • trtllm_num_aborted_requests_total (Counter) — Total number of aborted/cancelled requests
    • Labels: model_name, disaggregation_mode, engine_type

KV Cache Transfer Metrics (Disaggregated Deployments)

These metrics are only recorded in disaggregated (prefill + decode) deployments when a KV cache transfer actually occurs. They are sourced from TensorRT-LLM’s RequestPerfMetrics.timing_metrics.

  • trtllm_kv_transfer_success_total (Counter) — Total number of successful KV cache transfers (recorded on prefill side)
    • Labels: model_name, disaggregation_mode, engine_type
  • trtllm_kv_transfer_latency_seconds (Histogram) — KV cache transfer latency per request in seconds
    • Labels: model_name, disaggregation_mode, engine_type
  • trtllm_kv_transfer_bytes (Histogram) — KV cache transfer size per request in bytes
    • Labels: model_name, disaggregation_mode, engine_type
    • Buckets: 100KB, 500KB, 1MB, 5MB, 10MB, 50MB, 100MB, 500MB, 1GB, 5GB
  • trtllm_kv_transfer_speed_gb_s (Histogram) — KV cache transfer speed per request in GB/s
    • Labels: model_name, disaggregation_mode, engine_type

Non-Prometheus Performance Metrics

TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus.

Available via Code References

  • RequestPerfMetrics Structure: tensorrt_llm/executor/result.py - KV cache, timing, speculative decoding metrics
  • Engine Statistics: engine.llm.get_stats_async() - System-wide aggregate statistics
  • KV Cache Events: engine.llm.get_kv_cache_events_async() - Real-time cache operations

Example RequestPerfMetrics JSON Structure

1{
2 "timing_metrics": {
3 "arrival_time": 1234567890.123,
4 "first_scheduled_time": 1234567890.135,
5 "first_token_time": 1234567890.150,
6 "last_token_time": 1234567890.300,
7 "kv_cache_size": 2048576,
8 "kv_cache_transfer_start": 1234567890.140,
9 "kv_cache_transfer_end": 1234567890.145
10 },
11 "kv_cache_metrics": {
12 "num_total_allocated_blocks": 100,
13 "num_new_allocated_blocks": 10,
14 "num_reused_blocks": 90,
15 "num_missed_blocks": 5
16 },
17 "speculative_decoding": {
18 "acceptance_rate": 0.85,
19 "total_accepted_draft_tokens": 42,
20 "total_draft_tokens": 50
21 }
22}

Note: These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates.

Implementation Details

  • Prometheus Integration: Uses the MetricsCollector class from tensorrt_llm.metrics (see collector.py)
  • Dynamo Integration: Uses register_engine_metrics_callback() function with metric_prefix_filter=["trtllm_"]
  • Engine Configuration: return_perf_metrics set to True when --publish-events-and-metrics is enabled
  • Initialization: Metrics appear after TensorRT-LLM engine initialization completes
  • Metadata: MetricsCollector initialized with model metadata (model name, engine type)

Related Documentation

TensorRT-LLM Metrics

  • See the Non-Prometheus Performance Metrics section above for detailed performance data and source code references
  • TensorRT-LLM Metrics Collector - Source code reference

Dynamo Metrics

  • Dynamo Metrics Guide - Complete documentation on Dynamo runtime metrics
  • Prometheus and Grafana Setup - Visualization setup instructions
  • Dynamo runtime metrics (prefixed with dynamo_*) are available at the same /metrics endpoint alongside TensorRT-LLM metrics
    • Implementation: lib/runtime/src/metrics.rs (Rust runtime metrics)
    • Metric names: lib/runtime/src/metrics/prometheus_names.rs (metric name constants)
    • Integration code: components/src/dynamo/common/utils/prometheus.py - Prometheus utilities and callback registration