---
title: Reference Guide
subtitle: 'Features, configuration, and operational details for the TensorRT-LLM backend'
---

## Building a Custom Container

To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](/dynamo/additional-resources/tensor-rt-llm-details/building-a-custom-container) guide.

## KV Cache Transfer

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer).

## Request Migration

Dynamo supports [request migration](/dynamo/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/user-guides/fault-tolerance/request-migration) documentation for configuration details.

## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |

For more details, see the [Request Cancellation Architecture](/dynamo/user-guides/fault-tolerance/request-cancellation) documentation.

## Multimodal Support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).

## Video Diffusion Support (Experimental)

Dynamo supports video generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Video Diffusion Guide](/dynamo/user-guides/diffusion/trt-llm-diffusion).

## Logits Processing

Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](/dynamo/additional-resources/tensor-rt-llm-details/logits-processing).

## DP Rank Routing (Attention Data Parallelism)

TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](/dynamo/additional-resources/tensor-rt-llm-details/dp-rank-routing).

## KVBM Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

See the instructions here: [Running KVBM in TensorRT-LLM](/dynamo/user-guides/kv-cache-offloading#run-kvbm-in-dynamo-with-tensorrt-llm).

## Observability

TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Prometheus Metrics Guide](/dynamo/backends/tensor-rt-llm/prometheus-metrics).

## Known Issues and Mitigations

For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](/dynamo/backends/tensor-rt-llm/known-issues-and-mitigations) page.
