> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Reference Guide

## Building a Custom Container

To build a TensorRT-LLM container from source (e.g., for custom modifications or a different CUDA version), see the [Building a Custom Container](/dynamo/v1.1.0/additional-resources/tensor-rt-llm-details/building-a-custom-container) guide.

## KV Cache Transfer

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](/dynamo/v1.1.0/additional-resources/tensor-rt-llm-details/kv-cache-transfer).

## Request Migration

Dynamo supports [request migration](/dynamo/v1.1.0/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/v1.1.0/user-guides/fault-tolerance/request-migration) documentation for configuration details.

## Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

### Cancellation Support Matrix

| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |

For more details, see the [Request Cancellation Architecture](/dynamo/v1.1.0/user-guides/fault-tolerance/request-cancellation) documentation.

## Multiple Choices (`n`)

Dynamo forwards OpenAI-compatible multiple-choice requests to TensorRT-LLM using `n`. For an `n > 1` request on TensorRT-LLM's default deterministic decoding path, set `TLLM_ALLOW_N_GREEDY_DECODING=1` in the TensorRT-LLM worker environment. Without it, TensorRT-LLM rejects the request before generation.

If a test or deployment intentionally validates `n > 1` for that path, set:

```bash
export TLLM_ALLOW_N_GREEDY_DECODING=1
```

Scope this environment variable to the specific TensorRT-LLM worker or test configuration that needs `n > 1`. For Dynamo E2E tests, set it on the relevant `EngineConfig.env` rather than globally, and keep the client request OpenAI-shaped with `n` instead of adding `best_of`.

TensorRT-LLM documents `n`/`best_of` behavior and validates this guard as greedy decoding in [`tensorrt_llm/sampling_params.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/sampling_params.py).

## Multimodal Support

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).

## Diffusion Support (Experimental)

Dynamo supports video and image generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Diffusion Guide](/dynamo/v1.1.0/user-guides/diffusion/trt-llm-diffusion).

## Logits Processing

Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](/dynamo/v1.1.0/additional-resources/tensor-rt-llm-details/logits-processing).

## DP Rank Routing (Attention Data Parallelism)

TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](/dynamo/v1.1.0/additional-resources/tensor-rt-llm-details/dp-rank-routing).

## KVBM Integration

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

See the instructions here: [Running KVBM in TensorRT-LLM](/dynamo/v1.1.0/user-guides/kv-cache-offloading#run-kvbm-in-dynamo-with-tensorrt-llm).

## Observability

TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Observability Guide](/dynamo/v1.1.0/backends/tensor-rt-llm/observability).

## Known Issues and Mitigations

For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](/dynamo/v1.1.0/backends/tensor-rt-llm/known-issues-and-mitigations) page.