For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • DP Rank Routing
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Dynamo vs TRT-LLM Internal Routing
  • Enabling DP Rank Routing
Additional ResourcesTensorRT-LLM Details

DP Rank Routing (Attention Data Parallelism)

||View as Markdown|
Edit this page
Previous

Dynamo Docs Guide

For general TensorRT-LLM features and configuration, see the Reference Guide.


TensorRT-LLM supports attention data parallelism (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.

Dynamo vs TRT-LLM Internal Routing

  • Dynamo DP Rank Routing: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (attention_dp_relax=False). Use this with --router-mode kv for cache-aware routing.
  • TRT-LLM Internal Routing: TRT-LLM’s scheduler assigns DP ranks internally. Use this with --router-mode round-robin or random when KV-aware routing isn’t needed.

Enabling DP Rank Routing

$# Worker with attention DP
$# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
$CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
> --model-path <MODEL_PATH> \
> --tensor-parallel-size 2 \
> --enable-attention-dp \
> --publish-events-and-metrics
$
$# Frontend with KV routing
$python3 -m dynamo.frontend --router-mode kv

The --enable-attention-dp flag sets attention_dp_size = tensor_parallel_size and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each (worker_id, dp_rank) combination.

Attention DP requires TRT-LLM’s PyTorch backend. AutoDeploy does not support attention DP.