For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
      • Gemma3 Sliding Window
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Aggregated Serving
  • Aggregated Serving with KV Routing
  • Disaggregated Serving
  • Disaggregated Serving with KV Routing
Additional ResourcesTensorRT-LLM Details

Gemma3 Sliding Window

||View as Markdown|
Edit this page
Previous

Dynamo Docs Guide

For general TensorRT-LLM features and configuration, see the Reference Guide.


This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU. VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.

  • Ensure that required services such as nats and etcd are running before starting.
  • Request access to google/gemma-3-1b-it on Hugging Face and set your HF_TOKEN environment variable for authentication.

Aggregated Serving

$cd $DYNAMO_HOME/examples/backends/trtllm
$export MODEL_PATH=google/gemma-3-1b-it
$export SERVED_MODEL_NAME=$MODEL_PATH
$export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
$./launch/agg.sh

Aggregated Serving with KV Routing

$cd $DYNAMO_HOME/examples/backends/trtllm
$export MODEL_PATH=google/gemma-3-1b-it
$export SERVED_MODEL_NAME=$MODEL_PATH
$export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
$./launch/agg_router.sh

Disaggregated Serving

$cd $DYNAMO_HOME/examples/backends/trtllm
$export MODEL_PATH=google/gemma-3-1b-it
$export SERVED_MODEL_NAME=$MODEL_PATH
$export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
$export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
$./launch/disagg.sh

Disaggregated Serving with KV Routing

$cd $DYNAMO_HOME/examples/backends/trtllm
$export MODEL_PATH=google/gemma-3-1b-it
$export SERVED_MODEL_NAME=$MODEL_PATH
$export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
$export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
$./launch/disagg_router.sh