country_code
Skip to main content
Ctrl+K
NVIDIA TensorRT - Home NVIDIA TensorRT - Home

NVIDIA TensorRT

  • Documentation Home
NVIDIA TensorRT - Home NVIDIA TensorRT - Home

NVIDIA TensorRT

  • Documentation Home

Table of Contents

Getting Started

  • Release Notes
    • 11.1.0 (Latest)
    • 11.0.0
  • Build Your First Engine
  • Quick Start Guide
    • Example Deployment Using ONNX
    • Using the TensorRT Runtime API
  • Support Matrix

Installing TensorRT

  • Installation Guide Overview
  • Prerequisites
  • Installing TensorRT
    • Method 1: Python Package Index (pip)
    • Method 2: Debian Package Installation
    • Method 3: RPM Package Installation
    • Method 4: Tar File Installation
    • Method 5: Zip File Installation (Windows)
    • Method 6: Container Installation
    • Alternative Installation Methods
  • Upgrading TensorRT
  • Uninstalling TensorRT

Architecture

  • Architecture Overview
  • How TensorRT Works

Inference Library

  • Inference Library Overview
  • TensorRT’s Capabilities
  • C++ API Documentation
  • Python API Documentation
  • Sample Support Guide
    • Sample Explorer
    • Building and Running C++ Samples
    • Running Python Samples
    • Cross Compiling Samples
  • Advanced Topics
    • Engine Compatibility
    • Refitting an Engine
    • Precision Control
    • Data Formats and Tensors
    • Data Format Descriptions
    • Engine Tools and Debugging
    • Weight Streaming
    • Multi-Device Inference
  • Working with Quantized Types
    • Quantization Workflows
    • Quantization Schemes
    • Explicit Quantization
  • Accuracy Considerations
  • Working with Dynamic Shapes
    • Dynamic Shapes: Core Concepts
    • Dynamic Shapes: Advanced Topics
  • Extending TensorRT with Custom Layers
    • Adding Custom Layers Using the C++ API
    • Adding Custom Layers using the Python API (TensorRT >= 10.6)
    • Enabling Timing Caching and Using Custom Tactics
    • Plugin API Description
  • Working with Loops
  • Working with Conditionals
  • Working with DLA
    • Building and Launching the Loadable
    • DLA Supported Layers and Restrictions
    • GPU Fallback Mode
    • DLA Standalone Mode
    • Customizing DLA Memory Pools
  • TensorRT API Capture and Replay
  • Working with Transformers
    • Rotary Position Embedding
    • KV Cache
    • MoE (Mixture of Experts)
    • Fused Attention
    • Multi-Device Attention

Performance

  • Best Practices
  • Performance Benchmarking
  • Optimizing TensorRT Performance
    • Layer Fusion Catalog
  • Global Performance Tuning

API

  • C++ API
  • Python API
  • TensorRT Operators
  • NVIDIA TensorRT Migration Guide
    • Migrating from TensorRT 10.x to 11.x
    • Migrating Python Code from TensorRT 10.x to 11.x
      • Migrating from Weak Typing to Strong Typing
      • Removed Python APIs and Replacements
    • Migrating C++ Code from TensorRT 10.x to 11.x
      • Migrating from Weak Typing to Strong Typing
      • Removed C++ APIs and Replacements
    • Migrating trtexec Usage from TensorRT 10.x to 11.x
    • Migrating Safety Runtime Code from TensorRT 10.x to 11.x
    • Migrating IEngineInspector Usage from TensorRT 10.x to 11.x
    • Migrating TensorRT from 10.x to 11.x on NVIDIA DriveOS
    • Migrating TensorRT from 10.x to 11.x on Jetson/JetPack
    • Appendix: Migrating from TensorRT 8.x to 10.x
    • Appendix: Migrating Python Code from TensorRT 8.x to 10.x
      • Migrating I/O Buffer Allocation to Named Tensors
      • Python APIs Added in 10.x
    • Appendix: Migrating C++ Code from TensorRT 8.x to 10.x
      • Migrating from enqueueV2 to enqueueV3 (C++)
      • C++ APIs Added in 10.x
    • Appendix: Migrating trtexec Usage from TensorRT 8.x to 10.x
    • Appendix: Migrating Safety Runtime Code from TensorRT 8.x to 10.x
  • ONNX GraphSurgeon API
  • Polygraphy API

Reference

  • Troubleshooting
    • FAQs
    • Understanding Error Messages
    • Code Analysis Tools
    • Reporting TensorRT Issues
  • Environment Variables
  • Component Versioning
  • LICENSE AGREEMENT
  • Glossary
  • Working with Transformers
Is this page helpful?

Working with Transformers#

TensorRT provides built-in layers and fusions for common transformer workloads, including RoPE, KV cache updates, MoE, fused attention, and multi-device attention. The pages below cover each topic in depth.

  • Rotary Position Embedding
  • KV Cache
  • MoE (Mixture of Experts)
  • Fused Attention
    • IO Form (Tensor Layout)
    • Supported Attention Fusions
    • Constructing a Fused Attention
    • Example Workflow: Quantize an ONNX Attention Model to FP8 with ModelOpt
    • Example: Exploiting Sparsity in Attention Masks
  • Example: Using Transformer-Oriented APIs
    • Padded Format
    • Packed (Ragged) Format
    • Fully Packed (Ragged) Format
  • Multi-Device Attention

previous

TensorRT API Capture and Replay

next

Rotary Position Embedding

NVIDIA NVIDIA
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2021-2026, NVIDIA Corporation.

Last updated on Jun 16, 2026.