NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build NVIDIA TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get setup and send HTTP requests using TensorRT-LLM.
This document provides step-by-step instructions on how to install TensorRT-LLM on Linux.
This document provides instructions for building TensorRT-LLM from the source code on Linux.
This document provides step-by-step instructions on how to install TensorRT-LLM on Windows.
This document provides instructions for building TensorRT-LLM from the source code on Windows.
Clone the latest TensorRT-LLM branch, work with the code, participate in the development of the product, pull in latest changes, and view latest discussions.
This document provides an overview about TensorRT-LLM and how it accelerates and optimizes inference performance for the latest large language models (LLMs) on NVIDIA GPUs. Discover the major benefits that TensorRT-LLM provides and how it can help you.
This document provides the current status, software versions, fixed bugs, and known issues for TensorRT-LLM. All published functionality in the Release Notes has been fully tested and verified with known limitations documented.
This document lists the supported GPUs, models, and other hardware and software versions for the latest NVIDIA TensorRT-LLM release.
This document explains how TensorRT-LLM as toolkit, assembles optimized solutions to perform Large Language Model (LLM) inference.
This is the C++ API Runtime documentation for the TensorRT-LLM library.
This is the Python API Runtime documentation for the TensorRT-LLM library.
This is the Python API Layers documentation for the TensorRT-LLM library.
This is the Python API Functionals documentation for the TensorRT-LLM library.
This is the Python API Models documentation for the TensorRT-LLM library.
This is the Python API Plugin documentation for the TensorRT-LLM library.
This is the Python API Quantization documentation for the TensorRT-LLM library.
Learn how we used NVIDIA’s suite of solutions for optimizing LLM models and deploying in multi-GPU environments.
Learn about accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA’s TensorRT-LLM and Triton Inference Server.
Learn how we are leveraging TensorRT-LLM to implement key features of our model-serving product and highlight useful features of TensorRT-LLM such as streaming of tokens, in-flight batching, paged-attention, quantization, and more.
Find more news and tutorials.
Join the NVIDIA Developer Program.
Explore TensorRT-LLM forums.
This document describes how to debug unit tests, execution errors, E2E models, and installation issues.