NVIDIA TensorRT-LLM

NVIDIA Docs Hub Homepage NVIDIA TensorRT-LLM

NVIDIA TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build NVIDIA TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Getting Started
Code
Documentation
APIs
Blogs & Videos
Support

Quick Start Guide

This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get setup and send HTTP requests using TensorRT-LLM.

Installing on Linux

This document provides step-by-step instructions on how to install TensorRT-LLM on Linux.

Building from Source Code on Linux

This document provides instructions for building TensorRT-LLM from the source code on Linux.

GitHub TensorRT-LLM Code

Clone the latest TensorRT-LLM branch, work with the code, participate in the development of the product, pull in latest changes, and view latest discussions.

Product Overview

This document provides an overview about TensorRT-LLM and how it accelerates and optimizes inference performance for the latest large language models (LLMs) on NVIDIA GPUs. Discover the major benefits that TensorRT-LLM provides and how it can help you.

Release Notes

This document provides the current status, software versions, fixed bugs, and known issues for TensorRT-LLM. All published functionality in the Release Notes has been fully tested and verified with known limitations documented.

Support Matrix

This document lists the supported GPUs, models, and other hardware and software versions for the latest NVIDIA TensorRT-LLM release.

Architecture

This document explains how TensorRT-LLM as toolkit, assembles optimized solutions to perform Large Language Model (LLM) inference.

C++ API Runtime

This is the C++ API Runtime documentation for the TensorRT-LLM library.

Python API Runtime

This is the Python API Runtime documentation for the TensorRT-LLM library.

Python API Layers

This is the Python API Layers documentation for the TensorRT-LLM library.

Python API Functionals

This is the Python API Functionals documentation for the TensorRT-LLM library.

Python API Models

This is the Python API Models documentation for the TensorRT-LLM library.

Python API Plugin

This is the Python API Plugin documentation for the TensorRT-LLM library.

Python API Quantization

This is the Python API Quantization documentation for the TensorRT-LLM library.

GTC Session: Optimizing and Scaling LLMs With TensorRT-LLM for Text Generation

Learn how we used NVIDIA’s suite of solutions for optimizing LLM models and deploying in multi-GPU environments.

GTC Session: Accelerated LLM Model Alignment and Deployment in NeMo, TensorRT-LLM, and Triton Inference Server

Learn about accelerated LLM model alignment using the NeMo Framework and inference optimization and deployment through NVIDIA's TensorRT-LLM and Triton Inference Server.

GTC Session: Speeding up LLM Inference With TensorRT-LLM

Learn how we are leveraging TensorRT-LLM to implement key features of our model-serving product and highlight useful features of TensorRT-LLM such as streaming of tokens, in-flight batching, paged-attention, quantization, and more.

Technical Blogs

Find more news and tutorials.

NVIDIA Developer Program

Join the NVIDIA Developer Program.

NVIDIA Developer Forum

Explore TensorRT-LLM forums.

Troubleshooting

This document describes how to debug unit tests, execution errors, E2E models, and installation issues.