NVIDIA TensorRT Product Family

NVIDIA TensorRT Product Family

NVIDIA® TensorRT™ is a high-performance deep learning inference SDK that optimizes trained neural networks for deployment on NVIDIA GPUs. TensorRT transforms models from TensorFlow, PyTorch, ONNX, and other frameworks into optimized runtime engines that deliver low-latency, high-throughput inference across datacenter, cloud, edge, embedded, and consumer platforms.

The TensorRT family includes four products tailored for different deployment scenarios:

- TensorRT (Enterprise): Full-featured inference for datacenter, edge, and embedded systems
- TensorRT-LLM: Specialized toolkit for Large Language Model (LLM) inference optimization
- TensorRT-RTX: Optimized for consumer RTX GPUs in desktops, laptops, and workstations
- TensorRT-Cloud: Cloud-based engine building and optimization sweeping service

Choose the TensorRT product that matches your deployment target and use case.

TensorRT (Enterprise)

The comprehensive inference SDK for production AI deployments across datacenter, edge, and embedded platforms.

TensorRT delivers maximum performance for deep learning inference on NVIDIA datacenter GPUs (A100, H100, H200), edge devices (Jetson), and automotive platforms (DRIVE). It provides the complete TensorRT feature set with extensive model support, advanced optimizations, and enterprise-grade tooling.

TensorRT-LLM

Specialized toolkit for optimizing Large Language Model (LLM) inference with state-of-the-art performance on NVIDIA GPUs.

TensorRT-LLM provides a Python API to define LLMs and build TensorRT engines optimized specifically for LLM workloads. It includes pre-built implementations of popular open-source models, multi-GPU and multi-node support, in-flight batching, paged KV caching, and quantization techniques (FP8, INT8, INT4) to maximize LLM serving throughput and minimize latency.

TensorRT-LLM is the recommended solution for deploying LLMs in production at scale across datacenter and cloud environments.

TensorRT for RTX

Optimized inference for NVIDIA RTX GPUs in consumer desktops, laptops, and workstations.

TensorRT for RTX targets the 100M+ install base of NVIDIA RTX GPUs (GeForce RTX 20, 30, 40, 50 series and professional RTX GPUs). It delivers a compact runtime (under 200 MB) with Just-In-Time (JIT) optimization that generates inference engines in under 30 seconds directly on end-user devices.

This approach eliminates lengthy pre-compilation, enables rapid engine generation, improves application portability across RTX GPU generations, and provides cutting-edge inference performance for consumer AI applications.

TensorRT-Cloud

Cloud-based engine building and configuration optimization service for TensorRT and TensorRT-LLM.

TensorRT-Cloud (Early Access) provides on-demand engine building across diverse NVIDIA GPUs, operating systems, and library dependencies. It eliminates the need to maintain build infrastructure for every target platform and enables developers to discover optimal inference configurations through automated sweeping.