NVIDIA TensorRT Product Family

NVIDIA TensorRT Product Family

NVIDIA® TensorRT™ is a high-performance deep learning inference SDK that optimizes trained neural networks for deployment on NVIDIA GPUs. TensorRT transforms models from TensorFlow, PyTorch, ONNX, and other frameworks into optimized runtime engines that deliver low-latency, high-throughput inference across datacenter, cloud, edge, embedded, and consumer platforms.

The TensorRT family includes four products tailored for different deployment scenarios:

- TensorRT (Enterprise): Full-featured inference for datacenter, edge, and embedded systems
- TensorRT-LLM: Specialized toolkit for Large Language Model (LLM) inference optimization
- TensorRT-RTX: Optimized for consumer RTX GPUs in desktops, laptops, and workstations

Choose the TensorRT product that matches your deployment target and use case.

TensorRT (Enterprise)

The comprehensive inference SDK for production AI deployments across datacenter, edge, and embedded platforms.

TensorRT delivers maximum performance for deep learning inference on NVIDIA datacenter GPUs (A100, H100, H200), edge devices (Jetson), and automotive platforms (DRIVE). It provides the complete TensorRT feature set with extensive model support, advanced optimizations, and enterprise-grade tooling.

TensorRT-LLM

Specialized toolkit for optimizing Large Language Model (LLM) inference with state-of-the-art performance on NVIDIA GPUs.

TensorRT-LLM provides a Python API to define LLMs and build TensorRT engines optimized specifically for LLM workloads. It includes pre-built implementations of popular open-source models, multi-GPU and multi-node support, in-flight batching, paged KV caching, and quantization techniques (FP8, INT8, INT4) to maximize LLM serving throughput and minimize latency.

TensorRT-LLM is the recommended solution for deploying LLMs in production at scale across datacenter and cloud environments.

TensorRT for RTX

Optimized inference for NVIDIA RTX GPUs in consumer desktops, laptops, and workstations.

TensorRT for RTX targets the 100M+ install base of NVIDIA RTX GPUs (GeForce RTX 20, 30, 40, 50 series and professional RTX GPUs). It delivers a compact runtime (under 200 MB) with Just-In-Time (JIT) optimization that generates inference engines in under 30 seconds directly on end-user devices.

This approach eliminates lengthy pre-compilation, enables rapid engine generation, improves application portability across RTX GPU generations, and provides cutting-edge inference performance for consumer AI applications.