Introduction#

NVIDIA TensorRT for RTX (TensorRT-RTX) is a specialization of NVIDIA TensorRT for the RTX product line. Like TensorRT, it contains a deep learning inference optimizer and a runtime for execution in order to enable high-performance inference. Unlike TensorRT, TensorRT-RTX performs Just-In-Time (JIT) compilation on the end-user machine, greatly simplifying deployment when targeting a diverse set of end-user NVIDIA GPUs.

TensorRT-RTX is built for desktop-app developers who want to embed AI features: whether that’s turning text into lifelike speech, taking voice input from users, or generating images from prompts. Unlike server software, which usually targets a single known GPU, desktop applications must run on whatever graphics hardware end users have installed. TensorRT-RTX achieves this by automatically optimizing AI inference for each GPU model, without bloating installation times or inflating your package size. And, because many desktop programs (video games, creative suites, and so on) use the GPU simultaneously for rendering, TensorRT-RTX ensures that your AI workloads never steal performance from graphics, keeping frame rates smooth and users engaged.

After you have trained your deep learning model in a framework of your choice, TensorRT-RTX enables you to run it with higher throughput and lower latency. For this, compilation proceeds in two phases:

During the first phase, the ahead-of-time (AOT) optimizer translates the neural network into a TensorRT-RTX engine file (also known as a JIT-able engine). This step typically takes 20-30 seconds.
During the second phase, run at inference time, this engine is JIT-compiled into an executable inference plan with an optimized inference strategy, including the concrete choice of computation kernels. This step is very fast at the time of first inference invocation (<5 seconds latency for most models), and may optionally be speeded up even more in subsequent invocations via runtime caching.

This section covers the basic installation, conversion, and runtime options available in TensorRT-RTX and when they are best applied.

Here is a quick summary of each chapter:

Installing TensorRT-RTX - Installing TensorRT-RTX - We provide multiple, simple ways of installing TensorRT-RTX.

Example Deployment Using ONNX - This section examines the basic steps to convert and deploy your model. It introduces concepts used in the rest of the guide and walks you through the decisions you must make to optimize inference execution.

ONNX Conversion and Deployment - We provide a broad overview of ONNX exports from different training frameworks.

Using the TensorRT-RTX Runtime API - This section provides a tutorial on running a simple convolutional neural network using the TensorRT-RTX C++ and Python API.