TensorRT#

Integration with NVIDIA TensorRT for GPU-accelerated execution.

Overview#

The TensorRT library provides a simplified interface for integrating TensorRT engines into real-time applications.

Key Features#

  • Flexible Tensor Metadata - User-provided dimensions and strides

  • Automatic Stride Computation - Automatic layout computation for tensors

  • CUDA Graph Support - Pre/post enqueue hooks for graph capture

  • Engine Abstraction - Interface-based design

Core Concepts#

MLIR Tensor Parameters#

MLIRTensorParams defines the metadata for tensors used by the TensorRT engine. Each tensor requires:

  • name: Tensor identifier matching the TensorRT engine

  • data_type: Element data type (e.g., TensorR32F for float32)

  • rank: Number of dimensions (0 for scalar, 1-8 for tensors)

  • dims: Size of each dimension

  • strides: Optional memory layout (auto-computed if not provided)

When strides are not provided (last stride == 0), row-major strides are automatically computed from dimensions.

Creating Tensor Parameters#

// Define tensor parameters with name, data type, rank, and dimensions
MLIRTensorParams input_params{
        .name = "input_data", .data_type = tensor::TensorR32F, .rank = 2, .dims = {128, 256}};

// Access tensor properties
const auto rank = input_params.rank;
const auto batch_size = input_params.dims[0];
const auto feature_size = input_params.dims[1];

Tensor Parameters with Strides#

// Define tensor parameters with explicit strides
MLIRTensorParams params{
        .name = "data",
        .data_type = tensor::TensorR32F,
        .rank = 3,
        .dims = {4, 8, 16},
        .strides = {128, 16, 1}};

// Verify stride configuration
const auto outer_stride = params.strides[0];
const auto middle_stride = params.strides[1];
const auto inner_stride = params.strides[2];

MLIR TensorRT Engine#

MLIRTrtEngine provides a streamlined TensorRT interface that:

  • Eliminates batch size management (users handle batching externally)

  • Removes internal buffer allocation (users provide pre-allocated CUDA buffers)

  • Uses constructor-based initialization (no separate init() phase)

  • Accepts tensor dimensions and strides directly in MLIRTensorParams

The engine operates in three phases: construction, setup, and execution.

Engine Construction#

// Define input and output tensor parameters
std::vector<MLIRTensorParams> input_params = {
        {.name = "input", .data_type = tensor::TensorR32F, .rank = 1, .dims = {1024}}};

std::vector<MLIRTensorParams> output_params = {
        {.name = "output", .data_type = tensor::TensorR32F, .rank = 1, .dims = {1024}}};

// Create TensorRT runtime (mock for documentation)
auto runtime = std::make_unique<MockTrtEngine>();

// Construct MLIR TensorRT engine
const MLIRTrtEngine engine(
        std::move(input_params), std::move(output_params), std::move(runtime));

The engine is fully initialized in the constructor. All tensor shapes must be provided during construction.

Engine Setup#

// Prepare buffer pointers (CUDA device memory)
const std::vector<void *> input_buffers = {mock_ptr(0x1000)};
const std::vector<void *> output_buffers = {mock_ptr(0x2000)};

// Setup engine with buffer addresses
const auto setup_result = engine.setup(input_buffers, output_buffers);

Setup caches the provided buffer pointers for use during execution. Buffers are direct pointers to CUDA device memory and must remain valid for the lifetime of execution operations.

Engine Execution#

// Create CUDA stream for execution
cudaStream_t stream = mock_stream(0x100);

// Execute the engine
const auto result = engine.run(stream);

The run() method executes asynchronously on the provided CUDA stream.

TensorRT Engine Interface#

ITrtEngine is an abstract interface for TensorRT operations. The concrete TrtEngine implementation wraps the NVIDIA TensorRT runtime, while NullTrtEngine provides a null-object pattern for testing.

Multi-Rank Tensors#

The library supports tensors with ranks 0 through 8:

// Define tensors with different ranks
const MLIRTensorParams scalar{.name = "scalar", .data_type = tensor::TensorR32F, .rank = 0};

const MLIRTensorParams vector{
        .name = "vector", .data_type = tensor::TensorR32F, .rank = 1, .dims = {256}};

const MLIRTensorParams matrix{
        .name = "matrix", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 64}};

const MLIRTensorParams tensor_3d{
        .name = "tensor_3d", .data_type = tensor::TensorR32F, .rank = 3, .dims = {16, 32, 64}};

// Access rank information
const auto vec_rank = vector.rank;
const auto mat_rank = matrix.rank;

Complete Example#

This example demonstrates the full workflow from tensor definition through execution:

// Step 1: Define tensor parameters
std::vector<MLIRTensorParams> inputs = {
        {.name = "input0", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}},
        {.name = "input1", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}}};

std::vector<MLIRTensorParams> outputs = {
        {.name = "result", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}}};

// Step 2: Create engine with TensorRT runtime
auto trt_runtime = std::make_unique<MockTrtEngine>();
MLIRTrtEngine engine(std::move(inputs), std::move(outputs), std::move(trt_runtime));

// Step 3: Setup buffers
const std::vector<void *> input_addrs = {mock_ptr(0x1000), mock_ptr(0x2000)};
const std::vector<void *> output_addrs = {mock_ptr(0x3000)};
const auto setup_err = engine.setup(input_addrs, output_addrs);

// Step 4: Run the engine
cudaStream_t cu_stream = mock_stream(0x100);
const auto run_err = engine.run(cu_stream);

Additional Examples#

For more examples, see:

  • framework/tensorrt/tests/tensorrt_sample_tests.cpp - Documentation examples and unit tests

External Resources#

API Reference#

class CaptureStreamPrePostTrtEngEnqueue : public framework::tensorrt::IPrePostTrtEngEnqueue#
#include <trt_pre_post_enqueue_stream_cap.hpp>

Specific implementation of interface IPrePostTrtEngEnqueue.

Implements stream capture for interface IPrePostTrtEngEnqueue

Public Functions

CaptureStreamPrePostTrtEngEnqueue() = default#

Default constructor

~CaptureStreamPrePostTrtEngEnqueue() final#

Destructor, Destroys the captured CUDA graph

CaptureStreamPrePostTrtEngEnqueue(
const CaptureStreamPrePostTrtEngEnqueue&,
) = delete#
CaptureStreamPrePostTrtEngEnqueue &operator=(
const CaptureStreamPrePostTrtEngEnqueue&,
) = delete#
CaptureStreamPrePostTrtEngEnqueue(
CaptureStreamPrePostTrtEngEnqueue&&,
) = delete#
CaptureStreamPrePostTrtEngEnqueue &operator=(
CaptureStreamPrePostTrtEngEnqueue&&,
) = delete#
virtual utils::NvErrc pre_enqueue(cudaStream_t cu_stream) final#

Start Capture/End capture of stream

Parameters:

cu_stream – stream to use

Returns:

utils::NvErrc for SUCCESS or any failure

virtual utils::NvErrc post_enqueue(cudaStream_t cu_stream) final#

Post Enqueue activity after calling enqueue_v3()

Parameters:

cu_stream – stream to use

Returns:

utils::NvErrc SUCCESS or error

inline CUgraph get_graph() const#

Get the captured CUDA graph

Returns:

Pointer to the captured CUDA graph, or nullptr if no graph is captured

class IPrePostTrtEngEnqueue#
#include <trt_engine_interfaces.hpp>

Define an interface for Pre/Post Trt Engine EnqueueV3.

For pre and post enqueue_v3() in TrtEngine

Subclassed by framework::pipeline::tests::TestPrePostTrtEngEnqueue, framework::tensorrt::CaptureStreamPrePostTrtEngEnqueue, framework::tensorrt::NullPrePostTrtEngEnqueue

Public Functions

IPrePostTrtEngEnqueue() = default#
virtual ~IPrePostTrtEngEnqueue() = default#
IPrePostTrtEngEnqueue(
const IPrePostTrtEngEnqueue &pre_post_trt_eng_enqueue,
) = default#

Copy constructor

Parameters:

pre_post_trt_eng_enqueue[in] Source object to copy from

IPrePostTrtEngEnqueue &operator=(
const IPrePostTrtEngEnqueue &pre_post_trt_eng_enqueue,
) = default#

Copy assignment operator

Parameters:

pre_post_trt_eng_enqueue[in] Source object to copy from

Returns:

Reference to this object

IPrePostTrtEngEnqueue(
IPrePostTrtEngEnqueue &&pre_post_trt_eng_enqueue,
) = default#

Move constructor

Parameters:

pre_post_trt_eng_enqueue[in] Source object to move from

IPrePostTrtEngEnqueue &operator=(
IPrePostTrtEngEnqueue &&pre_post_trt_eng_enqueue,
) = default#

Move assignment operator

Parameters:

pre_post_trt_eng_enqueue[in] Source object to move from

Returns:

Reference to this object

virtual utils::NvErrc pre_enqueue(cudaStream_t cu_stream) = 0#

Pre Enqueue activity before calling enqueue_v3()

Parameters:

cu_stream – stream to use

Returns:

utils::NvErrc SUCCESS or error

virtual utils::NvErrc post_enqueue(cudaStream_t cu_stream) = 0#

Post Enqueue activity after calling enqueue_v3()

Parameters:

cu_stream – stream to use

Returns:

utils::NvErrc SUCCESS or error

class ITrtEngine#
#include <trt_engine_interface.hpp>

Abstract interface for TensorRT engine operations.

This interface abstracts the TensorRT components (IRuntime, ICudaEngine, IExecutionContext) into a unified API for engine initialization, configuration, and execution.

Subclassed by framework::pipeline::tests::TestTrtEngine, framework::tensorrt::NullTrtEngine, framework::tensorrt::TrtEngine

Public Functions

ITrtEngine() = default#
virtual ~ITrtEngine() = default#
ITrtEngine(const ITrtEngine &engine) = default#

Copy constructor

Parameters:

engine[in] Source object to copy from

ITrtEngine &operator=(const ITrtEngine &engine) = default#

Copy assignment operator

Parameters:

engine[in] Source object to copy from

Returns:

Reference to this object

ITrtEngine(ITrtEngine &&engine) = default#

Move constructor

Parameters:

engine[in] Source object to move from

ITrtEngine &operator=(ITrtEngine &&engine) = default#

Move assignment operator

Parameters:

engine[in] Source object to move from

Returns:

Reference to this object

virtual utils::NvErrc set_input_shape(
const std::string_view tensor_name,
const nvinfer1::Dims &dims,
) = 0#

Set the shape of an input tensor.

Parameters:
  • tensor_name[in] Name of the input tensor

  • dims[in] Dimensions to set for the tensor

Returns:

utils::NvErrc::Success on success, error code on failure

virtual utils::NvErrc set_tensor_address(
const std::string_view tensor_name,
void *address,
) = 0#

Set the memory address for a tensor.

Parameters:
  • tensor_name[in] Name of the tensor

  • address[in] Memory address to associate with the tensor

Returns:

utils::NvErrc::Success on success, error code on failure

virtual utils::NvErrc enqueue_inference(cudaStream_t cu_stream) = 0#

Execute inference asynchronously.

Parameters:

cu_stream[in] CUDA stream for asynchronous execution

Returns:

utils::NvErrc::Success on success, error code on failure

virtual bool all_input_dimensions_specified() const = 0#

Check if all input dimensions have been specified.

Returns:

true if all input dimensions are specified, false otherwise

struct MLIRTensorParams#
#include <trt_engine_params.hpp>

Tensor parameters for MLIR-TensorRT engines.

This structure provides tensor parameter representation for MLIR-TensorRT engines where tensor shapes and strides are provided by the user during initialization.

The user must provide the rank and dimensions. Strides are optional:

  • If strides are provided (last stride != 0), they are used as-is

  • If strides are not provided (last stride == 0), row-major strides are automatically computed from dimensions

See also

tensor::NvDataType for supported data type enumeration

Note

Maximum rank is 8, minimum rank is 0 (scalar), matching MLIR-TensorRT implementation limits

Public Functions

inline constexpr void set_rank(
const std::size_t n,
) noexcept(!utils::GSL_CONTRACT_THROWS)#

Set the number of dimensions for this tensor (rank)

Parameters:

n[in] Number of dimensions (must be <= MAX_TENSOR_RANK)

Public Members

std::string name#

Tensor name identifier.

tensor::NvDataType data_type = {}#

Data type of tensor elements.

std::size_t rank = {}#

Number of dimensions (0 for scalar, 1-8 for tensors)

std::array<std::size_t, MAX_TENSOR_RANK> dims = {}#

Tensor dimensions (first rank elements valid)

std::array<std::size_t, MAX_TENSOR_RANK> strides = {}#

Tensor strides (first rank elements valid, auto-computed if not provided)

Public Static Attributes

static constexpr std::size_t MAX_TENSOR_RANK = 8#

Maximum supported tensor rank.

class MLIRTrtEngine#
#include <mlir_trt_engine.hpp>

Simplified TensorRT engine that mimics MLIR-TensorRT runtime patterns.

This class provides a streamlined TensorRT engine implementation that:

  • Eliminates batch size management (users handle batching externally)

  • Removes internal buffer allocation (users provide pre-allocated CUDA buffers)

  • Uses constructor-based initialization (no separate init() phase)

  • Accepts tensor dimensions and strides directly in MLIRTensorParams

  • Supports user-provided tensor names from TrtParams

The engine uses tensor metadata (dims/strides) provided by users in MLIRTensorParams and directly interfaces with TensorRT APIs.

Public Functions

MLIRTrtEngine(
std::vector<MLIRTensorParams> input_tensor_prms,
std::vector<MLIRTensorParams> output_tensor_prms,
std::unique_ptr<ITrtEngine> tensorrt_runtime,
std::unique_ptr<IPrePostTrtEngEnqueue> pre_post_trt_eng_enqueue = nullptr,
)#

Construct MLIRTrtEngine with full initialization.

All initialization is performed in the constructor, eliminating the need for a separate init() method. The TensorRT runtime must be pre-initialized and provided by the caller. Tensor shapes (dims/strides) must be provided in the MLIRTensorParams. If strides are not provided (last stride == 0), row-major strides are automatically computed.

Parameters:
  • input_tensor_prms[in] Input tensor parameters (name, data_type, rank, dims, optional strides)

  • output_tensor_prms[in] Output tensor parameters (name, data_type, rank, dims, optional strides)

  • tensorrt_runtime[in] Pre-initialized TensorRT runtime (required)

  • pre_post_trt_eng_enqueue[in] Optional pre/post enqueue operations (e.g., CUDA graph capture)

Throws:
  • std::invalid_argument – if tensorrt_runtime is nullptr, or if rank is invalid (> 8)

  • std::runtime_error – on initialization failure

~MLIRTrtEngine() = default#
MLIRTrtEngine(const MLIRTrtEngine &engine) = delete#
MLIRTrtEngine &operator=(const MLIRTrtEngine &engine) = delete#
MLIRTrtEngine(MLIRTrtEngine &&engine) = delete#
MLIRTrtEngine &operator=(MLIRTrtEngine &&engine) = delete#
utils::NvErrc warmup(cudaStream_t cu_stream)#

Perform warmup inference to allocate TensorRT resources.

Runs the TensorRT engine once to ensure all internal resources are allocated and avoid first-run latency. Extracts tensor shapes from buffer descriptors and executes a single inference pass.

Note

Requires setup() to be called first to provide buffer pointers for metadata extraction.

Parameters:

cu_stream[in] CUDA stream for warmup operations

Returns:

utils::NvErrc::Success on success, error code on failure

utils::NvErrc setup(
const std::vector<void*> &input_buffers,
const std::vector<void*> &output_buffers,
)#

Setup input and output buffer addresses.

Caches the provided buffer pointers for use during inference. Buffers are direct pointers to CUDA memory (data pointers), not descriptor pointers. Buffers must remain valid for the lifetime of inference operations. No batch size parameter is needed as batching is handled externally.

Parameters:
  • input_buffers[in] Vector of input data buffer pointers (must match input_tensor_prms size)

  • output_buffers[in] Vector of output data buffer pointers (must match output_tensor_prms size)

Returns:

utils::NvErrc::Success on success, error code on failure

utils::NvErrc run(cudaStream_t cu_stream) const#

Execute inference on the configured tensors.

Performs the complete inference pipeline:

  1. Set tensor addresses in TensorRT using user-provided buffer pointers

  2. Set input shapes in TensorRT using dims from MLIRTensorParams

  3. Execute pre-enqueue operations (e.g., CUDA graph capture start)

  4. Run TensorRT inference

  5. Execute post-enqueue operations (e.g., CUDA graph capture end)

Parameters:

cu_stream[in] CUDA stream for execution

Returns:

utils::NvErrc::Success on success, error code on failure

class NullPrePostTrtEngEnqueue : public framework::tensorrt::IPrePostTrtEngEnqueue#
#include <trt_null_pre_post_enqueue.hpp>

Null/No-op implementation of IPrePostTrtEngEnqueue

This class provides a null object pattern implementation for scenarios where CUDA graph capture is not needed during TensorRT engine warmup.

Use Cases:

  • Pure stream-mode pipelines with no graph execution requirements

  • Unit tests that only exercise stream-based execution paths

  • Performance-critical scenarios where graph capture overhead must be avoided

Design Tradeoff:

  • Eliminates graph capture overhead during warmup (~milliseconds one-time cost)

  • Cannot support graph-based execution mode (execute_graph will fail)

  • Reduces memory footprint (no captured graph stored)

Example Usage:

// For stream-only execution
auto null_capturer = std::make_unique<NullPrePostTrtEngEnqueue>();
auto trt_engine = std::make_unique<MLIRTrtEngine>(
    inputs, outputs,
    std::move(tensorrt_runtime),
    std::move(null_capturer)  // No graph capture
);

// Warmup loads engine and runs once, but doesn't capture graph
trt_engine->warmup(stream);

// Stream execution works normally
trt_engine->run(stream);  // OK

// Graph execution would fail (no captured graph available)
// graph_capturer->get_graph();  // Would throw/fail

See also

CaptureStreamPrePostTrtEngEnqueue for graph-mode capture

See also

IPrePostTrtEngEnqueue for interface documentation

Public Functions

NullPrePostTrtEngEnqueue() = default#

Default constructor

~NullPrePostTrtEngEnqueue() override = default#

Destructor

NullPrePostTrtEngEnqueue(const NullPrePostTrtEngEnqueue&) = delete#
NullPrePostTrtEngEnqueue &operator=(
const NullPrePostTrtEngEnqueue&,
) = delete#
NullPrePostTrtEngEnqueue(NullPrePostTrtEngEnqueue&&) = delete#
NullPrePostTrtEngEnqueue &operator=(
NullPrePostTrtEngEnqueue&&,
) = delete#
inline virtual utils::NvErrc pre_enqueue(
cudaStream_t cu_stream,
) override#

Pre-enqueue hook (no-op for null implementation)

This method does nothing and immediately returns success. No graph capture or stream operations are performed.

Parameters:

cu_stream[in] CUDA stream (unused)

Returns:

utils::NvErrc::Success Always succeeds

inline virtual utils::NvErrc post_enqueue(
cudaStream_t cu_stream,
) override#

Post-enqueue hook (no-op for null implementation)

This method does nothing and immediately returns success. No graph capture or stream operations are performed.

Parameters:

cu_stream[in] CUDA stream (unused)

Returns:

utils::NvErrc::Success Always succeeds

class NullTrtEngine : public framework::tensorrt::ITrtEngine#
#include <trt_engine_interface.hpp>

Null object implementation of ITrtEngine.

Provides a default implementation that returns appropriate error codes for all TensorRT engine operations.

Public Functions

inline virtual utils::NvErrc set_input_shape(
const std::string_view tensor_name,
const nvinfer1::Dims &dims,
) final#

Set the shape of an input tensor.

Parameters:
  • tensor_name[in] Name of the input tensor

  • dims[in] Dimensions to set for the tensor

Returns:

utils::NvErrc::NotSupported

inline virtual utils::NvErrc set_tensor_address(
const std::string_view tensor_name,
void *address,
) final#

Set the memory address for a tensor.

Parameters:
  • tensor_name[in] Name of the tensor

  • address[in] Memory address to associate with the tensor

Returns:

utils::NvErrc::NotSupported

inline virtual utils::NvErrc enqueue_inference(
cudaStream_t cu_stream,
) final#

Execute inference asynchronously.

Parameters:

cu_stream[in] CUDA stream for asynchronous execution

Returns:

utils::NvErrc::NotSupported

inline virtual bool all_input_dimensions_specified() const final#

Check if all input dimensions have been specified.

Returns:

false (null implementation)

class TrtEngine : public framework::tensorrt::ITrtEngine#
#include <trt_engine.hpp>

Concrete implementation of ITrtEngine using NVIDIA TensorRT.

This class provides the actual TensorRT implementation for engine operations, wrapping the TensorRT IRuntime, ICudaEngine, and IExecutionContext components.

Public Functions

TrtEngine(
std::span<const std::byte> engine_data,
nvinfer1::ILogger &logger,
)#

Initialize TensorRT engine from serialized engine data.

Parameters:
  • engine_data[in] Span containing serialized engine binary data

  • logger[in] TensorRT logger instance for engine messages

Throws:

std::runtime_error – on initialization failure

TrtEngine(
const std::filesystem::path &engine_file_path,
nvinfer1::ILogger &logger,
)#

Initialize TensorRT engine from engine file.

Parameters:
  • engine_file_path[in] Path to the serialized engine file

  • logger[in] TensorRT logger instance for engine messages

Throws:

std::runtime_error – on file read or initialization failure

~TrtEngine() final = default#
TrtEngine(const TrtEngine &engine) = delete#
TrtEngine &operator=(const TrtEngine &engine) = delete#
TrtEngine(TrtEngine &&engine) = delete#
TrtEngine &operator=(TrtEngine &&engine) = delete#
virtual utils::NvErrc set_input_shape(
const std::string_view tensor_name,
const nvinfer1::Dims &dims,
) final#

Set the shape of an input tensor.

Parameters:
  • tensor_name[in] Name of the input tensor

  • dims[in] Dimensions to set for the tensor

Returns:

utils::NvErrc::Success on success, error code on failure

virtual utils::NvErrc set_tensor_address(
const std::string_view tensor_name,
void *address,
) final#

Set the memory address for a tensor.

Parameters:
  • tensor_name[in] Name of the tensor

  • address[in] Memory address to associate with the tensor

Returns:

utils::NvErrc::Success on success, error code on failure

virtual utils::NvErrc enqueue_inference(cudaStream_t cu_stream) final#

Execute inference asynchronously.

Parameters:

cu_stream[in] CUDA stream for asynchronous execution

Returns:

utils::NvErrc::Success on success, error code on failure

virtual bool all_input_dimensions_specified() const final#

Check if all input dimensions have been specified.

Returns:

true if all input dimensions are specified, false otherwise

class TrtLogger : public nvinfer1::ILogger#
#include <trt_engine_logger.hpp>

Logger implementation for TensorRT engine.

Concrete implementation of nvinfer1::ILogger interface required by TensorRT. Handles logging of TensorRT runtime messages with configurable severity levels.