TensorRT#
Integration with NVIDIA TensorRT for GPU-accelerated execution.
Overview#
The TensorRT library provides a simplified interface for integrating TensorRT engines into real-time applications.
Key Features#
Flexible Tensor Metadata - User-provided dimensions and strides
Automatic Stride Computation - Automatic layout computation for tensors
CUDA Graph Support - Pre/post enqueue hooks for graph capture
Engine Abstraction - Interface-based design
Core Concepts#
MLIR Tensor Parameters#
MLIRTensorParams defines the metadata for tensors used by the TensorRT engine. Each tensor requires:
name: Tensor identifier matching the TensorRT engine
data_type: Element data type (e.g.,
TensorR32Ffor float32)rank: Number of dimensions (0 for scalar, 1-8 for tensors)
dims: Size of each dimension
strides: Optional memory layout (auto-computed if not provided)
When strides are not provided (last stride == 0), row-major strides are automatically computed from dimensions.
Creating Tensor Parameters#
// Define tensor parameters with name, data type, rank, and dimensions
MLIRTensorParams input_params{
.name = "input_data", .data_type = tensor::TensorR32F, .rank = 2, .dims = {128, 256}};
// Access tensor properties
const auto rank = input_params.rank;
const auto batch_size = input_params.dims[0];
const auto feature_size = input_params.dims[1];
Tensor Parameters with Strides#
// Define tensor parameters with explicit strides
MLIRTensorParams params{
.name = "data",
.data_type = tensor::TensorR32F,
.rank = 3,
.dims = {4, 8, 16},
.strides = {128, 16, 1}};
// Verify stride configuration
const auto outer_stride = params.strides[0];
const auto middle_stride = params.strides[1];
const auto inner_stride = params.strides[2];
MLIR TensorRT Engine#
MLIRTrtEngine provides a streamlined TensorRT interface that:
Eliminates batch size management (users handle batching externally)
Removes internal buffer allocation (users provide pre-allocated CUDA buffers)
Uses constructor-based initialization (no separate init() phase)
Accepts tensor dimensions and strides directly in MLIRTensorParams
The engine operates in three phases: construction, setup, and execution.
Engine Construction#
// Define input and output tensor parameters
std::vector<MLIRTensorParams> input_params = {
{.name = "input", .data_type = tensor::TensorR32F, .rank = 1, .dims = {1024}}};
std::vector<MLIRTensorParams> output_params = {
{.name = "output", .data_type = tensor::TensorR32F, .rank = 1, .dims = {1024}}};
// Create TensorRT runtime (mock for documentation)
auto runtime = std::make_unique<MockTrtEngine>();
// Construct MLIR TensorRT engine
const MLIRTrtEngine engine(
std::move(input_params), std::move(output_params), std::move(runtime));
The engine is fully initialized in the constructor. All tensor shapes must be provided during construction.
Engine Setup#
// Prepare buffer pointers (CUDA device memory)
const std::vector<void *> input_buffers = {mock_ptr(0x1000)};
const std::vector<void *> output_buffers = {mock_ptr(0x2000)};
// Setup engine with buffer addresses
const auto setup_result = engine.setup(input_buffers, output_buffers);
Setup caches the provided buffer pointers for use during execution. Buffers are direct pointers to CUDA device memory and must remain valid for the lifetime of execution operations.
Engine Execution#
// Create CUDA stream for execution
cudaStream_t stream = mock_stream(0x100);
// Execute the engine
const auto result = engine.run(stream);
The run() method executes asynchronously on the provided CUDA stream.
TensorRT Engine Interface#
ITrtEngine is an abstract interface for TensorRT operations. The concrete TrtEngine implementation wraps the NVIDIA TensorRT runtime, while NullTrtEngine provides a null-object pattern for testing.
Multi-Rank Tensors#
The library supports tensors with ranks 0 through 8:
// Define tensors with different ranks
const MLIRTensorParams scalar{.name = "scalar", .data_type = tensor::TensorR32F, .rank = 0};
const MLIRTensorParams vector{
.name = "vector", .data_type = tensor::TensorR32F, .rank = 1, .dims = {256}};
const MLIRTensorParams matrix{
.name = "matrix", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 64}};
const MLIRTensorParams tensor_3d{
.name = "tensor_3d", .data_type = tensor::TensorR32F, .rank = 3, .dims = {16, 32, 64}};
// Access rank information
const auto vec_rank = vector.rank;
const auto mat_rank = matrix.rank;
Complete Example#
This example demonstrates the full workflow from tensor definition through execution:
// Step 1: Define tensor parameters
std::vector<MLIRTensorParams> inputs = {
{.name = "input0", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}},
{.name = "input1", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}}};
std::vector<MLIRTensorParams> outputs = {
{.name = "result", .data_type = tensor::TensorR32F, .rank = 2, .dims = {32, 128}}};
// Step 2: Create engine with TensorRT runtime
auto trt_runtime = std::make_unique<MockTrtEngine>();
MLIRTrtEngine engine(std::move(inputs), std::move(outputs), std::move(trt_runtime));
// Step 3: Setup buffers
const std::vector<void *> input_addrs = {mock_ptr(0x1000), mock_ptr(0x2000)};
const std::vector<void *> output_addrs = {mock_ptr(0x3000)};
const auto setup_err = engine.setup(input_addrs, output_addrs);
// Step 4: Run the engine
cudaStream_t cu_stream = mock_stream(0x100);
const auto run_err = engine.run(cu_stream);
Additional Examples#
For more examples, see:
framework/tensorrt/tests/tensorrt_sample_tests.cpp- Documentation examples and unit tests
External Resources#
API Reference#
-
class CaptureStreamPrePostTrtEngEnqueue : public framework::tensorrt::IPrePostTrtEngEnqueue#
- #include <trt_pre_post_enqueue_stream_cap.hpp>
Specific implementation of interface IPrePostTrtEngEnqueue.
Implements stream capture for interface IPrePostTrtEngEnqueue
Public Functions
-
CaptureStreamPrePostTrtEngEnqueue() = default#
Default constructor
-
~CaptureStreamPrePostTrtEngEnqueue() final#
Destructor, Destroys the captured CUDA graph
- CaptureStreamPrePostTrtEngEnqueue( ) = delete#
- CaptureStreamPrePostTrtEngEnqueue &operator=( ) = delete#
- CaptureStreamPrePostTrtEngEnqueue( ) = delete#
- CaptureStreamPrePostTrtEngEnqueue &operator=( ) = delete#
-
virtual utils::NvErrc pre_enqueue(cudaStream_t cu_stream) final#
Start Capture/End capture of stream
- Parameters:
cu_stream – stream to use
- Returns:
utils::NvErrc for SUCCESS or any failure
-
virtual utils::NvErrc post_enqueue(cudaStream_t cu_stream) final#
Post Enqueue activity after calling enqueue_v3()
- Parameters:
cu_stream – stream to use
- Returns:
utils::NvErrc SUCCESS or error
-
inline CUgraph get_graph() const#
Get the captured CUDA graph
- Returns:
Pointer to the captured CUDA graph, or nullptr if no graph is captured
-
CaptureStreamPrePostTrtEngEnqueue() = default#
-
class IPrePostTrtEngEnqueue#
- #include <trt_engine_interfaces.hpp>
Define an interface for Pre/Post Trt Engine EnqueueV3.
For pre and post enqueue_v3() in TrtEngine
Subclassed by framework::pipeline::tests::TestPrePostTrtEngEnqueue, framework::tensorrt::CaptureStreamPrePostTrtEngEnqueue, framework::tensorrt::NullPrePostTrtEngEnqueue
Public Functions
-
IPrePostTrtEngEnqueue() = default#
-
virtual ~IPrePostTrtEngEnqueue() = default#
- IPrePostTrtEngEnqueue(
- const IPrePostTrtEngEnqueue &pre_post_trt_eng_enqueue,
Copy constructor
- Parameters:
pre_post_trt_eng_enqueue – [in] Source object to copy from
- IPrePostTrtEngEnqueue &operator=(
- const IPrePostTrtEngEnqueue &pre_post_trt_eng_enqueue,
Copy assignment operator
- Parameters:
pre_post_trt_eng_enqueue – [in] Source object to copy from
- Returns:
Reference to this object
- IPrePostTrtEngEnqueue(
- IPrePostTrtEngEnqueue &&pre_post_trt_eng_enqueue,
Move constructor
- Parameters:
pre_post_trt_eng_enqueue – [in] Source object to move from
- IPrePostTrtEngEnqueue &operator=(
- IPrePostTrtEngEnqueue &&pre_post_trt_eng_enqueue,
Move assignment operator
- Parameters:
pre_post_trt_eng_enqueue – [in] Source object to move from
- Returns:
Reference to this object
-
virtual utils::NvErrc pre_enqueue(cudaStream_t cu_stream) = 0#
Pre Enqueue activity before calling enqueue_v3()
- Parameters:
cu_stream – stream to use
- Returns:
utils::NvErrc SUCCESS or error
-
virtual utils::NvErrc post_enqueue(cudaStream_t cu_stream) = 0#
Post Enqueue activity after calling enqueue_v3()
- Parameters:
cu_stream – stream to use
- Returns:
utils::NvErrc SUCCESS or error
-
IPrePostTrtEngEnqueue() = default#
-
class ITrtEngine#
- #include <trt_engine_interface.hpp>
Abstract interface for TensorRT engine operations.
This interface abstracts the TensorRT components (IRuntime, ICudaEngine, IExecutionContext) into a unified API for engine initialization, configuration, and execution.
Subclassed by framework::pipeline::tests::TestTrtEngine, framework::tensorrt::NullTrtEngine, framework::tensorrt::TrtEngine
Public Functions
-
ITrtEngine() = default#
-
virtual ~ITrtEngine() = default#
-
ITrtEngine(const ITrtEngine &engine) = default#
Copy constructor
- Parameters:
engine – [in] Source object to copy from
-
ITrtEngine &operator=(const ITrtEngine &engine) = default#
Copy assignment operator
- Parameters:
engine – [in] Source object to copy from
- Returns:
Reference to this object
-
ITrtEngine(ITrtEngine &&engine) = default#
Move constructor
- Parameters:
engine – [in] Source object to move from
-
ITrtEngine &operator=(ITrtEngine &&engine) = default#
Move assignment operator
- Parameters:
engine – [in] Source object to move from
- Returns:
Reference to this object
- virtual utils::NvErrc set_input_shape(
- const std::string_view tensor_name,
- const nvinfer1::Dims &dims,
Set the shape of an input tensor.
- Parameters:
tensor_name – [in] Name of the input tensor
dims – [in] Dimensions to set for the tensor
- Returns:
utils::NvErrc::Success on success, error code on failure
- virtual utils::NvErrc set_tensor_address(
- const std::string_view tensor_name,
- void *address,
Set the memory address for a tensor.
- Parameters:
tensor_name – [in] Name of the tensor
address – [in] Memory address to associate with the tensor
- Returns:
utils::NvErrc::Success on success, error code on failure
-
virtual utils::NvErrc enqueue_inference(cudaStream_t cu_stream) = 0#
Execute inference asynchronously.
- Parameters:
cu_stream – [in] CUDA stream for asynchronous execution
- Returns:
utils::NvErrc::Success on success, error code on failure
-
virtual bool all_input_dimensions_specified() const = 0#
Check if all input dimensions have been specified.
- Returns:
true if all input dimensions are specified, false otherwise
-
ITrtEngine() = default#
-
struct MLIRTensorParams#
- #include <trt_engine_params.hpp>
Tensor parameters for MLIR-TensorRT engines.
This structure provides tensor parameter representation for MLIR-TensorRT engines where tensor shapes and strides are provided by the user during initialization.
The user must provide the rank and dimensions. Strides are optional:
If strides are provided (last stride != 0), they are used as-is
If strides are not provided (last stride == 0), row-major strides are automatically computed from dimensions
See also
tensor::NvDataType for supported data type enumeration
Note
Maximum rank is 8, minimum rank is 0 (scalar), matching MLIR-TensorRT implementation limits
Public Functions
- inline constexpr void set_rank(
- const std::size_t n,
Set the number of dimensions for this tensor (rank)
- Parameters:
n – [in] Number of dimensions (must be <= MAX_TENSOR_RANK)
Public Members
-
std::string name#
Tensor name identifier.
-
tensor::NvDataType data_type = {}#
Data type of tensor elements.
-
std::size_t rank = {}#
Number of dimensions (0 for scalar, 1-8 for tensors)
-
std::array<std::size_t, MAX_TENSOR_RANK> dims = {}#
Tensor dimensions (first rank elements valid)
-
std::array<std::size_t, MAX_TENSOR_RANK> strides = {}#
Tensor strides (first rank elements valid, auto-computed if not provided)
Public Static Attributes
-
static constexpr std::size_t MAX_TENSOR_RANK = 8#
Maximum supported tensor rank.
-
class MLIRTrtEngine#
- #include <mlir_trt_engine.hpp>
Simplified TensorRT engine that mimics MLIR-TensorRT runtime patterns.
This class provides a streamlined TensorRT engine implementation that:
Eliminates batch size management (users handle batching externally)
Removes internal buffer allocation (users provide pre-allocated CUDA buffers)
Uses constructor-based initialization (no separate init() phase)
Accepts tensor dimensions and strides directly in MLIRTensorParams
Supports user-provided tensor names from TrtParams
The engine uses tensor metadata (dims/strides) provided by users in MLIRTensorParams and directly interfaces with TensorRT APIs.
Public Functions
- MLIRTrtEngine(
- std::vector<MLIRTensorParams> input_tensor_prms,
- std::vector<MLIRTensorParams> output_tensor_prms,
- std::unique_ptr<ITrtEngine> tensorrt_runtime,
- std::unique_ptr<IPrePostTrtEngEnqueue> pre_post_trt_eng_enqueue = nullptr,
Construct MLIRTrtEngine with full initialization.
All initialization is performed in the constructor, eliminating the need for a separate init() method. The TensorRT runtime must be pre-initialized and provided by the caller. Tensor shapes (dims/strides) must be provided in the MLIRTensorParams. If strides are not provided (last stride == 0), row-major strides are automatically computed.
- Parameters:
input_tensor_prms – [in] Input tensor parameters (name, data_type, rank, dims, optional strides)
output_tensor_prms – [in] Output tensor parameters (name, data_type, rank, dims, optional strides)
tensorrt_runtime – [in] Pre-initialized TensorRT runtime (required)
pre_post_trt_eng_enqueue – [in] Optional pre/post enqueue operations (e.g., CUDA graph capture)
- Throws:
std::invalid_argument – if tensorrt_runtime is nullptr, or if rank is invalid (> 8)
std::runtime_error – on initialization failure
-
~MLIRTrtEngine() = default#
-
MLIRTrtEngine(const MLIRTrtEngine &engine) = delete#
-
MLIRTrtEngine &operator=(const MLIRTrtEngine &engine) = delete#
-
MLIRTrtEngine(MLIRTrtEngine &&engine) = delete#
-
MLIRTrtEngine &operator=(MLIRTrtEngine &&engine) = delete#
-
utils::NvErrc warmup(cudaStream_t cu_stream)#
Perform warmup inference to allocate TensorRT resources.
Runs the TensorRT engine once to ensure all internal resources are allocated and avoid first-run latency. Extracts tensor shapes from buffer descriptors and executes a single inference pass.
Note
Requires setup() to be called first to provide buffer pointers for metadata extraction.
- Parameters:
cu_stream – [in] CUDA stream for warmup operations
- Returns:
utils::NvErrc::Success on success, error code on failure
- utils::NvErrc setup(
- const std::vector<void*> &input_buffers,
- const std::vector<void*> &output_buffers,
Setup input and output buffer addresses.
Caches the provided buffer pointers for use during inference. Buffers are direct pointers to CUDA memory (data pointers), not descriptor pointers. Buffers must remain valid for the lifetime of inference operations. No batch size parameter is needed as batching is handled externally.
- Parameters:
input_buffers – [in] Vector of input data buffer pointers (must match input_tensor_prms size)
output_buffers – [in] Vector of output data buffer pointers (must match output_tensor_prms size)
- Returns:
utils::NvErrc::Success on success, error code on failure
-
utils::NvErrc run(cudaStream_t cu_stream) const#
Execute inference on the configured tensors.
Performs the complete inference pipeline:
Set tensor addresses in TensorRT using user-provided buffer pointers
Set input shapes in TensorRT using dims from MLIRTensorParams
Execute pre-enqueue operations (e.g., CUDA graph capture start)
Run TensorRT inference
Execute post-enqueue operations (e.g., CUDA graph capture end)
- Parameters:
cu_stream – [in] CUDA stream for execution
- Returns:
utils::NvErrc::Success on success, error code on failure
-
class NullPrePostTrtEngEnqueue : public framework::tensorrt::IPrePostTrtEngEnqueue#
- #include <trt_null_pre_post_enqueue.hpp>
Null/No-op implementation of IPrePostTrtEngEnqueue
This class provides a null object pattern implementation for scenarios where CUDA graph capture is not needed during TensorRT engine warmup.
Use Cases:
Pure stream-mode pipelines with no graph execution requirements
Unit tests that only exercise stream-based execution paths
Performance-critical scenarios where graph capture overhead must be avoided
Design Tradeoff:
Eliminates graph capture overhead during warmup (~milliseconds one-time cost)
Cannot support graph-based execution mode (execute_graph will fail)
Reduces memory footprint (no captured graph stored)
Example Usage:
// For stream-only execution auto null_capturer = std::make_unique<NullPrePostTrtEngEnqueue>(); auto trt_engine = std::make_unique<MLIRTrtEngine>( inputs, outputs, std::move(tensorrt_runtime), std::move(null_capturer) // No graph capture ); // Warmup loads engine and runs once, but doesn't capture graph trt_engine->warmup(stream); // Stream execution works normally trt_engine->run(stream); // OK // Graph execution would fail (no captured graph available) // graph_capturer->get_graph(); // Would throw/fail
See also
CaptureStreamPrePostTrtEngEnqueue for graph-mode capture
See also
IPrePostTrtEngEnqueue for interface documentation
Public Functions
-
NullPrePostTrtEngEnqueue() = default#
Default constructor
-
~NullPrePostTrtEngEnqueue() override = default#
Destructor
-
NullPrePostTrtEngEnqueue(const NullPrePostTrtEngEnqueue&) = delete#
- NullPrePostTrtEngEnqueue &operator=(
- const NullPrePostTrtEngEnqueue&,
-
NullPrePostTrtEngEnqueue(NullPrePostTrtEngEnqueue&&) = delete#
- NullPrePostTrtEngEnqueue &operator=( ) = delete#
- inline virtual utils::NvErrc pre_enqueue(
- cudaStream_t cu_stream,
Pre-enqueue hook (no-op for null implementation)
This method does nothing and immediately returns success. No graph capture or stream operations are performed.
- Parameters:
cu_stream – [in] CUDA stream (unused)
- Returns:
utils::NvErrc::Success Always succeeds
- inline virtual utils::NvErrc post_enqueue(
- cudaStream_t cu_stream,
Post-enqueue hook (no-op for null implementation)
This method does nothing and immediately returns success. No graph capture or stream operations are performed.
- Parameters:
cu_stream – [in] CUDA stream (unused)
- Returns:
utils::NvErrc::Success Always succeeds
-
class NullTrtEngine : public framework::tensorrt::ITrtEngine#
- #include <trt_engine_interface.hpp>
Null object implementation of ITrtEngine.
Provides a default implementation that returns appropriate error codes for all TensorRT engine operations.
Public Functions
- inline virtual utils::NvErrc set_input_shape(
- const std::string_view tensor_name,
- const nvinfer1::Dims &dims,
Set the shape of an input tensor.
- Parameters:
tensor_name – [in] Name of the input tensor
dims – [in] Dimensions to set for the tensor
- Returns:
- inline virtual utils::NvErrc set_tensor_address(
- const std::string_view tensor_name,
- void *address,
Set the memory address for a tensor.
- Parameters:
tensor_name – [in] Name of the tensor
address – [in] Memory address to associate with the tensor
- Returns:
- inline virtual utils::NvErrc enqueue_inference(
- cudaStream_t cu_stream,
Execute inference asynchronously.
- Parameters:
cu_stream – [in] CUDA stream for asynchronous execution
- Returns:
-
inline virtual bool all_input_dimensions_specified() const final#
Check if all input dimensions have been specified.
- Returns:
false (null implementation)
-
class TrtEngine : public framework::tensorrt::ITrtEngine#
- #include <trt_engine.hpp>
Concrete implementation of ITrtEngine using NVIDIA TensorRT.
This class provides the actual TensorRT implementation for engine operations, wrapping the TensorRT IRuntime, ICudaEngine, and IExecutionContext components.
Public Functions
- TrtEngine(
- std::span<const std::byte> engine_data,
- nvinfer1::ILogger &logger,
Initialize TensorRT engine from serialized engine data.
- Parameters:
engine_data – [in] Span containing serialized engine binary data
logger – [in] TensorRT logger instance for engine messages
- Throws:
std::runtime_error – on initialization failure
- TrtEngine(
- const std::filesystem::path &engine_file_path,
- nvinfer1::ILogger &logger,
Initialize TensorRT engine from engine file.
- Parameters:
engine_file_path – [in] Path to the serialized engine file
logger – [in] TensorRT logger instance for engine messages
- Throws:
std::runtime_error – on file read or initialization failure
-
~TrtEngine() final = default#
- virtual utils::NvErrc set_input_shape(
- const std::string_view tensor_name,
- const nvinfer1::Dims &dims,
Set the shape of an input tensor.
- Parameters:
tensor_name – [in] Name of the input tensor
dims – [in] Dimensions to set for the tensor
- Returns:
utils::NvErrc::Success on success, error code on failure
- virtual utils::NvErrc set_tensor_address(
- const std::string_view tensor_name,
- void *address,
Set the memory address for a tensor.
- Parameters:
tensor_name – [in] Name of the tensor
address – [in] Memory address to associate with the tensor
- Returns:
utils::NvErrc::Success on success, error code on failure
-
virtual utils::NvErrc enqueue_inference(cudaStream_t cu_stream) final#
Execute inference asynchronously.
- Parameters:
cu_stream – [in] CUDA stream for asynchronous execution
- Returns:
utils::NvErrc::Success on success, error code on failure
-
virtual bool all_input_dimensions_specified() const final#
Check if all input dimensions have been specified.
- Returns:
true if all input dimensions are specified, false otherwise
-
class TrtLogger : public nvinfer1::ILogger#
- #include <trt_engine_logger.hpp>
Logger implementation for TensorRT engine.
Concrete implementation of nvinfer1::ILogger interface required by TensorRT. Handles logging of TensorRT runtime messages with configurable severity levels.