Pipeline#

Modular data processing framework for GPU-accelerated computational pipelines.

Overview#

The Pipeline library provides a high-performance framework for building modular, GPU-accelerated processing pipelines. It enables composing complex data transformations from reusable modules with efficient CUDA stream management and optional CUDA graph execution for minimal latency.

Key Features#

  • Modular Architecture: Compose pipelines from reusable processing modules (IModule)

  • Flexible Execution: Support for both stream-based and CUDA graph execution modes

  • Factory Pattern: Configuration-driven pipeline and module creation via PipelineSpec

  • Memory Management: Unified memory allocation with PipelineMemoryManager

  • Zero-Copy Optimization: Direct data flow between modules without intermediate copies

  • Module Routing: Automatic data routing between modules based on port connections

  • TensorRT Integration: First-class support for TensorRT engines

  • CUDA Graph Capture: Automatic CUDA graph construction for low-latency execution

Quick Start#

Creating a Module#

// Create a module with configuration parameters
const SimpleModule::StaticParams params{.tensor_size = 2048};
auto module = std::make_unique<SimpleModule>("my_module", params);

// Query module properties
const auto type_id = module->get_type_id();
const auto instance_id = module->get_instance_id();

Querying Module Ports#

// Query input and output ports
const auto input_ports = module->get_input_port_names();
const auto output_ports = module->get_output_port_names();

// Get tensor information for a specific port
const auto input_info = module->get_input_tensor_info("input");
const auto output_info = module->get_output_tensor_info("output");

Using a Module Factory#

// Create a module factory
auto factory = std::make_unique<SimpleModuleFactory>();

// Check if a module type is supported
const bool supported = factory->supports_module_type("simple_module");

// Create a module using the factory
const SimpleModule::StaticParams params{.tensor_size = 512};
auto module = factory->create_module("simple_module", "factory_module", std::any(params));

Core Concepts#

Pipeline Architecture#

A Pipeline orchestrates the execution of multiple Modules connected through named Ports. Data flows through the pipeline according to defined connections, with modules processing data on GPU streams.

External Input → Module A → Module B → Module C → External Output
                   ↓           ↓           ↓
               Port Connections (defined in PipelineSpec)

Key abstractions:

  • IPipeline: Coordinates module execution and manages data flow

  • IModule: Individual processing unit with defined inputs/outputs

  • PortInfo: Describes tensor data at module inputs/outputs

  • PipelineSpec: Configuration structure for pipeline creation

Modules#

A Module (IModule) is the fundamental processing unit in a pipeline. Each module:

  • Has named input and output ports

  • Implements one of several execution interfaces:

    • IStreamExecutor: Direct CUDA stream execution

    • IGraphNodeProvider: CUDA graph node provider for graph mode

    • IAllocationInfoProvider: Memory requirements for allocation

  • Receives its memory allocation from PipelineMemoryManager

  • Processes data independently without knowledge of other modules

Modules are created via IModuleFactory and configured with static parameters at construction time. Dynamic parameters (per-iteration data) are provided via configure_io().

Pipeline Specification#

A PipelineSpec defines the complete pipeline configuration including modules, connections, and execution mode.

Basic Pipeline Specification#

// Create a pipeline specification
pipeline::PipelineSpec spec;
spec.pipeline_name = "MyPipeline";
spec.execution_mode = pipeline::ExecutionMode::Stream;

// Define module configuration
const SimpleModule::StaticParams module_params{.tensor_size = 1024};

const pipeline::ModuleSpec module_spec(pipeline::ModuleCreationInfo{
        .module_type = "simple_module",
        .instance_id = "module_1",
        .init_params = std::any(module_params)});

spec.modules.push_back(module_spec);

// Define external I/O
spec.external_inputs = {"input"};
spec.external_outputs = {"output"};

Pipeline with Module Connections#

pipeline::PipelineSpec spec;
spec.pipeline_name = "TwoModulePipeline";

// Add two modules
const SimpleModule::StaticParams params{.tensor_size = 1024};

spec.modules.emplace_back(pipeline::ModuleCreationInfo{
        .module_type = "simple_module",
        .instance_id = "module_a",
        .init_params = std::any(params)});

spec.modules.emplace_back(pipeline::ModuleCreationInfo{
        .module_type = "simple_module",
        .instance_id = "module_b",
        .init_params = std::any(params)});

// Connect module_a output to module_b input
const pipeline::PortConnection connection{
        .source_module = "module_a",
        .source_port = "output",
        .target_module = "module_b",
        .target_port = "input"};

spec.connections.push_back(connection);

Execution Modes#

The pipeline supports two execution modes configured via ExecutionMode:

// Stream mode - flexible addressing, suitable for development
pipeline::PipelineSpec stream_spec;
stream_spec.execution_mode = pipeline::ExecutionMode::Stream;

// Graph mode - fixed addressing, optimal for production
pipeline::PipelineSpec graph_spec;
graph_spec.execution_mode = pipeline::ExecutionMode::Graph;

Stream Mode (ExecutionMode::Stream):

  • Sequential module execution on a CUDA stream

  • Flexible addressing - modules accept different tensor addresses per iteration

  • Supports dynamic topology changes

  • Suitable for development and debugging

Graph Mode (ExecutionMode::Graph):

  • Pre-built CUDA graph executed as a single unit

  • Fixed addressing - tensor addresses captured during warmup

  • Lower latency through single graph launch

  • Requires stable tensor addresses before graph build

  • Suitable for production deployments

Pipeline Lifecycle#

A typical pipeline follows this lifecycle:

  1. Construction: Create pipeline and modules via factory

  2. Setup: Allocate memory and initialize modules (setup())

  3. I/O Configuration: Establish connections and set inputs (configure_io())

  4. Warmup: One-time initialization - load models, capture graphs (warmup())

  5. Graph Build (graph mode only): Build CUDA graph (build_graph() or automatic)

  6. Execution: Process data (execute_stream() or execute_graph())

  7. Iteration: Repeat steps 3 and 6 with new data

Warmup Phase:

The warmup() method performs expensive one-time initialization:

  • Loading models to device memory (TensorRT engines)

  • Initializing module-specific resources

This is called once after the first configure_io() and before execution. For TensorRT modules, graph capture requires a non-default CUDA stream.

Data Flow#

Port Information#

// Allocate device memory for a tensor
const std::size_t tensor_size = 1024;
void *device_ptr{};
cudaMalloc(&device_ptr, tensor_size * sizeof(float));

// Create tensor info describing the data
const tensor::TensorInfo tensor_info{tensor::TensorInfo::DataType::TensorR32F, {tensor_size}};

// Create device tensor wrapper
const pipeline::DeviceTensor device_tensor{
        .device_ptr = device_ptr, .tensor_info = tensor_info};

// Create port info for external input
pipeline::PortInfo port_info{.name = "input0", .tensors = {device_tensor}};

PortInfo describes tensor data at module ports:

  • name: Port identifier (e.g., “input0”, “output”)

  • tensors: Vector of device tensors with metadata

Data flow sequence per iteration:

  1. External inputs provided to pipeline via PortInfo

  2. configure_io() calls set_inputs() on first module

  3. Module processes and provides outputs via get_outputs()

  4. Router passes outputs to next module’s inputs

  5. Process repeats through all modules

  6. Final outputs mapped to external outputs

DynamicParams can be passed to configure_io() to provide per-iteration configuration to modules. Modules receive these parameters and can use them to update their per-iteration state.

Memory Management#

Connection Copy Modes#

// Configure zero-copy mode for a connection
const auto module_params = SimpleModule::StaticParams{.tensor_size = 1024};
auto module = std::make_unique<SimpleModule>("my_module", module_params);

// Set connection to use zero-copy (if supported)
module->set_connection_copy_mode("input", pipeline::ConnectionCopyMode::ZeroCopy);

// Or configure to always copy data
module->set_connection_copy_mode("input", pipeline::ConnectionCopyMode::Copy);

ConnectionCopyMode controls data transfer between modules:

  • Copy: Allocate buffer and copy data via cudaMemcpy

  • ZeroCopy: Use upstream pointer directly (no copy)

Memory Characteristics#

// Query input memory characteristics
const auto input_chars = module->get_input_memory_characteristics("input");

// Query output memory characteristics
const auto output_chars = module->get_output_memory_characteristics("output");

// Check if zero-copy is possible
const bool can_use_zero_copy = pipeline::can_zero_copy(output_chars, input_chars);

Modules declare their memory capabilities:

  • provides_fixed_address_for_zero_copy: Whether output addresses are stable

  • requires_fixed_address_for_zero_copy: Whether inputs need fixed addresses

The helper function can_zero_copy() determines if zero-copy is possible for a connection based on upstream and downstream characteristics.

Zero-Copy Optimization#

The pipeline supports zero-copy data flow where downstream modules use upstream pointers directly, eliminating intermediate cudaMemcpy operations.

TensorRT Modules:

Zero-copy requires upstream modules to provide fixed/stable device addresses. TensorRT fuses I/O addresses during graph capture, so addresses must be known ahead of time.

Non-TensorRT Modules:

Zero-copy is supported in both Graph and Stream modes. Device pointers can be updated per-iteration via configure_io(), enabling flexible addressing. However, when upstream module pointers change, data must be copied to the downstream module’s input buffer.

General Rule:

Zero-copy possible when: upstream modules provide fixed address OR downstream modules accept dynamic addresses.

Benefits:

  • Reduced latency (eliminates copy overhead)

  • Lower memory usage (no duplicate buffers)

  • Better throughput (less memory bandwidth consumption)

Factory Pattern#

The pipeline library uses the factory pattern for flexible creation:

Module Factory (IModuleFactory):

  • Creates modules by type identifier (string)

  • Receives static configuration parameters as std::any

  • Returns std::unique_ptr<IModule>

Pipeline Factory (IPipelineFactory):

  • Creates pipelines by type identifier

  • Receives module factory and PipelineSpec

  • Constructs complete pipeline with all modules and connections

Complete Example#

The sample pipeline in framework/pipeline/samples/ demonstrates a complete two-module pipeline chaining TensorRT execution with a CUDA kernel:

External Input 0 ┐
                 ├─→ Module A (TensorRT Add) ─→ Module B (ReLU) ─→ Output
External Input 1 ┘

Creating the Pipeline#

const auto spec =
        create_pipeline_spec(tensor_size, engine_path, pipeline::ExecutionMode::Stream);
auto pipeline = pipeline_factory_->create_pipeline("sample", "test_pipeline", spec);

// Setup pipeline (allocates memory, initializes modules)
pipeline->setup();

Configuring and Executing#

// Configure I/O with external inputs/outputs
RT_LOG_INFO("Calling configure_io()");
const pipeline::DynamicParams params{.module_specific_params = {}};
pipeline->configure_io(params, external_inputs, external_outputs, stream_);

// Perform one-time warmup (loads TRT engine, captures CUDA graph)
RT_LOG_INFO("Calling warmup()");
pipeline->warmup(stream_);

// Execute pipeline in stream mode
RT_LOG_INFO("Calling execute_stream()");
pipeline->execute_stream(stream_);

// Synchronize to ensure completion
FRAMEWORK_CUDA_RUNTIME_CHECK_THROW(cudaStreamSynchronize(stream_));
RT_LOG_INFO("Stream execution completed");

Graph Mode Execution#

// Step 1: configure_io FIRST (provides tensor addresses, establishes
// connections)
RT_LOG_INFO("Calling configure_io() - establishes connections");
const pipeline::DynamicParams params{.module_specific_params = {}};
pipeline->configure_io(params, external_inputs, external_outputs, stream_);

// Step 2: warmup() (loads TRT engine, captures CUDA graph)
RT_LOG_INFO("Calling warmup() - loads engine, captures graph");
pipeline->warmup(stream_);

// Step 3: execute graph (build_graph() called automatically on first
// execution)
RT_LOG_INFO("Calling execute_graph()");
pipeline->execute_graph(stream_);

// Synchronize
FRAMEWORK_CUDA_RUNTIME_CHECK_THROW(cudaStreamSynchronize(stream_));
RT_LOG_INFO("Graph execution completed");

Additional Examples#

For complete working examples with full setup and validation, see the test files:

  • Documentation Examples: framework/pipeline/tests/pipeline_sample_tests.cpp - Clean examples demonstrating core concepts with focused test cases

  • Complete Pipeline: framework/pipeline/samples/tests/sample_pipeline_test.cpp - Full-featured pipeline with TensorRT and CUDA kernels, including stream and graph execution modes

These test files demonstrate complete workflows including memory allocation, module creation, pipeline setup, warmup, execution, and result validation.

API Reference#

enum class framework::pipeline::ConnectionCopyMode#

Mode for data transfer between modules

Specifies whether data copying is required between connected modules.

Values:

enumerator Copy#

Allocate buffer and copy data (cudaMemcpy)

enumerator ZeroCopy#

Use upstream address directly (no copy)

enum class framework::pipeline::ExecutionMode#

Pipeline execution mode determining addressing and memory allocation strategy

The execution mode is a static configuration set at pipeline construction time and cannot change during the pipeline’s lifetime. It determines:

  • Memory allocation strategy (fixed vs. dynamic addressing)

  • Zero-copy optimization possibilities

  • TRT engine configuration (graph capture vs. stream mode)

Values:

enumerator Stream#

Stream mode: flexible addressing, supports dynamic set_tensor_address() per iteration, enables zero-copy with dynamic upstream addresses

enumerator Graph#

Graph mode: fixed addressing required for CUDA graph capture/replay, zero-copy only possible with fixed upstream addresses

using framework::pipeline::DualKernelLaunchConfig = KernelLaunchConfig<2>#

Type alias for dual kernel configuration (2 parameters)

using framework::pipeline::ModuleSpec = fluent::NamedType<ModuleCreationInfo, struct ModuleSpecTag>#

Strong type for module specifications using NamedType

Provides type safety and clearer intent when working with module specifications in pipeline configurations.

template<std::size_t NUM_PARAMS>
inline void framework::pipeline::setup_kernel_function(
KernelLaunchConfig<NUM_PARAMS> &config,
const void *kernel_func,
)#

Helper to setup kernel function in launch configuration

Parameters:
  • config – Kernel launch configuration to update

  • kernel_func – Pointer to the kernel function

Throws:

std::runtime_error – if cudaGetFuncBySymbol fails

template<std::size_t NUM_PARAMS>
inline void framework::pipeline::setup_kernel_dimensions(
KernelLaunchConfig<NUM_PARAMS> &config,
const dim3 grid_dim,
const dim3 block_dim,
const std::size_t shared_mem_bytes = 0,
)#

Helper to setup kernel dimensions in launch configuration

Parameters:
  • config – Kernel launch configuration to update

  • grid_dim – Grid dimensions

  • block_dim – Block dimensions

  • shared_mem_bytes – Shared memory size in bytes (default: 0)

template<std::size_t NUM_PARAMS, typename ...Args>
void framework::pipeline::setup_kernel_arguments(
KernelLaunchConfig<NUM_PARAMS> &config,
Args&... args,
)#

Helper to setup kernel arguments in launch configuration

Parameters:
  • config – Kernel launch configuration to update

  • args – Variable number of device pointer arguments

inline CUresult framework::pipeline::launch_kernel(
const CUDA_KERNEL_NODE_PARAMS &kernel_node_params,
cudaStream_t stream,
)#

Launch a kernel using the pre-configured parameters.

This function wraps cuLaunchKernel with the parameters from CUDA_KERNEL_NODE_PARAMS. It enables consistent kernel launching across all modules using the cuBB pattern.

Parameters:
  • kernel_node_params – Pre-configured kernel launch parameters

  • stream – CUDA stream for kernel execution

Returns:

CUresult indicating success (CUDA_SUCCESS) or failure

std::size_t framework::pipeline::align_memory_offset(
std::size_t offset,
std::size_t alignment,
)#

Align offset to specified boundary

Parameters:
  • offset[in] Current offset

  • alignment[in] Alignment requirement

Returns:

Aligned offset

ModuleMemoryRequirements framework::pipeline::calculate_pipeline_memory_requirements(
const std::vector<ModuleMemoryRequirements> &module_requirements,
)#

Calculate total memory requirements for a collection of modules

This function simulates the allocation process to account for alignment padding between modules. It ensures the calculated total matches what will actually be needed during allocation.

Parameters:

module_requirements[in] Vector of memory requirements from each module

Returns:

Total memory requirements including alignment padding

ModuleMemoryRequirements framework::pipeline::calculate_pipeline_memory_requirements(
const std::vector<IModule*> &modules,
)#

Calculate total memory requirements for a collection of modules

This overload extracts requirements from the modules directly.

Parameters:

modules[in] Vector of modules to calculate requirements for

Throws:

std::runtime_error – if any module doesn’t implement IAllocationInfoProvider

Returns:

Total memory requirements including alignment padding

inline std::ostream &framework::pipeline::operator<<(
std::ostream &oss,
const ModuleMemoryRequirements &req,
)#

Stream output operator for ModuleMemoryRequirements

Parameters:
Returns:

Reference to the output stream

inline std::ostream &framework::pipeline::operator<<(
std::ostream &oss,
const ModuleMemorySlice &req,
)#

Stream output operator for ModuleMemorySlice

Parameters:
Returns:

Reference to the output stream

inline bool framework::pipeline::can_zero_copy(
const OutputPortMemoryCharacteristics &upstream,
const InputPortMemoryCharacteristics &downstream,
)#

Helper function to determine if zero-copy is possible for a connection.

Zero-copy decision matrix:

| Upstream Provides Fixed | Downstream Requires Fixed | Zero-Copy? | Explanation | |———————-&#8212;|————————&#8212;|———&#8212;|—————————————————————————–&#8212;| | true | true | YES | Graph mode: downstream uses upstream’s fixed address (no allocation) | | true | false | YES | Stream mode: downstream uses upstream’s fixed address each tick | | false | false | YES | Stream mode: downstream uses upstream’s changing address each tick | | false | true | NO | Incompatible: downstream needs fixed address but upstream changes → must copy |

The ONLY case requiring allocation + copy: upstream dynamic AND downstream requires fixed (e.g., graph mode with changing external inputs)

Parameters:
  • upstream[in] Output characteristics from the producing module

  • downstream[in] Input characteristics from the consuming module

Returns:

true if zero-copy is possible, false if copy is required

struct DeviceTensor#
#include <types.hpp>

Represents a tensor with its device memory location and metadata.

Each device tensor contains the device pointer to tensor data along with its associated metadata (dimensions, type, etc.). This allows ports to contain multiple tensors, each with its own device memory address.

Public Members

void *device_ptr = {nullptr}#

Device pointer to tensor data.

tensor::TensorInfo tensor_info#

Tensor metadata (dimensions, type, size)

class DynamicKernelLaunchConfig : public framework::pipeline::IKernelLaunchConfig#
#include <dynamic_kernel_launch_config.hpp>

Dynamic kernel launch configuration supporting arbitrary number of arguments

This class extends the base kernel launch configuration to support kernels with variable numbers of arguments. Use this when the number of kernel arguments is not known at compile time or varies significantly. For known fixed sizes, prefer KernelLaunchConfig<N>.

Public Functions

virtual void setup_kernel_function(const void *kernel_func) override#

Setup kernel function pointer

Parameters:

kernel_func[in] Pointer to the kernel function

Throws:

std::runtime_error – if cudaGetFuncBySymbol fails

virtual void setup_kernel_dimensions(
const dim3 grid_dim,
const dim3 block_dim,
const std::size_t shared_mem_bytes = 0,
) override#

Setup kernel dimensions and shared memory

Parameters:
  • grid_dim[in] Grid dimensions

  • block_dim[in] Block dimensions

  • shared_mem_bytes[in] Shared memory size in bytes (default: 0)

Throws:

std::invalid_argument – if shared memory size exceeds maximum

virtual CUresult launch(cudaStream_t stream) const override#

Launch kernel using the configured parameters

Parameters:

stream[in] CUDA stream for kernel execution

Returns:

CUresult indicating success (CUDA_SUCCESS) or failure

struct DynamicParams#
#include <types.hpp>

Container for dynamic parameters that can change per iteration.

This structure holds parameters that may vary for each iteration or unit of work being processed through the pipeline. The module_specific_params field allows modules to receive custom parameters for dynamic updates like matrix dimensions, kernel launch configurations, or other per-iteration variations.

Public Members

std::any module_specific_params#

Module-specific parameters for dynamic updates (e.g., matrix dimensions, kernel configs)

class Graph : public framework::pipeline::IGraph#
#include <graph.hpp>

Concrete implementation of IGraph for CUDA graph management.

This class provides a complete implementation of the IGraph interface, managing the lifecycle of CUDA graphs including creation, node addition, instantiation, upload, and launch operations.

Public Functions

Graph() = default#

Default constructor.

~Graph() override#

Destructor - cleans up CUDA graph resources.

Graph(const Graph&) = delete#
Graph &operator=(const Graph&) = delete#
Graph(Graph&&) = delete#
Graph &operator=(Graph&&) = delete#
virtual void create() override#

Creates a CUDA graph.

inline virtual bool is_created() const override#

Checks if the graph has been created.

Returns:

true if created, false otherwise

inline virtual CUgraph handle() const override#

Returns the graph handle.

Returns:

The CUDA graph handle

inline virtual CUgraphNode root_node() const override#

Returns the root node of the graph.

Returns:

The root node of the graph

virtual CUgraphNode add_kernel_node(
std::span<const CUgraphNode> deps,
const CUDA_KERNEL_NODE_PARAMS &params,
) override#

Adds a kernel node to the graph.

Parameters:
  • deps[in] Dependency nodes that must complete before this node executes

  • params[in] Kernel launch parameters for the node

Returns:

The created graph node handle

virtual CUgraphNode add_child_graph_node(
std::span<const CUgraphNode> deps,
CUgraph child_graph,
) override#

Adds a child graph node to the graph.

Parameters:
  • deps[in] Dependency nodes that must complete before this node executes

  • child_graph[in] The CUDA graph to add as a child

Returns:

The created graph node handle

inline virtual CUgraphExec exec_handle() const override#

Returns the graph execution handle.

Returns:

The graph execution handle

virtual void instantiate(unsigned int flags) override#

Instantiates the graph for execution.

Parameters:

flags[in] Instantiation flags passed to cuGraphInstantiate

virtual void upload(cudaStream_t stream) override#

Uploads the graph to the device.

Parameters:

stream[in] CUDA stream for upload passed to cuGraphUpload

virtual void launch(cudaStream_t stream) override#

Launches the graph on the specified stream.

Parameters:

stream[in] CUDA stream for graph execution passed to cuGraphLaunch

class GraphManager : public framework::pipeline::IGraphManager#
#include <graph_manager.hpp>

Concrete implementation of IGraphManager.

This class manages the lifecycle of a CUDA graph, providing a high-level interface for graph operations. It owns a single Graph instance for standard pipeline execution.

Note

Future extensions could support multiple graphs for conditional execution, DGL (Dynamic Graph Launch), or segmented execution modes.

Public Functions

GraphManager()#

Default constructor. Creates the internal graph instance.

~GraphManager() override = default#

Destructor.

GraphManager(const GraphManager&) = delete#
GraphManager &operator=(const GraphManager&) = delete#
GraphManager(GraphManager&&) = delete#
GraphManager &operator=(GraphManager&&) = delete#
virtual void instantiate_graph() const override#

Instantiates the graph for execution.

virtual void upload_graph(cudaStream_t stream) const override#

Uploads the graph to the device.

Parameters:

stream[in] CUDA stream for upload operation

virtual void launch_graph(cudaStream_t stream) const override#

Launches the graph on the specified stream.

Parameters:

stream[in] CUDA stream for graph execution

virtual CUgraphExec get_exec() const override#

Get the graph execution handle.

Returns the CUgraphExec handle needed for dynamic parameter updates via cuGraphExecKernelNodeSetParams.

Throws:

std::runtime_error – if graph is not instantiated

Returns:

The graph execution handle

virtual std::span<const CUgraphNode> add_kernel_node(
gsl_lite::not_null<IGraphNodeProvider*> graph_node_provider,
std::span<const CUgraphNode> deps,
) const override#

Adds kernel node(s) to the graph via a graph node provider.

Parameters:
  • graph_node_provider[in] Provider that will add nodes to the graph

  • deps[in] Dependency nodes that must complete before these nodes execute

Returns:

Span of created graph node handles (can contain single or multiple nodes)

class IAllocationInfoProvider#
#include <iallocation_info_provider.hpp>

Interface for providing memory allocation requirements.

This interface allows modules to specify their memory allocation needs using the detailed ModuleMemoryRequirements structure that aligns with cuBB’s memory management patterns.

Subclassed by framework::pipelines::samples::SampleModuleA, framework::pipelines::samples::SampleModuleB, ran::fronthaul::OrderKernelModule, ran::ldpc::CrcDecoderModule, ran::ldpc::LdpcDecoderModule, ran::ldpc::LdpcDerateMatchModule, ran::pusch::InnerRxModule

Public Functions

virtual ~IAllocationInfoProvider() = default#

Virtual destructor.

IAllocationInfoProvider(const IAllocationInfoProvider&) = delete#
IAllocationInfoProvider(IAllocationInfoProvider&&) = delete#
IAllocationInfoProvider &operator=(
const IAllocationInfoProvider&,
) = delete#
IAllocationInfoProvider &operator=(
IAllocationInfoProvider&&,
) = delete#
virtual ModuleMemoryRequirements get_requirements() const = 0#

Get the memory requirements for a module.

Returns:

The module’s memory requirements including static/dynamic descriptors and device tensors

class IGraph#
#include <igraph.hpp>

Abstract base class for all graph types Focus on the essential operations needed for graph building.

Subclassed by framework::pipeline::Graph

Public Functions

IGraph() = default#

Default constructor.

virtual ~IGraph() = default#

Virtual Destructor.

IGraph(const IGraph&) = delete#
IGraph(IGraph&&) = delete#
IGraph &operator=(const IGraph&) = delete#
IGraph &operator=(IGraph&&) = delete#
virtual void create() = 0#

Creates a graph.

virtual bool is_created() const = 0#

Checks if the graph has been created.

Returns:

true if created, false otherwise

virtual CUgraph handle() const = 0#

Returns the graph handle.

Returns:

The graph handle

virtual CUgraphNode root_node() const = 0#

Returns the root node of the graph.

Returns:

The root node of the graph

virtual CUgraphNode add_kernel_node(
std::span<const CUgraphNode> deps,
const CUDA_KERNEL_NODE_PARAMS &params,
) = 0#

Adds a kernel node to the graph.

Creates a kernel node in the CUDA graph with the specified dependencies and parameters. Returns the created node handle which should be stored by the caller for later parameter updates.

Parameters:
  • deps[in] Dependency nodes that must complete before this node executes

  • params[in] Kernel launch parameters for the node

Throws:

std::runtime_error – if CUDA operation fails

Returns:

The created graph node handle

virtual CUgraphNode add_child_graph_node(
std::span<const CUgraphNode> deps,
CUgraph child_graph,
) = 0#

Adds a child graph node to the graph.

Integrates a pre-captured CUDA graph as a child node within this graph. This is commonly used for TensorRT engines or other stream-captured execution units. Returns the created node handle which should be stored by the caller for later parameter updates.

Parameters:
  • deps[in] Dependency nodes that must complete before this node executes

  • child_graph[in] The CUDA graph to add as a child (must be a valid, captured graph)

Throws:

std::runtime_error – if CUDA operation fails

Returns:

The created graph node handle

virtual CUgraphExec exec_handle() const = 0#

Returns the graph execution handle.

Returns:

The graph execution handle

virtual void instantiate(unsigned int flags) = 0#

Instantiates the graph.

Parameters:

flags[in] The flags passed to cuGraphInstantiate

virtual void upload(cudaStream_t stream) = 0#

Uploads the graph to the device.

Parameters:

stream[in] The stream passed to cuGraphUpload

virtual void launch(cudaStream_t stream) = 0#

Launches the graph.

Parameters:

stream[in] The stream passed to cuGraphLaunch

class IGraphManager#
#include <igraph_manager.hpp>

Interface for managing CUDA graph lifecycle.

This interface abstracts the creation, instantiation, uploading, and launching of CUDA graphs. It provides a high-level API for pipeline implementations to build and execute computational graphs efficiently.

Subclassed by framework::pipeline::GraphManager

Public Functions

IGraphManager() = default#

Default constructor.

virtual ~IGraphManager() = default#

Virtual destructor.

IGraphManager(const IGraphManager&) = delete#
IGraphManager(IGraphManager&&) = delete#
IGraphManager &operator=(const IGraphManager&) = delete#
IGraphManager &operator=(IGraphManager&&) = delete#
virtual void instantiate_graph() const = 0#

Instantiate the graph for execution.

Converts the graph definition into an executable form. Must be called after all nodes have been added and before launching.

Throws:

std::runtime_error – if instantiation fails

virtual void upload_graph(cudaStream_t stream) const = 0#

Upload the graph to the device.

Prepares the graph for efficient execution on the GPU. Should be called after instantiation and before the first launch.

Parameters:

stream[in] CUDA stream for upload operation

Throws:

std::runtime_error – if upload fails

virtual void launch_graph(cudaStream_t stream) const = 0#

Launch the graph on the specified stream.

Executes the pre-built graph on the given CUDA stream. The graph must have been created, instantiated, and uploaded before calling this method.

Parameters:

stream[in] CUDA stream for graph execution

Throws:

std::runtime_error – if launch fails

virtual CUgraphExec get_exec() const = 0#

Get the executable graph handle.

Returns the CUDA graph execution handle, which can be used for dynamic parameter updates via cuGraphExecKernelNodeSetParams.

Throws:

std::runtime_error – if graph is not instantiated

Returns:

The graph execution handle

virtual std::span<const CUgraphNode> add_kernel_node(
gsl_lite::not_null<IGraphNodeProvider*> graph_node_provider,
std::span<const CUgraphNode> deps,
) const = 0#

Add kernel node(s) to the graph via a graph node provider.

Delegates to the provided graph node provider to add its kernel node(s) to the managed graph. The provider specifies dependencies and returns the created node handles.

Parameters:
  • graph_node_provider[in] Provider that will add nodes to the graph

  • deps[in] Dependency nodes that must complete before these nodes execute

Throws:

std::runtime_error – if node addition fails

Returns:

Span of created graph node handles (can contain single or multiple nodes)

class IGraphNodeProvider#
#include <igraph_node_provider.hpp>

Interface for providing a graph of nodes.

This interface provides methods for providing a graph of nodes, which are connected by data dependencies. It also supports dynamic updates to graph node parameters for scenarios requiring cuGraphExecKernelNodeSetParams.

Subclassed by framework::pipelines::samples::SampleModuleA, framework::pipelines::samples::SampleModuleB, ran::fronthaul::OrderKernelModule, ran::ldpc::CrcDecoderModule, ran::ldpc::LdpcDecoderModule, ran::ldpc::LdpcDerateMatchModule, ran::pusch::InnerRxModule

Public Functions

IGraphNodeProvider() = default#

Default constructor.

virtual ~IGraphNodeProvider() = default#

Virtual destructor.

IGraphNodeProvider(IGraphNodeProvider&&) = default#

Move constructor.

IGraphNodeProvider &operator=(IGraphNodeProvider&&) = default#

Move assignment operator.

Returns:

Reference to this object

IGraphNodeProvider(const IGraphNodeProvider&) = delete#
IGraphNodeProvider &operator=(const IGraphNodeProvider&) = delete#
virtual std::span<const CUgraphNode> add_node_to_graph(
gsl_lite::not_null<IGraph*> graph,
const std::span<const CUgraphNode> deps,
) = 0#

Add node(s) to the graph.

Parameters:
  • graph[in] The graph to add the node(s) to

  • deps[in] The dependencies of the node(s)

Throws:

std::runtime_error – if CUDA graph node creation fails

Returns:

Span of created graph node handles (can contain single or multiple nodes)

virtual void update_graph_node_params(
CUgraphExec exec,
const DynamicParams &params,
) = 0#

Update graph node parameters for dynamic iteration changes.

This method enables dynamic updates to kernel launch parameters using cuGraphExecKernelNodeSetParams. Modules can extract their specific parameters from params.module_specific_params and update their graph nodes accordingly (e.g., changing grid dimensions, shared memory size).

Parameters:
  • exec – The executable graph to update

  • params – Dynamic parameters containing module-specific parameters

Throws:

std::runtime_error – if cuGraphExecKernelNodeSetParams fails

class IKernelLaunchConfig#
#include <ikernel_launch_config.hpp>

Interface for kernel launch configuration

This interface provides a polymorphic approach to kernel launch configuration, supporting variable numbers of kernel arguments and different configuration types while maintaining backward compatibility and clean architecture.

Subclassed by framework::pipeline::DynamicKernelLaunchConfig, framework::pipeline::KernelLaunchConfig< NUM_PARAMS >

Public Functions

IKernelLaunchConfig() = default#

Default constructor

virtual ~IKernelLaunchConfig() = default#

Virtual destructor

IKernelLaunchConfig(const IKernelLaunchConfig&) = default#

Copy constructor

IKernelLaunchConfig(IKernelLaunchConfig&&) = default#

Move constructor

IKernelLaunchConfig &operator=(const IKernelLaunchConfig&) = default#

Copy assignment operator

Returns:

Reference to this object

IKernelLaunchConfig &operator=(IKernelLaunchConfig&&) = default#

Move assignment operator

Returns:

Reference to this object

virtual CUresult launch(cudaStream_t stream) const = 0#

Launch kernel using the configured parameters

Parameters:

stream[in] CUDA stream for kernel execution

Returns:

CUresult indicating success (CUDA_SUCCESS) or failure

virtual void setup_kernel_function(const void *kernel_func) = 0#

Setup kernel function pointer

Parameters:

kernel_func[in] Pointer to the kernel function

virtual void setup_kernel_dimensions(
const dim3 grid_dim,
const dim3 block_dim,
const std::size_t shared_mem_bytes = 0,
) = 0#

Setup kernel dimensions and shared memory

Parameters:
  • grid_dim[in] Grid dimensions

  • block_dim[in] Block dimensions

  • shared_mem_bytes[in] Shared memory size in bytes (default: 0)

template<typename ...Args>
inline void setup_kernel_arguments(
Args&&... args,
)#

Setup kernel arguments using variadic template

This method provides a compile-time interface for setting up kernel arguments. It clears existing arguments, adds each argument, and finalizes the configuration.

Parameters:

args[in] Variadic arguments to pass to the kernel (must be pointers)

class IModule#
#include <imodule.hpp>

Base interface for all algorithm modules.

This interface defines the contract that all modules in the processing pipeline must adhere to, including initialization, tensor information, and capability interfaces.

Subclassed by framework::pipelines::samples::SampleModuleA, framework::pipelines::samples::SampleModuleB, ran::fronthaul::OrderKernelModule, ran::ldpc::CrcDecoderModule, ran::ldpc::LdpcDecoderModule, ran::ldpc::LdpcDerateMatchModule, ran::pusch::InnerRxModule

Public Functions

IModule() = default#

Default constructor.

IModule(const IModule&) = default#

Copy constructor.

IModule(IModule&&) = default#

Move constructor.

IModule &operator=(const IModule&) = default#

Copy assignment operator.

Returns:

Reference to this object

IModule &operator=(IModule&&) = default#

Move assignment operator.

Returns:

Reference to this object

virtual ~IModule() = default#

Virtual destructor.

virtual std::string_view get_type_id() const = 0#

Get the type identifier of the module.

Returns:

The type ID as a string_view

virtual std::string_view get_instance_id() const = 0#

Get the instance identifier of the module.

Returns:

The instance ID as a string_view

virtual void setup_memory(const ModuleMemorySlice &memory_slice) = 0#

Perform one-time setup after memory allocation.

Parameters:

memory_slice[in] Memory slice allocated by PipelineMemoryManager

virtual std::vector<tensor::TensorInfo> get_input_tensor_info(
std::string_view port_name,
) const = 0#

Get the input tensor information for a specified port.

Parameters:

port_name[in] The name of the input port

Returns:

Vector of tensor information for all tensors on this port

virtual std::vector<tensor::TensorInfo> get_output_tensor_info(
std::string_view port_name,
) const = 0#

Get the output tensor information for a specified port.

Parameters:

port_name[in] The name of the output port

Returns:

Vector of tensor information for all tensors on this port

virtual std::vector<std::string> get_input_port_names() const = 0#

Get the names of all input ports.

Returns:

A vector of port names

virtual std::vector<std::string> get_output_port_names() const = 0#

Get the names of all output ports.

Returns:

A vector of port names

virtual void set_inputs(std::span<const PortInfo> inputs) = 0#

Set input connections for the module.

This method is called by the pipeline to connect input ports to their data sources. The module should validate that all required inputs are provided and that the port names match expected inputs.

Note

This is typically called during pipeline configuration after all modules are created

Parameters:

inputs[in] Span of port information with device pointers to input data

Throws:

std::invalid_argument – if required inputs are missing or port names don’t match

virtual std::vector<PortInfo> get_outputs() const = 0#

Get output port information.

Returns information about all output ports including their device pointers and tensor metadata. This is used by the pipeline to route data between modules.

Note

Device pointers are only valid after setup_memory() has been called

Returns:

Vector of port information for all outputs

inline virtual void warmup(cudaStream_t stream)#

Perform one-time warmup and initialization after connections are established.

This method is called once after set_inputs() to perform any expensive one-time initialization that requires knowledge of input/output connections. Examples include:

  • Loading machine learning models to device memory (TensorRT, PyTorch)

  • Capturing CUDA graphs for graph-mode execution

  • Allocating and initializing lookup tables

Typical pipeline lifecycle:

  1. setup() - allocate memory, initialize data structures

  2. set_inputs() - establish data flow connections (lightweight)

  3. warmup(stream) - one-time initialization (expensive, called once)

  4. loop: configure_io() + execute() - process data

Note

Default implementation is no-op (most modules don’t need warmup)

Note

This should only be called once after the first set_inputs() call

Note

Warmup must complete before build_graph() for graph-mode execution

Note

TensorRT graph capture requires a non-default stream (cannot use cudaStreamDefault)

Parameters:

stream[in] CUDA stream to use for warmup operations (e.g., graph capture)

Throws:

std::runtime_error – if warmup fails

virtual void configure_io(
const DynamicParams &params,
cudaStream_t stream,
) = 0#

Configure I/O for the current iteration.

This method is called before execute() to update any parameters that change per execution. The module should use this opportunity to update internal state, kernel parameters, or any iteration-specific configuration. The subsequent execute() call will use this prepared state.

Execution flow for both stream and graph modes:

  1. configure_io(params, stream) - prepare internal state

  2. execute(stream) - launch work using prepared state

Note

This is called after warmup() has completed one-time initialization

Note

For graph mode, this is called before each graph launch to update captured parameters

Parameters:
  • params[in] Dynamic parameters for the current iteration

  • stream[in] CUDA stream for async operations during configuration

inline virtual InputPortMemoryCharacteristics get_input_memory_characteristics(
std::string_view port_name,
) const#

Get memory characteristics for input ports (for zero-copy optimization).

Allows modules to declare whether they require fixed input addresses for zero-copy. Called during pipeline setup to optimize memory allocation strategy.

IMPORTANT: Only the requires_fixed_address_for_zero_copy field is used.

Note

Default implementation: flexible (doesn’t require fixed addresses)

Parameters:

port_name[in] Input port name

Returns:

Input port memory characteristics

inline virtual OutputPortMemoryCharacteristics get_output_memory_characteristics(
std::string_view port_name,
) const#

Get memory characteristics for output ports (for zero-copy optimization).

Allows modules to declare whether they provide fixed addresses for outputs. Called during pipeline setup to optimize memory allocation strategy.

IMPORTANT: Only the provides_fixed_address_for_zero_copy field is used.

Note

Default implementation: provides fixed addresses (typical case)

Parameters:

port_name[in] Output port name

Returns:

Output port memory characteristics

inline virtual void set_connection_copy_mode(
std::string_view port_name,
ConnectionCopyMode mode,
)#

Configure connection copy mode for an input port (for zero-copy optimization).

Called by the pipeline during setup() to inform the module about input characteristics. This allows modules to optimize memory allocation in get_requirements():

Typical flow:

  1. Pipeline analyzes input/output memory characteristics

  2. Pipeline calls set_connection_copy_mode() to configure each input port

  3. Module’s get_requirements() uses this info to calculate memory needs

  4. Module allocates only what it needs (skips buffers for zero-copy inputs)

Note

Default implementation is no-op (modules that don’t support zero-copy inputs ignore)

Note

Must be called before get_requirements() for memory optimization to work

Parameters:
  • port_name[in] Input port name

  • mode[in] Connection copy mode (Copy or ZeroCopy)

template<typename T>
inline std::add_pointer_t<const std::remove_reference_t<T>> as_type(
) const#

Cast the module to a specific type, using dynamic_cast.

Template Parameters:

T – The type to cast to

Returns:

The casted module

virtual IGraphNodeProvider *as_graph_node_provider() = 0#

Cast the module to a IGraphNodeProvider.

Returns:

The casted module

virtual IStreamExecutor *as_stream_executor() = 0#

Cast the module to a IStreamExecutor.

Returns:

The casted module

class IModuleFactory#
#include <imodule_factory.hpp>

Interface for creating modules dynamically.

This interface defines the contract for factories that can create different types of modules based on type identifiers and parameters.

Subclassed by framework::pipeline::ModuleFactory, framework::pipelines::samples::SampleModuleAFactory, framework::pipelines::samples::SampleModuleBFactory, framework::pipelines::samples::SampleModuleFactory, ran::fronthaul::OrderKernelModuleFactory, ran::ldpc::CrcDecoderModuleFactory, ran::ldpc::LdpcDecoderModuleFactory, ran::ldpc::LdpcDerateMatchModuleFactory, ran::pusch::InnerRxModuleFactory, ran::pusch::PuschModuleFactory

Public Functions

IModuleFactory() = default#

Default constructor.

virtual ~IModuleFactory() = default#

Virtual destructor.

IModuleFactory(IModuleFactory&&) = default#

Move constructor.

IModuleFactory &operator=(IModuleFactory&&) = default#

Move assignment operator.

Returns:

Reference to this object

IModuleFactory(const IModuleFactory&) = delete#
IModuleFactory &operator=(const IModuleFactory&) = delete#
virtual std::unique_ptr<IModule> create_module(
std::string_view module_type,
const std::string &instance_id,
const std::any &static_params,
) = 0#

Create a module of the specified type.

Parameters:
  • module_type[in] The type of module to create (e.g., “gemm”)

  • instance_id[in] The unique instance identifier for this module

  • static_params[in] Type-erased static parameters for module initialization

Throws:
  • std::invalid_argument – if module_type is not supported

  • std::bad_any_cast – if static_params type doesn’t match module requirements

Returns:

Unique pointer to the created module

virtual bool supports_module_type(
std::string_view module_type,
) const = 0#

Check if a module type is supported by this factory.

Parameters:

module_type[in] The type of module to check

Returns:

true if the module type is supported, false otherwise

struct InputPortMemoryCharacteristics#
#include <types.hpp>

Memory characteristics for a module INPUT port (for zero-copy optimization)

Describes what an input port requires from its upstream connection to enable zero-copy. Used by get_input_memory_characteristics() to declare input requirements.

Public Members

bool requires_fixed_address_for_zero_copy = {false}#

Whether this input port requires fixed upstream addresses for zero-copy optimization.

This field determines if the module can zero-copy with any upstream or only with fixed ones:

false = Can zero-copy with ANY upstream (fixed OR dynamic addresses)

  • Module is flexible: accepts any address, even if it changes per iteration

  • Example: TRT engine in STREAM MODE (uses set_tensor_address() per iteration)

  • Example: CUDA kernel with dynamic descriptors (pointer updated per iteration)

  • Zero-copy: ALWAYS possible! Just use whatever upstream provides

    • If upstream fixed: same address every iteration (no copy)

    • If upstream dynamic: different address each iteration (still no copy!)

true = Can ONLY zero-copy if upstream provides fixed addresses

  • Module MUST have fixed address BEFORE warmup()

  • Reason: Address needed for CUDA graph capture or other pre-warmup operations

  • Example: TRT engine in GRAPH MODE (address captured during graph warmup)

  • Zero-copy: Only possible if upstream provides_fixed_address_for_zero_copy=true (downstream uses upstream’s fixed address, no allocation needed)

  • If upstream provides_fixed_address_for_zero_copy=false: MUST allocate + cudaMemcpy each iteration

class IPipeline#
#include <ipipeline.hpp>

Base interface for all processing pipelines.

This interface defines the contract that all pipelines must adhere to, including initialization, setup, and execution phases. Pipelines coordinate the execution of multiple modules and manage data flow between them.

Subclassed by framework::pipelines::samples::SamplePipeline, ran::fronthaul::OrderKernelPipeline, ran::pusch::PuschPipeline

Public Functions

virtual ~IPipeline() = default#

Virtual destructor.

virtual std::string_view get_pipeline_id() const = 0#

Get the pipeline identifier.

Returns:

The pipeline ID as a string view

IPipeline() = default#

Default constructor.

IPipeline(const IPipeline&) = default#

Copy constructor.

IPipeline(IPipeline&&) = default#

Move constructor.

IPipeline &operator=(const IPipeline&) = default#

Copy assignment operator.

Returns:

Reference to this IPipeline

IPipeline &operator=(IPipeline&&) = default#

Move assignment operator.

Returns:

Reference to this IPipeline

virtual void setup() = 0#

Perform one-time setup after initialization.

This method should create modules, allocate memory, and establish data flow connections between modules.

inline virtual void warmup(cudaStream_t stream)#

Perform one-time warmup and initialization of all modules.

This method calls warmup() on all modules after connections are established. It should be called once after the first set_inputs() call and before any execution. This is where expensive one-time operations occur, such as:

  • Loading models to device memory (TensorRT engines)

  • Capturing CUDA graphs for graph-mode execution

Typical pipeline lifecycle:

  1. setup() - allocate memory, create modules

  2. configure_io() - establish connections, set external inputs (first call)

  3. warmup(…, stream) - one-time initialization (expensive, called once)

  4. loop: configure_io() + execute() - process data

Note

Default implementation is no-op (pipelines can override if needed)

Note

Must be called before build_graph() for graph-mode execution

Note

Should only be called once in the pipeline lifecycle

Note

TensorRT graph capture requires a non-default stream (cannot use cudaStreamDefault)

Note

Stream parameter is last to follow standard C++ convention

Parameters:

stream[in] CUDA stream to use for warmup operations (passed to modules)

Throws:

std::runtime_error – if any module warmup fails

virtual void configure_io(
const DynamicParams &params,
std::span<const PortInfo> external_inputs,
std::span<PortInfo> external_outputs,
cudaStream_t stream,
) = 0#

Configure the pipeline I/O for the next iteration.

This method configures external inputs/outputs and updates dynamic parameters for all modules. It must be called before execute().

Note

For the first call, this establishes connections between modules

Note

After the first call, warmup() should be called before execution

Note

Stream parameter is last to follow standard C++ convention

Parameters:
  • params[in] Dynamic parameters for the current iteration

  • external_inputs[in] Span of external input port information

  • external_outputs[out] Span of external output port information. Caller pre-allocates span storage; pipeline writes PortInfo metadata (including device_ptr). Pipeline retains ownership of device memory; caller must not free device pointers. Device pointers remain valid until pipeline destruction or next setup() call.

  • stream[in] CUDA stream to use for I/O configuration operations (e.g., descriptor copies)

virtual void execute_stream(cudaStream_t stream) = 0#

Execute the pipeline using CUDA streams.

This method launches all module kernels sequentially using the parameters configured in configure_io(). The pipeline must have been initialized, setup, and configure_io must have been called before this method.

Parameters:

stream[in] The CUDA stream to execute on

Throws:

std::runtime_error – if execution fails

virtual void execute_graph(cudaStream_t stream) = 0#

Execute the pipeline using CUDA graphs.

This method launches the pre-built CUDA graph. The graph must have been created, instantiated, and uploaded during the setup() phase. The pipeline must have been initialized, setup, and configure_io must have been called before this method.

Parameters:

stream[in] The CUDA stream to launch the graph on

Throws:

std::runtime_error – if graph execution is not supported or fails

virtual std::size_t get_num_external_inputs() const = 0#

Get the number of external inputs required by this pipeline.

Returns:

Number of external input tensors needed

virtual std::size_t get_num_external_outputs() const = 0#

Get the number of external outputs produced by this pipeline.

Returns:

Number of external output tensors produced

inline virtual std::vector<PortInfo> get_outputs() const#

Get pipeline output port information.

Provides access to the pipeline’s output buffers. This allows external components to access output buffer addresses without executing the pipeline. Typical use case: getting fixed buffer addresses after warmup for zero-copy data passing between pipelines.

Note

Buffer addresses are typically stable after warmup()

Note

Not all pipelines need to implement this - only those that expose outputs for external consumption

Throws:

std::logic_error – Default implementation throws - must be overridden by pipelines that need to expose outputs

Returns:

Vector of output PortInfo describing each output port

class IPipelineFactory#
#include <ipipeline_factory.hpp>

Interface for creating pipelines dynamically.

This interface defines the contract for factories that can create different types of pipelines based on type identifiers and specifications. The factory pattern enables configuration-driven pipeline construction and supports extensibility through runtime registration of pipeline types.

Subclassed by framework::pipeline::PipelineFactory, framework::pipelines::samples::SamplePipelineFactory, ran::fronthaul::OrderKernelPipelineFactory

Public Functions

IPipelineFactory() = default#

Default constructor.

virtual ~IPipelineFactory() = default#

Virtual destructor.

IPipelineFactory(IPipelineFactory&&) = default#

Move constructor.

IPipelineFactory &operator=(IPipelineFactory&&) = default#

Move assignment operator.

Returns:

Reference to this object

IPipelineFactory(const IPipelineFactory&) = delete#
IPipelineFactory &operator=(const IPipelineFactory&) = delete#
virtual std::unique_ptr<IPipeline> create_pipeline(
std::string_view pipeline_type,
const std::string &pipeline_id,
const PipelineSpec &spec,
) = 0#

Create a pipeline from a specification.

This method constructs a complete pipeline based on the provided specification, which includes module definitions, connections, and external I/O configuration.

Parameters:
  • pipeline_type[in] The type of pipeline to create (e.g., “skeleton”, “multi_module”)

  • pipeline_id[in] Unique identifier for this pipeline instance

  • spec[in] Complete pipeline specification including modules and connections

Throws:
  • std::invalid_argument – if pipeline_type is not supported

  • std::runtime_error – if pipeline creation fails

Returns:

Unique pointer to the created pipeline

virtual bool is_pipeline_type_supported(
std::string_view pipeline_type,
) const = 0#

Check if a pipeline type is supported by this factory.

Parameters:

pipeline_type[in] The type of pipeline to check

Returns:

true if the pipeline type is supported, false otherwise

virtual std::vector<std::string> get_supported_pipeline_types(
) const = 0#

Get all supported pipeline types.

Returns:

Vector of supported pipeline type identifiers

class IPipelineOutputProvider#
#include <ipipeline_output_provider.hpp>

Interface for accessing pipeline output buffer addresses.

This interface provides access to stable output buffer addresses from pipelines (e.g., Order Kernel) for zero-copy integration with downstream consumers. Addresses remain valid after pipeline warmup for the pipeline’s lifetime.

Thread-safety: Implementations must ensure thread-safe access if called from multiple threads.

Subclassed by ran::fronthaul::Fronthaul

Public Functions

IPipelineOutputProvider() = default#

Default constructor.

virtual ~IPipelineOutputProvider() = default#

Virtual destructor for proper cleanup of derived classes.

IPipelineOutputProvider(IPipelineOutputProvider&&) = default#

Move constructor.

IPipelineOutputProvider &operator=(
IPipelineOutputProvider&&,
) = default#

Move assignment operator.

Returns:

Reference to this object

IPipelineOutputProvider(const IPipelineOutputProvider&) = delete#

Deleted copy constructor (non-copyable).

IPipelineOutputProvider &operator=(
const IPipelineOutputProvider&,
) = delete#

Deleted copy assignment operator (non-copyable).

Returns:

Reference to this object

virtual std::span<const PortInfo> get_order_kernel_outputs(
) const noexcept = 0#

Get Order Kernel pipeline output addresses

Provides access to stable output buffer addresses captured after Order Kernel warmup. These addresses can be used for zero-copy data passing to downstream pipelines (e.g., PUSCH pipeline).

Note

Addresses are stable after warmup and remain valid for pipeline lifetime

Note

Thread-safe if implementation provides thread-safety guarantees

Returns:

Span of PortInfo describing Order Kernel outputs. Empty span indicates outputs are not available (e.g., pipeline not initialized).

class IStreamExecutor#
#include <istream_executor.hpp>

Interface for executing operations directly on a CUDA stream.

This interface provides a mechanism for executing operations on a CUDA stream, typically when CUDA graph execution is not available or suitable. This interface is designed to be used via composition rather than inheritance, allowing modules to delegate execution to concrete executor implementations.

Subclassed by framework::pipeline::NullStreamExecutor, framework::pipeline::StreamExecutor, framework::pipelines::samples::SampleModuleA, framework::pipelines::samples::SampleModuleB, ran::fronthaul::OrderKernelModule, ran::ldpc::CrcDecoderModule, ran::ldpc::LdpcDecoderModule, ran::ldpc::LdpcDerateMatchModule, ran::pusch::InnerRxModule

Public Functions

IStreamExecutor() = default#

Default constructor.

virtual ~IStreamExecutor() = default#

Virtual destructor.

IStreamExecutor(const IStreamExecutor&) = default#

Copy constructor.

IStreamExecutor(IStreamExecutor&&) = default#

Move constructor.

IStreamExecutor &operator=(const IStreamExecutor&) = default#

Copy assignment operator.

Returns:

Reference to this object

IStreamExecutor &operator=(IStreamExecutor&&) = default#

Move assignment operator.

Returns:

Reference to this object

virtual void execute(cudaStream_t stream) = 0#

Execute operations on a CUDA stream.

This method launches the module’s GPU operations using parameters previously set by setup_tick(). The separation between setup_tick() and execute() allows the same pattern to work for both stream and graph execution modes.

Note

setup_tick() must be called before execute() to prepare parameters

Parameters:

stream[in] The CUDA stream to execute on

class KernelDescriptorAccessor#
#include <kernel_descriptor_accessor.hpp>

Kernel descriptor accessor for type-safe parameter handling

This is Per-Module class type instance.

Provides access to both static and dynamic kernel parameter structures in pinned memory. Each module gets its own KernelDescriptorAccessor and is responsible for copying descriptors from CPU to GPU when ready.

Public Functions

explicit KernelDescriptorAccessor(
const ModuleMemorySlice &memory_slice,
)#

Constructor

Parameters:

memory_slice[in] Memory slice containing kernel descriptor regions

KernelDescriptorAccessor(const KernelDescriptorAccessor&) = delete#
KernelDescriptorAccessor &operator=(
const KernelDescriptorAccessor&,
) = delete#
KernelDescriptorAccessor(KernelDescriptorAccessor&&) = delete#
KernelDescriptorAccessor &operator=(
KernelDescriptorAccessor&&,
) = delete#
~KernelDescriptorAccessor() = default#

Destructor

template<typename T>
inline T &create_static_param(
const std::size_t offset_bytes,
)#

Create type-safe static kernel parameter at specific offset

In-place construction (placement new) so no UB when accessing the descriptor.

Note

The created object’s destructor is not called automatically. This is safe for POD types typically used as kernel parameters. Users must ensure proper cleanup for non-POD types.

Template Parameters:

T – Kernel parameter structure type

Parameters:

offset_bytes[in] Byte offset within the module’s static descriptor region

Throws:

std::runtime_error – if allocation exceeds slice bounds

Returns:

Reference to constructed kernel parameter object (CPU memory)

template<typename T>
inline T &create_dynamic_param(
const std::size_t offset_bytes,
)#

Create type-safe dynamic kernel parameter at specific offset

Template Parameters:

T – Kernel parameter structure type

Parameters:

offset_bytes[in] Byte offset within the module’s dynamic descriptor region

Throws:

std::runtime_error – if allocation exceeds slice bounds

Returns:

Reference to constructed kernel parameter object (CPU memory)

void copy_static_descriptors_to_device(cudaStream_t stream) const#

Copy static descriptors from CPU to GPU memory (async)

Module calls this when static descriptors are ready. Typically called once during module initialization.

Parameters:

stream[in] CUDA stream for async operation

void copy_dynamic_descriptors_to_device(cudaStream_t stream) const#

Copy dynamic descriptors from CPU to GPU memory (async)

Module calls this when dynamic descriptors are ready. Typically called every frame/slot when parameters change.

Parameters:

stream[in] CUDA stream for async operation

template<typename T>
inline T *get_static_device_ptr(
const std::size_t offset_bytes,
) const#

Get GPU device pointer for static kernel parameters

Template Parameters:

T – Kernel parameter structure type

Parameters:

offset_bytes[in] Byte offset within the static descriptor region

Returns:

Device pointer for kernel launch

template<typename T>
inline T *get_dynamic_device_ptr(
const std::size_t offset_bytes,
) const#

Get GPU device pointer for dynamic kernel parameters

Template Parameters:

T – Kernel parameter structure type

Parameters:

offset_bytes[in] Byte offset within the dynamic descriptor region

Returns:

Device pointer for kernel launch

template<std::size_t NUM_PARAMS>
class KernelLaunchConfig : public framework::pipeline::IKernelLaunchConfig#
#include <kernel_launch_config.hpp>

Template-based kernel launch configuration

This class is designed for CUDA Graph compatibility and efficient kernel launching using the CUDA Driver API. It stores pre-configured kernel launch parameters that can be reused across multiple kernel invocations.

Usage pattern:

  1. During setup(): Configure the CUDA_KERNEL_NODE_PARAMS once

  2. During execute(): Launch kernel with pre-configured parameters

This approach enables:

  • CUDA Graph capture and replay

  • Reduced overhead (configuration done once, not per execution)

  • Consistent error handling with CUresult

  • Clean separation between setup and execution phases

Template Parameters:

NUM_PARAMS – Number of kernel parameters (must be > 0)

Public Functions

inline virtual void setup_kernel_function(
const void *kernel_func,
) override#

Setup kernel function pointer

Parameters:

kernel_func[in] Pointer to the kernel function

Throws:

std::runtime_error – if cudaGetFuncBySymbol fails

inline virtual void setup_kernel_dimensions(
const dim3 grid_dim,
const dim3 block_dim,
const std::size_t shared_mem_bytes = 0,
) override#

Setup kernel dimensions and shared memory

Parameters:
  • grid_dim[in] Grid dimensions

  • block_dim[in] Block dimensions

  • shared_mem_bytes[in] Shared memory size in bytes (default: 0)

Throws:

std::invalid_argument – if shared memory size exceeds maximum

virtual CUresult launch(cudaStream_t stream) const override#

Launch kernel using the configured parameters

Parameters:

stream[in] CUDA stream for kernel execution

Returns:

CUresult indicating success (CUDA_SUCCESS) or failure

inline const CUDA_KERNEL_NODE_PARAMS &get_kernel_params() const#

Get const reference to kernel node parameters for graph node creation

Returns:

Const reference to CUDA_KERNEL_NODE_PARAMS

class KernelNodeHelper#
#include <kernel_node_helper.hpp>

Helper class for managing CUDA kernel node parameters

This class encapsulates common CUDA kernel launch configuration logic that can be reused across different kernel launch configurations via composition instead of inheritance.

Public Functions

void setup_kernel_function(const void *kernel_func)#

Setup kernel function pointer

Parameters:

kernel_func[in] Pointer to the kernel function

Throws:

std::runtime_error – if cudaGetFuncBySymbol fails

void setup_kernel_dimensions(
const dim3 grid_dim,
const dim3 block_dim,
const std::size_t shared_mem_bytes = 0,
)#

Setup kernel dimensions and shared memory

Parameters:
  • grid_dim[in] Grid dimensions

  • block_dim[in] Block dimensions

  • shared_mem_bytes[in] Shared memory size in bytes (default: 0)

Throws:

std::invalid_argument – if shared memory size exceeds maximum

const CUDA_KERNEL_NODE_PARAMS &get_kernel_params() const#

Get const reference to kernel node parameters

Returns:

Const reference to CUDA_KERNEL_NODE_PARAMS for kernel launch

void clear_kernel_params()#

Clear kernel node parameters

void set_kernel_params_ptr(void **kernel_params_ptr)#

Set kernel node parameters pointer

Note

Caller must ensure the pointer is valid for the duration of the kernel launch.

Parameters:

kernel_params_ptr[in] Pointer to kernel parameters

struct ModuleCreationInfo#
#include <types.hpp>

Information needed to create a module instance.

Contains all information needed to create and initialize a module through the factory pattern. This is the underlying data for ModuleSpec.

Public Members

std::string module_type#

Module type identifier (e.g., “gemm”, “relu”)

std::string instance_id#

Unique instance identifier for this module.

std::any init_params#

Type-erased initialization parameters.

class ModuleFactory : public framework::pipeline::IModuleFactory#
#include <module_factory.hpp>

Concrete implementation of IModuleFactory with runtime registration.

This factory uses a registry pattern allowing module types to be registered at runtime. This provides flexibility for different applications to register their specific module types without modifying the factory implementation.

Example usage:

ModuleFactory factory;
factory.register_module_type("gemm",
   [](const std::string& id, const std::any& params) {
     return std::make_unique<GemmModule>(id,
                                          std::any_cast<GemmConfig>(params));
    });

auto module = factory.create_module("gemm", "gemm_0", gemm_config);

Public Functions

ModuleFactory() = default#

Default constructor.

~ModuleFactory() override = default#

Destructor.

ModuleFactory(const ModuleFactory&) = delete#
ModuleFactory &operator=(const ModuleFactory&) = delete#
ModuleFactory(ModuleFactory&&) = default#

Move constructor.

ModuleFactory &operator=(ModuleFactory&&) = default#

Move assignment operator.

Returns:

Reference to this object

template<ModuleCreator Creator>
inline void register_module_type(
std::string_view module_type,
Creator &&creator,
)#

Register a module type with its creator function.

Parameters:
  • module_type[in] Type identifier for the module

  • creator[in] Function that creates instances of this module type

Throws:

std::invalid_argument – if module_type is already registered

virtual std::unique_ptr<IModule> create_module(
std::string_view module_type,
const std::string &instance_id,
const std::any &static_params,
) override#

Create a module of the specified type.

Parameters:
  • module_type[in] The type of module to create

  • instance_id[in] The unique instance identifier for this module

  • static_params[in] Type-erased static parameters for module initialization

Throws:
  • std::invalid_argument – if module_type is not supported

  • std::bad_any_cast – if static_params type doesn’t match module requirements

Returns:

Unique pointer to the created module

virtual bool supports_module_type(
std::string_view module_type,
) const override#

Check if a module type is supported by this factory.

Parameters:

module_type[in] The type of module to check

Returns:

true if the module type is supported, false otherwise

struct ModuleMemoryRequirements#
#include <types.hpp>

Memory allocation requirements for a module

Follows cuBB’s pattern with static/dynamic kernel descriptors and device tensor allocation.

Public Members

std::size_t static_kernel_descriptor_bytes = {0}#

Size of static kernel parameters (set once)

std::size_t dynamic_kernel_descriptor_bytes{0}#

Size of dynamic kernel parameters (updated per frame)

std::size_t device_tensor_bytes = {0}#

Size of module’s device tensor allocation (for any use)

std::size_t alignment = {DEFAULT_ALIGNMENT}#

Memory alignment requirement.

Public Static Attributes

static constexpr std::size_t DEFAULT_ALIGNMENT = 128#

Memory alignment requirement in bytes.

struct ModuleMemorySlice#
#include <types.hpp>

Memory slice assigned to a module

Contains pointers to memory regions allocated by the pipeline. Follows cuBB pattern: CPU/GPU descriptor pairs + device tensor slice.

Public Members

std::byte *static_kernel_descriptor_cpu_ptr{nullptr}#

Pinned memory for static kernel parameters (CPU)

std::byte *static_kernel_descriptor_gpu_ptr{nullptr}#

Device memory for static kernel parameters (GPU)

std::byte *dynamic_kernel_descriptor_cpu_ptr{nullptr}#

Pinned memory for dynamic kernel parameters (CPU)

std::byte *dynamic_kernel_descriptor_gpu_ptr{nullptr}#

Device memory for dynamic kernel parameters (GPU)

std::byte *device_tensor_ptr = {nullptr}#

Device memory for module’s tensor data (intermediate/output/scratch)

std::size_t static_kernel_descriptor_bytes{0}#

Size of static descriptor slices (same for CPU and GPU)

std::size_t dynamic_kernel_descriptor_bytes{0}#

Size of dynamic descriptor slices (same for CPU and GPU)

std::size_t device_tensor_bytes = {0}#

Size of device tensor slice.

class ModuleRouter#
#include <module_router.hpp>

Manages routing configuration between modules in a pipeline.

This class manages how modules are connected via their input/output ports, providing efficient lookup of connections involving specific modules.

Internal Architecture: The router uses an index-based lookup optimization for efficient connection queries.

  • connections_: Vector storing all connections sequentially

  • module_to_connection_indices_: Maps each module ID to indices in connections_

When a connection A->B is added:

  1. Connection is appended to connections_ at index N

  2. Index N is added to both moduleA’s and moduleB’s index lists

This allows O(1) lookup of all connections involving a module, rather than O(n) scanning of all connections.

Example after adding A->B, B->C, A->C:

connections_ = [
    0: {A, output0, B, input0},
    1: {B, output0, C, input0},
    2: {A, output1, C, input1}
]
module_to_connection_indices_ = {
    "A": [0, 2],  // A is involved in connections 0 and 2
    "B": [0, 1],  // B is involved in connections 0 and 1
    "C": [1, 2]   // C is involved in connections 1 and 2
}

Public Functions

ModuleRouter() = default#

Default constructor.

~ModuleRouter() = default#

Destructor.

ModuleRouter(const ModuleRouter&) = delete#
ModuleRouter &operator=(const ModuleRouter&) = delete#
ModuleRouter(ModuleRouter&&) = default#

Move constructor.

ModuleRouter &operator=(ModuleRouter&&) = default#

Move assignment operator.

Returns:

Reference to this object

void add_connection(const PortConnection &connection)#

Add a connection between two module ports.

Parameters:

connection[in] The port connection to add

Throws:

std::runtime_error – if duplicate connection already exists

std::vector<PortConnection> get_module_connections(
std::string_view module_id,
) const#

Get all connections for a specific module.

Returns connections where the module is either source or target.

Parameters:

module_id[in] The module ID to query

Returns:

Vector of connections involving this module

std::vector<PortConnection> get_input_connections(
std::string_view module_id,
) const#

Get input connections for a specific module.

Returns connections where the module is the target (receiving input).

Parameters:

module_id[in] The module ID to query

Returns:

Vector of connections where this module is the target

std::vector<PortConnection> get_output_connections(
std::string_view module_id,
) const#

Get output connections for a specific module.

Returns connections where the module is the source (providing output).

Parameters:

module_id[in] The module ID to query

Returns:

Vector of connections where this module is the source

bool has_connections(std::string_view module_id) const#

Check if a module has any connections.

Parameters:

module_id[in] The module ID to check

Returns:

true if the module has connections, false otherwise

std::unordered_set<std::string> get_all_module_ids() const#

Get all unique module IDs in the routing configuration.

Returns:

Set of module IDs

void validate() const#

Validate the routing configuration.

Checks for duplicate connections and invalid port configurations.

Throws:

std::runtime_error – if configuration is invalid

class NullStreamExecutor : public framework::pipeline::IStreamExecutor#
#include <null_stream_executor.hpp>

Null implementation of IStreamExecutor for testing and placeholder scenarios.

This class provides a no-op implementation of the IStreamExecutor interface. It can be used in unit tests or as a placeholder when stream execution is not needed.

Public Functions

inline virtual void execute(cudaStream_t stream) override#

Execute operation - does nothing.

Parameters:

stream[in] The CUDA stream to execute on (ignored)

struct OutputPortMemoryCharacteristics#
#include <types.hpp>

Memory characteristics for a module OUTPUT port (for zero-copy optimization)

Describes what an output port provides to its downstream connections to enable zero-copy. Used by get_output_memory_characteristics() to declare output capabilities.

Public Members

bool provides_fixed_address_for_zero_copy = {true}#

Whether this output port provides fixed device addresses (for zero-copy optimization).

true = Address allocated once in setup_memory(), never changes

  • Enables downstream zero-copy (if downstream can accept it)

  • This is the typical case for most modules false = Address may change per iteration

    • Examples: external inputs, ping-pong buffers

    • Limits zero-copy to flexible consumers only

class PipelineFactory : public framework::pipeline::IPipelineFactory#
#include <pipeline_factory.hpp>

Concrete implementation of IPipelineFactory with runtime registration.

This factory uses a registry pattern allowing pipeline types to be registered at runtime. Each pipeline creator receives a reference to the module factory for constructing pipeline modules.

Example usage:

ModuleFactory module_factory;
// ... register module types ...

PipelineFactory pipeline_factory(module_factory);
pipeline_factory.register_pipeline_type("skeleton",
    [](IModuleFactory& mf, const std::string& id, const std::any& params) {
        return std::make_unique<SkeletonPipeline>(mf,
std::any_cast<PipelineSpec>(params));
    });

auto pipeline = pipeline_factory.create_pipeline("skeleton", "pipeline_0",
spec);

Public Types

using PipelineCreator = std::function<std::unique_ptr<IPipeline>(IModuleFactory&, const std::string&, const std::any&)>#

Pipeline creator function signature. Takes module factory reference, instance ID, and parameters. Returns unique pointer to pipeline.

Public Functions

explicit PipelineFactory(IModuleFactory &module_factory)#

Constructor.

Parameters:

module_factory[in] Reference to module factory for creating pipeline modules

~PipelineFactory() override = default#

Destructor.

PipelineFactory(const PipelineFactory&) = delete#
PipelineFactory &operator=(const PipelineFactory&) = delete#
PipelineFactory(PipelineFactory&&) = delete#
PipelineFactory &operator=(PipelineFactory&&) = delete#
template<StorablePipelineCreator Creator>
void register_pipeline_type(
std::string_view pipeline_type,
Creator &&creator,
)#

Register a pipeline type with its creator function (concept-constrained).

Accepts any callable (lambda, function pointer, functor, std::function) that satisfies the StorablePipelineCreator concept.

This templated overload enables:

  • Better compile-time error messages

  • Zero-overhead for stateless lambdas (avoids std::function wrapper)

  • Type safety enforced at compile time

Template Parameters:

Creator – Type of the callable (deduced automatically)

Parameters:
  • pipeline_type[in] Type identifier for the pipeline

  • creator[in] Callable that creates instances of this pipeline type

Throws:

std::invalid_argument – if pipeline_type is already registered

void register_pipeline_type(
std::string_view pipeline_type,
PipelineCreator creator,
)#

Register a pipeline type with its creator function (std::function overload).

This overload accepts std::function directly for explicit usage. Prefer using the templated overload for better performance with lambdas.

Parameters:
  • pipeline_type[in] Type identifier for the pipeline

  • creator[in] Function that creates instances of this pipeline type

Throws:

std::invalid_argument – if pipeline_type is already registered

virtual std::unique_ptr<IPipeline> create_pipeline(
std::string_view pipeline_type,
const std::string &pipeline_id,
const PipelineSpec &spec,
) override#

Create a pipeline from a specification.

Parameters:
  • pipeline_type[in] The type of pipeline to create

  • pipeline_id[in] Unique identifier for this pipeline instance

  • spec[in] Complete pipeline specification (usually PipelineSpec)

Throws:
  • std::invalid_argument – if pipeline_type is not supported

  • std::runtime_error – if pipeline creation fails

Returns:

Unique pointer to the created pipeline

virtual bool is_pipeline_type_supported(
std::string_view pipeline_type,
) const override#

Check if a pipeline type is supported by this factory.

Parameters:

pipeline_type[in] The type of pipeline to check

Returns:

true if the pipeline type is supported, false otherwise

virtual std::vector<std::string> get_supported_pipeline_types(
) const override#

Get all supported pipeline types.

Returns:

Vector of supported pipeline type identifiers

class PipelineMemoryManager#
#include <pipeline_memory_manager.hpp>

Pipeline-level memory manager

Follows cuBB pattern where pipeline owns large contiguous memory arenas and assigns memory slices to modules. Five-tier allocation:

  1. Static kernel descriptors CPU: Small pinned memory (like cuBB’s m_kernelStatDescr CPU)

  2. Static kernel descriptors GPU: Small device memory (like cuBB’s m_kernelStatDescr GPU)

  3. Dynamic kernel descriptors CPU: Small pinned memory (like cuBB’s m_kernelDynDescr CPU)

  4. Dynamic kernel descriptors GPU: Small device memory (like cuBB’s m_kernelDynDescr GPU)

  5. Device tensors: Large device memory (like cuBB’s m_LinearAlloc)

Note: Pipeline only allocates memory slices. Each module is responsible for copying its own descriptors from CPU to GPU when ready.

Public Functions

PipelineMemoryManager(
std::size_t total_static_kernel_descriptor_bytes,
std::size_t total_dynamic_kernel_descriptor_bytes,
std::size_t total_device_tensor_bytes,
)#

Constructor

Parameters:
  • total_static_kernel_descriptor_bytes[in] Total memory for all static kernel descriptors (both CPU and GPU)

  • total_dynamic_kernel_descriptor_bytes[in] Total memory for all dynamic kernel descriptors (both CPU and GPU)

  • total_device_tensor_bytes[in] Total device memory for all module tensor allocations

PipelineMemoryManager(const PipelineMemoryManager&) = delete#
PipelineMemoryManager &operator=(
const PipelineMemoryManager&,
) = delete#
PipelineMemoryManager(PipelineMemoryManager&&) = default#

Move constructor.

PipelineMemoryManager &operator=(PipelineMemoryManager&&) = default#

Move assignment operator.

Returns:

Reference to this object

~PipelineMemoryManager() = default#

Destructor.

ModuleMemorySlice allocate_module_slice(
std::string_view module_id,
const ModuleMemoryRequirements &requirements,
)#

Allocate memory slice for a module

Parameters:
  • module_id[in] Unique identifier for the module

  • requirements[in] Memory requirements for the module

Throws:

std::runtime_error – if allocation fails or exceeds arena capacity

Returns:

Memory slice assigned to the module

const ModuleMemorySlice &get_module_slice(
std::string_view module_id,
) const#

Get memory slice for a previously allocated module

Parameters:

module_id[in] Module identifier

Throws:

std::runtime_error – if module not found

Returns:

Memory slice for the module

MemoryUsage get_memory_usage() const#

Get current memory usage statistics.

Returns:

Memory usage information across all arenas

void allocate_all_module_slices(
const std::vector<IModule*> &modules,
)#

Pre-allocate memory slices for all modules

This method calculates requirements and allocates memory slices for all modules in one atomic operation, ensuring consistency between calculation and allocation order. After calling this method, get_module_slice() can be used to retrieve pre-computed slices.

Parameters:

modules[in] Vector of modules that need memory allocation

Throws:
  • std::runtime_error – if any module doesn’t implement IAllocationInfoProvider

  • std::runtime_error – if allocation fails

void copy_all_static_descriptors_to_device(cudaStream_t stream) const#

Copy all static kernel descriptors to device in one bulk operation

Copies the entire contiguous static descriptor region (all modules) from CPU pinned memory to GPU device memory. This should be called once at the end of pipeline initialization after all modules have initialized their static parameters.

Parameters:

stream[in] CUDA stream for async copy operation

Throws:

std::runtime_error – if copy fails or pointers are null

void copy_all_dynamic_descriptors_to_device(
cudaStream_t stream,
) const#

Copy all dynamic kernel descriptors to device in one bulk operation

Copies the entire contiguous dynamic descriptor region (all modules) from CPU pinned memory to GPU device memory. This should be called every iteration in configure_io() after all modules have updated their dynamic parameters.

Parameters:

stream[in] CUDA stream for async copy operation

Throws:

std::runtime_error – if copy fails or pointers are null

Public Static Functions

static std::unique_ptr<PipelineMemoryManager> create_for_modules(
const std::vector<IModule*> &modules,
)#

Factory method to create a PipelineMemoryManager sized for the given modules

This method calculates the total memory requirements for all modules (including alignment) and creates a PipelineMemoryManager with the appropriate arena sizes.

Parameters:

modules[in] Vector of modules that will use this memory manager

Throws:

std::runtime_error – if any module doesn’t implement IAllocationInfoProvider

Returns:

Unique pointer to a properly sized PipelineMemoryManager

struct MemoryUsage#
#include <pipeline_memory_manager.hpp>

Get total memory usage statistics

Return:

Current memory usage across all arenas

Public Members

std::size_t static_kernel_descriptor_used{0}#

Used static kernel descriptor memory (CPU + GPU)

std::size_t dynamic_kernel_descriptor_used{0}#

Used dynamic kernel descriptor memory (CPU + GPU)

std::size_t device_tensor_used = {0}#

Used device tensor memory.

std::size_t static_kernel_descriptor_total{0}#

Total static kernel descriptor memory (CPU + GPU)

std::size_t dynamic_kernel_descriptor_total{0}#

Total dynamic kernel descriptor memory (CPU + GPU)

std::size_t device_tensor_total = {0}#

Total device tensor memory.

struct PipelineModuleConfig#
#include <types.hpp>

Configuration for creating all modules in a pipeline.

Contains a list of module creation specifications that define what modules a pipeline should create and in what order. The order in the vector determines the execution order of the modules.

Public Members

std::vector<ModuleSpec> modules#

Ordered list of modules to create.

struct PipelineSpec#
#include <types.hpp>

Complete specification for constructing a pipeline.

Contains all modules, connections, and external I/O specifications needed to construct a complete pipeline through the factory.

Public Members

std::string pipeline_name#

Pipeline name for identification.

std::vector<ModuleSpec> modules#

Ordered list of modules to create.

std::vector<PortConnection> connections#

Module interconnections.

std::vector<std::string> external_inputs#

External input identifiers.

std::vector<std::string> external_outputs#

External output identifiers.

ExecutionMode execution_mode = {ExecutionMode::Graph}#

Execution mode (Graph or Stream), default Graph for backward compatibility

struct PortConnection#
#include <types.hpp>

Represents a connection between module ports.

Public Members

std::string source_module#

Source module ID.

std::string source_port#

Source port name (e.g., “output0”)

std::string target_module#

Target module ID.

std::string target_port#

Target port name (e.g., “input0”)

struct PortInfo#
#include <types.hpp>

Information about a module’s input or output port

This structure represents a named port containing one or more tensors, each with its own device memory pointer and tensor metadata. Used for module interconnection in pipelines.

Usage contexts:

Note: Modules typically extract and cache just the device pointers (void*) from PortInfo for execution, not the entire structure.

Public Members

std::string name#

Port name (e.g., “input0”, “matrixA”, “output0”)

std::vector<DeviceTensor> tensors#

Vector of device tensors for this port.

class StreamExecutor : public framework::pipeline::IStreamExecutor#
#include <stream_executor.hpp>

Concrete implementation of IStreamExecutor that executes CUDA kernels on a stream.

This class takes a kernel launch configuration and executes it on the provided CUDA stream. It handles the actual kernel launch using the CUDA Driver API.

Public Functions

explicit StreamExecutor(
const IKernelLaunchConfig *kernel_launch_config,
)#

Constructor

Parameters:

kernel_launch_config[in] Pointer to the kernel launch configuration This must remain valid for the lifetime of the executor

virtual void execute(cudaStream_t stream) override#

Execute the kernel on a CUDA stream.

Uses kernel parameters previously configured via the kernel_launch_config. Dynamic parameters should be set via setup_tick() on the owning module before calling execute().

Parameters:

stream[in] The CUDA stream to execute on

Throws:

std::runtime_error – if kernel launch fails

class StreamExecutorFactory#
#include <stream_executor_factory.hpp>

Factory for creating IStreamExecutor instances.

This factory can create either real StreamExecutor instances that execute CUDA kernels, or NullStreamExecutor instances for testing or no-op scenarios.

Public Types

enum class ExecutorType#

Executor type enumeration

Values:

enumerator Real#

Create real StreamExecutor instances.

enumerator Null#

Create NullStreamExecutor instances.

Public Functions

explicit StreamExecutorFactory(ExecutorType type)#

Constructor

Parameters:

type[in] The type of executors this factory should create

std::unique_ptr<IStreamExecutor> create_stream_executor(
const IKernelLaunchConfig *kernel_config,
) const#

Create a stream executor instance.

Parameters:

kernel_config[in] Pointer to kernel launch configuration Required for Real executors Can be nullptr for Null executors. kernel_config is ignored in case of Null executors.

Returns:

A unique pointer to the created IStreamExecutor instance

template<std::size_t NUM_PARAMS>
inline std::unique_ptr<IStreamExecutor> create_stream_executor(
const KernelLaunchConfig<NUM_PARAMS> *kernel_config,
) const#

Create a stream executor instance with KernelLaunchConfig<NUM_PARAMS>.

Template Parameters:

NUM_PARAMS – Number of kernel parameters

Parameters:

kernel_config[in] Pointer to kernel launch configuration

Returns:

A unique pointer to the created IStreamExecutor instance

std::unique_ptr<IStreamExecutor> create_stream_executor(
const DynamicKernelLaunchConfig *kernel_config,
) const#

Create a stream executor instance with DynamicKernelLaunchConfig.

Parameters:

kernel_config[in] Pointer to dynamic kernel launch configuration

Returns:

A unique pointer to the created IStreamExecutor instance