Create a Custom Operator

DALI allows you to create a custom operator in C++ and load it at runtime. Here are several reasons you might need to write your custom operator:

  • DALI does not support the operation that you want to perform and it cannot be expressed by a composition of other operators.

  • You want to write an operator that depends on a third-party library.

  • You want to optimize your pipeline by providing a manually fused operation in C++.

In this tutorial, we will walk you through the process of writing, compiling, and loading a plugin with a DALI custom operator. For demonstration purposes we will provide a CPU and a GPU implementation for the CustomDummy operator. The implementation only copies the input data to the output without any modifications.


  • DALI is installed from the binary distribution or compiled the from source.

  • You can write in C++.

  • You have a basic knowledge of CMake.

Operator Definition

  1. Declare the operator in a header file.

  2. Provide common Setup functions.

The implementation of CanInferOutputs and SetupImpl can be shared across backends. SetupImpl provides the shape and type description of the output based on the input, and CanInferOutputs informs the executor that the Operator can provide that output description for the entire batch before executing RunImpl.

! cat customdummy/dummy.h

#include <vector>

#include "dali/pipeline/operator/operator.h"

namespace other_ns {

template <typename Backend>
class Dummy : public ::dali::Operator<Backend> {
  inline explicit Dummy(const ::dali::OpSpec &spec) :
    ::dali::Operator<Backend>(spec) {}

  virtual inline ~Dummy() = default;

  Dummy(const Dummy&) = delete;
  Dummy& operator=(const Dummy&) = delete;
  Dummy(Dummy&&) = delete;
  Dummy& operator=(Dummy&&) = delete;

  bool CanInferOutputs() const override {
    return true;

  bool SetupImpl(std::vector<::dali::OutputDesc> &output_desc,
                 const ::dali::workspace_t<Backend> &ws) override {
    const auto &input = ws.template InputRef<Backend>(0);
    output_desc[0] = {input.shape(), input.type()};
    return true;

  void RunImpl(::dali::workspace_t<Backend> &ws) override;

}  // namespace other_ns

#endif  // EXAMPLE_DUMMY_H_

CPU Operator Implementation

  1. Provide the CPU implementation in a C++ implementation file by overriding the RunImpl method for HostWorkspace.

  2. Register the schema for the custom operator with DALI_SCHEMA macro and register the CPU version of the operator with DALI_REGISTER_OPERATOR.

In RunImpl we obtain access to the entire batch that is processed. We get the reference to the CPU thread pool from the workspace ws and create tasks that will copy samples from input to output in parallel. The tasks will be ordered by the thread pool from the longest to the shortest, based on the tensor size, to best utilize the worker threads.

The outputs are already allocated as we provided the SetupImpl and CanInferOutputs functions.

! cat customdummy/
#include "dummy.h"

namespace other_ns {

template <>
void Dummy<::dali::CPUBackend>::RunImpl(::dali::HostWorkspace &ws) {
  const auto &input = ws.InputRef<::dali::CPUBackend>(0);
  auto &output = ws.OutputRef<::dali::CPUBackend>(0);

  ::dali::TypeInfo type = input.type();
  auto &tp = ws.GetThreadPool();
  const auto &in_shape = input.shape();
  for (int sample_id = 0; sample_id < in_shape.num_samples(); sample_id++) {
        [&, sample_id](int thread_id) {
          type.Copy<::dali::CPUBackend, ::dali::CPUBackend>(output.raw_mutable_tensor(sample_id),
                                                            in_shape.tensor_size(sample_id), 0);

}  // namespace other_ns

DALI_REGISTER_OPERATOR(CustomDummy, ::other_ns::Dummy<::dali::CPUBackend>, ::dali::CPU);

    .DocStr("Make a copy of the input tensor")

GPU operator implementation

  1. Provide a GPU implementation in a CUDA implementation file by overriding the RunImpl method for DeviceWorkspace.

  2. Register the GPU version of the operator with DALI_REGISTER_OPERATOR macro.

As it was the case for the CPU implementation, we obtain the entire batch in the RunImpl function. The outputs are already allocated based on the return value of SetupImpl function that was provided earlier.

It is important that we issue the GPU operations on the stream provided by the workspace. Here we copy the batch using cudaMemcpyAsync.

! cat customdummy/
#include <cuda_runtime_api.h>
#include "dummy.h"

namespace other_ns {

void Dummy<::dali::GPUBackend>::RunImpl(::dali::DeviceWorkspace &ws) {
  const auto &input = ws.Input<::dali::GPUBackend>(0);
  auto &output = ws.Output<::dali::GPUBackend>(0);

}  // namespace other_ns

DALI_REGISTER_OPERATOR(CustomDummy, ::other_ns::Dummy<::dali::GPUBackend>, ::dali::GPU);

Building the Plugin

  1. Specify the build configuration.

To retrieve the build configuration parameters use nvidia.dali.sysconfig.

import nvidia.dali.sysconfig as sysconfig
['-I/usr/local/lib/python3.6/dist-packages/nvidia/dali/include', '-D_GLIBCXX_USE_CXX11_ABI=1']
['-L/usr/local/lib/python3.6/dist-packages/nvidia/dali', '-ldali']

Important: Only one version of should be loaded in the process at the same time. A plugin must be linked against the exact same library in the DALI’s Python package directory that you intend to use to load your plugin. As a result of this limitation, when you upgrade your DALI version you must link your plugin against the new library again.

  1. In this example, we used CMake to build the plugin.

! cat customdummy/CMakeLists.txt
cmake_minimum_required(VERSION 3.5)
find_package(CUDA 9.0 REQUIRED)

        COMMAND python -c "import nvidia.dali as dali; print(dali.sysconfig.get_lib_dir())"

        COMMAND python -c "import nvidia.dali as dali; print(\" \".join(dali.sysconfig.get_compile_flags()))"

link_directories( "${DALI_LIB_DIR}" )

cuda_add_library(customdummy SHARED )
target_link_libraries(customdummy dali)
  1. We are now ready to compile the plugin that contains the CustomDummy custom operator.

! rm -rf customdummy/build
! mkdir -p customdummy/build
! cd customdummy/build && \
  cmake .. && \
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "9.0")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/git/DALI/docs/examples/custom_operations/custom_operator/customdummy/build
[ 33%] Building NVCC (Device) object CMakeFiles/customdummy.dir/
Scanning dependencies of target customdummy
[ 66%] Building CXX object CMakeFiles/customdummy.dir/
[100%] Linking CXX shared library
[100%] Built target customdummy
  1. After the build is complete we have a dynamic library file that is ready to use.

! ls customdummy/build/*.so

Importing the plugin

  1. We can see that there is no such operator called CustomDummy.

import nvidia.dali.ops as ops
except Exception as e:
    print('Error: ' + str(e))
Error: module 'nvidia.dali.ops' has no attribute 'CustomDummy'
  1. Load the plugin.

import nvidia.dali.plugin_manager as plugin_manager
  1. Verify that the new operator is available.

Help on class CustomDummy in module nvidia.dali.ops:

class CustomDummy(builtins.object)
 |  Make a copy of the input tensor
 |  Supported backends
 |   * 'cpu'
 |   * 'gpu'
 |  Keyword args
 |  ------------
 |  `bytes_per_sample_hint` : int, optional, default = 0
 |      Output size hint, in bytes per sample.
 |      If specified, the operator's outputs residing in GPU or page-locked host memory will be preallocated
 |      to accommodate a batch of samples of this size.
 |  `preserve` : bool, optional, default = False
 |      Prevents the operator from being removed from the
 |      graph even if its outputs are not used.
 |  `seed` : int, optional, default = -1
 |      Random seed.
 |      If not provided, it will be populated based on the global seed of the pipeline.
 |  Methods defined here:
 |  __call__(self, *inputs, **kwargs)
 |      __call__(data, **kwargs)
 |      Operator call to be used in graph definition.
 |      Args
 |      ----
 |      `data` : TensorList
 |          Input to the operator.
 |  __init__(self, **kwargs)
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  device
 |  preserve
 |  schema
 |  spec
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  schema_name = 'CustomDummy'