DriveWorks SDK Reference
3.0.4260 Release
For Test and Development only

DNN Plugins
Note
SW Release Applicability: This tutorial is applicable to modules in both NVIDIA DriveWorks and NVIDIA DRIVE Software releases.

DNN Plugins Overview

The DNN Plugins module enables DNN models that are composed of layers that are not supported by TensorRT to benefit from the efficiency of TensorRT.

Integrating such models to DriveWorks can be divided into the following steps:

  1. Implement custom layers as dwDNNPlugin and compile as a dynamic library
  2. Generate TensorRT model with plugins using tensorRT_optimization tool
  3. Load the model and the corresponding plugins at initialization

Implementing Custom DNN Layers as Plugins

Note
Dynamic libraries representing DNN layers must depend on the same TensorRT, CuDNN and CUDA version as DriveWorks if applicable.

A DNN plugin in DriveWorks requires implementing a set of pre-defined functions. The declarations of these functions are located in dw/dnn/plugins/DNNPlugin.h.

These functions have _dwDNNPluginHandle_t in common. This is a custom variable that can be constructed at initialization and used or modified during the lifetime of the plugin.

In this tutorial, we are going to implement and integrate a fully-connected layer (FCPlugin) for an MNIST network model. See DNN Plugin Sample for the corresponding sample.

The fully-connected layer in the MNIST network model looks like this:

layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1.0
}
param {
lr_mult: 2.0
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}

In FCPlugin example, we shall implement a class to store the members that must be available throughout the lifetime of the network model as well as to provide the methods that are required by dwDNNPlugin. Two of these members are cuBLAS and CuDNN handles, which we shall use to perform the inference. This class will be the custom variable that will be passed to each function.

class FCPlugin
{
FCPlugin(const dwDNNPluginWeights *weights, int32_t numWeights,
int32_t numOutputChannels);
FCPlugin(const void* data, size_t length);
~FCPlugin();
int initialize();
void terminate();
int32_t getNbOutputs() const;
dwBlobSize getOutputDimensions(int32_t index, const dwBlobSize* inputs, int32_t numInputDims);
bool supportsFormat(dwPrecision precision, dwDNNPluginLayout layout);
void configureWithFormat(dwPrecision precision, dwDNNPluginLayout layout);
size_t getWorkspaceSize(int32_t) const;
int32_t enqueue(int32_t batchSize, const void* const* inputs, void** outputs, void*, cudaStream_t stream);
size_t getSerializationSize();
void serialize(void* buffer);
private:
size_t type2size(dwPrecision precision);
template<typename T> void write(char*& buffer, const T& val);
template<typename T> void read(const char*& buffer, T& val);
void* copyToDevice(const void* data, size_t count);
void convertAndCopyToDevice(void*& deviceWeights, const dwDNNPluginWeights& weights);
void convertAndCopyToBuffer(char*& buffer, const dwDNNPluginWeights& weights);
void deserializeToDevice(const char*& hostBuffer, void*& deviceWeights, size_t size);
// Number of input and output channels of FC layer
int32_t m_numOutputChannels, m_numInputChannels;
// Kernel and bias weights of FC layer
dwDNNPluginWeights m_kernelWeights, m_biasWeights;
// Precision of the network model
// Actual storage on the device for kernel and bias weights
void* m_deviceKernel{nullptr};
void* m_deviceBias{nullptr};
// CuDNN and cuBLAS handles that are needed for the execution of the layer
cudnnHandle_t m_cudnn;
cublasHandle_t m_cublas;
cudnnTensorDescriptor_t m_srcDescriptor, m_dstDescriptor;
};

Initialization

A plugin can be initialized either directly from a structure dwDNNPluginWeights:

const dwDNNPluginWeights* weights, int32_t numWeights);

or it can be initialized from serialized data:

dwStatus _dwDNNPlugin_initialize(_dwDNNPluginHandle_t* handle, const char* layerName,
const void* data, size_t length);

Both of these functions shall define how the plugin should be initialized from given input data.

In FCPlugin example, we shall implement the backend for both these initialization functions as constructors of the class and implement both of the initialization functions mentioned above to create an instance of FCPlugin class. Additionally, we shall define helper methods to be able to deserialize layer from data.

The first constructor implements initializing the plugin class from dwDNNPluginWeight.

// Construction from dwDNNPluginWeights
FCPlugin::FCPlugin(const dwDNNPluginWeights *weights, int32_t numWeights, int32_t numOutputChannels)
: m_numOutputChannels(numOutputChannels)
{
m_kernelWeights = weights[0];
m_biasWeights = weights[1];
m_kernelWeights.values = malloc(m_kernelWeights.count * type2size(m_kernelWeights.precision));
memcpy(const_cast<void*>(m_kernelWeights.values), weights[0].values, m_kernelWeights.count * type2size(m_kernelWeights.precision));
m_biasWeights.values = malloc(m_biasWeights.count*type2size(m_biasWeights.precision));
memcpy(const_cast<void*>(m_biasWeights.values), weights[1].values, m_biasWeights.count*type2size(m_biasWeights.precision));
m_numInputChannels = int(weights[0].count / numOutputChannels);
initialize();
}

This constructor allocates kernel and bias weights, and copies them from the given input. numOutputChannels is a configurable parameter of the layer which is defined in the model's description.

The second constructor implements initializing the plugin class from binary data.

// Construction from data
FCPlugin::FCPlugin(const void* data, size_t length)
{
const char* d = static_cast<const char*>(data);
read(d, m_numInputChannels);
read(d, m_numOutputChannels);
m_kernelWeights.count = m_numInputChannels * m_numOutputChannels;
m_kernelWeights.values = nullptr;
read(d, m_biasWeights.count);
m_biasWeights.values = nullptr;
read(d, m_precision);
deserializeToDevice(d, m_deviceKernel, m_kernelWeights.count*type2size(m_precision));
deserializeToDevice(d, m_deviceBias, m_biasWeights.count*type2size(m_precision));
initialize();
}

The deserialization of the data is dependent on the way the data is serialized. Firstly, this constructor deserializes the dimensions. Then, it allocates member variables accordingly and deserializes the weights.

Initialize function is used by both constructors to create CuDNN and cuBLAS handles, and to perform the operations that are common to both constructors.

// Initialize common members to both constructors
void FCPlugin::initialize()
{
CHECK_CUDA_ERROR(cudnnCreate(&m_cudnn));// initialize cudnn and cublas
CHECK_CUDA_ERROR(cublasCreate(&m_cublas));
CHECK_CUDA_ERROR(cudnnCreateTensorDescriptor(&m_srcDescriptor));// create cudnn tensor descriptors we need for bias addition
CHECK_CUDA_ERROR(cudnnCreateTensorDescriptor(&m_dstDescriptor));
if (m_kernelWeights.values)
convertAndCopyToDevice(m_deviceKernel, m_kernelWeights);
if (m_biasWeights.values)
convertAndCopyToDevice(m_deviceBias, m_biasWeights);
}

Deserialization, in this case, is performed by simply copying the binary data from host to device.

// Deserialization
void FCPlugin::deserializeToDevice(const char*& hostBuffer, void*& deviceWeights, size_t size)
{
deviceWeights = copyToDevice(hostBuffer, size);
hostBuffer += size;
}

The following function facilitates allocating data on device and copying data to device.

// Allocate and copy given data to device memory
void* FCPlugin::copyToDevice(const void* data, size_t count)
{
void* deviceData;
CHECK_CUDA_ERROR(cudaMalloc(&deviceData, count));
CHECK_CUDA_ERROR(cudaMemcpy(deviceData, data, count, cudaMemcpyHostToDevice));
return deviceData;
}

The following function performs precision conversion if necessary before copying data to device.

// Convert weights based on loaded precision and copy to device
void FCPlugin::convertAndCopyToDevice(void*& deviceWeights, const dwDNNPluginWeights& weights)
{
if (weights.precision != m_precision) // Weights are converted in host memory first, if the type does not match
{
size_t size = weights.count* type2size(m_precision);
void* buffer = malloc(size);
for (int64_t v = 0; v < weights.count; ++v)
if (m_precision == DW_PRECISION_FP32)
static_cast<float32_t*>(buffer)[v] = __half2float(static_cast<const __half*>(weights.values)[v]);
else
static_cast<__half*>(buffer)[v] = __float2half(static_cast<const float32_t*>(weights.values)[v]);
deviceWeights = copyToDevice(buffer, size);
free(buffer);
}
else
deviceWeights = copyToDevice(weights.values, weights.count * type2size(m_precision));
}

The initialization functions required by dwDNNPlugin can now be implemented as follows:

const dwDNNPluginWeights* weights, int32_t numWeights)
{
std::unique_ptr<FCPlugin> fcPlugin(new FCPlugin(weights, numWeights, 10));
*handle = reinterpret_cast<_dwDNNPluginHandle_t>(fcPlugin.release());
return DW_SUCCESS;
}
const void* data, size_t length)
{
std::unique_ptr<FCPlugin> fcPlugin(new FCPlugin(data, length));
*handle = reinterpret_cast<_dwDNNPluginHandle_t>(fcPlugin.release());
return DW_SUCCESS;
}

Releasing

The release function must destroy everything that was created by the plugin at initialization.

In FCPlugin example, we are going to destroy everything in the destructor.

FCPlugin::~FCPlugin()
{
if (m_kernelWeights.values)
{
free(const_cast<void*>(m_kernelWeights.values));
m_kernelWeights.values = nullptr;
}
if (m_biasWeights.values)
{
free(const_cast<void*>(m_biasWeights.values));
m_biasWeights.values = nullptr;
}
terminate();
}
void FCPlugin::terminate()
{
CHECK_CUDA_ERROR(cublasDestroy(m_cublas));
CHECK_CUDA_ERROR(cudnnDestroy(m_cudnn));
if (m_deviceKernel)
{
cudaFree(m_deviceKernel);
m_deviceKernel = nullptr;
}
if (m_deviceBias)
{
cudaFree(m_deviceBias);
m_deviceBias = nullptr;
}
}
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
delete plugin;
return DW_SUCCESS;
}

Serialization

A plugin must provide functions to enable serialization from a buffer.

In FCPlugin example, we store the number of input channels, the number of output channels, the number of bias weights and precision. In addition, we store the kernel and bias weights. Therefore:

size_t FCPlugin::getSerializationSize()
{
return sizeof(m_numInputChannels) + sizeof(m_numOutputChannels) + sizeof(m_biasWeights.count) + sizeof(m_precision) +
(m_kernelWeights.count + m_biasWeights.count) * type2size(m_precision);
}
void FCPlugin::serialize(void* buffer)
{
char* d = static_cast<char*>(buffer);
write(d, m_numInputChannels);
write(d, m_numOutputChannels);
write(d, m_biasWeights.count);
write(d, m_precision);
convertAndCopyToBuffer(d, m_kernelWeights);
convertAndCopyToBuffer(d, m_biasWeights);
}
void FCPlugin::convertAndCopyToBuffer(char*& buffer, const dwDNNPluginWeights& weights)
{
if (weights.precision != m_precision)
for (int64_t v = 0; v < weights.count; ++v)
if (m_precision == DW_PRECISION_FP32)
reinterpret_cast<float32_t*>(buffer)[v] = __half2float(static_cast<const __half*>(weights.values)[v]);
else
reinterpret_cast<__half*>(buffer)[v] = __float2half(static_cast<const float32_t*>(weights.values)[v]);
else
memcpy(buffer, weights.values, weights.count * type2size(m_precision));
buffer += weights.count * type2size(m_precision);
}

The interface functions can be defined as:

{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*serializationSize = plugin->getSerializationSize();
return DW_SUCCESS;
}
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->serialize(buffer);
return DW_SUCCESS;
}

Getter Functions

There are multiple getter functions that are required to be implemented by plugin.

A plugin must provide a function for returning how many outputs it has:

In addition, it must provide a function for returning dimensions for each of the outputs based on the input dimensions:

int32_t outputIndex, const dwBlobSize* inputDimensions,
int32_t numInputs, _dwDNNPluginHandle_t handle);

Finally, it must provide a function to return temporary workspace size required by this layer.

dwStatus _dwDNNPlugin_getWorkspaceSize(size_t* workspaceSize, int32_t maxBatchSize,

In FCPlugin example, we know that there is only one output of the layer:

int32_t FCPlugin::getNbOutputs() const
{
return 1;
}

The dimensions of the output is expected to be NxCx1x1 where N is the batch size of the input, and C is the number of output channels.

dwBlobSize FCPlugin::getOutputDimensions(int32_t index, const dwBlobSize* inputs, int32_t numInputDims)
{
return dwBlobSize{inputs[0].batchsize, m_numOutputChannels, 1, 1};
}

FCPlugin does not need any temporary workspace during the generation of the model.

size_t FCPlugin::getWorkspaceSize(int32_t) const
{
return 0;
}

Finally, the interface functions can be defined as:

{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*numOutputs = plugin->getNbOutputs();
return DW_SUCCESS;
}
int32_t outputIndex, const dwBlobSize* inputDimensions,
int32_t numInputs, _dwDNNPluginHandle_t handle)
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*outputDimensions = plugin->getOutputDimensions(outputIndex, inputDimensions, numInputs);
return DW_SUCCESS;
}
dwStatus _dwDNNPlugin_getWorkspaceSize(size_t* workspaceSize, int32_t maxBatchSize,
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*workspaceSize = plugin->getWorkspaceSize(maxBatchSize);
return DW_SUCCESS;
}

Formats

A plugin must implement a function to indicate whether a given format is supported:

Furthermore, it must allow configuring the format:

int32_t numInputs, const dwBlobSize* outputDimensions,
int32_t numOutputs, dwPrecision precision,
dwDNNPluginLayout layout, int32_t maxBatchSize,

In FCPlugin sample, we support single and half precision and only planar layout:

bool FCPlugin::supportsFormat(dwPrecision precision, dwDNNPluginLayout layout) const
{
return (precision == DW_PRECISION_FP32 || precision == DW_PRECISION_FP16) && layout == DW_DNN_PLUGIN_LAYOUT_NCHW;
}

Since there is only one layout supported, precision is the only configurable parameter:

void FCPlugin::configureWithFormat(dwPrecision precision, dwDNNPluginLayout layout)
{
m_precision = precision;
}

Hence, the implementation of interface functions look like this:

{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*res = plugin->supportsFormat(precision, pluginLayout);
return DW_SUCCESS;
}
int32_t numInputs, const dwBlobSize* outputDimensions,
int32_t numOutputs, dwPrecision precision,
dwDNNPluginLayout layout, int32_t maxBatchSize,
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->configureWithFormat(precision, layout);
return DW_SUCCESS;
}

Inference

A plugin must implement a function to perform inference:

dwStatus _dwDNNPlugin_enqueue(int32_t batchSize, const void* const* inputs, void** outputs,
void* workspace, cudaStream_t stream, _dwDNNPluginHandle_t handle)

In FCPlugin example, we shall use CuDNN and cuBLAS to perform fully-connected layer. We shall use cuBLAS for matrix multiplication and CuDNN for adding bias.

int32_t FCPlugin::enqueue(int32_t batchSize, const void*const * inputs, void** outputs, void*, cudaStream_t stream)
{
float32_t onef{1.0f}, zerof{0.0f};
__half oneh = __float2half(1.0f), zeroh = __float2half(0.0f);
cublasSetStream(m_cublas, stream);
cudnnSetStream(m_cudnn, stream);
if (m_precision == DW_PRECISION_FP32)
{
CHECK_CUDA_ERROR(cublasSgemm(m_cublas, CUBLAS_OP_T, CUBLAS_OP_N, m_numOutputChannels, batchSize, m_numInputChannels, &onef,
reinterpret_cast<const float32_t*>(m_deviceKernel), m_numInputChannels,
reinterpret_cast<const float32_t*>(inputs[0]), m_numInputChannels, &zerof,
reinterpret_cast<float32_t*>(outputs[0]), m_numOutputChannels));
}
else
{
CHECK_CUDA_ERROR(cublasHgemm(m_cublas, CUBLAS_OP_T, CUBLAS_OP_N, m_numOutputChannels, batchSize, m_numInputChannels, &oneh,
reinterpret_cast<const __half*>(m_deviceKernel), m_numInputChannels,
reinterpret_cast<const __half*>(inputs[0]), m_numInputChannels, &zeroh,
reinterpret_cast<__half*>(outputs[0]), m_numOutputChannels));
}
if (m_biasWeights.count)
{
cudnnDataType_t cudnnDT = m_precision == DW_PRECISION_FP32 ? CUDNN_DATA_FLOAT : CUDNN_DATA_HALF;
CHECK_CUDA_ERROR(cudnnSetTensor4dDescriptor(m_srcDescriptor, CUDNN_TENSOR_NCHW, cudnnDT, 1, m_numOutputChannels, 1, 1));
CHECK_CUDA_ERROR(cudnnSetTensor4dDescriptor(m_dstDescriptor, CUDNN_TENSOR_NCHW, cudnnDT, batchSize, m_numOutputChannels, 1, 1));
CHECK_CUDA_ERROR(cudnnAddTensor(m_cudnn, &onef, m_srcDescriptor, m_deviceBias, &onef, m_dstDescriptor, outputs[0]));
}
return 0;
}

Finally, the implementation of the interface function is:

dwStatus _dwDNNPlugin_enqueue(int32_t batchSize, const void* const* inputs, void** outputs,
void* workspace, cudaStream_t stream, _dwDNNPluginHandle_t handle)
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->enqueue(batchSize, inputs, outputs, workspace, stream);
return DW_SUCCESS;
}

Building plugin as a dynamic library

The plugin code must be compiled into a dynamic library for each processor architecture that are to be supported.

In FCPlugin example, we shall add plugin to the samples to simplify the building process. Firstly, we shall create a folder under samples called fcplugin and create a CMakeLists.txt under this folder:

project(fc_plugin C CXX)
#-------------------------------------------------------------------------------
# Project files
#-------------------------------------------------------------------------------
set(SOURCES
FCPlugin.cpp
)
set(LIBRARIES
dw_samples_framework
${DriveWorks_LIBRARIES}
cudnn
)
#-------------------------------------------------------------------------------
# Final target
#-------------------------------------------------------------------------------
cuda_add_library(${PROJECT_NAME} SHARED ${SOURCES})
CUDA_ADD_CUBLAS_TO_TARGET(${PROJECT_NAME})
target_link_libraries(${PROJECT_NAME} PRIVATE ${LIBRARIES})
#-------------------------------------------------------------------------------
# Install target
#-------------------------------------------------------------------------------
sdk_add_sample(${PROJECT_NAME})

Append fcplugin to the SAMPLES list in CMakeLists.txt under samples:

set(SAMPLES framework;egomotion;sensors;features;rig;visualization;sfm;dnn;laneDetection;colorcorrection;rectifier;ipc;hello_world;image;stereo;freespaceperception;drivenet;maps;template;icp;lidar_accumulator;cameraBlindness;calibration;vehicleio;dataspeedBridge;waitcondition;pointcloudprocessor;fcplugin)

Finally, run cmake and the preferred build system to compile the library. This will generate a libfc_plugin.so.

Generating TensorRT Model with Plugins

In order to execute a DNN model with plugins into DriveWorks, like any other DNN model, it has to be converted to a TensorRT model. This is achieved by running TensorRT Optimizer Tool with an extra parameter to configure plugins.

Plugin configuration is provided in a json file. This file is expected to provide a path to a plugin as dynamic library for each custom layer:

{
"__comment" : "Please list here all the plugin paths (absolute path or relative to tensorRT_optimization tool) and the layers they apply to.",
"plugin1_path" : ["layer1", "layer2", "layer3"],
"plugin2_path" : ["layer4", "layer5"],
"plugin3_path" : ["layer6"]
}

In FCPlugin example, we have only one custom layer that requires a plugin. The name of the layer is ip2, and path of the plugin is libfc_plugin.so, assuming that it is located in the same folder as TensorRT Optimizer Tool . Therefore, we can create a plugin.json file as follows:

{
"libfc_plugin.so" : ["ip2"],
"plugin2_path" : ["layer4", "layer5"],
"plugin3_path" : ["layer6"]
}

Finally, we can generate the model by:

./tensorRT_optimization --modelType=caffe --prototxt=mnist.prototxt --caffemodel=mnist.caffemodel --outputBlobs=prob --pluginConfig=plugin.json --out=mnist_with_plugin.dnn

Loading TensorRT Model with Plugins

Loading a TensorRT model with plugin generated via TensorRT Optimizer Tool requires plugin configuration to be defined at initialization of dwDNNHandle_t.

The plugin configuration structures look like this:

typedef struct
{
const dwDNNCustomLayer* customLayers;
size_t numCustomLayers;
typedef struct
{
const char* pluginLibraryPath;
const char* layerName;

Just like the json file used at model generation (see plugin json), for each layer, the layer name and path to the plugin that implements the layer must be provided.

In FCPlugin example:

pluginConf.numCustomLayers = 1;
dwDNNCustomLayer customLayer{};
customLayer.layerName = "ip2";
// Assuming libfc_plugin.so is in the same folder as libdriveworks.so
std::string pluginPath = "libfc_plugin.so";
customLayer.pluginLibraryPath = pluginPath.c_str();
pluginConf.customLayers = &customLayer;
// Initialize DNN from a TensorRT file with plugin configuration
CHECK_DW_ERROR(dwDNN_initializeTensorRTFromFile(&dnn, "mnist_with_plugin.dnn", &pluginConf, DW_PROCESSOR_TYPE_GPU, sdk));

See DNN Plugin Sample for more details.