- Note
- SW Release Applicability: This tutorial is applicable to modules in both NVIDIA DriveWorks and NVIDIA DRIVE Software releases.
DNN Plugins Overview
The DNN Plugins module enables DNN models that are composed of layers that are not supported by TensorRT to benefit from the efficiency of TensorRT.
Integrating such models to DriveWorks can be divided into the following steps:
- Implement custom layers as dwDNNPlugin and compile as a dynamic library
- Generate TensorRT model with plugins using tensorRT_optimization tool
- Load the model and the corresponding plugins at initialization
Implementing Custom DNN Layers as Plugins
- Note
- Dynamic libraries representing DNN layers must depend on the same TensorRT, CuDNN and CUDA version as DriveWorks if applicable.
A DNN plugin in DriveWorks requires implementing a set of pre-defined functions. The declarations of these functions are located in dw/dnn/plugins/DNNPlugin.h
.
These functions have _dwDNNPluginHandle_t
in common. This is a custom variable that can be constructed at initialization and used or modified during the lifetime of the plugin.
In this tutorial, we are going to implement and integrate a fully-connected layer (FCPlugin) for an MNIST network model. See DNN Plugin Sample for the corresponding sample.
The fully-connected layer in the MNIST network model looks like this:
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1.0
}
param {
lr_mult: 2.0
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
In FCPlugin example, we shall implement a class to store the members that must be available throughout the lifetime of the network model as well as to provide the methods that are required by dwDNNPlugin. Two of these members are cuBLAS and CuDNN handles, which we shall use to perform the inference. This class will be the custom variable that will be passed to each function.
class FCPlugin
{
int32_t numOutputChannels);
FCPlugin(const void* data, size_t length);
~FCPlugin();
int initialize();
void terminate();
int32_t getNbOutputs() const;
size_t getWorkspaceSize(int32_t) const;
int32_t enqueue(int32_t batchSize, const void* const* inputs, void** outputs, void*, cudaStream_t stream);
size_t getSerializationSize();
void serialize(void* buffer);
private:
template<typename T> void write(char*& buffer, const T& val);
template<typename T> void read(const char*& buffer, T& val);
void* copyToDevice(const void* data, size_t count);
void deserializeToDevice(const char*& hostBuffer, void*& deviceWeights, size_t size);
int32_t m_numOutputChannels, m_numInputChannels;
void* m_deviceKernel{nullptr};
void* m_deviceBias{nullptr};
cudnnHandle_t m_cudnn;
cublasHandle_t m_cublas;
cudnnTensorDescriptor_t m_srcDescriptor, m_dstDescriptor;
};
Initialization
A plugin can be initialized either directly from a structure dwDNNPluginWeights
:
or it can be initialized from serialized data:
const void* data, size_t length);
Both of these functions shall define how the plugin should be initialized from given input data.
In FCPlugin example, we shall implement the backend for both these initialization functions as constructors of the class and implement both of the initialization functions mentioned above to create an instance of FCPlugin class. Additionally, we shall define helper methods to be able to deserialize layer from data.
The first constructor implements initializing the plugin class from dwDNNPluginWeight
.
FCPlugin::FCPlugin(
const dwDNNPluginWeights *weights, int32_t numWeights, int32_t numOutputChannels)
: m_numOutputChannels(numOutputChannels)
{
m_kernelWeights = weights[0];
m_biasWeights = weights[1];
m_kernelWeights.
values = malloc(m_kernelWeights.count * type2size(m_kernelWeights.precision));
memcpy(const_cast<void*>(m_kernelWeights.values), weights[0].
values, m_kernelWeights.count * type2size(m_kernelWeights.precision));
m_biasWeights.values = malloc(m_biasWeights.count*type2size(m_biasWeights.precision));
memcpy(const_cast<void*>(m_biasWeights.values), weights[1].
values, m_biasWeights.count*type2size(m_biasWeights.precision));
m_numInputChannels = int(weights[0].count / numOutputChannels);
initialize();
}
This constructor allocates kernel and bias weights, and copies them from the given input. numOutputChannels
is a configurable parameter of the layer which is defined in the model's description.
The second constructor implements initializing the plugin class from binary data.
FCPlugin::FCPlugin(const void* data, size_t length)
{
const char* d = static_cast<const char*>(data);
read(d, m_numInputChannels);
read(d, m_numOutputChannels);
m_kernelWeights.count = m_numInputChannels * m_numOutputChannels;
m_kernelWeights.values = nullptr;
read(d, m_biasWeights.count);
m_biasWeights.values = nullptr;
read(d, m_precision);
deserializeToDevice(d, m_deviceKernel, m_kernelWeights.count*type2size(m_precision));
deserializeToDevice(d, m_deviceBias, m_biasWeights.count*type2size(m_precision));
initialize();
}
The deserialization of the data is dependent on the way the data is serialized. Firstly, this constructor deserializes the dimensions. Then, it allocates member variables accordingly and deserializes the weights.
Initialize function is used by both constructors to create CuDNN and cuBLAS handles, and to perform the operations that are common to both constructors.
void FCPlugin::initialize()
{
CHECK_CUDA_ERROR(cudnnCreate(&m_cudnn));
CHECK_CUDA_ERROR(cublasCreate(&m_cublas));
CHECK_CUDA_ERROR(cudnnCreateTensorDescriptor(&m_srcDescriptor));
CHECK_CUDA_ERROR(cudnnCreateTensorDescriptor(&m_dstDescriptor));
if (m_kernelWeights.values)
convertAndCopyToDevice(m_deviceKernel, m_kernelWeights);
if (m_biasWeights.values)
convertAndCopyToDevice(m_deviceBias, m_biasWeights);
}
Deserialization, in this case, is performed by simply copying the binary data from host to device.
void FCPlugin::deserializeToDevice(const char*& hostBuffer, void*& deviceWeights, size_t size)
{
deviceWeights = copyToDevice(hostBuffer, size);
hostBuffer += size;
}
The following function facilitates allocating data on device and copying data to device.
void* FCPlugin::copyToDevice(const void* data, size_t count)
{
void* deviceData;
CHECK_CUDA_ERROR(cudaMalloc(&deviceData, count));
CHECK_CUDA_ERROR(cudaMemcpy(deviceData, data, count, cudaMemcpyHostToDevice));
return deviceData;
}
The following function performs precision conversion if necessary before copying data to device.
void FCPlugin::convertAndCopyToDevice(
void*& deviceWeights,
const dwDNNPluginWeights& weights)
{
{
size_t size = weights.
count* type2size(m_precision);
void* buffer = malloc(size);
for (int64_t v = 0; v < weights.
count; ++v)
static_cast<float32_t*
>(buffer)[v] = __half2float(static_cast<const __half*>(weights.
values)[v]);
else
static_cast<__half*
>(buffer)[v] = __float2half(static_cast<const float32_t*>(weights.
values)[v]);
deviceWeights = copyToDevice(buffer, size);
free(buffer);
}
else
deviceWeights = copyToDevice(weights.
values, weights.
count * type2size(m_precision));
}
The initialization functions required by dwDNNPlugin can now be implemented as follows:
{
std::unique_ptr<FCPlugin> fcPlugin(new FCPlugin(weights, numWeights, 10));
return DW_SUCCESS;
}
const void* data, size_t length)
{
std::unique_ptr<FCPlugin> fcPlugin(new FCPlugin(data, length));
return DW_SUCCESS;
}
Releasing
The release function must destroy everything that was created by the plugin at initialization.
In FCPlugin example, we are going to destroy everything in the destructor.
FCPlugin::~FCPlugin()
{
if (m_kernelWeights.values)
{
free(const_cast<void*>(m_kernelWeights.values));
m_kernelWeights.values = nullptr;
}
if (m_biasWeights.values)
{
free(const_cast<void*>(m_biasWeights.values));
m_biasWeights.values = nullptr;
}
terminate();
}
void FCPlugin::terminate()
{
CHECK_CUDA_ERROR(cublasDestroy(m_cublas));
CHECK_CUDA_ERROR(cudnnDestroy(m_cudnn));
if (m_deviceKernel)
{
cudaFree(m_deviceKernel);
m_deviceKernel = nullptr;
}
if (m_deviceBias)
{
cudaFree(m_deviceBias);
m_deviceBias = nullptr;
}
}
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
delete plugin;
return DW_SUCCESS;
}
Serialization
A plugin must provide functions to enable serialization from a buffer.
In FCPlugin example, we store the number of input channels, the number of output channels, the number of bias weights and precision. In addition, we store the kernel and bias weights. Therefore:
size_t FCPlugin::getSerializationSize()
{
return sizeof(m_numInputChannels) + sizeof(m_numOutputChannels) + sizeof(m_biasWeights.count) + sizeof(m_precision) +
(m_kernelWeights.count + m_biasWeights.count) * type2size(m_precision);
}
void FCPlugin::serialize(void* buffer)
{
char* d = static_cast<char*>(buffer);
write(d, m_numInputChannels);
write(d, m_numOutputChannels);
write(d, m_biasWeights.count);
write(d, m_precision);
convertAndCopyToBuffer(d, m_kernelWeights);
convertAndCopyToBuffer(d, m_biasWeights);
}
{
for (int64_t v = 0; v < weights.
count; ++v)
reinterpret_cast<float32_t*
>(buffer)[v] = __half2float(static_cast<const __half*>(weights.
values)[v]);
else
reinterpret_cast<__half*
>(buffer)[v] = __float2half(static_cast<const float32_t*>(weights.
values)[v]);
else
memcpy(buffer, weights.
values, weights.
count * type2size(m_precision));
buffer += weights.
count * type2size(m_precision);
}
The interface functions can be defined as:
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*serializationSize = plugin->getSerializationSize();
return DW_SUCCESS;
}
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->serialize(buffer);
return DW_SUCCESS;
}
Getter Functions
There are multiple getter functions that are required to be implemented by plugin.
A plugin must provide a function for returning how many outputs it has:
In addition, it must provide a function for returning dimensions for each of the outputs based on the input dimensions:
int32_t outputIndex,
const dwBlobSize* inputDimensions,
Finally, it must provide a function to return temporary workspace size required by this layer.
In FCPlugin example, we know that there is only one output of the layer:
int32_t FCPlugin::getNbOutputs() const
{
return 1;
}
The dimensions of the output is expected to be NxCx1x1 where N is the batch size of the input, and C is the number of output channels.
dwBlobSize FCPlugin::getOutputDimensions(int32_t index,
const dwBlobSize* inputs, int32_t numInputDims)
{
}
FCPlugin does not need any temporary workspace during the generation of the model.
size_t FCPlugin::getWorkspaceSize(int32_t) const
{
return 0;
}
Finally, the interface functions can be defined as:
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*numOutputs = plugin->getNbOutputs();
return DW_SUCCESS;
}
int32_t outputIndex,
const dwBlobSize* inputDimensions,
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*outputDimensions = plugin->getOutputDimensions(outputIndex, inputDimensions, numInputs);
return DW_SUCCESS;
}
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*workspaceSize = plugin->getWorkspaceSize(maxBatchSize);
return DW_SUCCESS;
}
Formats
A plugin must implement a function to indicate whether a given format is supported:
Furthermore, it must allow configuring the format:
int32_t numInputs,
const dwBlobSize* outputDimensions,
In FCPlugin sample, we support single and half precision and only planar layout:
Since there is only one layout supported, precision is the only configurable parameter:
{
m_precision = precision;
}
Hence, the implementation of interface functions look like this:
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
*res = plugin->supportsFormat(precision, pluginLayout);
return DW_SUCCESS;
}
int32_t numInputs,
const dwBlobSize* outputDimensions,
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->configureWithFormat(precision, layout);
return DW_SUCCESS;
}
Inference
A plugin must implement a function to perform inference:
In FCPlugin example, we shall use CuDNN and cuBLAS to perform fully-connected layer. We shall use cuBLAS for matrix multiplication and CuDNN for adding bias.
int32_t FCPlugin::enqueue(int32_t batchSize, const void*const * inputs, void** outputs, void*, cudaStream_t stream)
{
__half oneh = __float2half(1.0f), zeroh = __float2half(0.0f);
cublasSetStream(m_cublas, stream);
cudnnSetStream(m_cudnn, stream);
{
CHECK_CUDA_ERROR(cublasSgemm(m_cublas, CUBLAS_OP_T, CUBLAS_OP_N, m_numOutputChannels, batchSize, m_numInputChannels, &onef,
reinterpret_cast<const float32_t*>(m_deviceKernel), m_numInputChannels,
reinterpret_cast<const float32_t*>(inputs[0]), m_numInputChannels, &zerof,
reinterpret_cast<float32_t*>(outputs[0]), m_numOutputChannels));
}
else
{
CHECK_CUDA_ERROR(cublasHgemm(m_cublas, CUBLAS_OP_T, CUBLAS_OP_N, m_numOutputChannels, batchSize, m_numInputChannels, &oneh,
reinterpret_cast<const __half*>(m_deviceKernel), m_numInputChannels,
reinterpret_cast<const __half*>(inputs[0]), m_numInputChannels, &zeroh,
reinterpret_cast<__half*>(outputs[0]), m_numOutputChannels));
}
if (m_biasWeights.count)
{
cudnnDataType_t cudnnDT = m_precision ==
DW_PRECISION_FP32 ? CUDNN_DATA_FLOAT : CUDNN_DATA_HALF;
CHECK_CUDA_ERROR(cudnnSetTensor4dDescriptor(m_srcDescriptor, CUDNN_TENSOR_NCHW, cudnnDT, 1, m_numOutputChannels, 1, 1));
CHECK_CUDA_ERROR(cudnnSetTensor4dDescriptor(m_dstDescriptor, CUDNN_TENSOR_NCHW, cudnnDT, batchSize, m_numOutputChannels, 1, 1));
CHECK_CUDA_ERROR(cudnnAddTensor(m_cudnn, &onef, m_srcDescriptor, m_deviceBias, &onef, m_dstDescriptor, outputs[0]));
}
return 0;
}
Finally, the implementation of the interface function is:
{
FCPlugin *plugin = reinterpret_cast<FCPlugin*>(handle);
plugin->enqueue(batchSize, inputs, outputs, workspace, stream);
return DW_SUCCESS;
}
Building plugin as a dynamic library
The plugin code must be compiled into a dynamic library for each processor architecture that are to be supported.
In FCPlugin example, we shall add plugin to the samples to simplify the building process. Firstly, we shall create a folder under samples
called fcplugin
and create a CMakeLists.txt
under this folder:
project(fc_plugin C CXX)
#-------------------------------------------------------------------------------
# Project files
#-------------------------------------------------------------------------------
set(SOURCES
FCPlugin.cpp
)
set(LIBRARIES
dw_samples_framework
${DriveWorks_LIBRARIES}
cudnn
)
#-------------------------------------------------------------------------------
# Final target
#-------------------------------------------------------------------------------
cuda_add_library(${PROJECT_NAME} SHARED ${SOURCES})
CUDA_ADD_CUBLAS_TO_TARGET(${PROJECT_NAME})
target_link_libraries(${PROJECT_NAME} PRIVATE ${LIBRARIES})
#-------------------------------------------------------------------------------
# Install target
#-------------------------------------------------------------------------------
sdk_add_sample(${PROJECT_NAME})
Append fcplugin
to the SAMPLES
list in CMakeLists.txt
under samples
:
set(SAMPLES framework;egomotion;sensors;features;rig;visualization;sfm;dnn;laneDetection;colorcorrection;rectifier;ipc;hello_world;image;stereo;freespaceperception;drivenet;maps;template;icp;lidar_accumulator;cameraBlindness;calibration;vehicleio;dataspeedBridge;waitcondition;pointcloudprocessor;fcplugin)
Finally, run cmake
and the preferred build system to compile the library. This will generate a libfc_plugin.so
.
Generating TensorRT Model with Plugins
In order to execute a DNN model with plugins into DriveWorks, like any other DNN model, it has to be converted to a TensorRT model. This is achieved by running TensorRT Optimizer Tool with an extra parameter to configure plugins.
Plugin configuration is provided in a json file. This file is expected to provide a path to a plugin as dynamic library for each custom layer:
{
"__comment" : "Please list here all the plugin paths (absolute path or relative to tensorRT_optimization tool) and the layers they apply to.",
"plugin1_path" : ["layer1", "layer2", "layer3"],
"plugin2_path" : ["layer4", "layer5"],
"plugin3_path" : ["layer6"]
}
In FCPlugin example, we have only one custom layer that requires a plugin. The name of the layer is ip2
, and path of the plugin is libfc_plugin.so
, assuming that it is located in the same folder as TensorRT Optimizer Tool . Therefore, we can create a plugin.json
file as follows:
{
"libfc_plugin.so" : ["ip2"],
"plugin2_path" : ["layer4", "layer5"],
"plugin3_path" : ["layer6"]
}
Finally, we can generate the model by:
./tensorRT_optimization --modelType=caffe --prototxt=mnist.prototxt --caffemodel=mnist.caffemodel --outputBlobs=prob --pluginConfig=plugin.json --out=mnist_with_plugin.dnn
Loading TensorRT Model with Plugins
Loading a TensorRT model with plugin generated via TensorRT Optimizer Tool requires plugin configuration to be defined at initialization of dwDNNHandle_t
.
The plugin configuration structures look like this:
typedef struct
{
size_t numCustomLayers;
typedef struct
{
const char* pluginLibraryPath;
const char* layerName;
Just like the json file used at model generation (see plugin json), for each layer, the layer name and path to the plugin that implements the layer must be provided.
In FCPlugin example:
std::string pluginPath = "libfc_plugin.so";
customLayer.pluginLibraryPath = pluginPath.c_str();
pluginConf.customLayers = &customLayer;
See DNN Plugin Sample for more details.