Using TensorRT to Optimize Caffe Models in Python

TensorRT 3.0.0 includes support for a Python API to load in and optimize Caffe models which can then be executed and stored as portable PLAN files. The following notebook explains the workflow you can use to do so.

For python the TensorRT library is refered to as tensorrt, for the Early Access you should have been provided a wheel file with the API, this can be installed by using pip (e.g. for python2.7 on ubuntu 16.04 - pip install tensorrt-3.0.0-cp27-cp27mu-linux_x86_64.whl).

You can import tensorrt as you would import any other package

In [1]:
import tensorrt as trt

There are also some common tools that are used with tensorrt typically. We use PyCUDA to handle the CUDA operations needed to allocate memory on your GPU and to transfer data to the GPU and results back to the CPU. We also use numpy as our primary method to store data

In [2]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

For this example we also want to import an image processing library (pillow in this case) and randint

In [3]:
from random import randint
from PIL import Image
from matplotlib.pyplot import imshow #to show test case

Since we are converting a Caffe model we also need to use the caffeparser which is located in tensorrt.parsers

In [4]:
from tensorrt import parsers

Typically the first thing you will do is create a logger, which is used in may places during the model conversion and inference process. We provide a simple logger implementation in tensorrt.infer.ConsoleLogger but in the RC you will be able to define your own as well.

In [5]:
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.ERROR)

Now we will define some constants about our model, which in this example we will used to classify digits from the MNIST dataset

In [6]:
INPUT_LAYERS = ['data']
OUTPUT_LAYERS = ['prob']
INPUT_H = 28
INPUT_W =  28
OUTPUT_SIZE = 10

We are also going to define some paths, please change these to reflect where you placed the data included with the samples

In [7]:
MODEL_PROTOTXT = './data/mnist/mnist.prototxt'
CAFFE_MODEL = './data/mnist/mnist.caffemodel'
DATA = './data/mnist/'
IMAGE_MEAN = './data/mnist/mnist_mean.binaryproto'

The first step is to create our engine. The Python API provides some nice utilities to make this much simplier. Here we use the caffe model converter utility in tensorrt.utils. We provide it a logger, a path to the model prototxt, the model file, the max batch size, the max workspace size, the output layer(s) and the data type of the weights.

In [8]:
engine = trt.utils.caffe_to_trt_engine(G_LOGGER,
                                       MODEL_PROTOTXT,
                                       CAFFE_MODEL,
                                       1,
                                       1 << 20,
                                       OUTPUT_LAYERS,
                                       trt.infer.DataType.FLOAT)
Building Engine

Now let’s generate a test case for our engine.

In [9]:
rand_file = randint(0,9)
path = DATA + str(rand_file) + '.pgm'
im = Image.open(path)
%matplotlib inline
imshow(np.asarray(im))
arr = np.array(im)
img = arr.ravel()
print("Test Case: " + str(rand_file))
Test Case: 7
../_images/workflows_caffe_to_tensorrt_17_1.png

We now need to apply the mean to the input image, we have this stored in a .binaryproto file which we use the caffeparser to read

In [10]:
parser = parsers.caffeparser.create_caffe_parser()
mean_blob = parser.parse_binary_proto(IMAGE_MEAN)
parser.destroy()
#NOTE: This is different than the C++ API, you must provide the size of the data
mean = mean_blob.get_data(INPUT_W ** 2)
data = np.empty([INPUT_W ** 2])
for i in range(INPUT_W ** 2):
    data[i] = float(img[i]) - mean[i]
mean_blob.destroy()

Now we need to create a runtime for Inference and create a context for our engine

In [11]:
runtime = trt.infer.create_infer_runtime(G_LOGGER)
context = engine.create_execution_context()

Now we can run inference, we are going to start by making sure our data is in the correct datatype (FP32 for this model). Then we are going to create an empty array on the CPU to hold our results from inference.

In [12]:
assert(engine.get_nb_bindings() == 2)
#convert input data to Float32
img = img.astype(np.float32)
#create output array to receive data
output = np.empty(OUTPUT_SIZE, dtype = np.float32)

Now we are going to allocate memory on the GPU with PyCUDA and register them with the engine. The size of the allocations is the size of the input and expected output * the batch size

In [13]:
d_input = cuda.mem_alloc(1 * img.size * img.dtype.itemsize)
d_output = cuda.mem_alloc(1 * output.size * output.dtype.itemsize)

The engine needs bindings provided as pointers to the GPU memory. PyCUDA lets us do this for memory allocations by casting those allocations to ints

In [14]:
bindings = [int(d_input), int(d_output)]

We also are going to create a cuda stream to run inference in

In [15]:
stream = cuda.Stream()

Now we are going to transfer the data to the GPU, run inference and the copy the results back.

In [16]:
#transfer input data to device
cuda.memcpy_htod_async(d_input, img, stream)
#execute model
context.enqueue(1, bindings, stream.handle, None)
#transfer predictions back
cuda.memcpy_dtoh_async(output, d_output, stream)
#syncronize threads
stream.synchronize()

Now we have our results. We can just run ArgMax to get a prediction

In [17]:
print("Test Case: " + str(rand_file))
print ("Prediction: " + str(np.argmax(output)))
Test Case: 7
Prediction: 7

We can also save our engine to a file to use later

In [18]:
trt.utils.write_engine_to_file("./data/mnist/new_mnist.engine", engine.serialize())
Out[18]:
True

You can then load this engine later by using tensorrt.utils.load_engine

In [19]:
new_engine = trt.utils.load_engine(G_LOGGER, "./data/mnist/new_mnist.engine")

And as a final step, we are going to clean up our context, engine and runtime

In [20]:
context.destroy()
engine.destroy()
new_engine.destroy()
runtime.destroy()