Using TensorRT to Optimize Caffe Models in Python¶
TensorRT 3.0.0 includes support for a Python API to load in and optimize Caffe models which can then be executed and stored as portable PLAN files. The following notebook explains the workflow you can use to do so.
For python the TensorRT library is refered to as tensorrt
, for the
Early Access you should have been provided a wheel file with the API,
this can be installed by using pip
(e.g. for python2.7 on ubuntu
16.04 - pip install tensorrt-3.0.0-cp27-cp27mu-linux_x86_64.whl
).
You can import tensorrt as you would import any other package
In [1]:
import tensorrt as trt
There are also some common tools that are used with tensorrt typically. We use PyCUDA to handle the CUDA operations needed to allocate memory on your GPU and to transfer data to the GPU and results back to the CPU. We also use numpy as our primary method to store data
In [2]:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
For this example we also want to import an image processing library
(pillow
in this case) and randint
In [3]:
from random import randint
from PIL import Image
from matplotlib.pyplot import imshow #to show test case
Since we are converting a Caffe model we also need to use the
caffeparser
which is located in tensorrt.parsers
In [4]:
from tensorrt import parsers
Typically the first thing you will do is create a logger, which is used
in may places during the model conversion and inference process. We
provide a simple logger implementation in
tensorrt.infer.ConsoleLogger
but in the RC you will be able to
define your own as well.
In [5]:
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.ERROR)
Now we will define some constants about our model, which in this example we will used to classify digits from the MNIST dataset
In [6]:
INPUT_LAYERS = ['data']
OUTPUT_LAYERS = ['prob']
INPUT_H = 28
INPUT_W = 28
OUTPUT_SIZE = 10
We are also going to define some paths, please change these to reflect where you placed the data included with the samples
In [7]:
MODEL_PROTOTXT = './data/mnist/mnist.prototxt'
CAFFE_MODEL = './data/mnist/mnist.caffemodel'
DATA = './data/mnist/'
IMAGE_MEAN = './data/mnist/mnist_mean.binaryproto'
The first step is to create our engine. The Python API provides some
nice utilities to make this much simplier. Here we use the caffe model
converter utility in tensorrt.utils
. We provide it a logger, a path
to the model prototxt, the model file, the max batch size, the max
workspace size, the output layer(s) and the data type of the weights.
In [8]:
engine = trt.utils.caffe_to_trt_engine(G_LOGGER,
MODEL_PROTOTXT,
CAFFE_MODEL,
1,
1 << 20,
OUTPUT_LAYERS,
trt.infer.DataType.FLOAT)
Building Engine
Now let’s generate a test case for our engine.
In [9]:
rand_file = randint(0,9)
path = DATA + str(rand_file) + '.pgm'
im = Image.open(path)
%matplotlib inline
imshow(np.asarray(im))
arr = np.array(im)
img = arr.ravel()
print("Test Case: " + str(rand_file))
Test Case: 7
We now need to apply the mean to the input image, we have this stored in a .binaryproto file which we use the caffeparser to read
In [10]:
parser = parsers.caffeparser.create_caffe_parser()
mean_blob = parser.parse_binary_proto(IMAGE_MEAN)
parser.destroy()
#NOTE: This is different than the C++ API, you must provide the size of the data
mean = mean_blob.get_data(INPUT_W ** 2)
data = np.empty([INPUT_W ** 2])
for i in range(INPUT_W ** 2):
data[i] = float(img[i]) - mean[i]
mean_blob.destroy()
Now we need to create a runtime for Inference and create a context for our engine
In [11]:
runtime = trt.infer.create_infer_runtime(G_LOGGER)
context = engine.create_execution_context()
Now we can run inference, we are going to start by making sure our data is in the correct datatype (FP32 for this model). Then we are going to create an empty array on the CPU to hold our results from inference.
In [12]:
assert(engine.get_nb_bindings() == 2)
#convert input data to Float32
img = img.astype(np.float32)
#create output array to receive data
output = np.empty(OUTPUT_SIZE, dtype = np.float32)
Now we are going to allocate memory on the GPU with PyCUDA and register them with the engine. The size of the allocations is the size of the input and expected output * the batch size
In [13]:
d_input = cuda.mem_alloc(1 * img.size * img.dtype.itemsize)
d_output = cuda.mem_alloc(1 * output.size * output.dtype.itemsize)
The engine needs bindings provided as pointers to the GPU memory. PyCUDA lets us do this for memory allocations by casting those allocations to ints
In [14]:
bindings = [int(d_input), int(d_output)]
We also are going to create a cuda stream to run inference in
In [15]:
stream = cuda.Stream()
Now we are going to transfer the data to the GPU, run inference and the copy the results back.
In [16]:
#transfer input data to device
cuda.memcpy_htod_async(d_input, img, stream)
#execute model
context.enqueue(1, bindings, stream.handle, None)
#transfer predictions back
cuda.memcpy_dtoh_async(output, d_output, stream)
#syncronize threads
stream.synchronize()
Now we have our results. We can just run ArgMax to get a prediction
In [17]:
print("Test Case: " + str(rand_file))
print ("Prediction: " + str(np.argmax(output)))
Test Case: 7
Prediction: 7
We can also save our engine to a file to use later
In [18]:
trt.utils.write_engine_to_file("./data/mnist/new_mnist.engine", engine.serialize())
Out[18]:
True
You can then load this engine later by using
tensorrt.utils.load_engine
In [19]:
new_engine = trt.utils.load_engine(G_LOGGER, "./data/mnist/new_mnist.engine")
And as a final step, we are going to clean up our context, engine and runtime
In [20]:
context.destroy()
engine.destroy()
new_engine.destroy()
runtime.destroy()