Optimize and serve ONNX and TensorRT ensemble on PyTriton#
In this example, we show how to optimize ONNX and TensorRT models and build a zero-copy ensemble in PyTriton server.
Requirements#
The example requires CUDA 11.8 and torch package. It can be installed in your current environment using pip:
pip install torch
Or you can use NVIDIA Torch container:
docker run -it --gpus 1 --shm-size 8gb -v ${PWD}:${PWD} -w ${PWD} nvcr.io/nvidia/pytorch:22.12-py3 bash
If you select to use container, we recommend installing NVIDIA Container Toolkit.
Example must be executed from its directory#
cd examples/14_optimize_and_serve_onnx_and_tensorrt_ensemble_on_pytriton
Generate TensorRT model#
TensorRT plan must be generated on the target machine.
python ./generate_tensorrt_model.py
Run model optimization#
In the next step, the optimize process will be performed for the models.
python ./optimize.py
Once the process is done, the onnx_linear.nav and tensorrt_linear.nav packages are created in current working directory.
Serving model with NVIDIA PyTriton#
Before running the server and client, install the NVIDIA PyTriton:
pip install nvidia-pytriton
Next, start the PyTriton server with the package generated in the previous step.
python ./serve.py
Use client to test model deployment on PyTriton
python ./client.py