Python#
Triton Inference Server In-Process Python API [BETA]#
Starting with release 24.01 Triton Inference Server will include a Python package enabling developers to embed Triton Inference Server instances in their Python applications. The in-process Python API is designed to match the functionality of the in-process C API while providing a higher level abstraction. At its core the API relies on a 1:1 python binding of the C API and provides all the flexibility and power of the C API with a simpler to use interface.
[!Note] As the API is in BETA please expect some changes as we test out different features and get feedback. All feedback is weclome and we look forward to hearing from you!
Requirements#
The following instructions require a linux system with Docker installed. For CUDA support, make sure your CUDA driver meets the requirements in “NVIDIA Driver” section of Deep Learning Framework support matrix: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
Installation#
The tutorial and Python API package are designed to be installed and run
within the nvcr.io/nvidia/tritonserver:24.01-py3
docker image.
A set of convenience scripts are provided to create a docker image based
on the nvcr.io/nvidia/tritonserver:24.01-py3
image with the Python
API installed plus additional dependencies required for the examples.
Triton Inference Server 24.01 + Python API#
Clone Repository#
Build triton-python-api:r24.01
Image#
Supported Backends#
The built image includes all the backends shipped by default in the
tritonserver nvcr.io/nvidia/tritonserver:24.01-py3
container.
dali fil identity onnxruntime openvino python pytorch repeat square tensorflow tensorrt
Included Models#
The default
build includes an identity
model that can be used
for exercising basic operations including sending input tensors of
different data types. The identity
model copies provided inputs of
shape [-1, -1]
to outputs of shape [-1, -1]
. Inputs are named
data_type_input
and outputs are named data_type_output
(e.g. string_input
, string_output
, fp16_input
,
fp16_output
).
Hello World#
Start triton-python-api:r24.01
Container#
The following command starts a container and volume mounts the current
directory as workspace
.
Enter Python Shell#
Create and Start a Server Instance#
List Models#
server.models()
Example Output#
server.models()
returns a dictionary of the available models with
their current state.
Send an Inference Request#
Iterate through Responses#
model.infer()
returns an iterator that can be used to process the
results of an inference request.
Example Output#
Stable Diffusion#
This example is based on the Popular_Models_Guide/StableDiffusion tutorial.
Please note the following command will take many minutes depending on your hardware configuration and network connection.
The built image includes all the backends shipped by default in the
tritonserver nvcr.io/nvidia/tritonserver:24.01-py3
container.
dali fil identity onnxruntime openvino python pytorch repeat square tensorflow tensorrt
The diffusion
build includes a stable_diffustion
pipeline that
takes a text prompt and returns a generated image. For more details on
the models and pipeline please see the
Popular_Models_Guide/StableDiffusion
tutorial.
Start Container#
The following command starts a container and volume mounts the current
directory as workspace
.
Enter Python Shell#
Create and Start a Server Instance#
List Models#
server.models()
Example Output#
Send an Inference Request#
Iterate through Responses and save image#
Example Output#

Fig. 1 sample_generated_image#