Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center,edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming.
Major features include:
Provides Backend API that allows adding custom backends and pre/post processing operations
Metrics indicating GPU utilization, server throughput, server latency, and more
Join the Triton and TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more. Need enterprise support? NVIDIA global support is available for Triton Inference Server with the NVIDIA AI Enterprise software suite.
See the Lastest Release Notes for updates on the newest features and bug fixes.