Integrating TAO CV Models with Triton Inference Server#

TAO provides an easy interface to generate accurate and optimized models for a number of computer vision use cases. These models are typically deployed through the DeepStream SDK or Riva pipelines.

NVIDIA Triton Inference Server is an open-source inference software solution for deploying Deep Neural Networks (DNNs) from a wide range of frameworks — TensorRT, TensorFlow, ONNX Runtime, PyTorch, and others — with multi-model serving, dynamic batching, and concurrent execution. TAO ships a reference application that documents how to deploy a TAO-trained model into Triton.

Supported reference implementations cover the following networks:

For documentation and source code, refer to the TAO Toolkit Triton Apps repository on GitHub.

Driving Triton Deployment from the Agent#

In TAO 7.0, ask your agent to build the TensorRT engine and prepare the Triton model repository for any supported network. For example:

Build an FP16 TensorRT engine for DetectNet_v2 from the exported ONNX at
``s3://my-bucket/detectnet/model.onnx`` and stage a Triton model
repository under ``s3://my-bucket/triton/models/detectnet_v2/1/``.

The agent reads the network’s gen_trt_engine action, builds the engine on your chosen backend, and writes the engine plus the Triton config.pbtxt into the model-repository layout that the reference application expects.