TAO Deploy Overview#

../../_images/tao_deploy_workflow.jpg

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It provides APIs and parsers to import trained models from all major deep learning frameworks; it then generates optimized runtime engines deployable in a data center, as well as in automotive and embedded environments. To understand TensorRT and its capabilities better, refer to the official TensorRT documentation.

Models trained in TAO are deployed to NVIDIA inference SDKs — for example, DeepStream — through TensorRT. TAO model skills export their trained checkpoints to .onnx; TAO Deploy parses the ONNX file and produces an optimized TensorRT engine. Engines can be generated to support inference at low precision (FP16, INT8, or FP8), and most TAO models also support direct integration of the .onnx file with DeepStream when an optimized engine is not required.

TAO Deploy separates model training and optimization from deployment by parsing the .onnx file to generate the engine. It also runs evaluation and inference against that engine using the original TAO experiment specification. The supported actions are:

  • gen_trt_engine — build a TensorRT engine from an exported .onnx.

  • evaluate — run evaluation against a built engine.

  • inference — run inference against a built engine.

Driving TAO Deploy from the Agent#

In TAO 7.0, you drive these actions through the model skill’s gen_trt_engine, evaluate, and inference entries from your TAO agent. There is no separate tao deploy launcher CLI to install. The action runs in the TAO Deploy container, which the TAO Execution SDK pulls from NGC at dispatch time.

The pattern is the same for every supported network: describe the build in plain English, point the agent at the spec, and the agent resolves the spec keys and dispatches the job.

Build an INT8 TensorRT engine for DINO from the exported ONNX at
``s3://my-bucket/dino/model.onnx`` using ``trt-spec.yaml``. Calibrate
against ``s3://my-bucket/calib/`` and write the engine to
``s3://my-bucket/dino/model.engine``. Run on the local Docker daemon.

The agent reads models/tao-train-dino/SKILL.md for the gen_trt_engine action, overlays your overrides on the spec template, and dispatches the job via the TAO Execution SDK to the backend you named (local Docker, Brev, SLURM, or Kubernetes).

Note

You normally do not name a skill in your prompt; the agent picks the right one. Internally, gen_trt_engine, evaluate, and inference are dispatched through skills/tao-launch-workflow (the shared launch intake for TAO workflows), and long-running serving uses applications/tao-run-inference-service instead.

Specification Format#

The gen_trt_engine spec is shared across networks. A typical INT8 calibration spec looks like this:

gen_trt_engine:
  onnx_file: /path/to/model.onnx
  trt_engine: /path/to/model.engine
  input_channel: 3
  input_width: 960
  input_height: 544
  tensorrt:
    data_type: int8
    workspace_size: 1024
    min_batch_size: 1
    opt_batch_size: 10
    max_batch_size: 10
    calibration:
      cal_image_dir:
        - /path/to/cal/images
      cal_cache_file: /path/to/cal.bin
      cal_batch_size: 10
      cal_batches: 1000

The corresponding evaluate and inference specs reuse the standard TAO evaluation/inference specs and point evaluate.trt_engine (or inference.trt_engine) at the engine you built.

For per-network specs and any model-specific switches, refer to the network-specific pages linked under Per-Network Notes below.

NSight DL Designer Integration#

TAO models are compatible with NVIDIA NSight DL Designer, a visualization, debugging, and profiling tool for deep learning models. Through NSight DL Designer you can profile TensorRT engines generated from TAO models, set layer-level precision constraints, visualize model architectures, and debug inference performance bottlenecks. For details, refer to Integrating TAO Models with NSight DL Designer.

Per-Network Notes#

The pages below document network-specific spec keys, default values, and any quirks of the gen_trt_engine, evaluate, or inference action for that network. Use them as reference when you want to know what the agent will fill into your spec.