Example: Quick Start#
Downloading a Pre-built TensorRT Engine#
This section shows an example of downloading a TensorRT engine using the TensorRT-Cloud CLI.
Check which engines are available.
trt-cloud catalog engines
Choose an engine model and version combination you would like and download.
trt-cloud catalog download --model=gemma_2b_it_trtllm --version=bs1_int4awq_RTX4090_windows
Building a TensorRT Engine for the ONNX Model#
This section shows an example of how to build a TensorRT engine from an ONNX model using the TensorRT-Cloud CLI.
Prerequisites
Ensure you can log into TensorRT-Cloud.
Steps
Download an ONNX model, such as MobileNetV2.
wget https://github.com/onnx/models/raw/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx
Build the engine with TensorRT-Cloud.
trt-cloud build onnx --model mobilenetv2_050_Opset18.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16"
Unzip the downloaded file. The TensorRT engine is saved as
engine.trt
.engine.trt
can now be deployed using TensorRT 10.0 to run accelerated inference of MobileNetV2 on an RTX 4090 GPU on Windows.View the engine metrics in
metrics.json
. Metrics are extracted from TensorRT build logs. Below are the first few lines ofmetrics.json
generated for an example model.{ "throughput_qps": 782.646, "inference_gpu_memory_MiB": 5.0, "latency_ms": { "min": 0.911316, "max": 5.17419, "mean": 1.00846, "median": 0.97345, …
Building a TensorRT-LLM Engine#
Prerequisites
Ensure you can log into TensorRT-Cloud.
Steps
Pick a Hugging Face repository to build an engine for. For example, google/gemma-2b-it at main (huggingface.co).
Build the engine with TensorRT-Cloud.
trt-cloud build llm --hf-repo="google/gemma-2b-it" --dtype="bfloat16" --gpu RTX4090 --os windows
Unzip the downloaded file. The TensorRT engine is saved in
build_result/engine
. The engine can now be deployed.View the engine metrics in
build_result/metrics.json
. For example:{ "rouge1": 30.532889598883763, "rouge2": 10.519224834860456, "rougeL": 22.77946327498464, "rougeLsum": 25.958965209634254, "generation_tokens_per_second": 126.849, "gpu_peak_mem_gb": 7.783 }