TensorRT Engines for Community Models#
Community open-source software (OSS) models such as Llama, Gemma, and many more provide significant value in driving AI adoption. TensorRT-Cloud exposes optimized engines to provide best-in-class performance for these important models.
The catalog
feature allows downloading prebuilt TensorRT engines for popular models using the TensorRT-Cloud CLI.
Discovering Models#
The trt-cloud catalog models
command lists all models for which engines can download.
trt-cloud catalog models [I] Found 4 models. Model Name ------------------------ gemma_2b_it_trtllm llama_2_7b_chat_trtllm mistral-7b-it-v02_trtllm phi_2_trtllm
Discovering Available Engine Versions#
Each model has multiple engine versions built with different configurations, such as operating systems, GPUs, batch sizes, etc. The trt-cloud catalog engines
command provides a listing of all available engines.
The command also accepts optional filtering arguments such as —model
to filter out the available engine versions for a specific model.
trt-cloud catalog engines --model=gemma [I] Found 8 engines. model_name version_name trtllm_version os gpu num_gpus download_size weight_stripped ------------------ ----------------------------- ---------------- ------- --------- ---------- --------------- ----------------- gemma_2b_it_trtllm bs1_int4awq_RTX3070_windows 0.10.0 Windows RTX3070 1 1678.07 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4060TI_windows 0.10.0 Windows RTX4060TI 1 1678.07 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4070_windows 0.10.0 Windows RTX4070 1 1678.11 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4090_windows 0.10.0 Windows RTX4090 1 1678.12 MB False gemma_2b_it_trtllm bs1_int8wo_RTX3070_windows 0.10.0 Windows RTX3070 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4060TI_windows 0.10.0 Windows RTX4060TI 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4070_windows 0.10.0 Windows RTX4070 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4090_windows 0.10.0 Windows RTX4090 1 2573.57 MB False
Filtering Engines by Configs#
Filtering options will change regularly based on exposed configurations. To view the latest filtering options, run:
trt-cloud catalog engines -h
You can only filter by model
, trtllm_version
, os
, and gpu
.
Downloading a Pre-built Engine#
The trt-cloud catalog download
command allows for the download of a specific engine version for a model. The model and engine version are specified with the --model
and --version
options, respectively. An optional --output
option specifies the local path to save the engine. If not specified, the engine will be saved in the current working directory.
trt-cloud catalog download --model=gemma_2b_it_trtllm --version=bs1_int4awq_RTX3070_windows [I] EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy. A temporary copy of this document can be found at: … This will be deleted after a response to this prompt. A copy is also included with the downloaded engine archive. Do you agree to the EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows? (yes/no) yes [I] Saving engine to gemma_2b_it_trtllm_bs1_int4awq_RTX3070_windows_files.zip Downloading ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19% 0:00:27
Running a Pre-Built Engine#
Integrate the downloaded engine into your application using TensorRT or the TensorRT-LLM version that matches the pre-built engines. This example shows how to run inference using the downloaded Gemma engine.
Unzip the downloaded engine archive.
Download the model’s tokenizer and ensure that the environment satisfies the requirements of the downloaded engine (GPU type, GPU count, operating system, and installed TensorRT-LLM version).
git clone https://huggingface.co/google/gemma-2b-it gemma-2b-it
Run the downloaded TensorRT-LLM engine.
python3 run.py --engine_dir engines/ --max_output_len 100 --tokenizer_dir gemma-2b-it/ --input_text "How do I count to nine in French?"
For more information, refer to the TensorRT-LLM documentation.