TensorRT Engines for Community Models
Community open source software (OSS) models such as Llama, Gemma, and many more, provide a lot of value for driving AI adoption forward. TensorRT-Cloud exposes optimized engines to provide you with best in class performance for these important models.
The catalog
feature allows downloading prebuilt TensorRT engines for popular models using the TensorRT-Cloud CLI.
Discovering Models
The trt-cloud catalog models
command allows listing all models for which engines are available to download.
trt-cloud catalog models [I] Found 4 models. Model Name ------------------------ gemma_2b_it_trtllm llama_2_7b_chat_trtllm mistral-7b-it-v02_trtllm phi_2_trtllm
Discovering Available Engine Versions
Each model has multiple engine versions that are built with different configurations such as operating systems, GPUs, batch sizes, and so on. The trt-cloud catalog engines
command provides a listing of all available engines.
The command also accepts optional filtering arguments such as —model
to filter out the available engine versions for a specific model.
trt-cloud catalog engines --model=gemma [I] Found 8 engines. model_name version_name trtllm_version os gpu num_gpus download_size weight_stripped ------------------ ----------------------------- ---------------- ------- --------- ---------- --------------- ----------------- gemma_2b_it_trtllm bs1_int4awq_RTX3070_windows 0.10.0 Windows RTX3070 1 1678.07 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4060TI_windows 0.10.0 Windows RTX4060TI 1 1678.07 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4070_windows 0.10.0 Windows RTX4070 1 1678.11 MB False gemma_2b_it_trtllm bs1_int4awq_RTX4090_windows 0.10.0 Windows RTX4090 1 1678.12 MB False gemma_2b_it_trtllm bs1_int8wo_RTX3070_windows 0.10.0 Windows RTX3070 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4060TI_windows 0.10.0 Windows RTX4060TI 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4070_windows 0.10.0 Windows RTX4070 1 2573.57 MB False gemma_2b_it_trtllm bs1_int8wo_RTX4090_windows 0.10.0 Windows RTX4090 1 2573.57 MB False
Filtering Engines by Configs
Filtering options will change regularly based on configs that are exposed. To view the latest filtering options, run:
trt-cloud catalog engines -h
Currently, you can only filter by: model
, trtllm_version
, os
, and gpu
.
Downloading a Pre-built Engine
The trt-cloud catalog download
command allows downloading a specific engine version for a model. The model and engine version are specified with the -model
and -version
options respectively. An optional -output
option allows specifying the local path to save the engine to. If not specified, the engine will be saved in the current working directory.
trt-cloud catalog download --model=gemma_2b_it_trtllm --version=bs1_int4awq_RTX3070_windows [I] EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy. A temporary copy of this document can be found at: … This will be deleted after a response to this prompt. A copy is also included with the downloaded engine archive. Do you agree to the EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows? (yes/no) yes [I] Saving engine to gemma_2b_it_trtllm_bs1_int4awq_RTX3070_windows_files.zip Downloading ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19% 0:00:27
Running a Pre-Built Engine
Integrate the downloaded engine into your application using TensorRT or the TensorRT-LLM version that matches the pre-built engines. In this example, we will show how to run inference using the downloaded Gemma engine.
Unzip the downloaded engine archive.
Download the model’s tokenizer and ensure that the environment satisfies the requirements of the downloaded engine (GPU type, GPU count, operating system, and installed TensorRT-LLM version).
git clone https://huggingface.co/google/gemma-2b-it gemma-2b-it
Run the downloaded TensorRT-LLM engine.
python3 run.py --engine_dir engines/ --max_output_len 100 --tokenizer_dir gemma-2b-it/ --input_text "How do I count to nine in French?"
For more information, refer to the TensorRT-LLM documentation.