TensorRT Engines for Community Models#

Community open-source software (OSS) models such as Llama, Gemma, and many more provide significant value in driving AI adoption. TensorRT-Cloud exposes optimized engines to provide best-in-class performance for these important models.

The catalog feature allows downloading prebuilt TensorRT engines for popular models using the TensorRT-Cloud CLI.

Discovering Models#

The trt-cloud catalog models command lists all models for which engines can download.

trt-cloud catalog models
[I] Found 4 models.

Model Name
------------------------
gemma_2b_it_trtllm
llama_2_7b_chat_trtllm
mistral-7b-it-v02_trtllm
phi_2_trtllm

Discovering Available Engine Versions#

Each model has multiple engine versions built with different configurations, such as operating systems, GPUs, batch sizes, etc. The trt-cloud catalog engines command provides a listing of all available engines.

The command also accepts optional filtering arguments such as —model to filter out the available engine versions for a specific model.

trt-cloud catalog engines --model=gemma
[I] Found 8 engines.

model_name          version_name                   trtllm_version    os       gpu          num_gpus  download_size    weight_stripped
------------------  -----------------------------  ----------------  -------  ---------  ----------  ---------------  -----------------
gemma_2b_it_trtllm  bs1_int4awq_RTX3070_windows    0.10.0            Windows  RTX3070             1  1678.07 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4060TI_windows  0.10.0            Windows  RTX4060TI           1  1678.07 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4070_windows    0.10.0            Windows  RTX4070             1  1678.11 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4090_windows    0.10.0            Windows  RTX4090             1  1678.12 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX3070_windows     0.10.0            Windows  RTX3070             1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4060TI_windows   0.10.0            Windows  RTX4060TI           1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4070_windows     0.10.0            Windows  RTX4070             1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4090_windows     0.10.0            Windows  RTX4090             1  2573.57 MB       False

Filtering Engines by Configs#

Filtering options will change regularly based on exposed configurations. To view the latest filtering options, run:

trt-cloud catalog engines -h

You can only filter by model, trtllm_version, os, and gpu.

Downloading a Pre-built Engine#

The trt-cloud catalog download command allows for the download of a specific engine version for a model. The model and engine version are specified with the --model and --version options, respectively. An optional --output option specifies the local path to save the engine. If not specified, the engine will be saved in the current working directory.

trt-cloud catalog download --model=gemma_2b_it_trtllm --version=bs1_int4awq_RTX3070_windows
[I] EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows
GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement

ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy.

A temporary copy of this document can be found at: 
This will be deleted after a response to this prompt. A copy is also included with the downloaded engine archive.

Do you agree to the EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows? (yes/no) yes
[I] Saving engine to gemma_2b_it_trtllm_bs1_int4awq_RTX3070_windows_files.zip
Downloading  ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  19% 0:00:27

Running a Pre-Built Engine#

Integrate the downloaded engine into your application using TensorRT or the TensorRT-LLM version that matches the pre-built engines. This example shows how to run inference using the downloaded Gemma engine.

  1. Unzip the downloaded engine archive.

  2. Download the model’s tokenizer and ensure that the environment satisfies the requirements of the downloaded engine (GPU type, GPU count, operating system, and installed TensorRT-LLM version).

    git clone https://huggingface.co/google/gemma-2b-it gemma-2b-it
    
  3. Run the downloaded TensorRT-LLM engine.

    python3 run.py --engine_dir engines/  --max_output_len 100 --tokenizer_dir gemma-2b-it/ --input_text "How do I count to nine in French?"
    

    For more information, refer to the TensorRT-LLM documentation.