TensorRT Engines for Community Models

Community open source software (OSS) models such as Llama, Gemma, and many more, provide a lot of value for driving AI adoption forward. TensorRT-Cloud exposes optimized engines to provide you with best in class performance for these important models.

The catalog feature allows downloading prebuilt TensorRT engines for popular models using the TensorRT-Cloud CLI.

Discovering Models

The trt-cloud catalog models command allows listing all models for which engines are available to download.

trt-cloud catalog models
[I] Found 4 models.

Model Name
------------------------
gemma_2b_it_trtllm
llama_2_7b_chat_trtllm
mistral-7b-it-v02_trtllm
phi_2_trtllm

Discovering Available Engine Versions

Each model has multiple engine versions that are built with different configurations such as operating systems, GPUs, batch sizes, and so on. The trt-cloud catalog engines command provides a listing of all available engines.

The command also accepts optional filtering arguments such as —model to filter out the available engine versions for a specific model.

trt-cloud catalog engines --model=gemma
[I] Found 8 engines.

model_name          version_name                   trtllm_version    os       gpu          num_gpus  download_size    weight_stripped
------------------  -----------------------------  ----------------  -------  ---------  ----------  ---------------  -----------------
gemma_2b_it_trtllm  bs1_int4awq_RTX3070_windows    0.10.0            Windows  RTX3070             1  1678.07 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4060TI_windows  0.10.0            Windows  RTX4060TI           1  1678.07 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4070_windows    0.10.0            Windows  RTX4070             1  1678.11 MB       False
gemma_2b_it_trtllm  bs1_int4awq_RTX4090_windows    0.10.0            Windows  RTX4090             1  1678.12 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX3070_windows     0.10.0            Windows  RTX3070             1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4060TI_windows   0.10.0            Windows  RTX4060TI           1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4070_windows     0.10.0            Windows  RTX4070             1  2573.57 MB       False
gemma_2b_it_trtllm  bs1_int8wo_RTX4090_windows     0.10.0            Windows  RTX4090             1  2573.57 MB       False

Filtering Engines by Configs

Filtering options will change regularly based on configs that are exposed. To view the latest filtering options, run:

trt-cloud catalog engines -h

Currently, you can only filter by: model, trtllm_version, os, and gpu.

Downloading a Pre-built Engine

The trt-cloud catalog download command allows downloading a specific engine version for a model. The model and engine version are specified with the -model and -version options respectively. An optional -output option allows specifying the local path to save the engine to. If not specified, the engine will be saved in the current working directory.

trt-cloud catalog download --model=gemma_2b_it_trtllm --version=bs1_int4awq_RTX3070_windows
[I] EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows
GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement

ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy.

A temporary copy of this document can be found at: 
This will be deleted after a response to this prompt. A copy is also included with the downloaded engine archive.

Do you agree to the EULA for Model: gemma_2b_it_trtllm, Engine: bs1_int4awq_RTX3070_windows? (yes/no) yes
[I] Saving engine to gemma_2b_it_trtllm_bs1_int4awq_RTX3070_windows_files.zip
Downloading  ━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  19% 0:00:27

Running a Pre-Built Engine

Integrate the downloaded engine into your application using TensorRT or the TensorRT-LLM version that matches the pre-built engines. In this example, we will show how to run inference using the downloaded Gemma engine.

  1. Unzip the downloaded engine archive.

  2. Download the model’s tokenizer and ensure that the environment satisfies the requirements of the downloaded engine (GPU type, GPU count, operating system, and installed TensorRT-LLM version).

    git clone https://huggingface.co/google/gemma-2b-it gemma-2b-it
    
  3. Run the downloaded TensorRT-LLM engine.

    python3 run.py --engine_dir engines/  --max_output_len 100 --tokenizer_dir gemma-2b-it/ --input_text "How do I count to nine in French?"
    

    For more information, refer to the TensorRT-LLM documentation.