Building a TensorRT-LLM Engine#
Ensure you can log into TensorRT-Cloud.
Important
Building on-demand engines is provided as a closed Early Access (EA) product. Access is restricted and is provided upon request (refer to Getting TensorRT-Cloud Access). These features will not be functional unless access is granted.
To build a TensorRT-LLM engine, you must provide your TensorRT-LLM build configs along with the model name or checkpoint you would like to target, and TensorRT-Cloud will generate a corresponding engine. You can also generate quantization checkpoints for a given model to save and use.
Additionally, the TensorRT-Cloud CLI provides utility flags to prune and generate weight-stripped TensorRT-LLM engines. In short, building weight-stripped engines reduces the engine binary size at a potential performance cost.
In the sections below, we provide examples for building different kinds of engines.
Only TensorRT-LLM version 0.12.0 is supported.
Hugging Face Repo Name |
|
|
|
|
|
|
---|---|---|---|---|---|---|
meta-llama/Llama-2-7B-chat-hf |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
meta-llama/Llama-2-13B-chat-hf |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
meta-llama/Meta-Llama-3-8B-Instruct |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes [1] |
Google/gemma-2b-it |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Google/gemma-7b-it |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
mistralai/Mistral-7B-Instruct-v0.1 |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes [1] |
Microsoft/phi-2 |
Yes |
No |
No |
No |
No |
No |
Hugging Face Repo Name |
|
|
|
|
|
|
---|---|---|---|---|---|---|
meta-llama/Llama-2-7B-chat-hf |
No |
No |
Yes |
Yes |
No |
No |
meta-llama/Llama-2-13B-chat-hf |
No |
No |
Yes |
Yes |
No |
No |
meta-llama/Meta-Llama-3-8B-Instruct |
No |
No |
Yes |
Yes |
No |
No |
Google/gemma-2b-it |
Yes |
No |
Yes |
Yes |
No |
No |
Google/gemma-7b-it |
Yes |
No |
Yes |
Yes |
No |
No |
mistralai/Mistral-7B-Instruct-v0.1 |
No |
No |
Yes |
Yes |
No |
No |
Microsoft/phi-2 |
No |
No |
No |
No |
No |
No |
Note
Building a TensorRT-LLM engine for each of these models on a GPU and data type of choice will be subject to VRAM availability on that particular GPU.
Specifying an Engine Build Configuration#
The TensorRT-Cloud CLI trt-cloud build llm
command provides multiple arguments. To see the full list of arguments, run:
trt-cloud build llm -h
Key arguments that allow for system and engine configuration are:
--gpu
- Picks GPU target. Usetrt-cloud info
to get the list of available GPUs.--os
- Picks OS target (linux
orwindows
)--dtype
- Specifies the model data type (for activations and non-quantized weights) (float16
orbfloat16
).--return-type
- Specifies what should be returned from the build.checkpoint_only
: Returns only a quantized checkpoint. Refer to Quantized Checkpoint Generation.engine_only
: Returns only an engine and timing cache.metrics_only
: Returns only metrics (no engine is returned).engine_and_metrics
: Returns an engine, metrics, and timing cache.
--quantization
- Specifies the quantization to use (fp8
,int4_awq
,w4a8_awq
,int8_wo
,int4_wo
,full_prec
). The default isfull_prec
.--quantize-kv-cache
- If specified, quantizes the KV cache. The quantization type (int8
,fp8
) is picked automatically based on the model and quantization. Onlyfp8
,int4_awq
, andfull_prec
(Gemma only) quantization is supported. FP8 requires SM89 or higher and is not supported on all GPUs.--kv-cache-quantization
- Specifies the quantization for the KV cache (int8
,fp8
).--max-batch-size
- Max batch size defines the maximum number of requests the engine can handle.--max-seq-len
- Max sequence length defines the maximum sequence length of a single request.--max-num-tokens
- Max num tokens define the maximum number of batched input tokens after removing padding in each batch.--tp-size
- Specifies the number of GPUs for tensor-parallelism during inference. Only supports Linux builds.--pp-size
- Specifies the number of GPUs for pipeline parallelism during inference. Only supports Linux builds.
Building from a Hugging Face Repository#
Run trt-cloud build llm
with --hf-repo
to build from a Hugging Face repository. Optionally, a repository revision can be specified using --hf-repo-revision
.
trt-cloud build llm --hf-repo="google/gemma-2b-it" --dtype="float16" --gpu="A100" --os=linux EULA for use of huggingface repo: google/gemma-2b-it GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy. A copy will also be included in the engine archive. Do you agree to the EULA for use of huggingface repo: google/gemma-2b-it? (yes/no) yes … Downloading to nvcf_download.zip ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.9/5.9 GB 100% 0:00:00 [08/02/2024-15:26:54] Measured rouge1 score of engine: 30.532890 [08/02/2024-15:26:54] Last 15 lines of summarize.log: --- … --------------------------------------------------------- [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] TensorRT-LLM (total latency: 1.3209233283996582 sec) [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 166) [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 125.66967092716459) [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] TensorRT-LLM beam 0 result [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] rouge1 : 30.532889598883763 [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] rouge2 : 10.519224834860456 [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] rougeL : 22.77946327498464 [08/02/2024-15:26:54] [08/02/2024-22:24:35] [TRT-LLM] [I] rougeLsum : 25.958965209634254 [08/02/2024-15:26:54] --- [08/02/2024-15:26:54] Last 5 lines of trtllm_build.log: --- [08/02/2024-15:26:54] [08/02/2024-22:24:08] [TRT] [I] Serialized 6 timing cache entries [08/02/2024-15:26:54] [08/02/2024-22:24:08] [TRT-LLM] [I] Timing cache serialized to /tmp/tmp9cqxrw_c/out/build_result/timing_cache [08/02/2024-15:26:54] [08/02/2024-22:24:08] [TRT-LLM] [I] Serializing engine to /tmp/tmp9cqxrw_c/out/build_result/engines/rank0.engine... [08/02/2024-15:26:54] [08/02/2024-22:24:14] [TRT-LLM] [I] Engine serialized. Total time: 00:00:06 [08/02/2024-15:26:54] [08/02/2024-22:24:15] [TRT-LLM] [I] Total time of building all engines: 00:02:04 [08/02/2024-15:26:54] --- [08/02/2024-15:26:54] Saved build result to build_result.zip
Building from a TensorRT-LLM Checkpoint#
Building an engine from a TensorRT-LLM checkpoint may be useful in the following scenarios:
The target GPU is not large enough to accommodate the original model weights, but it can fit them if they are quantized on a larger GPU.
The engine must be built for custom weights.
The engine must be built for fixed quantized weights rather than being quantized on the fly.
To build a TensorRT-LLM engine from a TensorRT-LLM checkpoint, run trt-cloud build llm
with --trtllm-checkpoint
.
The checkpoint can be a local path or a URL. It can be generated manually with TensorRT-LLM or NVIDIA ModelOpt or by using TensorRT-Cloud (refer to Quantized Checkpoint Generation). The checkpoint must be provided as a single zip archive containing the safe tensors and the config.json
in the root of the archive.
Ensure that no other files or directories exist in the provided archive. All provided checkpoints are validated, and the build will be rejected if the validation fails.
Note
Metrics will not be returned for engines built from a TensorRT-LLM checkpoint.
Checkpoints generated with a different TensorRT-LLM version might be incompatible.
You must have the right to use the model and its weights before you upload a checkpoint for engine build through TensorRT-Cloud.
trt-cloud build llm --trtllm-checkpoint=./checkpoint.zip --os=linux --gpu=A100 --dtype=float16
Uploading checkpoint.zip
Splitting file into multiple assets because it is larger than 5 GB
…
Selected NVCF Function …
NVCF Request ID: …
[I] Latest poll status: 202 at<>. Position in queue: 0.
Downloading to nvcf_download.zip
…
Last 5 lines of trtllm_build.log:
---
[TRT] [I] Serialized 6 timing cache entries
[TRT-LLM] [I] Timing cache serialized to …
[TRT-LLM] [I] Serializing engine to...
[TRT-LLM] [I] Engine serialized. Total time: …
[TRT-LLM] [I] Total time of building all engines: …
---
Saved build result to build_result.zip
Weight-Stripped Engine Generation#
Weight-stripped engine generation is supported and enabled by passing --strip-weights
to the trt-cloud build llm
command. These weight-stripped engines can then be refitted with weights from a TensorRT-LLM checkpoint directly on an end-user GPU.
Note
Weight-stripped engine generation is only supported for builds from TensorRT-LLM checkpoints.
Local weight pruning and refit are only supported for builds from a TensorRT-LLM checkpoint with a local path.
The engine will still be built with a
--strip-plan
for checkpoints provided as a URL. However, it will not be pruned before submitting the build. To refit engines built using a checkpoint URL, download the checkpoint locally and then manually run the--local-refit
command.
trt-cloud build llm --trtllm-checkpoint=./checkpoint.zip --os=linux --gpu=A100 --dtype=float16 --strip-weights [--local-refit]
[08/02/2024-15:55:37] Extracting trtllm checkpoint from checkpoint.zip -> /tmp/tmpdqogy19d
[08/02/2024-15:56:21] Pruning weights from /tmp/tmpdqogy19d
[08/02/2024-15:56:21] [TRT-LLM] [I] Checkpoint Dir: /tmp/tmpdqogy19d, Out Dir: /tmp/tmp0ylsx6ot
[08/02/2024-15:56:25] Creating pruned checkpoint archive.
[08/02/2024-15:57:22] Uploading /tmp/tmp2wn3wup5/weight_pruned_checkpoint.zip
Uploading /tmp/tmp2wn3wup5/weight_pruned_checkpoint.zip …
[08/02/2024-15:57:40] Will build using TRT LLM version 0.11.0
…
Downloading to nvcf_download.zip
…
[08/02/2024-16:03:37] Last 5 lines of trtllm_build.log:
---
[08/02/2024-16:03:37] [08/02/2024-23:03:35] [TRT] [I] Serialized 7 timing cache entries
[08/02/2024-16:03:37] [08/02/2024-23:03:35] [TRT-LLM] [I] Timing cache serialized to /tmp/tmpl31kazux/out/build_result/timing_cache
[08/02/2024-16:03:37] [08/02/2024-23:03:35] [TRT-LLM] [I] Serializing engine to /tmp/tmpl31kazux/out/build_result/engines/rank0.engine...
[08/02/2024-16:03:37] [08/02/2024-23:03:35] [TRT-LLM] [I] Engine serialized. Total time: 00:00:00
[08/02/2024-16:03:37] [08/02/2024-23:03:35] [TRT-LLM] [I] Total time of building all engines: 00:05:21
[08/02/2024-16:03:37] ---
[08/02/2024-16:03:37] Saved build result to build_result.zip
Flow for weight-stripped engine generation:
[local] The CLI extracts the provided checkpoint archive, prunes weights from the checkpoint, and then re-creates a pruned checkpoint archive.
[remote] The builder builds a weight-stripped TensorRT-LLM engine (built with
--strip-plan
).[local] Optionally, if
--local-refit
is provided, the built engine is refitted with the original model weights.
Quantized Checkpoint Generation#
Generating only a quantized checkpoint is supported. This checkpoint can later be reused to generate an engine using the flow described in the previous section. To generate a quantized checkpoint, specify checkpoint_only
as the return type during the build command.
The GPU used to generate the checkpoint does not have to match the GPU used to build an engine from that checkpoint. This allows for quantizing a model on a large GPU, such as an A100, and using the resulting checkpoint to build an engine for a GPU on which the original weights would not have fit.
The GPU used for quantization must support the quantized data type. Quantization formats fp8
and w4a8_awq
require the FP8 data type, which is only available on NVIDIA Ada Lovelace and NVIDIA Hopper GPUs and later.
For example, the following sample code shows how to build a Gemma 7B checkpoint with INT4 AWQ quantization.
trt-cloud build llm --hf-repo google/gemma-7b-it --quantization int4_awq --os linux --gpu A100 --return-type checkpoint_only --dtype bfloat16 -o gemma_7b_checkpoint.zip EULA for use of huggingface repo: google/gemma-7b-it GOVERNING TERMS: The use of this TensorRT Engine is governed by the NVIDIA TRT Engine License: https://docs.nvidia.com/deeplearning/tensorrt-cloud/latest/reference/eula.html#nvidia-tensorrt-engine-license-agreement ATTRIBUTION: Gemma Terms of Use available at https://ai.google.dev/gemma/terms; and Gemma Prohibited Use Policy available at https://ai.google.dev/gemma/prohibited_use_policy. A copy will also be included in the engine archive. Do you agree to the EULA for use of huggingface repo: google/gemma-7b-it? (yes/no) yes [I] Will build using TRT LLM version 0.11.0 [I] Selected NVCF Function 55e7f2c8-788c-498a-9ce9-db414e3d48cd with version 2692545d-ce95-493d-b807-a6747924c5e8 [I] NVCF Request ID: a1ab2d06-aad3-49e4-ad49-49eb529bffd3 [I] Latest poll status: 202 at 15:30:10. Position in queue: 0. Downloading to nvcf_download.zip ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 GB 100% 0:00:00 [I] Last 5 lines of quantize.log: --- [I] Calibrating batch 31 [I] Quantization done. Total time used: 189.02 s. [I] current rank: 0, tp rank: 0, pp rank: 0 [I] Quantized model exported to /tmp/tmp16tbnqj3/out/build_result/trtllm_checkpoint [I] Total time used 48.87 s. [I] --- [I] Saved build result to gemma_7b_checkpoint.zip
Using Quantized Checkpoints to Build an Engine#
The generated TensorRT-LLM checkpoint can be used to build a TensorRT-LLM engine on the target GPU.
For example, the following sample code shows how to build a TensorRT-LLM engine from a quantized Gemma checkpoint.
trt-cloud build llm --trtllm-checkpoint gemma_7b_checkpoint.zip --gpu RTX3070 --os windows --dtype bfloat16 -o gemma_engine.zip [I] Uploading gemma_7b_checkpoint.zip [I] Splitting file into multiple assets because it is larger than 5 GB [I] Wrote file chunk to /tmp/tmp7cf8co2p Uploading /tmp/tmp7cf8co2p ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.0/5.0 GB 100% 0:00:00 [I] Uploaded new NVCF asset with ID f1bc62f7-0499-45a8-bd32-704878f749dc [I] Wrote file chunk to /tmp/tmpfg7l87i6 Uploading /tmp/tmpfg7l87i6 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 693.2/693.2 MB 100% 0:00:00 [I] Uploaded new NVCF asset with ID c6e42354-75f4-4406-b1bd-51c3013d4295 [I] Will build using TRT LLM version 0.11.0 [I] Selected NVCF Function e62edd9e-71ee-43fe-8ea9-fc1aaa6007b5 with version 2eb25f40-6cb3-48ad-bc55-5e29419469b8 [I] NVCF Request ID: 98eef062-e48d-4719-8255-911a4a00ed2d [I] Latest poll status: 202 at 15:53:03. Position in queue: 0. Downloading to nvcf_download.zip ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 GB 100% 0:00:00 [I] Last 5 lines of trtllm_build.log: --- [I] [08/06/2024-15:51:25] [TRT] [I] Serialized 6 timing cache entries [I] [08/06/2024-15:51:25] [TRT-LLM] [I] Timing cache serialized to C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmp2cgwcbuh\out\build_result\timing_cache [I] [08/06/2024-15:51:25] [TRT-LLM] [I] Serializing engine to C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmp2cgwcbuh\out\build_result\engines\rank0.engine... [I] [08/06/2024-15:51:36] [TRT-LLM] [I] Engine serialized. Total time: 00:00:10 [I] [08/06/2024-15:51:36] [TRT-LLM] [I] Total time of building all engines: 00:04:10 [I] --- [I] Saved build result to gemma_engine.zip
TensorRT-LLM Engine Build Metrics#
Metrics are returned in the file metrics.json
. They are returned in the following format:
{ "rouge1": 30.532889598883763, "rouge2": 10.519224834860456, "rougeL": 22.77946327498464, "rougeLsum": 25.958965209634254, "generation_tokens_per_second": 126.849, "gpu_peak_mem_gb": 7.783 }
Rouge metrics measure accuracy and are evaluated using the summarize.py
script in TensorRT-LLM and the cnn_dailymail
dataset.
generation_tokens_per_second
is measured using the benchmark.py
script in TensorRT-LLM and batch size 1.
gpu_peak_mem_gb
is the peak GPU memory used while benchmarking and uses batch size 1.
Running a TensorRT-LLM Engine#
Build a TensorRT-LLM engine.
trt-cloud build llm --hf-repo="meta-llama/Llama-2-13B-chat-hf" --gpu="A100" --os="linux"
Install the corresponding version of
tensorrt_llm
locally.
pip install tensorrt_llm==<version>
Clone the TensorRT-LLM examples.
git clone https://github.com/NVIDIA/TensorRT-LLM.git --branch v<version>
Log into Hugging Face with your token, if needed.
huggingface-cli login --token <your_token> Alternatively, copy the tokenizer for the model to a locally accessible path, and provide the path to the ``run.py`` script (``-tokenizer_dir``).
Run the engine.
python3 ./TensorRT-LLM/examples/run.py --engine_dir build_result/engines --max_output_len=100 --input_text "How do I count to nine in French?" [TensorRT-LLM] TensorRT-LLM version: 0.11.0 [10/02/2024-21:20:38] [TRT-LLM] [W] tokenizer_dir is not specified. Try to infer from model_name, but this may be incorrect. tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 776/776 [00:00<00:00, 3.77MB/s] tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 25.9MB/s] tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 7.22MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 2.94MB/s] [10/02/2024-21:23:31] [TRT-LLM] [I] Load engine takes: 170.79934310913086 sec Input [Text 0]: "<s> How do I count to nine in French?" Output [Text 0 Beam 0]: " To count to nine in French, you can use the numbers from one to nine, which are: 1 - un 2 - deux 3 - trois 4 - quatre 5 - cinq 6 - six 7 - sept 8 - huit 9 - neuf So, to count to nine in French, you would say: un, deux, trois, quatre, cinq, six, sept, huit, neuf"
For more information, refer to the TensorRT-LLM Quick Start Guide.
Footnotes