Building a TensorRT-LLM Engine#

To build a TensorRT-LLM engine, you must provide your TensorRT-LLM build configs along with the model name or checkpoint you would like to target, and TensorRT-Cloud will generate a corresponding engine. You can also generate quantization checkpoints for a given model to save and use.

Additionally, the TensorRT-Cloud CLI provides utility flags to prune and generate weight-stripped TensorRT-LLM engines. In short, building weight-stripped engines reduces the engine binary size at a potential performance cost.

For a list of supported models, refer to the Supported Models, Hardware, and OS section.

Specifying an Engine Build Configuration#

The TensorRT-Cloud CLI trt-cloud build llm command provides multiple arguments. To see the full list of arguments, run:

$ trt-cloud build llm -h

Building from a Hugging Face Repository#

Run trt-cloud build llm with --src-hf-repo to build from a Hugging Face repository.

$ trt-cloud build llm --src-hf-repo="google/gemma-2b-it" --dtype="float16" --gpu="A100" --os=linux

[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>

Building from a TensorRT-LLM Checkpoint#

Building an engine from a TensorRT-LLM checkpoint may be useful in the following scenarios:

The target GPU needs to be larger to accommodate the original model weights, but it can fit them if they are quantized on a larger GPU.
- Quantize on a large GPU and bring the quantized checkpoint to the target GPU.
The engine must be built for custom or pre-quantized weights.

To build a TensorRT-LLM engine from a TensorRT-LLM checkpoint, run trt-cloud build llm with --src-type trtllm_checkpoint, and provide the checkpoint via one of --src-path, --src-url, or --src-ngc.

Where:

--src-path is the local path that contains the TRT-LLM checkpoint.
--src-url is the URL to a model hosted on AWS S3 or GitHub.

Note

The URL must not require authentication headers.
- For TRT-LLM checkpoints hosted on S3, it is recommended that a pre-signed GET URL with a limited Time to Live (TTL) is created for use with TensorRT-Cloud.
--src-ngc is the NGC Private Registry model location in org/[team/]name[:version] format which contains the TensorRT-LLM checkpoint. For example:
- my-org/my-team/my-checkpoint:1.0
- my-org/other-trtllm-checkpoint:custom-version

For all of the above, the local path, URL, and NGC model must contain one of the following:

A directory containing the safe tensors and the config.json in the top level.
A zip archive of the above.

The checkpoint can be generated manually with TensorRT-LLM, NVIDIA ModelOpt, or by using TensorRT-Cloud (refer to Quantized Checkpoint Generation for more information).

Ensure that no other files or directories exist in the provided archive. All provided checkpoints are validated, and the build will be rejected if the validation fails.

Note

When building from a TensorRT-LLM checkpoint, you must also pass the --model-family argument to specify the input model family. The currently supported model families are llama, gemma, gpt, nemotron_nas, phi, and qwen.
By default, performance metrics are not returned for engines built from a TensorRT-LLM checkpoint. To generate performance metrics, you must pass the --return-type=metrics_only or --return-type=engine_and_metrics and also pass a tokenizer using --tokenizer-hf-repo or --tokenizer-path.
Checkpoints generated with a different TensorRT-LLM version might be incompatible.
You must have the correct permissions to use the model and its weights before you upload a checkpoint for engine build through TensorRT-Cloud.

$ trt-cloud build llm --src-path=./checkpoint.zip --src-type trtllm_checkpoint --os=linux --gpu=A100 --dtype=float16 --model-family llama

[I] Local model was provided. Checking NGC upload cache for existing model...
[I] Configuring NGC client with org: <org>, team: <team>
[I] Validating configuration...
[I] Successfully validated configuration.
[I] NGC client configured successfully.
[I] Computing hash of local path 'checkpoint.zip' for cache lookup...
[I] Creating NGC Model 'local-model-4596b9cc' in Private Registry
[I] Uploading local path 'mobilenetv2_050_Opset18.onnx' to NGC Model 'local-model-4596b9cc'
Upload progress: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 100% 0:00:00
[I] Successfully uploaded NGC Model 'local-model-4596b9cc'
[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>

Weight-Stripped Engine Generation#

Weight-stripped engine generation is supported and enabled by passing --strip-weights to the trt-cloud build llm command. These weight-stripped engines can then be refitted with weights from a TensorRT-LLM checkpoint directly on an end-user GPU.

Note

Weight-stripped engine generation is only supported for builds from TensorRT-LLM checkpoints.
Local weight pruning is only supported for builds from a TensorRT-LLM checkpoint with a local path.
The engine will still be built with a --strip-plan for checkpoints provided as a URL. However, it will not be pruned before submitting the build.

  $ trt-cloud build llm --src-path=./checkpoint.zip --src-type=trtllm_checkpoint --os=linux --gpu=A100 --dtype=float16 --strip-weights --model-family llama
[I] Interpreting ./checkpoint.zip as checkpoint archive.
[I] Extracting checkpoint from ./checkpoint.zip -> /tmp/tmpr3y6ly22
[I] Pruning weights from /tmp/tmpr3y6ly22/build_result/trtllm_checkpoint

[08/02/2024-15:55:37] Extracting trtllm checkpoint from checkpoint.zip -> /tmp/tmpdqogy19d
[08/02/2024-15:56:21] Pruning weights from /tmp/tmpdqogy19d
[08/02/2024-15:56:21] [TRT-LLM] [I] Checkpoint Dir: /tmp/tmpdqogy19d, Out Dir: /tmp/tmp0ylsx6ot
[08/02/2024-15:56:25] Creating pruned checkpoint archive.
[I] Local model was provided. Checking NGC upload cache for existing model...
[I] Configuring NGC client with org: <org>, team: <team>
[I] Validating configuration...
[I] Successfully validated configuration.
[I] NGC client configured successfully.
[I] Computing hash of local path 'checkpoint.zip' for cache lookup...
[I] Creating NGC Model 'local-model-4596b9cc' in Private Registry
[I] Uploading local path '/tmp/tmp0ylsx6ot' to NGC Model 'local-model-4596b9cc'
Upload progress: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 100% 0:00:00
[I] Successfully uploaded NGC Model 'local-model-4596b9cc'
[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>

Flow for weight-stripped engine generation:

[local] The CLI extracts the provided checkpoint archive, prunes weights from the checkpoint, and then re-creates a pruned checkpoint archive.
- If a non-local (URL, Hugging Face, and so on) checkpoint is provided, then the pruning step is skipped.
[remote] The builder builds a weight-stripped TensorRT-LLM engine (built with --strip-plan).

Quantized Checkpoint Generation#

Generating only a quantized checkpoint is supported. This checkpoint can later be reused to generate an engine using the flow described in the previous section. To generate a quantized checkpoint, specify checkpoint_only as the return type during the build command.

The GPU used to generate the checkpoint does not have to match the GPU used to build an engine from that checkpoint. This allows for quantizing a model on a large GPU, such as an A100, and using the resulting checkpoint to build an engine for a GPU on which the original weights would not have fit.

The GPU used for quantization must support the quantized data type. Quantization formats fp8 and w4a8_awq require the FP8 data type, which is only available on NVIDIA Ada Lovelace and NVIDIA Hopper GPUs and later.

For example, the following sample code shows how to build a Gemma 7B checkpoint with INT4 AWQ quantization.

$ trt-cloud build llm --src-hf-repo google/gemma-7b-it --quantization int4_awq --os linux --gpu A100 --return-type checkpoint_only --dtype bfloat16

[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>

Using Quantized Checkpoints to Build an Engine#

The generated TensorRT-LLM checkpoint can be used to build a TensorRT-LLM engine on the target GPU.

For example, the following sample code shows how to build a TensorRT-LLM engine from a quantized Gemma checkpoint.

$ trt-cloud build llm --src-path gemma_7b_checkpoint.zip --src-type trtllm_checkpoint --gpu RTX3070 --os windows  --dtype bfloat16 --model-family gemma

[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>

Alternatively, you can provide the input TensorRT-LLM checkpoint from the NGC Private Registry. This is particularly useful if you generated the checkpoint using the directions from Quantized Checkpoint Generation, as the checkpoint will already be uploaded to NGC by default.

For example, if the checkpoint was uploaded to the NGC Private Registry at the following URL: https://registry.ngc.nvidia.com/orgs/my-org/models/trt-cloud-build-result-1234-5678/files?version=1.0, you could pass the checkpoint directly from NGC like this:

$ trt-cloud build llm --src-ngc "my-org/trt-cloud-build-result-1234-5678:1.0" --src-type trtllm_checkpoint --gpu RTX3070 --os windows  --dtype bfloat16 --model-family gemma

TensorRT-LLM Engine Build Metrics#

Metrics are returned in the file metrics.json. They are returned in the following format:

{
  "rouge1": 30.532889598883763,
  "rouge2": 10.519224834860456,
  "rougeL": 22.77946327498464,
  "rougeLsum": 25.958965209634254,
  "generation_tokens_per_second": 126.849,
  "gpu_peak_mem_gb": 7.783
}

Rouge metrics measure accuracy and are evaluated using the summarize.py script in TensorRT-LLM and the cnn_dailymail dataset.

generation_tokens_per_second is measured using the benchmark.py script in TensorRT-LLM and batch size 1.

gpu_peak_mem_gb is the peak GPU memory used while benchmarking and uses batch size 1.

Note

By default, metrics are not returned for engines built from a TensorRT-LLM checkpoint. To generate metrics, you must pass --return-type=metrics_only or --return-type=engine_and_metrics and also pass a tokenizer using --tokenizer-hf-repo or --tokenizer-path.

Custom Build Result Location#

By default, the build results from trt-cloud build (including any TensorRT engines) are automatically uploaded to your organization’s NGC Private Registry as an NGC model with the name trt-cloud-build-result-<build_id>. You can provide a custom location where the build result should be sent via the --dst-ngc CLI argument. This argument is shared across trt-cloud build onnx, trt-cloud build llm, and trt-cloud sweep build.

Uploading Build Results to a Custom NGC Model#

You may pass a custom NGC model location in the format org/[team/]name[:version] and the build results are sent there instead. For example:

trt-cloud build onnx --src-path model.onnx --gpu RTX3070 --os windows --dst-ngc "my-org/my-team/my-model-name:model-version"

Note

If the organization is different from the one you are currently logged into in the TensorRT-Cloud CLI, you must provide an NGC access token for that organization via --dst-token.

Running a TensorRT-LLM Engine#

Build a TensorRT-LLM engine.

$ trt-cloud build llm --src-hf-repo="meta-llama/Llama-2-13B-chat-hf" --gpu="A100" --os="linux"

Install the corresponding version of tensorrt_llm locally. Perform the steps outlined in the Installing on Linux section for TensorRT-LLM.

Note

If you encounter issues when trying to install TensorRT-LLM version 0.17.0, you may need to install version 0.17.0.post1 instead.

Clone the TensorRT-LLM examples.

$ git clone https://github.com/NVIDIA/TensorRT-LLM.git --branch v<version>

Log into Hugging Face with your token if needed.
```
$ huggingface-cli login --token <your_token>
```
Alternatively, copy the tokenizer for the model to a locally accessible path and provide the path to the run.py script (–tokenizer_dir).

Run the engine.

$ python3 ./TensorRT-LLM/examples/run.py --engine_dir build_result/engines --max_output_len=100 --input_text "How do I count to nine in French?"

[TensorRT-LLM] TensorRT-LLM version: 0.11.0
[10/02/2024-21:20:38] [TRT-LLM] [W] tokenizer_dir is not specified. Try to infer from model_name, but this may be incorrect.
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 776/776 [00:00<00:00, 3.77MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 25.9MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 7.22MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 2.94MB/s]

[10/02/2024-21:23:31] [TRT-LLM] [I] Load engine takes: 170.79934310913086 sec
Input [Text 0]: "<s> How do I count to nine in French?"
Output [Text 0 Beam 0]: "

To count to nine in French, you can use the numbers from one to nine, which are:

1 - un
2 - deux
3 - trois
4 - quatre
5 - cinq
6 - six
7 - sept
8 - huit
9 - neuf

So, to count to nine in French, you would say:

un, deux, trois, quatre, cinq, six, sept, huit, neuf"

For more information, refer to the TensorRT-LLM Quick Start Guide.