Release Notes#
TensorRT-Cloud CLI
Minor improvements will be continuously pushed without the expectation that you will need to upgrade.
Major changes can be expected at a monthly cadence with the expectation that you will upgrade your version of the CLI.
Until we release TensorRT-Cloud CLI 1.0, expect some API-breaking changes with new releases.
TensorRT-Cloud
Minor improvements will be continuously pushed to production to provide enhancements as soon as possible.
Major API-breaking changes will be announced clearly in the release notes. We expect to make API-breaking changes as we receive feedback from EA customers. We expect you to upgrade to a newer version of the CLI on an API-breaking change.
Backward compatibility will be considered for support for GA.
TensorRT-Cloud 0.5.0 Early Access (EA)#
Key Features and Enhancements
The following features and enhancements have been added to this release:
Added multi-version support for TensorRT-LLM.
Can now build using TensorRT-LLM version 0.11 or 0.12.
Added support for TensorRT versions 10.4 and 10.5.
Limitations
Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.
Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
Invalid
trt-llm
engine build configs may fail when setting--tp-size > 1
for too small GPUs.
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.
TensorRT-Cloud 0.4.1 Early Access (EA)#
Key Features and Enhancements
The following features and enhancements have been added to this release:
Upgraded NVIDIA TensorRT-LLM to version 0.12.
Breaking API Changes
Replaced
-kv-cache-quantization
with-quantize-kv-cache
. The new argument automatically picks an appropriate KV cache quantization type based on the model and quantization.
Limitations
Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.
Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
Known Issues
Invalid
trt-llm
engine build configs may fail when setting--tp-size > 1
for too small GPUs.
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.
TensorRT-Cloud 0.4.0 Early Access (EA)#
Key Features and Enhancements
The following features and enhancements have been added to this release:
Upgraded NVIDIA TensorRT to version 10.3.
Added Linux support for GeForce cards.
Added a new TRT-LLM model enablement. Refer to the Building a TensorRT-LLM Engine section for a list of the latest available models.
Breaking API Changes
By default, use the latest version of TensorRT for ONNX models.
Previously, we were using a fixed version. However, we encourage you to upgrade to the latest version.
Removed the
--model-family
option due to its redundancy.
Fixed Issues
The following issues have been fixed in this release:
Updated multiple CLI arguments to have reasonable defaults.
Updated and corrected TensorRT-Cloud documentation to be less vague.
Limitations
Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.
Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
Known Issues
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.
TensorRT-Cloud 0.3.0 Early Access (EA)#
Key Features and Enhancements
The following features and enhancements have been added to this release:
Support for on-demand TensorRT-LLM builds has been added.
Support for ONNX builds with multiple TensorRT versions has been added
Added refit support for TensorRT-LLM engines.
Support for local inputs larger than 5 GB has been added.
Breaking API Changes
The TensorRT-Cloud build now requires a build type (
onnx
,llm
,request_id
) to be specified.
Limitations
Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.
Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
Known Issues
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.
TensorRT-Cloud 0.2.0 Early Access (EA)#
Announcements
The TensorRT-Cloud CLI tool is now available on PyPI.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Added support for access to pre-built engines through TensorRT-Cloud.
Added support for more NVIDIA GeForce GPUs. For more information, refer to Planned GPU Support.
Breaking API Changes
CLI flags:
trt-cloud build --weightless
was renamed to--strip-weights
trt-cloud build --strip-weights
(formerly--weightless
) no longer performs refit automatically. It is now an opt-in option with--local-refit
.
Limitations
Input model files have a maximum file size of 5 GB.
This will be fixed in future releases. For now, models larger than 5 GB should use the weight-stripped flow. Refer to the Weight-Stripped Engine Generation section for information on weight-stripped engine building.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
The TensorRT-Cloud server has a daily limit on how much data it can process for building engines on Windows. If TensorRT-Cloud hits this limit on a given day, building on Windows will not be available for the rest of the day.
Known Issues
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.
TensorRT-Cloud 0.1.1 Early Access (EA)#
Announcements
The TensorRT-Cloud CLI tool will soon be published in PyPI.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Support for on-demand ONNX TensorRT engines for closed EA accounts was added
We have added support for various NVIDIA GeForce GPUs that can be used to build TensorRT engines. For more information, refer to Planned GPU Support.
Limitations
Input model files have a maximum file size of 5 GB.
This will be fixed in future releases. For now, models larger than 5 GB should use the weight-stripped flow. Refer to the Weight-Stripped Engine Generation section for information on weight-stripped engine building.
Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)
By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the
--refit
flag in thetrtexec
arg list may be refitted with arbitrary weights.Fully refittable engines might have some performance degradation.
Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.
Input ONNX models must come from one of the following:
S3
GitHub
Local machine
The TensorRT-Cloud server has a daily limit on how much data it can process for building engines on Windows. If TensorRT-Cloud hits this limit on a given day, building on Windows will not be available for the rest of the day.
Known Issues
For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.