Release Notes#

TensorRT-Cloud CLI

Minor improvements will be continuously pushed without the expectation that you will need to upgrade.

Major changes can be expected at a monthly cadence with the expectation that you will upgrade your version of the CLI.

Until we release TensorRT-Cloud CLI 1.0, expect some API-breaking changes with new releases.

TensorRT-Cloud

Minor improvements will be continuously pushed to production to provide enhancements as soon as possible.

Major API-breaking changes will be announced clearly in the release notes. We expect to make API-breaking changes as we receive feedback from EA customers. We expect you to upgrade to a newer version of the CLI on an API-breaking change.

Backward compatibility will be considered for support for GA.

TensorRT-Cloud 0.5.0 Early Access (EA)#

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Added multi-version support for TensorRT-LLM.

    • Can now build using TensorRT-LLM version 0.11 or 0.12.

  • Added support for TensorRT versions 10.4 and 10.5.

Limitations

  • Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.

  • Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

  • Invalid trt-llm engine build configs may fail when setting --tp-size > 1 for too small GPUs.

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.

TensorRT-Cloud 0.4.1 Early Access (EA)#

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Upgraded NVIDIA TensorRT-LLM to version 0.12.

Breaking API Changes

  • Replaced -kv-cache-quantization with -quantize-kv-cache. The new argument automatically picks an appropriate KV cache quantization type based on the model and quantization.

Limitations

  • Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.

  • Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

Known Issues

  • Invalid trt-llm engine build configs may fail when setting --tp-size > 1 for too small GPUs.

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.

TensorRT-Cloud 0.4.0 Early Access (EA)#

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Upgraded NVIDIA TensorRT to version 10.3.

  • Added Linux support for GeForce cards.

  • Added a new TRT-LLM model enablement. Refer to the Building a TensorRT-LLM Engine section for a list of the latest available models.

Breaking API Changes

  • By default, use the latest version of TensorRT for ONNX models.

    • Previously, we were using a fixed version. However, we encourage you to upgrade to the latest version.

  • Removed the --model-family option due to its redundancy.

Fixed Issues

The following issues have been fixed in this release:

  • Updated multiple CLI arguments to have reasonable defaults.

  • Updated and corrected TensorRT-Cloud documentation to be less vague.

Limitations

  • Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.

  • Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

Known Issues

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.

TensorRT-Cloud 0.3.0 Early Access (EA)#

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Support for on-demand TensorRT-LLM builds has been added.

  • Support for ONNX builds with multiple TensorRT versions has been added

  • Added refit support for TensorRT-LLM engines.

  • Support for local inputs larger than 5 GB has been added.

Breaking API Changes

  • The TensorRT-Cloud build now requires a build type (onnx, llm, request_id) to be specified.

Limitations

  • Return types with metrics are currently unsupported for checkpoint inputs with on-demand TensorRT-LLM builds.

  • Weight-stripped on-demand LLM engine building is only supported for checkpoint inputs.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

Known Issues

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.

TensorRT-Cloud 0.2.0 Early Access (EA)#

Announcements

  • The TensorRT-Cloud CLI tool is now available on PyPI.

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Added support for access to pre-built engines through TensorRT-Cloud.

  • Added support for more NVIDIA GeForce GPUs. For more information, refer to Planned GPU Support.

Breaking API Changes

  • CLI flags:

    • trt-cloud build --weightless was renamed to --strip-weights

    • trt-cloud build --strip-weights (formerly --weightless) no longer performs refit automatically. It is now an opt-in option with --local-refit.

Limitations

  • Input model files have a maximum file size of 5 GB.

    • This will be fixed in future releases. For now, models larger than 5 GB should use the weight-stripped flow. Refer to the Weight-Stripped Engine Generation section for information on weight-stripped engine building.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

  • The TensorRT-Cloud server has a daily limit on how much data it can process for building engines on Windows. If TensorRT-Cloud hits this limit on a given day, building on Windows will not be available for the rest of the day.

Known Issues

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.

TensorRT-Cloud 0.1.1 Early Access (EA)#

Announcements

  • The TensorRT-Cloud CLI tool will soon be published in PyPI.

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • Support for on-demand ONNX TensorRT engines for closed EA accounts was added

  • We have added support for various NVIDIA GeForce GPUs that can be used to build TensorRT engines. For more information, refer to Planned GPU Support.

Limitations

  • Input model files have a maximum file size of 5 GB.

    • This will be fixed in future releases. For now, models larger than 5 GB should use the weight-stripped flow. Refer to the Weight-Stripped Engine Generation section for information on weight-stripped engine building.

  • Refit requires a GPU of the same SM version used to build the engine. (This is a TensorRT limitation.)

  • By default, weight-stripped engines must be refitted with the original ONNX weights. Only engines built with the --refit flag in the trtexec arg list may be refitted with arbitrary weights.

  • Fully refittable engines might have some performance degradation.

  • Custom plugins or any custom ops are not supported. Only built-in TensorRT ops and plugins will work.

  • Input ONNX models must come from one of the following:

    • S3

    • GitHub

    • Local machine

  • The TensorRT-Cloud server has a daily limit on how much data it can process for building engines on Windows. If TensorRT-Cloud hits this limit on a given day, building on Windows will not be available for the rest of the day.

Known Issues

For inquiries and to report issues, contact tensorrt-cloud-contact@nvidia.com.