NVIDIA TensorRT-Cloud Documentation#
Important
NVIDIA TensorRT-Cloud is provided as a developer preview in Early Access (EA). Access is restricted and is provided upon request (refer to the Getting TensorRT-Cloud Access section).
TensorRT-Cloud (TRTC) helps developers deploy GenAI models with the best possible inference configurations for their workloads, by offering two key functionalities:
TensorRT-LLM configuration sweeping to help you optimize inference across popular OSS LLMs and NVIDIA hardware SKUs.
TensorRT engine-building capabilities across diverse NVIDIA GPUs, OS, and library dependencies. The goal is to enable developers to build optimized TensorRT and TensorRT-LLM engines with the convenience of a command line interface (CLI) for the variety of NVIDIA GPUs that applications need to support. This is done through on-demand engine building. This, coupled with the weight refit capabilities of NVIDIA TensorRT 10.0, allows you to integrate TensorRT accelerated inference in your applications without worrying about bloating your application binaries.
The TensorRT-Cloud CLI is the interface through which you interact with TensorRT-Cloud.
Getting Started
User Guide
- Getting Access
- Sweeping for Optimized TensorRT-LLM Engines
- Building a TensorRT-LLM Engine
- Building an ONNX Engine
- Specifying an Engine Build Configuration
- Specifying the ONNX Model
- Querying the Status of a Build
- Obtaining the Results of a Build
- Weightful Engine Generation
- Weight-Stripped Engine Generation
- Refittable Engine Generation (Weightful or Weight-Stripped)
- Building with Large ONNX Files
- Supported
trtexec
Arguments - Custom Build Result Location
- Running a TensorRT Engine
- Usage Credits
Troubleshooting