NVIDIA TensorRT-Cloud Documentation#

Important

NVIDIA TensorRT-Cloud is provided as a developer preview in Early Access (EA). Access is restricted and is provided upon request (refer to the Getting TensorRT-Cloud Access section).

TensorRT-Cloud (TRTC) helps developers deploy GenAI models with the best possible inference configurations for their workloads, by offering two key functionalities:

  1. TensorRT-LLM configuration sweeping to help you optimize inference across popular OSS LLMs and NVIDIA hardware SKUs.

  2. TensorRT engine-building capabilities across diverse NVIDIA GPUs, OS, and library dependencies. The goal is to enable developers to build optimized TensorRT and TensorRT-LLM engines with the convenience of a command line interface (CLI) for the variety of NVIDIA GPUs that applications need to support. This is done through on-demand engine building. This, coupled with the weight refit capabilities of NVIDIA TensorRT 10.0, allows you to integrate TensorRT accelerated inference in your applications without worrying about bloating your application binaries.

The TensorRT-Cloud CLI is the interface through which you interact with TensorRT-Cloud.

User Guide

Troubleshooting