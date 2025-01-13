NVIDIA NIM for LLM automatically leverages model- and hardware-specific optimizations intended to improve the performance of large language models. The core metrics optimized for are:

Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.

Inter-Token Latency (ITL): The latency between each token after the first.

Total Throughput: The total number of tokens generated per second by the NIM.

The NVIDIA TensorRT-LLM accelerated NIM backend provides support for optimized versions of common models across a number of NVIDIA GPUs. If an optimized engine doesn’t exist for a SKU being used, a generic backend is used instead.

The TensorRT-LLM NIM backend includes multiple optimization profiles, catered to either minimize latency or maximize throughput. These engine profiles are tagged as called latency and throughput respectively in the model manifest included with the NIM.

While there are many differences between the throughput and latency variants, one of the most significant is that throughput variants utilize the minimum number of GPUs required to host a model (typically constrained by memory utilization), while latency variants use additional GPUs to decrease request latency at the cost of decreased total throughput per GPU relative to the throughput variant.