Overview#

Executive Summary#

This document provides insights into how to benchmark deployment of Large Language Models (LLMs), popular metrics and parameters, as well as a step-by-step guide. Using this guide, LLM application developers and enterprise system owners will be able to answer the following questions:

What are the most important metrics for measuring LLM inference latency and throughput?
What benchmarking tools should I use for LLMs and what are some of the major differences?
How to use NVIDIA GenAI-Perf to measure latency and throughput of my LLM application?

Introduction to LLM Inference Benchmarking#

The past few years have witnessed the rise in popularity of generative AI and Large Language Models (LLMs), as part of a broader AI revolution. As LLM-based applications are increasingly rolled out across enterprises, there is a strong and urgent need to benchmark and ensure the cost efficiency of different serving solutions. The cost of an LLM application varies depending on how many queries it can process while being responsive and engaging for the end users. Note that all the cost measurement should be based on reaching an acceptable accuracy measurement, as defined by the application’s use case. This guide focuses on cost measurement and accuracy measurement is not covered.

Standardized benchmarking of LLM performance can be done with many tools, including long-standing tools such as Locust and K6, as well as new open-source tools that are specialized for LLMs such as NVIDIA GenAI-Perf and LLMPerf. These client-side tools offer specific metrics for LLM-based applications but are not consistent in how they define, measure and calculate different metrics. This guide tries to clarify the common metrics and their differences and limitations. We also give a step-by-step guide on using our preferred tool (GenAI-Perf) to benchmark your LLM applications.

It is worth noting that performance benchmarking and load testing are two distinct approaches to evaluating the deployment of a large language model. Load testing, as exemplified by tools like K6, focuses on simulating a large number of concurrent requests to a model to assess its ability to simulate real-world traffic and scale. This type of testing helps identify issues related to server capacity, autoscaling tactics, network latency, and resource utilization. In contrast, performance benchmarking, as demonstrated by NVIDIA’s GenAI-Perf tool, is concerned with measuring the actual performance of the model itself, such as its throughput, latency, and token-level metrics. This document focuses on this type of testing and helps identify issues related to model efficiency, optimization, and configuration. While load testing is essential for ensuring the model can handle a large volume of requests, performance testing is crucial for understanding the model’s ability to process requests efficiently. By combining both approaches, developers can gain a comprehensive understanding of their large language model deployment’s capabilities and identify areas for improvement.

Note

Server-side metrics are also available for NVIDIA NIM but are out of scope for this document, please refer to NIM Observability documentation.

Background On How LLM Inference Works#

Prior to examining benchmark metrics, it is important to understand how LLM inference works as well as the related terminologies. An LLM application produces results through inference. For a given specific LLM application, these are the broad stages.

The user provides a query (prompt),

The query gets in the queue, and waits for its turn to be processed, i.e. the queuing phase.

The application’s underlying LLM model processes the prompt, i.e. the prefill phase.

The LLM model outputs a response, one token at a time, i.e. the generation phase.

Token is a concept specific for LLMs and is a core LLM inference performance metric. It is the unit that LLM uses to break down and process natural language. The collection of all tokens is known as a vocabulary. Each LLM has its own tokenizer that is learned from the data so as to represent the input text efficiently. As an approximation, for many popular LLMs, each token is ~0.75 English words.

Sequence length is the length of the sequence of data. The Input Sequence Length (ISL) is how many tokens that the LLM gets. It includes the user query, any system prompt (eg. instructions for the model), previous chat history, chain-of-thought reasoning, and documents from the retrieval augmented generation (RAG) pipeline. The Output Sequence Length (OSL) is how many tokens the LLM makes. Context length is how many tokens the LLM uses at each generation step, including both the input and output tokens generated so far. Each LLM has a maximum context length that can be allocated to both input and output tokens. For a deeper dive into LLM inference, see our blog Mastering LLM Techniques: Inference Optimization.

Streaming is an option that allows the partial LLM outputs to be returned to the users, in the form of a chunk of tokens that the model incrementally generates so far. This is important for chatbot applications, where it is desirable to receive an initial response quickly. While the user digests the partial content, the next chunk of the result arrives in the background.

In contrast, in non-streaming mode, the full answer is returned all at once.