Large Language Models (1.0.0)
Large Language Models (1.0.0)

Introduction to LLM Inference Benchmarking

The past few years have witnessed the rise in popularity of generative AI and Large Language Models (LLMs), as part of a broader AI revolution. As LLM-based applications are increasingly rolled out across enterprises, there is a strong and urgent need to benchmark and ensure the cost efficiency of different serving solutions. The cost of an LLM application varies depending on how many queries it can process while being responsive and engaging for the end users. Note that all the cost measurement should be based on reaching an acceptable accuracy measurement, as defined by the application’s use case. This guide focuses on cost measurement and accuracy measurement isn’t covered.

Standardized benchmarking of LLM performance can be done with many tools, including long-standing tools such as Locust and K6, along with new open-source tools that are specialized for LLMs such as NVIDIA GenAI-Perf and LLMPerf. These client-side tools offer specific metrics for LLM-based applications but aren’t consistent in how they define, measure and calculate different metrics. This guide tries to clarify the common metrics and their differences and limitations. We also give a step-by-step guide on using our preferred tool (GenAI-Perf) to benchmark your LLM applications.

It’s worth noting that performance benchmarking and load testing are two distinct approaches to evaluating the deployment of a large language model. Load testing, as exemplified by tools like K6, focuses on simulating a large number of concurrent requests to a model to assess its ability to simulate real-world traffic and scale. This type of testing helps identify issues related to server capacity, auto scaling tactics, network latency, and resource utilization. In contrast, performance benchmarking, as demonstrated by NVIDIA’s GenAI-Perf tool, is concerned with measuring the actual performance of the model itself, such as its throughput, latency, and token-level metrics. This document focuses on this type of testing and helps identify issues related to model efficiency, optimization, and configuration. While load testing is essential for ensuring the model can handle a large volume of requests, performance testing is crucial for understanding the model’s ability to process requests efficiently. By combining both approaches, developers can gain a comprehensive understanding of their large language model deployment’s capabilities and identify areas for improvement.

Important

To learn more about benchmarking LLMS, please refer to the NIM for LLMs Benchmarking Guide.

Previous Configuring a NIM
Next Support Matrix
© Copyright © 2024, NVIDIA Corporation. Last updated on Aug 21, 2024.