Parameters and Best Practices
Now that we have reviewed benchmarking metrics for measuring LLM inference latency and throughput, let’s discuss some important test parameters and their sweep range which ensures meaningful benchmarking and quality assurance.
An application’s specific use cases influence the sequence lengths (i.e., ISL and OSL), which in turn affects how fast a system digests the input to form KV-cache and generate output tokens. Longer ISL increase memory requirement for the prefill stage and thus TTFT, while longer OSL increase memory requirement (both bandwidth and capacity) for the generation stage and thus increase ITL. You must understand the distribution of inputs and outputs in your LLM deployment to best optimize your hardware utilization. Common use cases and the likely ISL/OSL pairs include the following.
Translation: This includes translation between languages and code, and is characterized by having similar ISL and OSL, of roughly 500~2000 tokens each.
Generation: This includes generation for code, story, email, and generic content via search. This is characterized by having OSL of O(1000) tokens, much longer than ISL of O(100) tokens.
Summarization: This includes retrieval, chain-of-thought prompting, and multi-turn conversations. This is characterized by having ISL of O(1000) tokens, much longer than OSL of O(100) tokens.
If you have real data, that can be provided as inputs too. GenAI-perf supports the HuggingFace OpenOrca and CNN Dailymail datasets.
Concurrency N is the number of concurrent users each having one active request, or equivalently the number of requests concurrently being served by an LLM service. As soon as each user’s request receives a complete response, another request is sent, to ensure that at any time, the system has exactly N requests.
Note about LLMperf: It sends out requests in batches of N requests, but there is a draining period where it waits for all the requests to complete, before sending out the next batch. As such, towards the end of the batch, the number of concurrent requests reduces gradually to 0. This differs from GenAI-perf, which always ensures N active requests throughout the benchmarking period.
Concurrency is most frequently used to describe and control the load induced on the inference system.
Max batch size: Batch is the group of simultaneous request being processed by the inference engine. This may be a subset of the concurrent requests. The maximum batch size parameter defines the maximum number of requests that the inference engine can process simultaneously.
If the concurrency exceeds the maximum batch size multiplied by the number of active replicas, some requests will have to wait in a queue for later processing. In this case you may see an increase in Time-to-first-Token value due to the queueing effect of waiting for a slot to open up.
Request rate is another parameter that can be used to control load by determining the rate at which new requests are sent. Using a constant (or static) request rate r means 1 request is sent every 1/r seconds, while using a poisson (or exponential) request rate determines the average inter-arrival time.
GenAI-perf supports both concurrency and request rate, however we recommend using concurrency (as with request rate, the number of outstanding requests may grow unbounded if the request per second exceeds the system throughput).
When specifying the concurrencies to test, it is useful to sweep over a range of values, from a minimum value of 1 to a maximum value not much greater than the max batch size. That is because when the concurrency is larger than the max batch size of the engine, some requests will have to wait in a queue. Therefore, the throughput of the system generally saturates around the max batch size while the latency steadily increases.
“ignore_eos”: Most LLMs have a special end-of-sequence (EOS) token, which signifies the end of generation. It indicates that the LLM has generated a complete response, and should stop. Under general usage, LLM inference should respect this signal and stop generating further tokens. The ignore_eos parameter generally instructs whether an LLM inference framework should ignore the EOS token and continue generating tokens until reaching the max_tokens limit. For benchmarking purposes, this parameter should be set to True, in order to reach the intended output length and obtain consistent measurement.
Sampling vs. greedy: Different sampling strategies might have impacts on the LLM generation speed. Greedy for example, can be implemented simply by selecting the token with the highest logit (no need for normalizing and sorting the probability distribution over tokens, thus saving on computation). See this blog for a detailed explanation on different sampling methods. Whichever sampling method chosen, it is a good practice to stay consistent within the same benchmarking setup.