Concurrency N is the number of concurrent users each having one active request, or equivalently the number of requests concurrently being served by an LLM service. As soon as each user’s request receives a complete response, another request is sent, to ensure that at any time, the system has exactly N requests. Note about LLMperf: It sends out requests in batches of N requests, but there is a draining period where it waits for all the requests to complete, before sending out the next batch. As such, towards the end of the batch, the number of concurrent requests reduces gradually to 0. This differs from GenAI-perf, which always ensures N active requests throughout the benchmarking period. Concurrency is most frequently used to describe and control the load induced on the inference system.

Max batch size: Batch is the group of simultaneous request being processed by the inference engine. This may be a subset of the concurrent requests. The maximum batch size parameter defines the maximum number of requests that the inference engine can process simultaneously. If the concurrency exceeds the maximum batch size multiplied by the number of active replicas, some requests will have to wait in a queue for later processing. In this case you may see an increase in Time-to-first-Token value due to the queueing effect of waiting for a slot to open up.