AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.
Request cancellation tests how inference servers handle client disconnections. A percentage of requests are sent completely, then the client disconnects before receiving the full response.
The cancellation timer starts at T2 (“request fully sent”) for two reasons:
Realistic simulation: The server always receives the complete request before cancellation, just like when a real user closes their browser tab.
Reproducibility: The delay is measured from a fixed point (request fully sent) rather than being affected by variable queue times or connection setup. This means running the same benchmark twice with --request-cancellation-delay 0.5 will cancel requests at the same point in their lifecycle, regardless of system load.
If the server responds before the delay expires, the request completes normally and is not cancelled. Only requests still waiting for a response when the timer expires are cancelled.
A delay of 0 means “send the full request, then immediately disconnect”. The server receives the complete request but the client closes the connection before receiving any response. Longer delays allow partial responses to be received before disconnection.
The delay parameter can be used to target different inference phases:
This is useful for testing how disaggregated architectures (separate prefill and decode workers) handle cancellations at different stages of request processing.
Test with a small percentage of cancelled requests:
Sample Output (Successful Run):
Parameters Explained:
--request-cancellation-rate 10: Cancel 10% of requests (value between 0.0 and 100.0)--request-cancellation-delay 0.5: Wait .5 seconds before cancelling selected requestsTest service resilience under frequent cancellations:
Sample Output (Successful Run):
Test immediate disconnection where the client closes the connection right after sending the request:
Sample Output (Successful Run):
What happens with delay=0: