Request Cancellation Testing
AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.
How Request Cancellation Works
Request cancellation tests how inference servers handle client disconnections. A percentage of requests are sent completely, then the client disconnects before receiving the full response.
Timing Flow
The cancellation timer starts at T2 (“request fully sent”) for two reasons:
-
Realistic simulation: The server always receives the complete request before cancellation, just like when a real user closes their browser tab.
-
Reproducibility: The delay is measured from a fixed point (request fully sent) rather than being affected by variable queue times or connection setup. This means running the same benchmark twice with
--request-cancellation-delay 0.5will cancel requests at the same point in their lifecycle, regardless of system load.
If the server responds before the delay expires, the request completes normally and is not cancelled. Only requests still waiting for a response when the timer expires are cancelled.
Understanding the Delay Parameter
A delay of 0 means “send the full request, then immediately disconnect”. The server receives the complete request but the client closes the connection before receiving any response. Longer delays allow partial responses to be received before disconnection.
Testing Disaggregated Inference Systems
The delay parameter can be used to target different inference phases:
This is useful for testing how disaggregated architectures (separate prefill and decode workers) handle cancellations at different stages of request processing.
Setting Up the Server
Basic Request Cancellation
Test with a small percentage of cancelled requests:
Sample Output (Successful Run):
Parameters Explained:
--request-cancellation-rate 10: Cancel 10% of requests (value between 0.0 and 100.0)--request-cancellation-delay 0.5: Wait .5 seconds before cancelling selected requests
High Cancellation Rate Testing
Test service resilience under frequent cancellations:
Sample Output (Successful Run):
Immediate Cancellation Testing (Delay = 0)
Test immediate disconnection where the client closes the connection right after sending the request:
Sample Output (Successful Run):
What happens with delay=0:
- The full request (headers + body) is sent to the server
- The client immediately disconnects after sending
- The server receives the complete request but the client won’t read any response
- Tests how the server handles abandoned requests and cleans up resources