***

sidebar-title: Request Cancellation Testing
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/aiperf/tutorials/load-patterns-scheduling/llms.txt. For full documentation content, see https://docs.nvidia.com/aiperf/tutorials/load-patterns-scheduling/llms-full.txt.

# Request Cancellation Testing

AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.

## How Request Cancellation Works

Request cancellation tests how inference servers handle client disconnections. A percentage of requests are sent completely, then the client disconnects before receiving the full response.

### Timing Flow

```
T0: Request scheduled
     │
     │← Worker processing, connection acquired from pool
     ▼
T1: Start writing request to socket
     │
     │← HTTP headers + body transmitted
     ▼
T2: Request fully sent (cancellation timer starts here)
     │
     │← --request-cancellation-delay
     ▼
T3: Request cancelled if still waiting for response
```

The cancellation timer starts at **T2** ("request fully sent") for two reasons:

1. **Realistic simulation**: The server always receives the complete request before cancellation, just like when a real user closes their browser tab.

2. **Reproducibility**: The delay is measured from a fixed point (request fully sent) rather than being affected by variable queue times or connection setup. This means running the same benchmark twice with `--request-cancellation-delay 0.5` will cancel requests at the same point in their lifecycle, regardless of system load.

<Note>
If the server responds before the delay expires, the request completes normally and is **not** cancelled. Only requests still waiting for a response when the timer expires are cancelled.
</Note>

### Understanding the Delay Parameter

| Delay | Behavior |
|-------|----------|
| `0` | Disconnect immediately after request is fully sent |
| `0.5` | Wait 0.5 seconds after sending, then disconnect |
| `5` | Wait 5 seconds after sending, then disconnect |

<Tip>
A delay of **0 means "send the full request, then immediately disconnect"**. The server receives the complete request but the client closes the connection before receiving any response. Longer delays allow partial responses to be received before disconnection.
</Tip>

### Testing Disaggregated Inference Systems

The delay parameter can be used to target different inference phases:

| Delay | Likely Cancelled During | Tests |
|-------|------------------------|-------|
| `0` or very small | **Prefill phase** | Prefill worker cancellation, KV cache allocation cleanup |
| Longer delays | **Generation phase** | Decode worker cancellation, partial KV cache cleanup |

This is useful for testing how disaggregated architectures (separate prefill and decode workers) handle cancellations at different stages of request processing.

## Setting Up the Server

```bash
# Start vLLM server
docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen3-0.6B \
  --host 0.0.0.0 --port 8000 &
```

```bash
# Wait for server to be ready
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
```

## Basic Request Cancellation

Test with a small percentage of cancelled requests:

```bash
# Profile with 10% request cancellation
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --url localhost:8000 \
    --request-cancellation-rate 10 \
    --request-cancellation-delay 0.5 \
    --synthetic-input-tokens-mean 800 \
    --synthetic-input-tokens-stddev 80 \
    --output-tokens-mean 400 \
    --output-tokens-stddev 40 \
    --concurrency 8 \
    --request-count 50 \
    --warmup-request-count 5
```

**Sample Output (Successful Run):**
```
INFO     Starting AIPerf System
INFO     Request cancellation enabled: 10.0% rate, 0.5s delay
INFO     AIPerf System is WARMING UP

Warming Up: 5/5 |████████████████████████| 100% [00:04<00:00]

INFO     Warmup completed, starting profiling phase
INFO     AIPerf System is PROFILING

Profiling: 50/50 |████████████████████████| 100% [01:15<00:00]

INFO     Benchmark completed successfully
INFO     Cancelled requests: 5 (10.0%)
INFO     Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency8/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1234.56 │ 987.34 │ 1678.90 │ 1598.23 │ 1198.45 │
│    Time to First Token (ms) │  234.56 │ 187.90 │  298.34 │  289.67 │  228.12 │
│    Inter Token Latency (ms) │   14.23 │  11.45 │   19.67 │   18.90 │   13.89 │
│ Output Token Count (tokens) │  400.00 │ 360.00 │  440.00 │  438.00 │  398.00 │
│  Request Throughput (req/s) │   12.34 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency8/profile_export_aiperf.json
```

**Parameters Explained:**
- `--request-cancellation-rate 10`: Cancel 10% of requests (value between 0.0 and 100.0)
- `--request-cancellation-delay 0.5`: Wait .5 seconds before cancelling selected requests

### High Cancellation Rate Testing

Test service resilience under frequent cancellations:

```bash
# Profile with 50% request cancellation
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --url localhost:8000 \
    --request-cancellation-rate 50 \
    --request-cancellation-delay 1.0 \
    --synthetic-input-tokens-mean 1200 \
    --output-tokens-mean 600 \
    --concurrency 10 \
    --request-count 40
```

**Sample Output (Successful Run):**
```
INFO     Starting AIPerf System
INFO     Request cancellation enabled: 50.0% rate, 1.0s delay
INFO     AIPerf System is PROFILING

Profiling: 40/40 |████████████████████████| 100% [01:30<00:00]

INFO     Benchmark completed successfully
INFO     Cancelled requests: 20 (50.0%)
INFO     Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency10/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1890.45 │ 1456.78 │ 2456.89 │ 2398.12 │ 1867.34 │
│    Time to First Token (ms) │  345.67 │  278.90 │  456.23 │  445.67 │  338.45 │
│    Inter Token Latency (ms) │   16.78 │   13.45 │   22.34 │   21.56 │   16.45 │
│ Output Token Count (tokens) │  600.00 │  540.00 │  660.00 │  658.00 │  598.00 │
│  Request Throughput (req/s) │    8.90 │       - │       - │       - │       - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency10/profile_export_aiperf.json
```

### Immediate Cancellation Testing (Delay = 0)

Test immediate disconnection where the client closes the connection right after sending the request:

```bash
# Profile with immediate cancellation (0 delay)
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --url localhost:8000 \
    --request-cancellation-rate 30 \
    --request-cancellation-delay 0.0 \
    --synthetic-input-tokens-mean 500 \
    --output-tokens-mean 100 \
    --concurrency 15 \
    --request-count 60
```

**Sample Output (Successful Run):**
```
INFO     Starting AIPerf System
INFO     Request cancellation enabled: 30.0% rate, 0.0s delay (immediate)
INFO     AIPerf System is PROFILING

Profiling: 60/60 |████████████████████████| 100% [00:45<00:00]

INFO     Benchmark completed successfully
INFO     Cancelled requests: 18 (30.0%)
INFO     Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency15/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                      Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│        Request Latency (ms) │ 678.90 │ 534.56 │ 898.12 │ 876.34 │ 667.89 │
│    Time to First Token (ms) │ 156.78 │ 123.45 │ 198.90 │ 192.34 │ 154.23 │
│    Inter Token Latency (ms) │  12.45 │   9.89 │  16.78 │  16.12 │  12.23 │
│ Output Token Count (tokens) │ 100.00 │  90.00 │ 110.00 │ 109.00 │  99.00 │
│  Request Throughput (req/s) │  23.45 │      - │      - │      - │      - │
└─────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency15/profile_export_aiperf.json
```

**What happens with delay=0:**
- The full request (headers + body) is sent to the server
- The client immediately disconnects after sending
- The server receives the complete request but the client won't read any response
- Tests how the server handles abandoned requests and cleans up resources