Time-Based Benchmarking | NVIDIA AIPerf Documentation

Time-based benchmarking runs for a specific duration rather than a fixed number of requests. Use it for SLA validation, stability testing, capacity planning, and A/B comparisons where consistent time windows matter.

Quick Start

$ aiperf profile \
>     --model your-model \
>     --url localhost:8000 \
>     --endpoint-type chat \
>     --streaming \
>     --concurrency 10 \
>     --benchmark-duration 60

Requests are sent continuously until the duration expires. AIPerf then waits for in-flight requests to complete (up to the grace period).

How It Works

│          BENCHMARK DURATION           │   GRACE PERIOD    │
│        (sending requests)             │   (drain only)    │
├───────────────────────────────────────┼───────────────────┤
│ New requests dispatched               │ No new requests   │
│ Responses collected                   │ Wait for in-flight│
└───────────────────────────────────────┴───────────────────┘
                    ▲                             ▲
           Duration expires              Grace period ends

Grace period default: 30 seconds (use inf to wait forever, 0 for immediate completion)
Responses received within grace period are included in metrics; responses still pending when grace expires are not

--benchmark-grace-period requires --benchmark-duration to be set.

Combining with Request Count

Duration can be combined with count-based stopping—first condition reached wins:

$ # Stop when EITHER 1000 requests sent OR 120 seconds pass
$ aiperf profile \
>     --model your-model \
>     --url localhost:8000 \
>     --endpoint-type chat \
>     --streaming \
>     --request-rate 20 \
>     --benchmark-duration 120 \
>     --request-count 1000

Examples

Stability Test (5 minutes)

$ aiperf profile \
>     --model Qwen/Qwen2.5-7B-Instruct \
>     --url localhost:8000 \
>     --endpoint-type chat \
>     --streaming \
>     --concurrency 50 \
>     --benchmark-duration 300 \
>     --benchmark-grace-period 60 \
>     --warmup-duration 30

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     AIPerf System is WARMING UP
Warming Up: [00:30] - Running for 30 seconds...
INFO     Warmup completed, starting profiling phase
INFO     AIPerf System is PROFILING
Profiling: [05:00] - Running for 300 seconds...
INFO     Benchmark duration reached, draining in-flight requests
INFO     Grace period: waiting up to 60 seconds for responses
INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency50/
            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                     Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│       Request Latency (ms) │ 245.67 │ 178.90 │ 398.12 │ 367.89 │ 239.45 │
│   Time to First Token (ms) │  56.78 │  42.34 │  89.12 │  82.45 │  55.23 │
│   Inter Token Latency (ms) │  13.45 │  10.23 │  19.67 │  18.45 │  13.12 │
│ Request Throughput (req/s) │  89.23 │      - │      - │      - │      - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/Qwen_Qwen2.5-7B-Instruct-chat-concurrency50/profile_export_aiperf.json

Soak Test (1 hour)

$ aiperf profile \
>     --model Qwen/Qwen2.5-7B-Instruct \
>     --url localhost:8000 \
>     --endpoint-type chat \
>     --streaming \
>     --concurrency 20 \
>     --benchmark-duration 3600 \
>     --benchmark-grace-period 120 \
>     --warmup-duration 60