Profile with AIMO Dataset

View as Markdown

AIPerf supports benchmarking using AIMO math reasoning datasets, which contain competition mathematics problems requiring chain-of-thought reasoning. These datasets are useful for measuring model throughput and latency under long-context, reasoning-heavy workloads.

Four variants are available:

Dataset--public-datasetDescription
NuminaMath-TIRaimoTool-integrated reasoning problems
NuminaMath-CoTaimo_numina_cotChain-of-thought reasoning problems (~859k)
NuminaMath-1.5aimo_numina_1_5Latest NuminaMath release
AIME Validationaimo_aime90 AIME competition problems

This guide covers profiling OpenAI-compatible chat completions endpoints using the AIMO public datasets.


Start a vLLM Server

Launch a vLLM server with a chat model:

$docker pull vllm/vllm-openai:latest
$docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
> --model Qwen/Qwen3-0.6B

Verify the server is ready:

$curl -s localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"user","content":"test"}],"max_tokens":1}'

Profile with AIMO Dataset

AIPerf loads the AIMO dataset from HuggingFace and uses each problem as a single-turn prompt.

AIMO problems elicit long chain-of-thought responses. Use --prompt-output-tokens-mean to cap output length and reduce benchmark duration:

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --streaming \
> --url localhost:8000 \
> --public-dataset aimo \
> --request-count 10 \
> --concurrency 4 \
> --prompt-output-tokens-mean 512

Sample Output (Successful Run):

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ Time to First Token │ 1,509.04 │ 687.93 │ 1,879.53 │ 1,879.53 │ 1,879.51 │ 1,563.66 │ 430.29 │
│ (ms) │ │ │ │ │ │ │ │
│ Time to Second Token │ 56.44 │ 32.27 │ 86.48 │ 84.59 │ 67.58 │ 53.16 │ 14.62 │
│ (ms) │ │ │ │ │ │ │ │
│ Request Latency (ms) │ 32,823.02 │ 18,943.44 │ 38,431.19 │ 38,431.18 │ 38,431.07 │ 34,163.15 │ 7,192.79 │
│ Inter Token Latency │ 61.40 │ 35.80 │ 71.72 │ 71.72 │ 71.67 │ 63.92 │ 13.26 │
│ (ms) │ │ │ │ │ │ │ │
│ Output Token │ 17.42 │ 13.94 │ 27.94 │ 27.94 │ 27.93 │ 15.64 │ 5.31 │
│ Throughput Per User │ │ │ │ │ │ │ │
│ (tokens/sec/user) │ │ │ │ │ │ │ │
│ Output Sequence │ 511.00 │ 511.00 │ 511.00 │ 511.00 │ 511.00 │ 511.00 │ 0.00 │
│ Length (tokens) │ │ │ │ │ │ │ │
│ Input Sequence │ 56.90 │ 31.00 │ 84.00 │ 83.46 │ 78.60 │ 53.00 │ 17.59 │
│ Length (tokens) │ │ │ │ │ │ │ │
│ Output Token │ 55.82 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Throughput │ │ │ │ │ │ │ │
│ (tokens/sec) │ │ │ │ │ │ │ │
│ Request Throughput │ 0.11 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests/sec) │ │ │ │ │ │ │ │
│ Request Count │ 10.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │
│ (requests) │ │ │ │ │ │ │ │
└──────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴──────────┘

Higher request latency compared to conversational datasets is expected — AIMO problems require extended chain-of-thought reasoning and produce significantly longer outputs.