***

sidebar-title: Profile OpenAI-Compatible Text APIs Using AIPerf
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/aiperf/tutorials/model-endpoint-guides/llms.txt. For full documentation content, see https://docs.nvidia.com/aiperf/tutorials/model-endpoint-guides/llms-full.txt.

# Profile OpenAI-Compatible Text APIs Using AIPerf

This guide covers profiling OpenAI-compatible Chat Completions and Completions endpoints with vLLM and AIPerf.

## Start a vLLM server

Pull and start a vLLM server using Docker:
```bash
docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model Qwen/Qwen3-0.6B \
  --reasoning-parser qwen3
```

Verify the server is ready:
```bash
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
```

## Profile Chat Completions API
The Chat Completions API uses the `/v1/chat/completions` endpoint.

### Profile with synthetic inputs

Run AIPerf against the Chat Completions endpoint using synthetic inputs:
```bash
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --synthetic-input-tokens-mean 100 \
    --synthetic-input-tokens-stddev 0 \
    --output-tokens-mean 200 \
    --output-tokens-stddev 0 \
    --url localhost:8000 \
    --request-count 20
```

**Sample Output (Successful Run):**
```
INFO     Starting AIPerf System
INFO     AIPerf System is PROFILING

Profiling: 20/20 |████████████████████████| 100% [00:35<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┃                      Metric ┃     avg ┃     min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1678.90 │ 1456.34 │ 1923.45 │ 1923.45 │ 1667.23 │
│    Time to First Token (ms) │  234.56 │  198.34 │  289.12 │  289.12 │  231.45 │
│    Inter Token Latency (ms) │   13.89 │   11.23 │   17.45 │   17.45 │   13.67 │
│ Output Token Count (tokens) │  200.00 │  200.00 │  200.00 │  200.00 │  200.00 │
│  Request Throughput (req/s) │    5.67 │       - │       - │       - │       - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency1/profile_export_aiperf.json
```

### Profile with custom input file

Create a JSONL input file:

```bash
cat <<EOF > inputs.jsonl
{"texts": ["Hello!"]}
{"texts": ["Tell me a joke."]}
EOF
```

Run AIPerf against the Chat Completions endpoint using the custom input file:
```bash
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --endpoint /v1/chat/completions \
    --streaming \
    --input-file inputs.jsonl \
    --custom-dataset-type single_turn \
    --url localhost:8000 \
    --request-count 10
```

## Profile Completions API
The Completions API uses the `/v1/completions` endpoint.

### Profile with synthetic inputs

Run AIPerf against the Completions endpoint using synthetic inputs:
```bash
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type completions \
    --endpoint /v1/completions \
    --synthetic-input-tokens-mean 64 \
    --synthetic-input-tokens-stddev 4 \
    --output-tokens-mean 128 \
    --output-tokens-stddev 4 \
    --url localhost:8000 \
    --request-count 32
```

**Sample Output (Successful Run):**
```
INFO     Starting AIPerf System
INFO     Using /v1/completions endpoint
INFO     AIPerf System is PROFILING

Profiling: 32/32 |████████████████████████| 100% [00:28<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/Qwen_Qwen3-0.6B-completions-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┃                      Metric ┃    avg ┃    min ┃    max ┃    p99 ┃    p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│        Request Latency (ms) │ 876.45 │ 789.34 │ 987.12 │ 978.90 │ 871.23 │
│    Time to First Token (ms) │ 156.78 │ 134.56 │ 189.23 │ 185.67 │ 155.12 │
│    Inter Token Latency (ms) │  12.34 │  10.23 │  15.67 │  15.34 │  12.12 │
│ Output Token Count (tokens) │ 128.00 │ 120.00 │ 136.00 │ 135.00 │ 128.00 │
│  Request Throughput (req/s) │  10.89 │      - │      - │      - │      - │
└─────────────────────────────┴────────┴────────┴────────┴────────┴────────┘

JSON Export: artifacts/Qwen_Qwen3-0.6B-completions-concurrency1/profile_export_aiperf.json
```

### Profile with custom input file

Create a JSONL input file:
```bash
cat <<EOF > inputs.jsonl
{"texts": ["How are you?"]}
{"texts": ["Give me a poem."]}
EOF

```
Run AIPerf against the Completions endpoint using the custom input file:
```bash
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type completions \
    --endpoint /v1/completions \
    --input-file inputs.jsonl \
    --custom-dataset-type single_turn \
    --url localhost:8000 \
    --request-count 10

```