Profile Hugging Face TGI Models with AIPerf

View as Markdown

AIPerf can benchmark Large Language Models (LLMs) served through the Hugging Face Text Generation Inference (TGI) generate API. TGI exposes two standard HTTP endpoints for text generation:

EndpointDescriptionAIPerf Flag
/generateReturns the full text completion in one response (non-streaming).(default)
/generate_streamStreams generated tokens as they are produced (SSE).--streaming

Start a Hugging Face TGI Server

To launch a Hugging Face TGI server, use the official ghcr.io image:

$docker run --gpus all --rm -it \
> -p 8080:80 \
> -e MODEL_ID=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
> ghcr.io/huggingface/text-generation-inference:latest
$# Verify the server is running
$curl -s http://localhost:8080/generate \
> -H "Content-Type: application/json" \
> -d '{"inputs":"Hello world"}' | jq

Profile with AIPerf

You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.

Non-Streaming (/generate)

Profile with synthetic inputs

$aiperf profile \
> -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
> --endpoint-type huggingface_generate \
> --url localhost:8080 \
> --request-count 10

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Using Hugging Face TGI /generate endpoint (non-streaming)
INFO AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:08<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1234.56 │ 987.34 │ 1567.89 │ 1567.89 │ 1198.45 │
│ Output Token Count (tokens) │ 256.00 │ 200.00 │ 300.00 │ 300.00 │ 254.00 │
│ Request Throughput (req/s) │ 2.34 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

You can also provide your own text prompts using the —input-file option. The file should be in JSONL format and contain text entries.

$cat > inputs.jsonl <<'EOF'
${"text": "Hello TinyLlama!"}
${"text": "Tell me a joke."}
$EOF

Then run:

$aiperf profile \
> -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
> --endpoint-type huggingface_generate \
> --url localhost:8080 \
> --input-file ./inputs.jsonl \
> --custom-dataset-type single_turn \
> --request-count 10

Streaming (/generate_stream)

When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.

Profile with synthetic inputs

$aiperf profile \
> -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
> --endpoint-type huggingface_generate \
> --url localhost:8080 \
> --streaming \
> --request-count 10

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO Using Hugging Face TGI /generate_stream endpoint (streaming)
INFO AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:09<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1189.45 │ 945.67 │ 1498.34 │ 1498.34 │ 1156.78 │
│ Time to First Token (ms) │ 234.56 │ 189.34 │ 298.45 │ 298.45 │ 228.90 │
│ Inter Token Latency (ms) │ 14.23 │ 11.45 │ 18.90 │ 18.90 │ 13.89 │
│ Output Token Count (tokens) │ 256.00 │ 200.00 │ 300.00 │ 300.00 │ 254.00 │
│ Request Throughput (req/s) │ 2.56 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

Create your own prompt file in JSONL format:

$cat > inputs.jsonl <<'EOF'
${"text": "Explain quantum computing in simple terms."}
${"text": "Write a haiku about rain."}
${"text": "Summarize the causes of the French Revolution."}
$EOF

Then run:

$aiperf profile \
> -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
> --endpoint-type huggingface_generate \
> --url localhost:8080 \
> --input-file ./inputs.jsonl \
> --custom-dataset-type single_turn \
> --streaming \
> --request-count 10