Profile Hugging Face TGI Models with AIPerf
AIPerf can benchmark Large Language Models (LLMs) served through the
Hugging Face Text Generation Inference (TGI)
generate API.
TGI exposes two standard HTTP endpoints for text generation:
Start a Hugging Face TGI Server
To launch a Hugging Face TGI server, use the official ghcr.io image:
Profile with AIPerf
You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.
Non-Streaming (/generate)
Profile with synthetic inputs
Sample Output (Successful Run):
Profile with custom input file
You can also provide your own text prompts using the —input-file option. The file should be in JSONL format and contain text entries.
Then run:
Streaming (/generate_stream)
When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.
Profile with synthetic inputs
Sample Output (Successful Run):
Profile with custom input file
Create your own prompt file in JSONL format:
Then run: