Profile Hugging Face TGI Models with AIPerf | NVIDIA AIPerf Documentation

AIPerf can benchmark Large Language Models (LLMs) served through the Hugging Face Text Generation Inference (TGI) generate API. TGI exposes two standard HTTP endpoints for text generation:

Endpoint	Description	AIPerf Flag
`/generate`	Returns the full text completion in one response (non-streaming).	(default)
`/generate_stream`	Streams generated tokens as they are produced (SSE).	`--streaming`

Start a Hugging Face TGI Server

To launch a Hugging Face TGI server, use the official ghcr.io image:

$ docker run --gpus all --rm -it \
>   -p 8080:80 \
>   -e MODEL_ID=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
>   ghcr.io/huggingface/text-generation-inference:latest

$ # Verify the server is running
$ curl -s http://localhost:8080/generate \
>   -H "Content-Type: application/json" \
>   -d '{"inputs":"Hello world"}' | jq

Profile with AIPerf

You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.

Non-Streaming (`/generate`)

Profile with synthetic inputs

$ aiperf profile \
>     -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
>     --endpoint-type huggingface_generate \
>     --url localhost:8080 \
>     --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate endpoint (non-streaming)
INFO     AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:08<00:00]
INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/
            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1234.56 │ 987.34 │ 1567.89 │ 1567.89 │ 1198.45 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.34 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

You can also provide your own text prompts using the —input-file option. The file should be in JSONL format and contain text entries.

$ cat > inputs.jsonl <<'EOF'
$ {"text": "Hello TinyLlama!"}
$ {"text": "Tell me a joke."}
$ EOF

Then run:

$ aiperf profile \
>     -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
>     --endpoint-type huggingface_generate \
>     --url localhost:8080 \
>     --input-file ./inputs.jsonl \
>     --custom-dataset-type single_turn \
>     --request-count 10

Streaming (`/generate_stream`)

When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.

Profile with synthetic inputs

$ aiperf profile \
>     -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
>     --endpoint-type huggingface_generate \
>     --url localhost:8080 \
>     --streaming \
>     --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate_stream endpoint (streaming)
INFO     AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:09<00:00]
INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/
            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1189.45 │ 945.67 │ 1498.34 │ 1498.34 │ 1156.78 │
│    Time to First Token (ms) │  234.56 │ 189.34 │  298.45 │  298.45 │  228.90 │
│    Inter Token Latency (ms) │   14.23 │  11.45 │   18.90 │   18.90 │   13.89 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.56 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

Create your own prompt file in JSONL format:

$ cat > inputs.jsonl <<'EOF'
$ {"text": "Explain quantum computing in simple terms."}
$ {"text": "Write a haiku about rain."}
$ {"text": "Summarize the causes of the French Revolution."}
$ EOF

Then run:

$ aiperf profile \
>     -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
>     --endpoint-type huggingface_generate \
>     --url localhost:8080 \
>     --input-file ./inputs.jsonl \
>     --custom-dataset-type single_turn \
>     --streaming \
>     --request-count 10

Start a Hugging Face TGI Server

Profile with AIPerf

Non-Streaming (/generate)

Profile with synthetic inputs

Profile with custom input file

Streaming (/generate_stream)

Profile with synthetic inputs

Profile with custom input file

Non-Streaming (`/generate`)

Streaming (`/generate_stream`)