Profile Hugging Face TGI Models Using GenAI-Perf#

GenAI-Perf can profile LLMs running on a Hugging Face’s Text Generation Inference (TGI) API-compatible server using the generate API. This guide walks you through:

Step 1: Start a Hugging Face TGI Server#

To launch a Hugging Face TGI server, use the official ghcr.io image:

docker run --gpus all --rm -it \
  -p 8080:80 \
  -e MODEL_ID=llava-hf/llava-v1.6-mistral-7b-hf \
  ghcr.io/huggingface/text-generation-inference

Approach 1. Profile Using Synthetic Inputs#

Run with built-in synthetic prompts:

genai-perf profile \
  -m llava-hf/llava-v1.6-mistral-7b-hf \
  --endpoint-type huggingface_generate \
  --url localhost:8080 \
  --batch-size-image 1 \
  --image-width-mean 100 \
  --image-height-mean 100 \
  --synthetic-input-tokens-mean 10

Approach 2: Bring Your Own Data (BYOD)#

Instead of letting GenAI-Perf create the synthetic data, you can also provide GenAI-Perf with your own data using --input-file CLI option. The file needs to be in JSONL format and should contain both the prompt and the filepath to the image to send.

For instance, an example of input file would look something as following:

// input.jsonl
{"text": "What is in this image?", "image": "path/to/image1.png"}
{"text": "What is the color of the dog?", "image": "path/to/image2.jpeg"}
{"text": "Describe the scene in the picture.", "image": "path/to/image3.png"}

2. Run GenAI-Perf#

genai-perf profile \
  -m llava-hf/llava-v1.6-mistral-7b-hf \
  --endpoint-type huggingface_generate \
  --url localhost:8080 \
  --input-file input.jsonl

Review the Output#

Example output:

                                        NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                            Statistic ┃       avg ┃      min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│                 Request Latency (ms) │  6,675.95 │   234.82 │ 54,474.75 │ 50,240.69 │ 12,134.14 │  1,283.24 │
│      Output Sequence Length (tokens) │    347.50 │     3.00 │  2,908.00 │  2,681.92 │    647.20 │     59.75 │
│       Input Sequence Length (tokens) │ 10,164.10 │ 7,293.00 │ 12,113.00 │ 12,113.00 │ 12,113.00 │ 12,112.75 │
│ Output Token Throughput (tokens/sec) │     52.05 │      N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      0.15 │      N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │     10.00 │      N/A │       N/A │       N/A │       N/A │       N/A │
└──────────────────────────────────────┴───────────┴──────────┴───────────┴───────────┴───────────┴───────────┘