Profile Vision-Language Models with GenAI-Perf#
GenAI-Perf allows you to profile Vision-Language Models (VLM) running on OpenAI Chat Completions API-compatible server by sending multi-modal content to the server. Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:
The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.
The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.
Before we dive into the two approaches, you can start OpenAI API compatible server with a VLM model using following command:
docker run --runtime nvidia --gpus all \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16
Approach 1: Synthetic Multi-Modal Data Generation#
GenAI-Perf can generate synthetic multi-modal data such as texts or images using the parameters provide by the user through CLI.
genai-perf profile \
-m llava-hf/llava-v1.6-mistral-7b-hf \
--service-kind openai \
--endpoint-type vision \
--image-width-mean 512 \
--image-width-stddev 30 \
--image-height-mean 512 \
--image-height-stddev 30 \
--image-format png \
--synthetic-input-tokens-mean 100 \
--synthetic-input-tokens-stddev 0 \
--streaming
[!Note] Under the hood, GenAI-Perf generates synthetic images using a few source images under the
inputs/source_images
directory. If you would like to add/remove/edit the source images, you can do so by directly editing the source images under the directory. GenAI-Perf will pickup the images under the directory automatically when generating the synthetic images.
Approach 2: Bring Your Own Data (BYOD)#
Instead of letting GenAI-Perf create the synthetic data,
you can also provide GenAI-Perf with your own data using
--input-file
CLI option.
The file needs to be in JSONL format and should contain both the prompt and
the filepath to the image to send.
For instance, an example of input file would look something as following:
// input.jsonl
{"text_input": "What is in this image?", "image": "path/to/image1.png"}
{"text_input": "What is the color of the dog?", "image": "path/to/image2.jpeg"}
{"text_input": "Describe the scene in the picture.", "image": "path/to/image3.png"}
...
After you create the file, you can run GenAI-Perf using the following command:
genai-perf profile \
-m llava-hf/llava-v1.6-mistral-7b-hf \
--service-kind openai \
--endpoint-type vision \
--input-file input.jsonl \
--streaming
Running GenAI-Perf using either approach will give you an example output that looks like below:
LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ Time to first token (ms) │ 321.05 │ 291.30 │ 537.07 │ 497.88 │ 318.46 │ 317.35 │
│ Inter token latency (ms) │ 12.28 │ 11.44 │ 12.88 │ 12.87 │ 12.81 │ 12.53 │
│ Request latency (ms) │ 1,866.23 │ 1,044.70 │ 2,832.22 │ 2,779.63 │ 2,534.64 │ 2,054.03 │
│ Output sequence length │ 126.68 │ 59.00 │ 204.00 │ 200.58 │ 177.80 │ 147.50 │
│ Input sequence length │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │
└──────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 67.40
Request throughput (per sec): 0.53