Profile Multi-Modal Language Models with GenAI-Perf#

GenAI-Perf allows you to profile Multi-Modal Language Models (MMLM) running on OpenAI Chat Completions API-compatible server by sending multi-modal contents to the server (for instance, see OpenAI vision and audio inputs).

Quickstart 1. Run GenAI-Perf on Vision Language Model (VLM)#

Start OpenAI API compatible server with a VLM model using following command:

docker run --runtime nvidia --gpus all \
    -p 8000:8000 --ipc=host \
    vllm/vllm-openai:latest \
    --model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16

Use GenAI-Perf to generate/send text and image request data to the server

genai-perf profile \
    -m llava-hf/llava-v1.6-mistral-7b-hf \
    --endpoint-type multimodal \
    --image-width-mean 50 \
    --image-height-mean 50 \
    --synthetic-input-tokens-mean 10 \
    --output-tokens-mean 10 \
    --streaming

Console output will have the following result table

                           NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃      max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time To First Token (ms) │ 205.38 │ 172.31 │ 1,020.58 │ 246.02 │ 207.33 │ 204.96 │
│         Time To Second Token (ms) │  18.58 │  18.03 │    19.22 │  19.13 │  18.72 │  18.65 │
│              Request Latency (ms) │ 369.06 │ 336.56 │ 1,183.65 │ 408.84 │ 370.97 │ 368.50 │
│          Inter Token Latency (ms) │  16.41 │  16.27 │    18.32 │  18.16 │  16.47 │  16.41 │
│   Output Sequence Length (tokens) │  10.98 │  10.00 │    11.00 │  11.00 │  11.00 │  11.00 │
│    Input Sequence Length (tokens) │  10.06 │  10.00 │    11.00 │  11.00 │  10.00 │  10.00 │
│ Output Token Throughput (per sec) │  29.74 │    N/A │      N/A │    N/A │    N/A │    N/A │
│      Request Throughput (per sec) │   2.71 │    N/A │      N/A │    N/A │    N/A │    N/A │
│             Request Count (count) │  97.00 │    N/A │      N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴──────────┴────────┴────────┴────────┘

Quickstart 2. Run GenAI-Perf on Multi-Modal Language Model (MMLM)#

In this example, we will measure performance of the recent Multi-Modal Language Model (MMLM) Phi-4-multimodal-instruct from Microsoft hosted on NVIDIA API which is OpenAI API compatible. First, visit https://build.nvidia.com/microsoft/phi-4-multimodal-instruct and create API key.

Run GenAI-Perf to generate/send all three modalities to the server

export NVIDIA_API_KEY=your_api_key

genai-perf profile \
    -m microsoft/phi-4-multimodal-instruct \
    -u https://integrate.api.nvidia.com \
    --endpoint-type multimodal \
    --synthetic-input-tokens-mean 10 \
    --output-tokens-mean 10 \
    --image-width-mean 50 \
    --image-height-mean 50 \
    --audio-length-mean 3 \
    --audio-depths 16 32 \
    --audio-sample-rates 16 44.1 48 \
    --audio-num-channels 2 \
    --audio-format wav \
    --streaming \
    --header "Authorization: Bearer '${NVIDIA_API_KEY}'"

Console output will have the following result table

                          NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time To First Token (ms) │ 284.94 │ 228.38 │ 562.57 │ 371.15 │ 325.76 │ 295.26 │
│         Time To Second Token (ms) │  10.02 │   8.79 │  10.94 │  10.89 │  10.59 │  10.32 │
│              Request Latency (ms) │ 346.80 │ 251.96 │ 662.96 │ 463.03 │ 404.79 │ 382.18 │
│          Inter Token Latency (ms) │   9.56 │   0.00 │  15.76 │  14.22 │  11.05 │  10.81 │
│   Output Sequence Length (tokens) │   7.43 │   1.00 │  14.00 │  14.00 │  12.00 │  10.00 │
│    Input Sequence Length (tokens) │  10.11 │  10.00 │  11.00 │  11.00 │  11.00 │  10.00 │
│ Output Token Throughput (per sec) │  20.96 │    N/A │    N/A │    N/A │    N/A │    N/A │
│      Request Throughput (per sec) │   2.82 │    N/A │    N/A │    N/A │    N/A │    N/A │
│             Request Count (count) │ 101.00 │    N/A │    N/A │    N/A │    N/A │    N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Generating Multi-Modal Requests with GenAI-Perf#

Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:

The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.
The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.

Approach 2: Bring Your Own Data (BYOD)#

[!Note] This approach only supports text and image inputs at the moment.

Instead of letting GenAI-Perf create the synthetic data, you can also provide GenAI-Perf with your own data using --input-file CLI option. The input file must be in JSONL format, where each line can define both text and image data. The image field can be either a path to a local image file or a URL. GenAI-Perf converts local image filepaths to base64-encoded strings and leaves URL paths as is, consistent with the OpenAI Vision Guide.

A sample input file input.jsonl would look like

// texts and local image files
{"text": "What is in this image?", "image": "path/to/image1.png"}
{"text": "What is the color of the dog?", "image": "path/to/image2.jpeg"}

// Or, text and URL path
{"text": "What is in this image?", "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}

After you create the file, you can run GenAI-Perf using the following command:

genai-perf profile \
    -m <multimodal_model> \
    --endpoint-type multimodal \
    --input-file input.jsonl \
    --streaming

Profile Multi-Modal Language Models with GenAI-Perf#

Quickstart 1. Run GenAI-Perf on Vision Language Model (VLM)#

Quickstart 2. Run GenAI-Perf on Multi-Modal Language Model (MMLM)#

Generating Multi-Modal Requests with GenAI-Perf#

Approach 1: Synthetic Multi-Modal Data Generation#

Approach 2: Bring Your Own Data (BYOD)#