Profile Multi-Modal Language Models with GenAI-Perf#

GenAI-Perf allows you to profile Multi-Modal Language Models (MMLM) running on OpenAI Chat Completions API-compatible server by sending multi-modal contents to the server (for instance, see OpenAI vision and audio inputs).

Quickstart 1. Run GenAI-Perf on Vision Language Model (VLM)#

Start OpenAI API compatible server with a VLM model using following command:

docker run --runtime nvidia --gpus all \
    -p 8000:8000 --ipc=host \
    vllm/vllm-openai:latest \
    --model llava-hf/llava-v1.6-mistral-7b-hf --dtype float16

Use GenAI-Perf to generate/send text and image request data to the server

genai-perf profile \
    -m llava-hf/llava-v1.6-mistral-7b-hf \
    --endpoint-type multimodal \
    --image-width-mean 50 \
    --image-height-mean 50 \
    --synthetic-input-tokens-mean 10 \
    --output-tokens-mean 10 \
    --streaming

Console output will have the following result table

                           NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic     avg     min       max     p99     p90     p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time To First Token (ms)  205.38  172.31  1,020.58  246.02  207.33  204.96 │
│         Time To Second Token (ms)   18.58   18.03     19.22   19.13   18.72   18.65 │
│              Request Latency (ms)  369.06  336.56  1,183.65  408.84  370.97  368.50 │
│          Inter Token Latency (ms)   16.41   16.27     18.32   18.16   16.47   16.41 │
│   Output Sequence Length (tokens)   10.98   10.00     11.00   11.00   11.00   11.00 │
│    Input Sequence Length (tokens)   10.06   10.00     11.00   11.00   10.00   10.00 │
│ Output Token Throughput (per sec)   29.74     N/A       N/A     N/A     N/A     N/A │
│      Request Throughput (per sec)    2.71     N/A       N/A     N/A     N/A     N/A │
│             Request Count (count)   97.00     N/A       N/A     N/A     N/A     N/A │
└───────────────────────────────────┴────────┴────────┴──────────┴────────┴────────┴────────┘

Quickstart 2. Run GenAI-Perf on Multi-Modal Language Model (MMLM)#

In this example, we will measure performance of the recent Multi-Modal Language Model (MMLM) Phi-4-multimodal-instruct from Microsoft hosted on NVIDIA API which is OpenAI API compatible. First, visit https://build.nvidia.com/microsoft/phi-4-multimodal-instruct and create API key.

Run GenAI-Perf to generate/send all three modalities to the server

export NVIDIA_API_KEY=your_api_key

genai-perf profile \
    -m microsoft/phi-4-multimodal-instruct \
    -u https://integrate.api.nvidia.com \
    --endpoint-type multimodal \
    --synthetic-input-tokens-mean 10 \
    --output-tokens-mean 10 \
    --image-width-mean 50 \
    --image-height-mean 50 \
    --audio-length-mean 3 \
    --audio-depths 16 32 \
    --audio-sample-rates 16 44.1 48 \
    --audio-num-channels 2 \
    --audio-format wav \
    --streaming \
    --header "Authorization: Bearer '${NVIDIA_API_KEY}'"

Console output will have the following result table

                          NVIDIA GenAI-Perf | Multi-Modal Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃                         Statistic     avg     min     max     p99     p90     p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│          Time To First Token (ms)  284.94  228.38  562.57  371.15  325.76  295.26 │
│         Time To Second Token (ms)   10.02    8.79   10.94   10.89   10.59   10.32 │
│              Request Latency (ms)  346.80  251.96  662.96  463.03  404.79  382.18 │
│          Inter Token Latency (ms)    9.56    0.00   15.76   14.22   11.05   10.81 │
│   Output Sequence Length (tokens)    7.43    1.00   14.00   14.00   12.00   10.00 │
│    Input Sequence Length (tokens)   10.11   10.00   11.00   11.00   11.00   10.00 │
│ Output Token Throughput (per sec)   20.96     N/A     N/A     N/A     N/A     N/A │
│      Request Throughput (per sec)    2.82     N/A     N/A     N/A     N/A     N/A │
│             Request Count (count)  101.00     N/A     N/A     N/A     N/A     N/A │
└───────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Generating Multi-Modal Requests with GenAI-Perf#

Currently, you can send multi-modal contents with GenAI-Perf using the following two approaches:

  1. The synthetic data generation approach, where GenAI-Perf generates the multi-modal data for you.

  2. The Bring Your Own Data (BYOD) approach, where you provide GenAI-Perf with the data to send.

Approach 1: Synthetic Multi-Modal Data Generation#

GenAI-Perf can generate synthetic data of three modalities (text, image, and audio) using the modality-specific parameters provide by the user through CLI. Checkout CLI Input Options for a complete list of parameters that you can tweak for different modalities.

genai-perf profile \
    -m <multimodal_model> \
    --endpoint-type multimodal \
    # audio parameters
    --audio-length-mean 10 \
    --audio-length-stddev 2 \
    --audio-depths 16 32 \
    --audio-sample-rates 16 44.1 48 \
    --audio-num-channels 1 \
    --audio-format wav \
    # image parameters
    --image-width-mean 512 \
    --image-width-stddev 30 \
    --image-height-mean 512 \
    --image-height-stddev 30 \
    --image-format png \
    # text parameters
    --synthetic-input-tokens-mean 100 \
    --synthetic-input-tokens-stddev 0 \
    --streaming

[!Note] Under the hood, GenAI-Perf generates synthetic images using a few source images under the inputs/source_images directory. If you would like to add/remove/edit the source images, you can do so by directly editing the source images under the directory. GenAI-Perf will pickup the images under the directory automatically when generating the synthetic images.

Approach 2: Bring Your Own Data (BYOD)#

[!Note] This approach only supports text and image inputs at the moment.

Instead of letting GenAI-Perf create the synthetic data, you can also provide GenAI-Perf with your own data using --input-file CLI option. The input file must be in JSONL format, where each line can define both text and image data. The image field can be either a path to a local image file or a URL. GenAI-Perf converts local image filepaths to base64-encoded strings and leaves URL paths as is, consistent with the OpenAI Vision Guide.

A sample input file input.jsonl would look like

// texts and local image files
{"text": "What is in this image?", "image": "path/to/image1.png"}
{"text": "What is the color of the dog?", "image": "path/to/image2.jpeg"}

// Or, text and URL path
{"text": "What is in this image?", "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}

After you create the file, you can run GenAI-Perf using the following command:

genai-perf profile \
    -m <multimodal_model> \
    --endpoint-type multimodal \
    --input-file input.jsonl \
    --streaming