Profile Embeddings Models with GenAI-Perf#

GenAI-Perf allows you to profile embedding models running on an OpenAI Embeddings API-compatible server.

Create a Sample Embeddings Input File#

To create a sample embeddings input file, use the following command:

echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl

This will generate a file named embeddings.jsonl with the following content:

{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}

Start an OpenAI Embeddings-Compatible Server#

To start an OpenAI embeddings-compatible server, run the following command:

docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024

Run GenAI-Perf#

To profile embeddings models using GenAI-Perf, use the following command:

genai-perf profile \
    -m intfloat/e5-mistral-7b-instruct \
    --service-kind openai \
    --endpoint-type embeddings \
    --batch-size 2 \
    --input-file embeddings.jsonl
  • -m intfloat/e5-mistral-7b-instruct is to specify what model you want to run (intfloat/e5-mistral-7b-instruct)

  • --service-kind openai is to specify that the server type is OpenAI-API compatible

  • --endpoint-type embeddings is to specify that the sent requests should be formatted to follow the embeddings API

  • --batch-size 2 is to specify that each request will contain the inputs for 2 individual inferences, making a batch size of 2

  • --input-file embeddings.jsonl is to specify the input data to be used for inferencing

This will use default values for optional arguments. You can also pass in additional arguments with the --extra-inputs flag. For example, you could use this command:

genai-perf profile \
    -m intfloat/e5-mistral-7b-instruct \
    --service-kind openai \
    --endpoint-type embeddings \
    --extra-inputs user:sample_user

Example output:

                          Embeddings Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Statistic            ┃ avg   ┃ min   ┃ max    ┃ p99   ┃ p90   ┃ p75   ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Request latency (ms) │ 42.21 │ 28.18 │ 318.61 │ 56.50 │ 49.21 │ 43.07 │
└──────────────────────┴───────┴───────┴────────┴───────┴───────┴───────┘
Request throughput (per sec): 23.63