Profile Embeddings Models with GenAI-Perf#
GenAI-Perf allows you to profile embedding models running on an OpenAI Embeddings API-compatible server.
Create a Sample Embeddings Input File#
To create a sample embeddings input file, use the following command:
echo '{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}' > embeddings.jsonl
This will generate a file named embeddings.jsonl with the following content:
{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}
Start an OpenAI Embeddings-Compatible Server#
To start an OpenAI embeddings-compatible server, run the following command:
docker run -it --net=host --rm --gpus=all vllm/vllm-openai:latest --model intfloat/e5-mistral-7b-instruct --dtype float16 --max-model-len 1024
Run GenAI-Perf#
To profile embeddings models using GenAI-Perf, use the following command:
genai-perf profile \
-m intfloat/e5-mistral-7b-instruct \
--service-kind openai \
--endpoint-type embeddings \
--batch-size 2 \
--input-file embeddings.jsonl
-m intfloat/e5-mistral-7b-instruct
is to specify what model you want to run (intfloat/e5-mistral-7b-instruct
)--service-kind openai
is to specify that the server type is OpenAI-API compatible--endpoint-type embeddings
is to specify that the sent requests should be formatted to follow the embeddings API--batch-size 2
is to specify that each request will contain the inputs for 2 individual inferences, making a batch size of 2--input-file embeddings.jsonl
is to specify the input data to be used for inferencing
This will use default values for optional arguments. You can also pass in
additional arguments with the --extra-inputs
flag.
For example, you could use this command:
genai-perf profile \
-m intfloat/e5-mistral-7b-instruct \
--service-kind openai \
--endpoint-type embeddings \
--extra-inputs user:sample_user
Example output:
Embeddings Metrics
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Request latency (ms) │ 42.21 │ 28.18 │ 318.61 │ 56.50 │ 49.21 │ 43.07 │
└──────────────────────┴───────┴───────┴────────┴───────┴───────┴───────┘
Request throughput (per sec): 23.63