FIL Backend Benchmarks#
WARNING: The models which were used for this benchmarking script have temporarily been removed. They will be restored using a storage solution other than Git LFS at a later date.
In order to facilitate performance analysis during development of the FIL
backend, the qa/run_benchmarks.sh scripts can run a simple set of benchmarks
against standard models. To run this script, first install the benchmarking
conda environment:
conda env create -f conda/environments/triton_benchmark.yml
Next, start the Triton server with the provided benchmark models. Note that you will need git lfs to checkout these models. You may start the server by running the following command from the repo root:
docker run \
--rm \
--gpus=all \
--name benchmark_server \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
-v $PWD/qa/benchmark_repo:/models \
triton_fil \
tritonserver \
--model-repository=/models
Here, triton_fil is used as the Docker image, since this is the standard tag
used during development, but you may run the benchmarks against any Triton
image which contains the FIL backend.
In a separate terminal, you may now invoke the benchmark script itself as follows:
conda activate triton_benchmark
./qa/run_benchmarks.sh
The benchmark script will provide output in the qa/benchmark_output
directory. Each model tested will have its own directory with .csv files
representing results for various batch sizes. The summary directory will also
contain a .csv collating the data from each run as well as a .png showing
throughput vs. p99 latency for all tested models on a single graph.
The benchmark script can be configured using a few different environment variables, summarized below:
MODELS: A space-separated list of the models to benchmark (defaults to standard benchmarking models)BATCHES: A space-separated list of the batch sizes to use during benchmarking (defaults to'1 16 128 1024')MAX_LATENCY: The maximum latency (in ms) to explore during benchmarking (defaults to 5 ms)