Quick Start#
The steps below will guide you on how to start using Perf Analyzer.
Step 1: Start Triton Container#
export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3
Step 2: Download simple
Model#
# inside triton container
git clone --depth 1 https://github.com/triton-inference-server/server
mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository
Step 3: Start Triton Server#
# inside triton container
tritonserver --model-repository $(pwd)/model_repository &> server.log &
# confirm server is ready, look for 'HTTP/1.1 200 OK'
curl -v localhost:8000/v2/health/ready
# detach (CTRL-p CTRL-q)
Step 4: Start Triton SDK Container#
docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
Step 5: Run Perf Analyzer#
# inside sdk container
perf_analyzer -m simple
Step 6: Observe and Analyze Output#
$ perf_analyzer -m simple
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 25348
Throughput: 1407.84 infer/sec
Avg latency: 708 usec (standard deviation 663 usec)
p50 latency: 690 usec
p90 latency: 881 usec
p95 latency: 926 usec
p99 latency: 1031 usec
Avg HTTP time: 700 usec (send/recv 102 usec + response wait 598 usec)
Server:
Inference count: 25348
Execution count: 25348
Successful request count: 25348
Avg request latency: 382 usec (overhead 41 usec + queue 41 usec + compute input 26 usec + compute infer 257 usec + compute output 16 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1407.84 infer/sec, latency 708 usec
We can see from the output that the model was able to complete approximately 1407.84 inferences per second, with an average latency of 708 microseconds per inference request. Concurrency of 1 meant that Perf Analyzer attempted to always have 1 outgoing request at all times.