Performance#
E2E and Component Latency#
The following table shows the end-to-end latency breakdown for the reference workflow:
Configuration Details:
Speculative Speech Processing: On
TTS: Elevenlabs
Platform: g5.12xlarge
GPU: 4xA10
Number of streams: 1
LLM: llama-3.1-8b-instruct, self-hosted on L40
KPI |
Unit |
Average |
P90 |
P75 |
P50 |
---|---|---|---|---|---|
E2E latency |
ms |
2124 |
2408 |
2184 |
2058 |
ASR latency |
ms |
323 |
400 |
372 |
303 |
LLM Latency |
ms |
789 |
940 |
809 |
756 |
TTS Latency |
ms |
200 |
219 |
200 |
195 |
Component Latency |
ms |
1312 |
1483 |
1386 |
1260 |
Other Latency |
ms |
812 |
934 |
845 |
796 |
Resource Usage per GPU for Different Concurrent Streams#
Configuration Details:
Avatar: Aki at 1280x720 resolution
Platform: g5.12xlarge
GPU: 4xA10
GPU VRAM on each A10 GPU = 22.5 GB
KPI |
Unit |
GPU Index |
1 Stream |
3 Streams |
6 Streams |
---|---|---|---|---|---|
Average GPU VRAM usage |
GiB |
0 |
7.0 |
7.0 |
7.0 |
1 |
4.2 |
4.1 |
8.4 |
||
2 |
0 |
4.1 |
8.3 |
||
3 |
0.4 |
4.5 |
8.8 |
||
Average GPU utilization |
% |
0 |
20.4 |
45.4 |
64.6 |
1 |
54.6 |
46.8 |
91.0 |
||
2 |
0 |
44.4 |
93.9 |
||
3 |
3.8 |
48.4 |
97.0 |
||
Average CPU utilization |
Number of logical core used |
4.2 |
11.0 |
21.9 |
|
Average RAM usage |
GiB |
7.3 |
12.2 |
19.5 |
|
Average Renderer FPS |
FPS |
30 |
30 |
29.8 |
|
Average WebRTC FPS |
FPS |
30 |
30 |
29.8 |