Performance#

E2E and Component Latency#

The following table shows the end-to-end latency breakdown for the reference workflow:

Configuration Details:

  • Speculative Speech Processing: On

  • TTS: Elevenlabs

  • Platform: g5.12xlarge

  • GPU: 4xA10

  • Number of streams: 1

  • LLM: llama-3.1-8b-instruct, self-hosted on L40

End-to-End Latency Performance Data#

KPI

Unit

Average

P90

P75

P50

E2E latency

ms

2124

2408

2184

2058

ASR latency

ms

323

400

372

303

LLM Latency

ms

789

940

809

756

TTS Latency

ms

200

219

200

195

Component Latency

ms

1312

1483

1386

1260

Other Latency

ms

812

934

845

796

Resource Usage per GPU for Different Concurrent Streams#

Configuration Details:

  • Avatar: Aki at 1280x720 resolution

  • Platform: g5.12xlarge

  • GPU: 4xA10

  • GPU VRAM on each A10 GPU = 22.5 GB

Resource Usage per GPU for Different Concurrent Streams#

KPI

Unit

GPU Index

1 Stream

3 Streams

6 Streams

Average GPU VRAM usage

GiB

0

7.0

7.0

7.0

1

4.2

4.1

8.4

2

0

4.1

8.3

3

0.4

4.5

8.8

Average GPU utilization

%

0

20.4

45.4

64.6

1

54.6

46.8

91.0

2

0

44.4

93.9

3

3.8

48.4

97.0

Average CPU utilization

Number of logical core used

4.2

11.0

21.9

Average RAM usage

GiB

7.3

12.2

19.5

Average Renderer FPS

FPS

30

30

29.8

Average WebRTC FPS

FPS

30

30

29.8