Llama-2 Results

Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)

Configuration 1: Chatbot Conversation use case

  • batch size: 1 - 8
    • input tokens length: 128

    • output tokens length: 20

infer_modelsize_scaling_llama2_128_20.svg

infer_gpu_scaling_llama2_70b_128_20.svg

infer_bs_scaling_llama2_70b_128_20.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

Llama-2-7B 1 234.0 153.8 4.3 6.5 1 1 1
Llama-2-7B 2 249.6 160.8 8.0 12.4 1 1 1
Llama-2-7B 4 272.6 174.8 14.7 22.9 1 1 1
Llama-2-7B 8 329.5 199.2 24.3 40.2 1 1 1
Llama-2-7B 1 171.7 128.6 5.8 7.8 2 1 2
Llama-2-7B 2 180.3 132.0 11.1 15.1 2 1 2
Llama-2-7B 4 202.9 137.9 19.7 29.0 2 1 2
Llama-2-7B 8 237.6 156.2 33.7 51.2 2 1 2
Llama-2-7B 1 143.7 107.4 7.0 9.3 4 1 4
Llama-2-7B 2 149.9 114.0 13.3 17.5 4 1 4
Llama-2-7B 4 165.2 120.4 24.2 33.2 4 1 4
Llama-2-7B 8 196.4 134.6 40.7 59.5 4 1 4
Llama-2-7B 1 136.5 97.6 7.3 10.2 8 1 8
Llama-2-7B 2 143.3 109.1 14.0 18.3 8 1 8
Llama-2-7B 4 158.6 116.2 25.2 34.4 8 1 8
Llama-2-7B 8 181.9 129.1 44.0 62.0 8 1 8
Llama-2-13B 1 142.5 86.4 7.0 11.6 1 1 1
Llama-2-13B 2 163.3 94.2 12.2 21.2 1 1 1
Llama-2-13B 4 198.5 117.8 20.2 34.0 1 1 1
Llama-2-13B 8 282.6 146.7 28.3 54.5 1 1 1
Llama-2-13B 1 100.4 69.7 10.0 14.3 2 1 2
Llama-2-13B 2 112.2 73.7 17.8 27.1 2 1 2
Llama-2-13B 4 320.0 88.3 12.5 45.3 2 1 2
Llama-2-13B 8 188.6 109.8 42.4 72.8 2 1 2
Llama-2-13B 1 207.8 61.4 4.8 16.3 4 1 4
Llama-2-13B 2 84.6 62.0 23.6 32.3 4 1 4
Llama-2-13B 4 102.3 72.0 39.1 55.6 4 1 4
Llama-2-13B 8 143.0 88.6 56.0 90.3 4 1 4
Llama-2-13B 1 72.2 54.3 13.9 18.4 8 1 8
Llama-2-13B 2 76.3 59.3 26.2 33.7 8 1 8
Llama-2-13B 4 212.0 157.3 18.9 25.4 8 1 8
Llama-2-13B 8 242.6 81.8 33.0 97.8 8 1 8
Llama-2-70B 1 1,108.3 652.7 0.9 1.5 2 1 2
Llama-2-70B 2 1,156.6 668.2 1.7 3.0 2 1 2
Llama-2-70B 4 1,272.9 742.1 3.1 5.4 2 1 2
Llama-2-70B 8 1,520.3 818.2 5.3 9.8 2 1 2
Llama-2-70B 1 673.4 433.3 1.5 2.3 4 1 4
Llama-2-70B 2 715.3 446.9 2.8 4.5 4 1 4
Llama-2-70B 4 784.8 487.3 5.1 8.2 4 1 4
Llama-2-70B 8 941.4 537.3 8.5 14.9 4 1 4
Llama-2-70B 1 504.1 343.1 2.0 2.9 8 1 8
Llama-2-70B 2 542.0 359.8 3.7 5.6 8 1 8
Llama-2-70B 4 586.6 386.6 6.8 10.3 8 1 8
Llama-2-70B 8 695.6 428.2 11.5 18.7 8 1 8

Configuration 2: Translation / Style Transfer use case

  • batch size: 1 - 8
    • input tokens length: 200

    • output tokens length: 200

infer_modelsize_scaling_llama2_200_200.svg

infer_gpu_scaling_llama2_70b_200_200.svg

infer_bs_scaling_llama2_70b_200_200.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

Llama-2-7B 1 2,189.6 1,440.0 0.5 0.7 1 1 1
Llama-2-7B 2 2,227.9 1,463.8 0.9 1.4 1 1 1
Llama-2-7B 4 2,386.5 1,509.7 1.7 2.7 1 1 1
Llama-2-7B 8 2,611.4 1,653.7 3.1 4.8 1 1 1
Llama-2-7B 1 1,544.2 1,143.2 0.6 0.9 2 1 2
Llama-2-7B 2 1,588.9 1,163.0 1.3 1.7 2 1 2
Llama-2-7B 4 1,649.4 1,175.1 2.4 3.4 2 1 2
Llama-2-7B 8 1,841.0 1,238.2 4.3 6.5 2 1 2
Llama-2-7B 1 1,280.0 923.8 0.8 1.1 4 1 4
Llama-2-7B 2 1,313.0 991.0 1.5 2.0 4 1 4
Llama-2-7B 4 1,383.5 1,017.2 2.9 3.9 4 1 4
Llama-2-7B 8 1,463.5 1,070.9 5.5 7.5 4 1 4
Llama-2-7B 1 1,187.4 827.6 0.8 1.2 8 1 8
Llama-2-7B 2 1,248.4 936.5 1.6 2.1 8 1 8
Llama-2-7B 4 1,329.7 975.4 3.0 4.1 8 1 8
Llama-2-7B 8 1,416.6 1,020.7 5.6 7.8 8 1 8
Llama-2-13B 1 3,884.5 2,396.8 0.3 0.4 1 1 1
Llama-2-13B 2 4,020.7 2,413.6 0.5 0.8 1 1 1
Llama-2-13B 4 4,250.9 2,559.8 0.9 1.6 1 1 1
Llama-2-13B 8 4,590.2 2,722.8 1.7 2.9 1 1 1
Llama-2-13B 1 2,499.1 1,717.2 0.4 0.6 2 1 2
Llama-2-13B 2 2,620.4 1,746.2 0.8 1.1 2 1 2
Llama-2-13B 4 2,699.3 1,778.3 1.5 2.2 2 1 2
Llama-2-13B 8 2,967.1 1,944.8 2.7 4.1 2 1 2
Llama-2-13B 1 1,894.0 1,431.2 0.5 0.7 4 1 4
Llama-2-13B 2 1,945.1 1,407.0 1.0 1.4 4 1 4
Llama-2-13B 4 2,047.3 1,451.8 2.0 2.8 4 1 4
Llama-2-13B 8 2,117.4 1,498.0 3.8 5.3 4 1 4
Llama-2-13B 1 1,692.8 1,201.3 0.6 0.8 8 1 8
Llama-2-13B 2 1,735.4 1,304.1 1.2 1.5 8 1 8
Llama-2-13B 4 1,836.8 1,361.6 2.2 2.9 8 1 8
Llama-2-13B 8 1,926.9 1,420.0 4.2 5.6 8 1 8
Llama-2-70B 1 10,500.4 6,267.3 0.1 0.2 2 1 2
Llama-2-70B 2 10,695.1 6,288.4 0.2 0.3 2 1 2
Llama-2-70B 4 11,151.1 6,401.6 0.4 0.6 2 1 2
Llama-2-70B 8 11,858.6 6,731.0 0.7 1.2 2 1 2
Llama-2-70B 1 6,403.0 4,115.6 0.2 0.2 4 1 4
Llama-2-70B 2 6,604.8 4,146.6 0.3 0.5 4 1 4
Llama-2-70B 4 6,833.8 4,241.9 0.6 0.9 4 1 4
Llama-2-70B 8 7,394.9 4,367.1 1.1 1.8 4 1 4
Llama-2-70B 1 4,734.8 3,202.1 0.2 0.3 8 1 8
Llama-2-70B 2 4,995.7 3,311.5 0.4 0.6 8 1 8
Llama-2-70B 4 5,110.5 3,379.7 0.8 1.2 8 1 8
Llama-2-70B 8 5,577.7 3,450.4 1.4 2.3 8 1 8
Previous Model Deployment
Next Gemma and CodeGemma
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.