Llama-2 Results

Inference Performance

Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)

Configuration 1: Chatbot Conversation use case

  • batch size: 1 - 8
    • input tokens length: 128

    • output tokens length: 20

../_images/infer_modelsize_scaling_llama2_128_20.svg
../_images/infer_gpu_scaling_llama2_70b_128_20.svg
../_images/infer_bs_scaling_llama2_70b_128_20.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

Llama-2-7B

1

234.0

153.8

4.3

6.5

1

1

1

Llama-2-7B

2

249.6

160.8

8.0

12.4

1

1

1

Llama-2-7B

4

272.6

174.8

14.7

22.9

1

1

1

Llama-2-7B

8

329.5

199.2

24.3

40.2

1

1

1

Llama-2-7B

1

171.7

128.6

5.8

7.8

2

1

2

Llama-2-7B

2

180.3

132.0

11.1

15.1

2

1

2

Llama-2-7B

4

202.9

137.9

19.7

29.0

2

1

2

Llama-2-7B

8

237.6

156.2

33.7

51.2

2

1

2

Llama-2-7B

1

143.7

107.4

7.0

9.3

4

1

4

Llama-2-7B

2

149.9

114.0

13.3

17.5

4

1

4

Llama-2-7B

4

165.2

120.4

24.2

33.2

4

1

4

Llama-2-7B

8

196.4

134.6

40.7

59.5

4

1

4

Llama-2-7B

1

136.5

97.6

7.3

10.2

8

1

8

Llama-2-7B

2

143.3

109.1

14.0

18.3

8

1

8

Llama-2-7B

4

158.6

116.2

25.2

34.4

8

1

8

Llama-2-7B

8

181.9

129.1

44.0

62.0

8

1

8

Llama-2-13B

1

142.5

86.4

7.0

11.6

1

1

1

Llama-2-13B

2

163.3

94.2

12.2

21.2

1

1

1

Llama-2-13B

4

198.5

117.8

20.2

34.0

1

1

1

Llama-2-13B

8

282.6

146.7

28.3

54.5

1

1

1

Llama-2-13B

1

100.4

69.7

10.0

14.3

2

1

2

Llama-2-13B

2

112.2

73.7

17.8

27.1

2

1

2

Llama-2-13B

4

320.0

88.3

12.5

45.3

2

1

2

Llama-2-13B

8

188.6

109.8

42.4

72.8

2

1

2

Llama-2-13B

1

207.8

61.4

4.8

16.3

4

1

4

Llama-2-13B

2

84.6

62.0

23.6

32.3

4

1

4

Llama-2-13B

4

102.3

72.0

39.1

55.6

4

1

4

Llama-2-13B

8

143.0

88.6

56.0

90.3

4

1

4

Llama-2-13B

1

72.2

54.3

13.9

18.4

8

1

8

Llama-2-13B

2

76.3

59.3

26.2

33.7

8

1

8

Llama-2-13B

4

212.0

157.3

18.9

25.4

8

1

8

Llama-2-13B

8

242.6

81.8

33.0

97.8

8

1

8

Llama-2-70B

1

1,108.3

652.7

0.9

1.5

2

1

2

Llama-2-70B

2

1,156.6

668.2

1.7

3.0

2

1

2

Llama-2-70B

4

1,272.9

742.1

3.1

5.4

2

1

2

Llama-2-70B

8

1,520.3

818.2

5.3

9.8

2

1

2

Llama-2-70B

1

673.4

433.3

1.5

2.3

4

1

4

Llama-2-70B

2

715.3

446.9

2.8

4.5

4

1

4

Llama-2-70B

4

784.8

487.3

5.1

8.2

4

1

4

Llama-2-70B

8

941.4

537.3

8.5

14.9

4

1

4

Llama-2-70B

1

504.1

343.1

2.0

2.9

8

1

8

Llama-2-70B

2

542.0

359.8

3.7

5.6

8

1

8

Llama-2-70B

4

586.6

386.6

6.8

10.3

8

1

8

Llama-2-70B

8

695.6

428.2

11.5

18.7

8

1

8

Configuration 2: Translation / Style Transfer use case

  • batch size: 1 - 8
    • input tokens length: 200

    • output tokens length: 200

../_images/infer_modelsize_scaling_llama2_200_200.svg
../_images/infer_gpu_scaling_llama2_70b_200_200.svg
../_images/infer_bs_scaling_llama2_70b_200_200.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

Llama-2-7B

1

2,189.6

1,440.0

0.5

0.7

1

1

1

Llama-2-7B

2

2,227.9

1,463.8

0.9

1.4

1

1

1

Llama-2-7B

4

2,386.5

1,509.7

1.7

2.7

1

1

1

Llama-2-7B

8

2,611.4

1,653.7

3.1

4.8

1

1

1

Llama-2-7B

1

1,544.2

1,143.2

0.6

0.9

2

1

2

Llama-2-7B

2

1,588.9

1,163.0

1.3

1.7

2

1

2

Llama-2-7B

4

1,649.4

1,175.1

2.4

3.4

2

1

2

Llama-2-7B

8

1,841.0

1,238.2

4.3

6.5

2

1

2

Llama-2-7B

1

1,280.0

923.8

0.8

1.1

4

1

4

Llama-2-7B

2

1,313.0

991.0

1.5

2.0

4

1

4

Llama-2-7B

4

1,383.5

1,017.2

2.9

3.9

4

1

4

Llama-2-7B

8

1,463.5

1,070.9

5.5

7.5

4

1

4

Llama-2-7B

1

1,187.4

827.6

0.8

1.2

8

1

8

Llama-2-7B

2

1,248.4

936.5

1.6

2.1

8

1

8

Llama-2-7B

4

1,329.7

975.4

3.0

4.1

8

1

8

Llama-2-7B

8

1,416.6

1,020.7

5.6

7.8

8

1

8

Llama-2-13B

1

3,884.5

2,396.8

0.3

0.4

1

1

1

Llama-2-13B

2

4,020.7

2,413.6

0.5

0.8

1

1

1

Llama-2-13B

4

4,250.9

2,559.8

0.9

1.6

1

1

1

Llama-2-13B

8

4,590.2

2,722.8

1.7

2.9

1

1

1

Llama-2-13B

1

2,499.1

1,717.2

0.4

0.6

2

1

2

Llama-2-13B

2

2,620.4

1,746.2

0.8

1.1

2

1

2

Llama-2-13B

4

2,699.3

1,778.3

1.5

2.2

2

1

2

Llama-2-13B

8

2,967.1

1,944.8

2.7

4.1

2

1

2

Llama-2-13B

1

1,894.0

1,431.2

0.5

0.7

4

1

4

Llama-2-13B

2

1,945.1

1,407.0

1.0

1.4

4

1

4

Llama-2-13B

4

2,047.3

1,451.8

2.0

2.8

4

1

4

Llama-2-13B

8

2,117.4

1,498.0

3.8

5.3

4

1

4

Llama-2-13B

1

1,692.8

1,201.3

0.6

0.8

8

1

8

Llama-2-13B

2

1,735.4

1,304.1

1.2

1.5

8

1

8

Llama-2-13B

4

1,836.8

1,361.6

2.2

2.9

8

1

8

Llama-2-13B

8

1,926.9

1,420.0

4.2

5.6

8

1

8

Llama-2-70B

1

10,500.4

6,267.3

0.1

0.2

2

1

2

Llama-2-70B

2

10,695.1

6,288.4

0.2

0.3

2

1

2

Llama-2-70B

4

11,151.1

6,401.6

0.4

0.6

2

1

2

Llama-2-70B

8

11,858.6

6,731.0

0.7

1.2

2

1

2

Llama-2-70B

1

6,403.0

4,115.6

0.2

0.2

4

1

4

Llama-2-70B

2

6,604.8

4,146.6

0.3

0.5

4

1

4

Llama-2-70B

4

6,833.8

4,241.9

0.6

0.9

4

1

4

Llama-2-70B

8

7,394.9

4,367.1

1.1

1.8

4

1

4

Llama-2-70B

1

4,734.8

3,202.1

0.2

0.3

8

1

8

Llama-2-70B

2

4,995.7

3,311.5

0.4

0.6

8

1

8

Llama-2-70B

4

5,110.5

3,379.7

0.8

1.2

8

1

8

Llama-2-70B

8

5,577.7

3,450.4

1.4

2.3

8

1

8