Performance#

This section shows the latency and throughput numbers for Llama models powered by NVIDIA NIM. Please see Using GenAI-Perf to Benchmark for the benchmark process.

For specifications of the hardware on which these measurements were collected, see the Hardware Specifications section.

llama-3.3-70b-instruct Results#

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

368.11

19.94

48.44

5

446.68

23.89

199.17

25

751.74

42.12

564.5

50

1083.13

63.72

738.78

100

17754.93

87.55

777.31

150

46381.93

87.39

777.94

200

73326.6

87.64

777.02

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

71.82

19.26

51.84

5

109.62

20.04

237.72

25

159.62

23.28

1032.56

50

232.46

25.58

1880.75

100

293.19

32.83

2914.64

150

448.16

45.57

3096.87

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

32.78

19.11

52.15

5

81.17

19.9

247.26

25

247.81

21.99

1080.83

50

377.42

23.94

1936.88

100

583.54

31.79

2886.57

150

831.88

44.81

3067.43

200

925.3

45.92

3947.82

250

1105.17

49.83

4521.68

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

103.2

19.31

51.54

5

125.56

20.52

234.76

25

195.68

24.76

979.58

50

243.19

28.55

1706.26

100

344.86

38.65

2507.94

150

568.88

54.06

2663.03

200

1384.9

58.11

3238.57

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

190.36

19.6

49.57

5

430.84

21.64

216.96

25

870.9

35

654.15

50

1066.1

51.56

903.17

100

1332.94

88.91

1062.3

150

3853.57

124.93

1070.78

200

17105.6

125.13

1064.79

250

30309.49

125.27

1073.5

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

1637.36

20.5

46.92

5

3126.01

27

175.65

25

33539.02

52.4

323.8

50

127678.39

51.08

335.75

100

277353.39

48.51

345.84

150

339462.09

48.51

342.77

200

336890.58

51.03

338.3

250

515613.11

49.4

349.04

llama-3.1-70b-instruct Results#

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

25.05

13.21

75.36

5

64.82

12.64

387.55

25

146.11

15.31

1564.9

50

169.61

17.42

2743.6

100

235.17

23.28

4092.34

150

511.21

27.73

4962.78

200

582.07

31.22

5855.9

250

718.98

35.25

6429.95

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

38.64

14.24

70.14

5

109.06

13.68

364.24

25

393.58

15.86

1557.21

50

564.17

17.14

2868.22

100

640.42

22.64

4350.2

150

792.84

28.49

5183.08

200

1417.35

31.62

6168.81

250

2045.69

35.75

6511.88

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

59.89

14.24

69.98

5

180.35

13.72

360.12

25

571.16

16.38

1475.28

50

843.45

18.15

2631.13

100

740

25.55

3794.76

150

952.07

32.31

4493.57

200

980.89

37.92

5114.09

250

2632.73

41.92

5497.21

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

1018.5

30.32

32.45

5

2116.84

31.29

153.71

25

3987.46

43.16

541.65

50

7012.87

56.95

793.95

100

93862.88

61.72

815.75

150

175864.67

61.73

815.38

200

240432.76

61.73

815.52

250

335844.98

61.75

828.23

llama-3.1-8b-instruct Results#

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

8.98

4.37

227.44

5

15.31

5.1

971.34

25

21.08

6.02

4100.33

50

50.68

7.19

6745.78

100

209.66

8.62

10369.56

150

398.54

12.49

10383.16

200

501.44

17.76

9884.77

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

13.89

4.77

209.52

5

27.22

5.45

915.96

25

110.61

6.32

3924.39

50

112.02

8.3

5979.38

100

245.7

11.12

8878.96

150

5217.57

13.78

8955.7

200

9112.64

18.27

8443.89

250

21097.73

19.26

8256.58

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

19.97

4.77

208.88

5

57.87

5.45

908.9

25

170.91

6.57

3710.08

50

214.79

8.74

5582.41

100

373.73

12.41

7819.41

150

890.01

15.84

8935.06

200

3231.57

18.31

9151.55

250

6469.67

21.75

8725.87

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

82.87

6.04

161.51

5

249.06

7.04

664.39

25

372.59

11.61

2024.67

50

415.72

19.99

2398.11

100

501.51

33.71

2844.33

150

646.08

47.58

3047.56

200

6830.5

51.83

3008.29

250

14795.72

51.74

3030.76

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

443.88

11.3

86.79

5

1011.39

12.71

377.88

25

1432.91

21.41

1099.79

50

16505.76

32.16

1116.23

100

89292.14

32.21

1190.64

150

153371.08

32.17

1195.86

200

208680.03

32.14

1200.69

250

281213.01

32.19

1213.79

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

11.96

6.54

152.24

5

27.18

7.39

667.60

25

34.67

8.60

2858.90

50

56.65

10.04

4858.86

100

168.60

12.57

7478.74

150

479.48

14.66

8811.28

200

755.88

18.81

8864.97

250

957.59

24.77

8439.91

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

19.89

6.94

143.93

5

51.19

7.70

647.12

25

202.87

9.42

2625.41

50

329.35

11.62

4242.00

100

552.68

15.67

6268.04

150

782.65

20.32

7232.18

200

7996.68

21.76

7666.88

250

20815.08

21.96

7610.47

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

29.73

6.93

143.87

5

96.74

7.72

640.26

25

386.45

9.50

2530.05

50

625.88

12.03

3951.34

100

724.55

17.79

5326.15

150

1126.07

22.88

6241.50

200

1691.82

27.89

6548.23

250

5664.63

29.12

7150.32

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

128.32

8.14

119.30

5

503.39

8.88

506.79

25

1243.85

17.36

1259.32

50

1273.57

29.38

1561.19

100

6941.19

44.67

1689.99

150

21179.86

44.66

1690.45

200

35001.67

44.54

1694.93

250

49204.54

44.48

1703.87

Concurrency

TTFT (ms)

ITL (ms)

Throughput(Tokens/s)

1

637.84

13.16

74.21

5

1881.45

13.50

346.33

25

19476.78

31.76

579.50

50

92969.97

31.81

558.51

100

216580.75

31.83

582.54

150

301171.32

31.79

583.14

200

348383.77

31.81

582.88

250

484376.95

31.85

584.58

Hardware Specifications#

Motherboard Model

NVIDIA DGX H100

Server Model

NVIDIA DGX H100

Number of Nodes

1

CPU Information

Platinum 8480CL @ 3.8GHz Turbo (Sapphire Rapids) HT On

Number of CPU sockets enabled

2

Number of CPU threads enabled

224

GPU Information

H100 80GB HBM3(GH100) 4*81559 MiB 4*132 SM

Driver Information

560.35.05 (r560_00)

GPU Core Clock (MHz)

1980

GPU Boost Clock (MHz)

1980

GPU Memory Clock (MHz)

2619