Performance#
This section shows the latency and throughput numbers for Llama models powered by NVIDIA NIM. Please see Using GenAI-Perf to Benchmark for the benchmark process.
For specifications of the hardware on which these measurements were collected, see the Hardware Specifications section.
llama-3.3-70b-instruct Results#
NIM Container: Llama-3.3-70b-Instruct
Version: 1.5.0
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
368.11 |
19.94 |
48.44 |
5 |
446.68 |
23.89 |
199.17 |
25 |
751.74 |
42.12 |
564.5 |
50 |
1083.13 |
63.72 |
738.78 |
100 |
17754.93 |
87.55 |
777.31 |
150 |
46381.93 |
87.39 |
777.94 |
200 |
73326.6 |
87.64 |
777.02 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
71.82 |
19.26 |
51.84 |
5 |
109.62 |
20.04 |
237.72 |
25 |
159.62 |
23.28 |
1032.56 |
50 |
232.46 |
25.58 |
1880.75 |
100 |
293.19 |
32.83 |
2914.64 |
150 |
448.16 |
45.57 |
3096.87 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
32.78 |
19.11 |
52.15 |
5 |
81.17 |
19.9 |
247.26 |
25 |
247.81 |
21.99 |
1080.83 |
50 |
377.42 |
23.94 |
1936.88 |
100 |
583.54 |
31.79 |
2886.57 |
150 |
831.88 |
44.81 |
3067.43 |
200 |
925.3 |
45.92 |
3947.82 |
250 |
1105.17 |
49.83 |
4521.68 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
103.2 |
19.31 |
51.54 |
5 |
125.56 |
20.52 |
234.76 |
25 |
195.68 |
24.76 |
979.58 |
50 |
243.19 |
28.55 |
1706.26 |
100 |
344.86 |
38.65 |
2507.94 |
150 |
568.88 |
54.06 |
2663.03 |
200 |
1384.9 |
58.11 |
3238.57 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
190.36 |
19.6 |
49.57 |
5 |
430.84 |
21.64 |
216.96 |
25 |
870.9 |
35 |
654.15 |
50 |
1066.1 |
51.56 |
903.17 |
100 |
1332.94 |
88.91 |
1062.3 |
150 |
3853.57 |
124.93 |
1070.78 |
200 |
17105.6 |
125.13 |
1064.79 |
250 |
30309.49 |
125.27 |
1073.5 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
1637.36 |
20.5 |
46.92 |
5 |
3126.01 |
27 |
175.65 |
25 |
33539.02 |
52.4 |
323.8 |
50 |
127678.39 |
51.08 |
335.75 |
100 |
277353.39 |
48.51 |
345.84 |
150 |
339462.09 |
48.51 |
342.77 |
200 |
336890.58 |
51.03 |
338.3 |
250 |
515613.11 |
49.4 |
349.04 |
llama-3.1-70b-instruct Results#
NIM Container: Llama-3.1-70b-Instruct
Version: 1.3.0
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
25.05 |
13.21 |
75.36 |
5 |
64.82 |
12.64 |
387.55 |
25 |
146.11 |
15.31 |
1564.9 |
50 |
169.61 |
17.42 |
2743.6 |
100 |
235.17 |
23.28 |
4092.34 |
150 |
511.21 |
27.73 |
4962.78 |
200 |
582.07 |
31.22 |
5855.9 |
250 |
718.98 |
35.25 |
6429.95 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
38.64 |
14.24 |
70.14 |
5 |
109.06 |
13.68 |
364.24 |
25 |
393.58 |
15.86 |
1557.21 |
50 |
564.17 |
17.14 |
2868.22 |
100 |
640.42 |
22.64 |
4350.2 |
150 |
792.84 |
28.49 |
5183.08 |
200 |
1417.35 |
31.62 |
6168.81 |
250 |
2045.69 |
35.75 |
6511.88 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
59.89 |
14.24 |
69.98 |
5 |
180.35 |
13.72 |
360.12 |
25 |
571.16 |
16.38 |
1475.28 |
50 |
843.45 |
18.15 |
2631.13 |
100 |
740 |
25.55 |
3794.76 |
150 |
952.07 |
32.31 |
4493.57 |
200 |
980.89 |
37.92 |
5114.09 |
250 |
2632.73 |
41.92 |
5497.21 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
1018.5 |
30.32 |
32.45 |
5 |
2116.84 |
31.29 |
153.71 |
25 |
3987.46 |
43.16 |
541.65 |
50 |
7012.87 |
56.95 |
793.95 |
100 |
93862.88 |
61.72 |
815.75 |
150 |
175864.67 |
61.73 |
815.38 |
200 |
240432.76 |
61.73 |
815.52 |
250 |
335844.98 |
61.75 |
828.23 |
llama-3.1-8b-instruct Results#
NIM Container: Llama-3.1-8b-Instruct
Version: 1.3.0
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
8.98 |
4.37 |
227.44 |
5 |
15.31 |
5.1 |
971.34 |
25 |
21.08 |
6.02 |
4100.33 |
50 |
50.68 |
7.19 |
6745.78 |
100 |
209.66 |
8.62 |
10369.56 |
150 |
398.54 |
12.49 |
10383.16 |
200 |
501.44 |
17.76 |
9884.77 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
13.89 |
4.77 |
209.52 |
5 |
27.22 |
5.45 |
915.96 |
25 |
110.61 |
6.32 |
3924.39 |
50 |
112.02 |
8.3 |
5979.38 |
100 |
245.7 |
11.12 |
8878.96 |
150 |
5217.57 |
13.78 |
8955.7 |
200 |
9112.64 |
18.27 |
8443.89 |
250 |
21097.73 |
19.26 |
8256.58 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
19.97 |
4.77 |
208.88 |
5 |
57.87 |
5.45 |
908.9 |
25 |
170.91 |
6.57 |
3710.08 |
50 |
214.79 |
8.74 |
5582.41 |
100 |
373.73 |
12.41 |
7819.41 |
150 |
890.01 |
15.84 |
8935.06 |
200 |
3231.57 |
18.31 |
9151.55 |
250 |
6469.67 |
21.75 |
8725.87 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
82.87 |
6.04 |
161.51 |
5 |
249.06 |
7.04 |
664.39 |
25 |
372.59 |
11.61 |
2024.67 |
50 |
415.72 |
19.99 |
2398.11 |
100 |
501.51 |
33.71 |
2844.33 |
150 |
646.08 |
47.58 |
3047.56 |
200 |
6830.5 |
51.83 |
3008.29 |
250 |
14795.72 |
51.74 |
3030.76 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
443.88 |
11.3 |
86.79 |
5 |
1011.39 |
12.71 |
377.88 |
25 |
1432.91 |
21.41 |
1099.79 |
50 |
16505.76 |
32.16 |
1116.23 |
100 |
89292.14 |
32.21 |
1190.64 |
150 |
153371.08 |
32.17 |
1195.86 |
200 |
208680.03 |
32.14 |
1200.69 |
250 |
281213.01 |
32.19 |
1213.79 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
11.96 |
6.54 |
152.24 |
5 |
27.18 |
7.39 |
667.60 |
25 |
34.67 |
8.60 |
2858.90 |
50 |
56.65 |
10.04 |
4858.86 |
100 |
168.60 |
12.57 |
7478.74 |
150 |
479.48 |
14.66 |
8811.28 |
200 |
755.88 |
18.81 |
8864.97 |
250 |
957.59 |
24.77 |
8439.91 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
19.89 |
6.94 |
143.93 |
5 |
51.19 |
7.70 |
647.12 |
25 |
202.87 |
9.42 |
2625.41 |
50 |
329.35 |
11.62 |
4242.00 |
100 |
552.68 |
15.67 |
6268.04 |
150 |
782.65 |
20.32 |
7232.18 |
200 |
7996.68 |
21.76 |
7666.88 |
250 |
20815.08 |
21.96 |
7610.47 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
29.73 |
6.93 |
143.87 |
5 |
96.74 |
7.72 |
640.26 |
25 |
386.45 |
9.50 |
2530.05 |
50 |
625.88 |
12.03 |
3951.34 |
100 |
724.55 |
17.79 |
5326.15 |
150 |
1126.07 |
22.88 |
6241.50 |
200 |
1691.82 |
27.89 |
6548.23 |
250 |
5664.63 |
29.12 |
7150.32 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
128.32 |
8.14 |
119.30 |
5 |
503.39 |
8.88 |
506.79 |
25 |
1243.85 |
17.36 |
1259.32 |
50 |
1273.57 |
29.38 |
1561.19 |
100 |
6941.19 |
44.67 |
1689.99 |
150 |
21179.86 |
44.66 |
1690.45 |
200 |
35001.67 |
44.54 |
1694.93 |
250 |
49204.54 |
44.48 |
1703.87 |
Concurrency |
TTFT (ms) |
ITL (ms) |
Throughput(Tokens/s) |
---|---|---|---|
1 |
637.84 |
13.16 |
74.21 |
5 |
1881.45 |
13.50 |
346.33 |
25 |
19476.78 |
31.76 |
579.50 |
50 |
92969.97 |
31.81 |
558.51 |
100 |
216580.75 |
31.83 |
582.54 |
150 |
301171.32 |
31.79 |
583.14 |
200 |
348383.77 |
31.81 |
582.88 |
250 |
484376.95 |
31.85 |
584.58 |
Hardware Specifications#
Motherboard Model |
NVIDIA DGX H100 |
Server Model |
NVIDIA DGX H100 |
Number of Nodes |
1 |
CPU Information |
Platinum 8480CL @ 3.8GHz Turbo (Sapphire Rapids) HT On |
Number of CPU sockets enabled |
2 |
Number of CPU threads enabled |
224 |
GPU Information |
H100 80GB HBM3(GH100) 4*81559 MiB 4*132 SM |
Driver Information |
560.35.05 (r560_00) |
GPU Core Clock (MHz) |
1980 |
GPU Boost Clock (MHz) |
1980 |
GPU Memory Clock (MHz) |
2619 |