Performance for NVIDIA NeMo Retriever Embedding NIM#

To benchmark the performance of NVIDIA NeMo Retriever Embedding NIM, you can use the genai-perf tool. genai-perf is pre-installed in the Triton Server SDK container. For the remainder of this section, we will use genai-perf==0.0.11 that comes packaged with the Triton Server SDK 25.02.

To run a performance benchmark, first create a dataset of text examples that genai-perf can use when making requests to the embedding service. These examples should be representative of the type of data that you expect to receive in a production setting. The dataset should be formatted as a JSONL file where each line contains a {"text": ...} object, as shown in the following example.

Example#

Create a file named embeddings.jsonl that contains the following content.

{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}

Use the following example to run the Triton Inference Server SDK docker container. Mount the directory where you created your JSONL file, which appears as datasets/ in the following example.

export RELEASE="25.02"

docker run -it --rm \
  --gpus=all \
  --network="host" \
  --mount type=bind,source=${PWD}/datasets,target=/datasets \
  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Execute the following command to run a performance benchmark using the genai-perf command line tool.

genai-perf profile \
    -m nvidia/llama-nemotron-embed-1b-v2 \
    --service-kind openai \
    --endpoint-type embeddings \
    --batch-size 2 \
    --input-file /datasets/embeddings.jsonl \
    --extra-inputs input_type:query \
    --extra-inputs truncate:END \
    --concurrency 5 \
    --url http://localhost:8000

You can see the full set of command line options for genai-perf in the Command Line Options section of the GenAI-Perf documentation.

Benchmarks#

All latency measurements are reported in milliseconds.

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

7.8

8.0

8.0

8.3

126.0

passage

20

1

3

8.3

8.0

9.0

8.8

348.0

passage

20

1

5

8.8

8.0

9.0

10.3

547.0

passage

20

1

7

10.8

11.0

12.0

11.9

585.0

passage

20

1

9

13.2

13.0

15.0

15.1

643.0

passage

20

1

11

15.5

16.0

18.0

18.4

652.0

passage

20

1

13

17.5

17.0

21.0

21.5

673.0

passage

20

1

15

20.6

21.0

24.0

24.0

662.0

passage

300

64

1

69.5

69.0

71.0

72.3

896.0

passage

300

64

3

103.5

104.0

107.0

108.3

1779.0

passage

300

64

5

172.3

173.0

178.0

179.9

1799.0

passage

512

64

1

103.6

101.0

116.0

120.1

606.0

passage

512

64

3

180.6

182.0

192.0

193.6

1040.0

passage

512

64

5

298.6

300.0

318.0

325.1

1041.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.0

8.0

9.0

8.9

122.0

passage

20

1

3

8.6

8.0

9.0

9.2

340.0

passage

20

1

5

8.8

9.0

9.0

9.2

548.0

passage

20

1

7

10.9

11.0

12.0

12.6

575.0

passage

20

1

9

13.3

14.0

15.0

15.2

638.0

passage

20

1

11

15.4

16.0

17.0

17.9

668.0

passage

20

1

13

17.6

17.0

21.0

21.3

688.0

passage

20

1

15

21.2

22.0

24.0

24.9

645.0

passage

300

64

1

70.9

71.0

73.0

74.3

880.0

passage

300

64

3

112.8

113.0

117.0

118.4

1652.0

passage

300

64

5

187.7

189.0

194.0

194.9

1653.0

passage

512

64

1

109.5

108.0

128.0

128.9

574.0

passage

512

64

3

202.5

203.0

216.0

219.4

919.0

passage

512

64

5

336.4

338.0

352.0

356.9

926.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.2

8.0

9.0

8.6

120.0

passage

20

1

3

8.5

8.0

9.0

9.2

345.0

passage

20

1

5

9.4

9.0

10.0

10.8

508.0

passage

20

1

7

11.4

11.0

13.0

13.4

576.0

passage

20

1

9

14.4

15.0

17.0

16.9

584.0

passage

20

1

11

16.6

16.0

20.0

20.2

602.0

passage

20

1

13

20.2

21.0

24.0

24.2

590.0

passage

20

1

15

22.6

23.0

26.0

26.7

600.0

passage

300

64

1

68.8

68.0

72.0

73.1

907.0

passage

300

64

3

119.3

120.0

125.0

126.4

1552.0

passage

300

64

5

199.8

202.0

208.0

210.3

1555.0

passage

512

64

1

112.7

112.0

125.0

131.9

559.0

passage

512

64

3

229.5

232.0

242.0

246.6

814.0

passage

512

64

5

383.5

391.0

401.0

407.0

813.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.2

8.0

8.0

8.7

120.0

passage

20

1

3

8.3

8.0

9.0

8.8

339.0

passage

20

1

5

8.7

9.0

9.0

9.3

535.0

passage

20

1

7

10.7

11.0

12.0

11.8

597.0

passage

20

1

9

13.2

13.0

15.0

15.2

642.0

passage

20

1

11

15.5

16.0

18.0

18.3

654.0

passage

20

1

13

18.4

19.0

21.0

21.4

650.0

passage

20

1

15

20.8

21.0

24.0

24.5

668.0

passage

300

64

1

78.7

79.0

82.0

82.8

795.0

passage

300

64

3

149.7

151.0

156.0

157.8

1250.0

passage

300

64

5

249.8

254.0

260.0

261.9

1246.0

passage

512

64

1

129.6

127.0

142.0

150.2

487.0

passage

512

64

3

283.1

286.0

299.0

302.8

660.0

passage

512

64

5

474.0

482.0

499.0

502.4

658.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.3

8.0

8.0

8.7

118.0

passage

20

1

3

9.0

9.0

9.0

9.5

320.0

passage

20

1

5

9.9

10.0

11.0

11.4

476.0

passage

20

1

7

11.5

12.0

14.0

13.7

573.0

passage

20

1

9

14.4

14.0

16.0

16.5

583.0

passage

20

1

11

17.8

18.0

21.0

21.7

518.0

passage

20

1

13

20.2

20.0

24.0

23.7

582.0

passage

20

1

15

20.5

21.0

24.0

24.7

664.0

passage

300

64

1

97.3

97.0

99.0

100.2

646.0

passage

300

64

3

200.0

202.0

204.0

204.9

940.0

passage

300

64

5

332.7

336.0

340.0

340.5

938.0

passage

512

64

1

199.9

189.0

261.0

268.1

317.0

passage

512

64

3

415.9

419.0

427.0

428.2

455.0

passage

512

64

5

690.1

701.0

711.0

712.8

453.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.6

9.0

9.0

9.1

114.0

passage

20

1

3

9.3

9.0

10.0

10.2

312.0

passage

20

1

5

10.5

10.0

11.0

11.2

459.0

passage

20

1

7

14.4

14.0

15.0

21.8

453.0

passage

20

1

9

17.6

18.0

20.0

20.2

476.0

passage

20

1

11

20.2

21.0

25.0

24.9

499.0

passage

20

1

13

24.5

24.0

27.0

28.4

480.0

passage

20

1

15

28.7

30.0

33.0

33.3

471.0

passage

300

64

1

124.1

124.0

126.0

127.2

508.0

passage

300

64

3

273.1

275.0

278.0

279.2

692.0

passage

300

64

5

451.2

458.0

463.0

463.7

692.0

passage

512

64

1

239.6

230.0

297.0

301.0

265.0

passage

512

64

3

529.9

534.0

541.0

542.7

357.0

passage

512

64

5

876.4

892.0

900.0

901.8

357.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

5.5

5.0

7.0

7.0

180.0

passage

20

1

3

7.0

7.0

7.0

7.1

421.0

passage

20

1

5

10.5

11.0

12.0

11.8

455.0

passage

20

1

7

12.6

12.0

16.0

16.4

529.0

passage

20

1

9

18.5

20.0

21.0

21.3

466.0

passage

20

1

11

19.6

19.0

23.0

25.2

524.0

passage

20

1

13

24.6

25.0

28.0

30.1

481.0

passage

20

1

15

26.4

28.0

30.0

30.3

509.0

passage

300

64

1

115.7

115.0

120.0

120.4

544.0

passage

300

64

3

229.7

231.0

236.0

239.2

817.0

passage

300

64

5

381.6

385.0

393.0

397.6

817.0

passage

512

64

1

199.6

195.0

237.0

242.4

317.0

passage

512

64

3

431.6

433.0

447.0

448.9

438.0

passage

512

64

5

715.0

724.0

740.0

746.6

437.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

9.6

9.0

10.0

11.0

103.0

passage

20

1

3

12.6

13.0

14.0

14.2

232.0

passage

20

1

5

20.6

22.0

23.0

23.1

237.0

passage

20

1

7

28.3

31.0

32.0

31.7

234.0

passage

20

1

9

35.1

36.0

41.0

41.8

244.0

passage

20

1

11

43.3

45.0

50.0

50.4

236.0

passage

20

1

13

52.6

54.0

59.0

58.8

234.0

passage

20

1

15

58.0

60.0

68.0

68.7

234.0

passage

300

64

1

304.9

305.0

309.0

310.3

209.0

passage

300

64

3

780.6

791.0

799.0

800.0

241.0

passage

300

64

5

1296.8

1320.0

1330.0

1331.8

241.0

passage

512

64

1

520.7

519.0

533.0

538.6

122.0

passage

512

64

3

1385.5

1404.0

1424.0

1428.4

137.0

passage

512

64

5

2294.3

2341.0

2362.0

2368.3

137.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

20

1

1

8.8

8.0

10.0

10.0

113.0

passage

20

1

3

11.1

11.0

12.0

11.8

259.0

passage

20

1

5

16.4

16.0

19.0

18.9

295.0

passage

20

1

7

23.3

22.0

26.0

26.8

290.0

passage

20

1

9

29.5

30.0

33.0

33.9

289.0

passage

20

1

11

34.2

34.0

40.0

40.6

302.0

passage

20

1

13

41.6

44.0

47.0

48.1

288.0

passage

20

1

15

45.7

48.0

55.0

55.2

300.0

passage

300

64

1

339.6

339.0

343.0

345.5

187.0

passage

300

64

3

918.3

927.0

933.0

934.6

207.0

passage

300

64

5

1517.3

1547.0

1554.0

1555.5

206.0

passage

512

64

1

642.3

643.0

657.0

661.9

99.0

passage

512

64

3

1795.3

1818.0

1835.0

1841.2

105.0

passage

512

64

5

2976.6

3034.0

3066.0

3075.5

105.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

7.0

7.0

8.0

8.0

140.7

query

20

1

3

10.0

10.0

10.0

10.0

291.7

query

20

1

5

16.0

17.0

17.0

17.0

307.0

query

20

1

7

22.0

23.0

24.0

24.0

310.9

query

20

1

9

25.0

26.0

31.0

31.0

329.7

query

20

1

11

30.0

32.0

38.0

38.0

332.8

query

20

1

13

36.0

37.0

43.0

44.0

340.0

query

20

1

15

42.0

43.0

48.0

50.0

327.5

passage

300

64

1

159.0

159.0

162.0

163.0

401.8

passage

300

64

3

267.0

269.0

275.0

277.0

709.4

passage

300

64

5

392.0

390.0

401.0

411.0

814.5

passage

512

64

1

222.0

218.0

233.0

235.0

286.3

passage

512

64

3

431.0

430.0

450.0

455.0

440.1

passage

512

64

5

615.0

617.0

694.0

701.0

504.3

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

8.0

8.0

8.0

9.0

125.6

query

20

1

3

12.0

12.0

13.0

13.0

246.1

query

20

1

5

20.0

21.0

22.0

22.0

247.9

query

20

1

7

25.0

26.0

30.0

30.0

267.5

query

20

1

9

34.0

34.0

38.0

39.0

251.5

query

20

1

11

43.0

46.0

47.0

48.0

237.7

query

20

1

13

46.0

49.0

53.0

54.0

260.9

query

20

1

15

55.0

58.0

65.0

65.0

248.4

passage

300

64

1

186.0

186.0

190.0

192.0

342.7

passage

300

64

3

345.0

346.0

354.0

356.0

550.7

passage

300

64

5

525.0

524.0

535.0

544.0

608.3

passage

512

64

1

269.0

265.0

280.0

284.0

237.2

passage

512

64

3

563.0

564.0

581.0

584.0

337.3

passage

512

64

5

846.0

861.0

946.0

956.0

366.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

10.0

10.0

10.0

11.0

98.0

query

20

1

3

15.0

17.0

17.0

17.0

192.5

query

20

1

5

24.0

23.0

28.0

28.0

200.8

query

20

1

7

34.0

34.0

39.0

40.0

197.0

query

20

1

9

44.0

45.0

51.0

51.0

197.3

query

20

1

11

53.0

56.0

62.0

62.0

196.8

query

20

1

13

63.0

67.0

73.0

73.0

190.6

query

20

1

15

73.0

79.0

84.0

84.0

188.3

passage

300

64

1

277.0

278.0

280.0

281.0

230.3

passage

300

64

3

615.0

617.0

629.0

632.0

309.4

passage

300

64

5

976.0

975.0

983.0

987.0

327.4

passage

512

64

1

443.0

441.0

448.0

454.0

144.3

passage

512

64

3

1071.0

1077.0

1089.0

1090.0

177.6

passage

512

64

5

1736.0

1735.0

1752.0

1758.0

184.1

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

12.0

12.0

12.0

12.0

85.7

query

20

1

3

19.0

21.0

22.0

22.0

150.3

query

20

1

5

32.0

29.0

36.0

36.0

152.8

query

20

1

7

46.0

50.0

50.0

50.0

148.0

query

20

1

9

55.0

57.0

65.0

65.0

156.3

query

20

1

11

68.0

72.0

79.0

79.0

154.5

query

20

1

13

79.0

86.0

93.0

93.0

154.4

query

20

1

15

84.0

86.0

95.0

101.0

165.0

passage

300

64

1

350.0

350.0

353.0

354.0

182.8

passage

300

64

3

821.0

823.0

831.0

842.0

231.7

passage

300

64

5

1320.0

1317.0

1330.0

1346.0

242.1

passage

512

64

1

567.0

566.0

573.0

575.0

112.7

passage

512

64

3

1440.0

1448.0

1464.0

1475.0

132.2

passage

512

64

5

2370.0

2371.0

2391.0

2396.0

134.8

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

8.0

8.0

9.0

10.0

122.3

query

20

1

3

11.0

12.0

14.0

14.0

252.9

query

20

1

5

20.0

19.0

23.0

24.0

250.9

query

20

1

7

27.0

27.0

31.0

32.0

251.6

query

20

1

9

30.0

30.0

37.0

42.0

280.1

query

20

1

11

41.0

42.0

47.0

51.0

250.9

query

20

1

13

46.0

49.0

53.0

57.0

262.0

query

20

1

15

50.0

51.0

60.0

63.0

269.7

passage

300

64

1

325.0

324.0

331.0

333.0

196.9

passage

300

64

3

699.0

699.0

717.0

723.0

272.1

passage

300

64

5

1095.0

1093.0

1109.0

1113.0

291.8

passage

512

64

1

496.0

497.0

502.0

518.0

128.8

passage

512

64

3

1165.0

1169.0

1199.0

1209.0

163.2

passage

512

64

5

1872.0

1870.0

1904.0

1916.0

170.5

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

10.0

9.0

10.0

10.0

104.7

query

20

1

3

12.0

12.0

12.0

12.0

246.7

query

20

1

5

18.0

19.0

20.0

20.0

276.7

query

20

1

7

25.0

24.0

28.0

28.0

268.9

query

20

1

9

30.0

31.0

36.0

36.0

291.4

query

20

1

11

38.0

39.0

44.0

44.0

270.3

query

20

1

13

43.0

44.0

52.0

52.0

272.4

query

20

1

15

49.0

52.0

56.0

57.0

276.4

passage

300

64

1

362.0

362.0

367.0

369.0

176.8

passage

300

64

3

786.0

789.0

803.0

812.0

241.9

passage

300

64

5

1240.0

1239.0

1258.0

1262.0

257.6

passage

512

64

1

550.0

548.0

565.0

570.0

116.2

passage

512

64

3

1327.0

1335.0

1354.0

1359.0

143.4

passage

512

64

5

2145.0

2145.0

2180.0

2187.0

149.0

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

21.0

21.0

21.0

21.0

48.6

query

20

1

3

39.0

44.0

45.0

45.0

75.8

query

20

1

5

61.0

60.0

75.0

75.0

80.2

query

20

1

7

85.0

89.0

104.0

105.0

79.2

query

20

1

9

109.0

119.0

134.0

134.0

79.0

query

20

1

11

144.0

149.0

164.0

164.0

73.2

query

20

1

13

159.0

164.0

194.0

194.0

76.5

query

20

1

15

175.0

179.0

209.0

209.0

79.4

passage

300

64

1

888.0

888.0

899.0

899.0

72.1

passage

300

64

3

2272.0

2280.0

2329.0

2341.0

83.9

passage

300

64

5

3795.0

3801.0

3828.0

3840.0

84.2

passage

512

64

1

1451.0

1451.0

1471.0

1473.0

44.1

passage

512

64

3

3926.0

3947.0

3988.0

4009.0

48.6

passage

512

64

5

6577.0

6571.0

6632.0

6657.0

48.6

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

13.0

13.0

14.0

15.0

74.2

query

20

1

3

21.0

23.0

23.0

23.0

138.0

query

20

1

5

32.0

31.0

39.0

39.0

152.2

query

20

1

7

46.0

46.0

53.0

54.0

145.8

query

20

1

9

56.0

61.0

69.0

69.0

150.3

query

20

1

11

67.0

69.0

76.0

77.0

154.0

query

20

1

13

86.0

91.0

98.0

99.0

141.2

query

20

1

15

98.0

105.0

113.0

114.0

141.0

passage

300

64

1

724.0

725.0

730.0

733.0

88.3

passage

300

64

3

1865.0

1876.0

1891.0

1892.0

102.1

passage

300

64

5

3117.0

3127.0

3147.0

3150.0

102.1

passage

512

64

1

1300.0

1300.0

1314.0

1318.0

49.2

passage

512

64

3

3551.0

3577.0

3606.0

3618.0

53.6

passage

512

64

5

5940.0

5968.0

6000.0

6007.0

53.6

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

query

20

1

1

18.0

18.0

18.0

19.0

55.8

query

20

1

3

33.0

37.0

38.0

38.0

87.7

query

20

1

5

55.0

62.0

63.0

63.0

87.4

query

20

1

7

71.0

75.0

88.0

88.0

94.6

query

20

1

9

96.0

101.0

114.0

114.0

90.3

query

20

1

11

115.0

126.0

139.0

139.0

89.8

query

20

1

13

131.0

138.0

164.0

164.0

91.4

query

20

1

15

154.0

157.0

189.0

189.0

88.4

passage

300

64

1

1011.0

1011.0

1019.0

1019.0

63.3

passage

300

64

3

2702.0

2716.0

2734.0

2754.0

70.6

passage

300

64

5

4527.0

4527.0

4550.0

4571.0

70.6

passage

512

64

1

1705.0

1710.0

1733.0

1740.0

37.5

passage

512

64

3

4770.0

4799.0

4860.0

4872.0

39.9

passage

512

64

5

8001.0

8025.0

8079.0

8085.0

39.9

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

99.8

100.9

107.9

108.6

639.0

passage

300

64

3

143.8

143.3

156.6

159.0

1330.0

passage

300

64

5

239.7

239.7

259.1

265.0

1331.0

passage

512

64

1

114.6

114.4

115.9

117.0

556.5

passage

512

64

3

170.2

169.9

171.2

171.8

1124.2

passage

512

64

5

284.6

284.5

285.6

286.1

1121.4

query

20

1

1

5.1

5.1

5.4

5.4

196.3

query

20

1

3

6.0

5.5

7.4

7.6

498.5

query

20

1

5

11.9

12.3

12.8

12.9

418.3

query

20

1

7

16.5

17.2

18.0

18.1

422.0

query

20

1

9

21.4

22.3

23.3

23.6

418.3

query

20

1

11

26.0

26.0

28.4

28.6

421.3

query

20

1

13

30.7

30.9

33.1

33.6

422.2

query

20

1

15

37.3

37.9

39.1

39.3

401.4

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

2554.3

2563.9

2678.1

2698.3

25.0

passage

300

64

3

7349.2

7502.1

7889.3

7968.1

25.5

passage

300

64

5

11913.2

12461.9

12893.4

12969.4

25.6

passage

512

64

1

3701.9

3701.6

3703.1

3703.4

17.3

passage

512

64

3

10730.2

10985.2

10987.0

11029.2

17.5

passage

512

64

5

17355.4

14691.3

22035.4

22035.7

17.4

query

20

1

1

32.4

32.4

32.7

32.8

30.7

query

20

1

3

82.5

85.6

85.9

86.0

36.3

query

20

1

5

135.5

142.9

143.3

143.3

36.8

query

20

1

7

191.7

200.2

200.5

200.6

36.5

query

20

1

9

246.9

257.4

257.8

257.9

36.4

query

20

1

11

301.7

314.6

315.1

315.2

36.4

query

20

1

13

356.6

371.6

372.2

372.4

36.4

query

20

1

15

409.5

401.4

429.8

429.9

36.5

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

176.5

177.1

188.6

190.4

362.2

passage

300

64

3

336.1

337.0

359.1

365.8

570.2

passage

300

64

5

560.2

562.9

592.3

634.6

569.8

passage

512

64

1

205.3

204.7

208.2

210.8

311.4

passage

512

64

3

410.9

411.1

412.5

412.7

466.4

passage

512

64

5

681.5

682.0

683.6

684.1

468.7

query

20

1

1

5.3

5.3

5.6

5.7

186.3

query

20

1

3

7.4

7.4

7.5

7.7

403.8

query

20

1

5

11.9

12.4

12.6

12.8

419.2

query

20

1

7

16.6

17.3

17.5

17.6

421.5

query

20

1

9

21.2

22.1

22.5

22.6

423.9

query

20

1

11

26.1

27.2

27.7

27.8

420.5

query

20

1

13

30.8

31.2

32.6

32.7

422.3

query

20

1

15

36.4

37.3

37.9

38.0

411.7

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

188.4

191.7

197.7

198.8

338.7

passage

300

64

3

371.7

372.7

393.8

471.0

515.3

passage

300

64

5

619.5

621.7

648.5

728.8

515.1

passage

512

64

1

222.7

222.3

226.0

227.4

286.5

passage

512

64

3

447.0

447.0

448.7

449.4

428.4

passage

512

64

5

742.3

742.8

745.0

745.5

430.1

query

20

1

1

6.6

6.6

7.0

7.1

149.3

query

20

1

3

7.4

7.3

7.6

7.7

404.8

query

20

1

5

11.8

12.2

12.5

12.6

421.5

query

20

1

7

16.4

17.1

17.4

17.5

426.4

query

20

1

9

20.9

21.9

22.3

22.4

429.9

query

20

1

11

25.7

26.8

27.4

27.7

427.2

query

20

1

13

30.4

31.5

32.1

32.2

427.4

query

20

1

15

35.6

36.4

37.9

38.0

420.6

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

377.9

376.4

396.5

404.5

169.1

passage

300

64

3

929.9

932.3

972.0

979.2

205.8

passage

300

64

5

1545.5

1555.2

1597.3

1610.3

205.9

passage

512

64

1

469.5

468.6

473.9

475.1

136.2

passage

512

64

3

1178.9

1182.4

1183.0

1183.2

162.2

passage

512

64

5

1958.1

1970.9

1971.6

1971.8

162.2

query

20

1

1

11.1

11.1

11.5

11.6

89.8

query

20

1

3

19.3

20.3

20.8

21.0

154.9

query

20

1

5

32.1

34.0

34.6

34.8

155.5

query

20

1

7

44.8

47.4

48.1

48.2

156.0

query

20

1

9

57.7

60.9

61.8

62.0

155.8

query

20

1

11

70.6

74.0

75.5

75.7

155.5

query

20

1

13

83.8

82.8

89.2

89.6

154.9

query

20

1

15

97.5

96.6

103.1

103.4

153.7

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

483.6

483.6

505.2

509.5

132.3

passage

300

64

3

1328.0

1334.4

1367.9

1379.6

143.9

passage

300

64

5

2181.8

2203.7

2241.9

2250.4

145.5

passage

512

64

1

633.8

633.8

638.6

639.4

100.9

passage

512

64

3

1744.9

1755.3

1761.5

1763.1

109.4

passage

512

64

5

2892.2

2923.9

2934.8

2936.8

109.4

query

20

1

1

8.0

8.0

8.3

8.3

124.1

query

20

1

3

11.2

12.2

12.6

12.8

266.1

query

20

1

5

19.9

20.6

21.1

21.2

250.3

query

20

1

7

27.6

28.9

29.4

29.6

253.0

query

20

1

9

35.1

36.7

37.3

37.5

256.1

query

20

1

11

42.7

44.6

45.5

45.7

256.9

query

20

1

13

50.7

50.3

54.0

54.2

255.9

query

20

1

15

57.4

57.9

62.2

62.5

261.0