Performance#

You can use the genai-perf tool to benchmark the performance of the Text Reranking NIM under simulated production load. genai-perf comes pre-installed in the Triton Server SDK container.

To run a performance benchmark, first create a dataset of text examples that genai-perf can use when making requests to the ranking service. These examples should be representative of the type of data that you expect to receive in a production setting. The dataset should be formatted as JSONL files where each line contains a {"text": ...} object. You’ll need to create two files, queries.jsonl and passages.jsonl, in the same directory following this format. genai-perf will randomly assemble query-passage pairs for making requests to the service.

Example:

queries.jsonl

{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}

passages.jsonl

{"text": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."}
{"text": "Kevin Loader is a British film and television producer."}
{"text": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."}
{"text": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}

Use the following example to run the Triton Inference Server SDK docker container, mounting the directory, as shown as datasets/ in the following example, where you created your JSONL files.

export RELEASE="yy.mm" # e.g. export RELEASE="24.10"

docker run -it --rm \
  --gpus=all \
  --network="host" \
  --mount type=bind,source=${PWD}/datasets,target=/datasets \
  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Execute the following command to run a performance benchmark using the genai-perf command line tool.

genai-perf profile \
    -m nvidia/nv-rerankqa-mistral-4b-v3 \
    --service-kind openai \
    --endpoint-type rankings \
    --batch-size 10 \
    --input-file /datasets/ \
    --extra-inputs truncate:END \
    --concurrency 5 \
    --url http://localhost:8000

You can see the full set of command line options for genai-perf in the Command Line Options section of the GenAI-Perf documentation.

Benchmarks#

All latency measurements are reported in milliseconds.

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

33.0

31.0

37.0

37.0

307.3

512

10

3

62.0

63.0

72.0

74.0

474.9

512

10

5

103.0

104.0

113.0

116.0

471.7

512

20

1

57.0

59.0

64.0

65.0

351.1

512

20

3

124.0

123.0

139.0

140.0

475.7

512

20

5

206.0

207.0

227.0

230.0

477.7

512

40

1

99.0

99.0

109.0

110.0

402.8

512

40

3

244.0

248.0

261.0

267.0

483.4

512

40

5

405.0

414.0

429.0

434.0

483.7

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

40.0

39.0

43.0

44.0

250.4

512

10

3

88.0

89.0

96.0

104.0

337.7

512

10

5

145.0

148.0

153.0

154.0

337.1

512

20

1

71.0

74.0

78.0

79.0

280.1

512

20

3

171.0

172.0

184.0

187.0

345.0

512

20

5

284.0

288.0

305.0

310.0

345.3

512

40

1

129.0

129.0

139.0

140.0

309.5

512

40

3

341.0

346.0

358.0

361.0

347.5

512

40

5

565.0

575.0

592.0

601.0

347.2

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

58.0

58.0

61.0

61.0

170.8

512

10

3

149.0

150.0

159.0

204.0

197.8

512

10

5

246.0

250.0

257.0

260.0

198.8

512

20

1

119.0

117.0

123.0

124.0

168.5

512

20

3

315.0

319.0

325.0

326.0

187.9

512

20

5

523.0

532.0

540.0

540.0

187.5

512

40

1

234.0

234.0

242.0

243.0

171.1

512

40

3

652.0

661.0

670.0

674.0

181.6

512

40

5

1080.0

1101.0

1118.0

1120.0

181.4

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

78.0

78.0

80.0

80.0

128.2

512

10

3

205.0

207.0

210.0

212.0

144.8

512

10

5

340.0

346.0

354.0

357.0

144.0

512

20

1

158.0

157.0

162.0

163.0

126.8

512

20

3

430.0

436.0

443.0

446.0

137.2

512

20

5

716.0

728.0

739.0

744.0

136.9

512

40

1

312.0

312.0

320.0

321.0

128.0

512

40

3

886.0

896.0

907.0

910.0

134.0

512

40

5

1463.0

1492.0

1512.0

1515.0

134.0

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

74.0

74.0

79.0

80.0

134.8

512

10

3

185.0

188.0

202.0

244.0

157.3

512

10

5

311.0

315.0

325.0

329.0

157.6

512

20

1

139.0

137.0

149.0

150.0

143.5

512

20

3

371.0

373.0

394.0

398.0

159.8

512

20

5

615.0

622.0

644.0

648.0

159.5

512

40

1

267.0

266.0

286.0

290.0

149.8

512

40

3

744.0

752.0

788.0

799.0

159.1

512

40

5

1231.0

1252.0

1296.0

1316.0

159.0

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

79.0

79.0

84.0

85.0

126.4

512

10

3

203.0

206.0

217.0

271.0

144.5

512

10

5

339.0

346.0

354.0

358.0

144.4

512

20

1

151.0

151.0

160.0

161.0

132.0

512

20

3

405.0

411.0

428.0

435.0

145.4

512

20

5

672.0

685.0

710.0

717.0

145.5

512

40

1

289.0

287.0

307.0

310.0

138.2

512

40

3

811.0

817.0

842.0

851.0

146.3

512

40

5

1340.0

1357.0

1405.0

1412.0

146.2

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

230.0

229.0

240.0

244.0

43.5

512

10

3

639.0

644.0

667.0

675.0

46.5

512

10

5

1055.0

1076.0

1101.0

1107.0

46.4

512

20

1

447.0

442.0

464.0

468.0

44.7

512

20

3

1257.0

1276.0

1310.0

1320.0

46.9

512

20

5

2088.0

2126.0

2172.0

2185.0

46.9

512

40

1

877.0

879.0

902.0

906.0

45.6

512

40

3

2534.0

2566.0

2605.0

2617.0

46.7

512

40

5

4194.0

4273.0

4321.0

4339.0

46.7

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

190.0

187.0

197.0

199.0

52.7

512

10

3

534.0

543.0

554.0

557.0

55.2

512

10

5

888.0

907.0

924.0

925.0

55.2

512

20

1

383.0

384.0

395.0

397.0

52.1

512

20

3

1117.0

1134.0

1150.0

1153.0

53.0

512

20

5

1850.0

1885.0

1908.0

1916.0

53.0

512

40

1

768.0

768.0

785.0

787.0

52.1

512

40

3

2254.0

2279.0

2302.0

2308.0

52.7

512

40

5

3712.0

3795.0

3834.0

3839.0

52.7

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

188.0

188.0

196.0

197.0

53.2

512

10

3

533.0

541.0

554.0

559.0

55.4

512

10

5

884.0

902.0

917.0

922.0

55.4

512

20

1

383.0

382.0

396.0

399.0

52.2

512

20

3

1119.0

1134.0

1147.0

1152.0

53.1

512

20

5

1848.0

1884.0

1910.0

1916.0

53.0

512

40

1

767.0

767.0

784.0

788.0

52.2

512

40

3

2245.0

2276.0

2301.0

2308.0

52.8

512

40

5

3714.0

3790.0

3825.0

3831.0

52.8

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

59.2

58.6

59.4

59.7

168.6

512

20

1

109.8

108.6

109.7

110.3

181.9

512

40

1

199.8

198.3

199.5

200.2

200.0

512

10

3

145.1

145.1

145.8

146.0

206.5

512

20

3

277.4

277.3

278.9

279.4

216.0

512

40

3

557.2

557.9

559.8

560.5

215.1

512

10

5

230.6

230.5

231.9

232.5

216.6

512

20

5

472.3

472.7

474.4

474.9

211.5

512

40

5

934.6

935.9

938.6

939.1

213.7

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

75.9

75.1

76.7

77.6

131.5

512

20

1

145.4

143.9

146.8

148.9

137.4

512

40

1

266.0

264.8

268.8

271.4

150.2

512

10

3

203.4

203.5

204.9

205.3

147.3

512

20

3

380.3

380.6

383.4

384.4

157.6

512

40

3

768.8

770.2

772.7

773.1

155.8

512

10

5

316.0

316.2

319.6

320.4

158.1

512

20

5

654.0

654.6

657.1

657.7

152.7

512

40

5

1291.4

1293.7

1297.1

1297.5

154.6

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

145.1

145.2

148.5

149.2

68.9

512

20

1

297.2

295.7

298.4

299.1

67.3

512

40

1

585.3

586.1

591.0

591.8

68.3

512

10

3

425.0

425.9

434.5

435.7

70.5

512

20

3

857.7

860.4

870.1

871.8

69.8

512

40

3

1752.4

1773.1

1787.9

1791.5

68.2

512

10

5

714.2

717.6

724.4

725.5

69.9

512

20

5

1456.2

1461.0

1474.4

1477.2

68.5

512

40

5

2909.7

2877.8

3558.4

3564.7

68.1

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

220.1

219.6

222.0

224.4

45.4

512

20

1

459.9

459.4

463.1

463.7

43.5

512

40

1

918.5

923.7

932.3

934.3

43.5

512

10

3

664.0

667.2

677.6

678.5

45.1

512

20

3

1378.7

1394.0

1409.0

1410.9

43.4

512

40

3

2782.8

2803.1

2838.5

2839.8

42.9

512

10

5

1138.4

1151.4

1164.2

1165.4

43.8

512

20

5

2335.1

2351.6

2374.7

2377.9

42.7

512

40

5

4653.8

4701.1

4757.4

4758.4

42.7

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

179.8

175.2

177.8

240.4

55.6

512

20

1

351.2

343.6

345.8

451.4

56.9

512

40

1

672.7

658.6

715.5

738.2

59.5

512

10

3

498.5

499.7

502.7

503.2

60.1

512

20

3

968.1

970.2

971.6

971.9

61.9

512

40

3

1952.9

1961.7

1963.7

1964.3

61.2

512

10

5

804.9

806.1

808.1

808.8

62.0

512

20

5

1624.4

1631.2

1633.3

1633.8

61.3

512

40

5

3259.6

3277.6

3282.3

3282.8

61.1

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

219.9

201.0

287.8

288.7

45.5

512

20

1

436.9

417.7

544.2

566.5

45.8

512

40

1

836.3

803.6

1010.0

1144.6

47.8

512

10

3

613.9

623.1

643.7

644.4

48.8

512

20

3

1227.0

1248.9

1275.5

1278.1

48.8

512

40

3

2451.1

2529.2

2536.7

2538.0

48.7

512

10

5

1007.0

1035.5

1039.8

1040.8

49.6

512

20

5

2069.8

2088.5

2181.8

2219.3

48.2

512

40

5

4099.7

4231.8

4256.0

4288.3

48.3

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

591.1

586.2

590.3

594.1

16.9

512

20

1

1187.8

1173.2

1175.5

1362.1

16.8

512

40

1

2396.1

2346.0

2535.7

2536.2

16.7

512

10

3

1711.4

1720.3

1720.7

1720.7

17.4

512

20

3

3422.8

3440.8

3441.2

3441.3

17.4

512

40

3

6844.3

6880.9

6881.5

6881.6

17.4

512

10

5

2837.1

2867.2

2867.5

2867.7

17.4

512

20

5

5674.2

5734.9

5735.3

5735.6

17.2

512

40

5

11343.2

11467.9

11468.5

11468.9

17.4

Input tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

512

10

1

705.6

676.9

916.8

921.1

14.2

512

20

1

1414.1

1363.6

1589.5

1605.7

14.1

512

40

1

2758.9

2719.8

2937.8

2944.9

14.5

512

10

3

2004.2

2018.6

2021.6

2022.5

14.9

512

20

3

4012.6

4040.1

4045.6

4048.0

14.7

512

40

3

8019.4

8075.2

8082.2

8087.4

14.9

512

10

5

3321.0

3365.0

3370.8

3372.7

14.9

512

20

5

6645.1

6734.9

6741.6

6742.8

14.9

512

40

5

13284.7

13461.9

13473.1

13475.7

14.9