Text Reranking (Latest)
Microservices

Performance

You can use the genai-perf tool to benchmark the performance of the Text Reranking NIM under simulated production load. genai-perf comes pre-installed in the Triton Server SDK container.

To run a performance benchmark, first create a dataset of text examples that genai-perf can use when making requests to the ranking service. These examples should be representative of the type of data that you expect to receive in a production setting. The dataset should be formatted as JSONL files where each line contains a {"text": ...} object. You’ll need to create two files, queries.jsonl and passages.jsonl, in the same directory following this format. genai-perf will randomly assemble query-passage pairs for making requests to the service.

Example:

queries.jsonl

Copy
Copied!
            

{"text": "What was the first car ever driven?"} {"text": "Who served as the 5th President of the United States of America?"} {"text": "Is the Sydney Opera House located in Australia?"} {"text": "In what state did they film Shrek 2?"}

passages.jsonl

Copy
Copied!
            

{"text": "Eric Anderson (born January 18, 1968) is an American sociologist and sexologist."} {"text": "Kevin Loader is a British film and television producer."} {"text": "Francisco Antonio Zea Juan Francisco Antonio Hilari was a Colombian journalist, botanist, diplomat, politician, and statesman who served as the 1st Vice President of Colombia."} {"text": "Daddys Home 2 Principal photography on the film began in Massachusetts in March 2017 and it was released in the United States by Paramount Pictures on November 10, 2017. Although the film received unfavorable reviews, it has grossed over $180 million worldwide on a $69 million budget."}

Use the following example to run the Triton Inference Server SDK docker container, mounting the directory, as shown as datasets/ in the following example, where you created your JSONL files.

Copy
Copied!
            

export RELEASE="yy.mm" # e.g. export RELEASE="24.07" docker run -it --rm \ --gpus=all \ --network="host" \ --mount type=bind,source=${PWD}/datasets,target=/datasets \ nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Execute the following command to run a performance benchmark using the genai-perf command line tool.

Copy
Copied!
            

genai-perf \ -m nvidia/nv-rerankqa-mistral-4b-v3 \ --service-kind openai \ --endpoint-type rankings \ --batch-size 10 \ --input-file /datasets/ \ --extra-inputs truncate:END \ --concurrency 5 \ --url http://localhost:8000

You can see the full set of command line options for genai-perf in the Command Line Options section of the GenAI-Perf documentation.

All latency measurements are reported in milliseconds.

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 59.15127325 58.609364 59.445833 59.748584 168.6138099
512 20 1 109.7725775 108.615585 109.7381228 110.2646772 181.9210795
512 40 1 199.7630066 198.295041 199.485733 200.1701864 200.0279253
512 10 3 145.0585768 145.085874 145.7667445 145.9763256 206.5299841
512 20 3 277.4162398 277.31522 278.8673645 279.3810225 216.0334681
512 40 3 557.1574629 557.9253725 559.7991384 560.4534646 215.0590275
512 10 5 230.5593247 230.5364655 231.9350625 232.4801408 216.6192372
512 20 5 472.2989115 472.697953 474.387241 474.861087 211.498378
512 40 5 934.5552783 935.8601985 938.6095725 939.0818028 213.67363

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 75.86532237 75.1407115 76.7011962 77.5664238 131.5224666
512 20 1 145.3515023 143.8648295 146.7868184 148.8625933 137.4335951
512 40 1 266.0357232 264.769601 268.8105086 271.3871036 150.2466364
512 10 3 203.4351888 203.5259465 204.9101358 205.3337731 147.2855861
512 20 3 380.299025 380.564969 383.385757 384.4144393 157.5641609
512 40 3 768.8054255 770.201022 772.700505 773.082237 155.800105
512 10 5 316.0120425 316.238098 319.6002415 320.3638658 158.0551982
512 20 5 654.0109239 654.567629 657.052084 657.706441 152.7136315
512 40 5 1291.357554 1293.736154 1297.131213 1297.513957 154.5669644

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 145.0521137 145.179583 148.4610684 149.2430566 68.87816567
512 20 1 297.2250056 295.705999 298.4342734 299.0625553 67.25092533
512 40 1 585.2691985 586.054975 590.980213 591.7934157 68.31800435
512 10 3 424.9944788 425.870006 434.542943 435.6959476 70.4994262
512 20 3 857.6705295 860.434129 870.1208244 871.7973532 69.81664061
512 40 3 1752.389449 1773.108766 1787.904004 1791.487716 68.22303433
512 10 5 714.2413788 717.606439 724.4338442 725.5423389 69.91198473
512 20 5 1456.152644 1460.962993 1474.374011 1477.235713 68.51543386
512 40 5 2909.654168 2877.841434 3558.40065 3564.714347 68.11933297

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 220.1351336 219.567336 222.0303097 224.3575836 45.40353794
512 20 1 459.864657 459.392442 463.0699963 463.6936073 43.48502173
512 40 1 918.5487263 923.729079 932.3048492 934.2903168 43.53730644
512 10 3 664.0047687 667.160883 677.5579322 678.5403365 45.10850872
512 20 3 1378.672097 1393.956222 1408.957817 1410.906663 43.38919086
512 40 3 2782.790843 2803.10455 2838.511819 2839.832203 42.86870375
512 10 5 1138.37367 1151.432405 1164.212337 1165.351363 43.84062093
512 20 5 2335.058665 2351.57569 2374.724255 2377.939241 42.67050255
512 40 5 4653.793032 4701.129501 4757.436871 4758.378576 42.66874487

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 179.771684 175.1607995 177.7801014 240.3639517 55.61237041
512 20 1 351.20563 343.581056 345.8178384 451.4078324 56.93840342
512 40 1 672.7244908 658.6091475 715.5040111 738.1744728 59.45287654
512 10 3 498.4991796 499.738937 502.6723146 503.1805484 60.11255981
512 20 3 968.0608491 970.228264 971.5925308 971.9322844 61.85250381
512 40 3 1952.855543 1961.722358 1963.688374 1964.261407 61.19315254
512 10 5 804.8911109 806.0804515 808.0845801 808.759168 62.0412946
512 20 5 1624.395943 1631.16363 1633.335954 1633.836346 61.33026392
512 40 5 3259.60437 3277.647756 3282.293595 3282.83031 61.05900781

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 219.8578405 200.99986 287.829985 288.737918 45.46329418
512 20 1 436.9196227 417.696669 544.2437424 566.5041747 45.76348523
512 40 1 836.2948987 803.587618 1009.951138 1144.642618 47.82129566
512 10 3 613.9488757 623.116859 643.7347195 644.4218348 48.79144936
512 20 3 1226.981552 1248.932983 1275.525827 1278.13238 48.76759708
512 40 3 2451.100619 2529.15884 2536.667733 2537.993478 48.69703136
512 10 5 1007.035652 1035.543607 1039.805042 1040.844671 49.56925593
512 20 5 2069.772105 2088.5293 2181.837337 2219.299533 48.15679733
512 40 5 4099.673705 4231.822854 4255.994611 4288.337907 48.31888282

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 591.100441 586.245698 590.2812663 594.1412835 16.91465597
512 20 1 1187.761116 1173.209232 1175.455402 1362.145419 16.83655672
512 40 1 2396.08959 2345.975085 2535.747176 2536.192954 16.69267152
512 10 3 1711.403037 1720.286958 1720.653258 1720.734911 17.43498274
512 20 3 3422.835992 3440.76653 3441.237919 3441.282993 17.43428699
512 40 3 6844.313022 6880.896307 6881.522162 6881.597176 17.43488284
512 10 5 2837.085986 2867.152507 2867.51241 2867.714845 17.43542495
512 20 5 5674.188629 5734.86263 5735.333012 5735.617738 17.24835647
512 40 5 11343.21061 11467.88201 11468.537 11468.86158 17.43554112

Input tokens

Batch Size

Concurrency

Avg Latency

P50

P90

P95

Throughput (inputs/s)

512 10 1 705.6236106 676.9081665 916.8449053 921.1387847 14.17048456
512 20 1 1414.107028 1363.629102 1589.479377 1605.672883 14.14229382
512 40 1 2758.85533 2719.830774 2937.829124 2944.937048 14.49778013
512 10 3 2004.225864 2018.646714 2021.602494 2022.47008 14.87404615
512 20 3 4012.562161 4040.132441 4045.611963 4047.986468 14.6719336
512 40 3 8019.3531 8075.15545 8082.199962 8087.389755 14.86706617
512 10 5 3321.029584 3365.024784 3370.791862 3372.740231 14.86651922
512 20 5 6645.143873 6734.94333 6741.577521 6742.792916 14.8596257
512 40 5 13284.70586 13461.89189 13473.10206 13475.66536 14.86595233

Previous Optimization
Next Reference
© Copyright © 2024, NVIDIA Corporation. Last updated on Jul 23, 2024.