Performance

You can use the genai-perf tool to benchmark the performance of the Text Embedding NIM under simulated production load. genai-perf comes pre-installed in the Triton Server SDK container.

To run a performance benchmark, first create a dataset of text examples that genai-perf can use when making requests to the embedding service. These examples should be representative of the type of data that you expect to receive in a production setting. The dataset should be formatted as a JSONL file where each line contains a {"text": ...} object, as shown in the following example.

Example: (embeddings.jsonl)

{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}

Use the following example to run the Triton Inference Server SDK docker container, mounting the directory, as shown as datasets/ in the following example, where you created your JSONL file.

export RELEASE="yy.mm" # e.g. export RELEASE="24.07"

docker run -it --rm \
  --gpus=all \
  --network="host" \
  --mount type=bind,source=${PWD}/datasets,target=/datasets \
  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Execute the following command to run a performance benchmark using the genai-perf command line tool.

genai-perf \
    -m nvidia/nv-embedqa-e5-v5 \
    --service-kind openai \
    --endpoint-type embeddings \
    --batch-size 2 \
    --input-file /datasets/embeddings.jsonl \
    --extra-inputs input_type:query \
    --extra-inputs truncate:END \
    --concurrency 5 \
    --url http://localhost:8000

You can see the full set of command line options for genai-perf in the Command Line Options section of the GenAI-Perf documentation.

Benchmarks

All latency measurements are reported in milliseconds.

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	99.8	100.9	107.9	108.6	639.0
passage	300	64	3	143.8	143.3	156.6	159.0	1330.0
passage	300	64	5	239.7	239.7	259.1	265.0	1331.0
passage	512	64	1	114.6	114.4	115.9	117.0	556.5
passage	512	64	3	170.2	169.9	171.2	171.8	1124.2
passage	512	64	5	284.6	284.5	285.6	286.1	1121.4
query	20	1	1	5.1	5.1	5.4	5.4	196.3
query	20	1	3	6.0	5.5	7.4	7.6	498.5
query	20	1	5	11.9	12.3	12.8	12.9	418.3
query	20	1	7	16.5	17.2	18.0	18.1	422.0
query	20	1	9	21.4	22.3	23.3	23.6	418.3
query	20	1	11	26.0	26.0	28.4	28.6	421.3
query	20	1	13	30.7	30.9	33.1	33.6	422.2
query	20	1	15	37.3	37.9	39.1	39.3	401.4

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	2554.3	2563.9	2678.1	2698.3	25.0
passage	300	64	3	7349.2	7502.1	7889.3	7968.1	25.5
passage	300	64	5	11913.2	12461.9	12893.4	12969.4	25.6
passage	512	64	1	3701.9	3701.6	3703.1	3703.4	17.3
passage	512	64	3	10730.2	10985.2	10987.0	11029.2	17.5
passage	512	64	5	17355.4	14691.3	22035.4	22035.7	17.4
query	20	1	1	32.4	32.4	32.7	32.8	30.7
query	20	1	3	82.5	85.6	85.9	86.0	36.3
query	20	1	5	135.5	142.9	143.3	143.3	36.8
query	20	1	7	191.7	200.2	200.5	200.6	36.5
query	20	1	9	246.9	257.4	257.8	257.9	36.4
query	20	1	11	301.7	314.6	315.1	315.2	36.4
query	20	1	13	356.6	371.6	372.2	372.4	36.4
query	20	1	15	409.5	401.4	429.8	429.9	36.5

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	176.5	177.1	188.6	190.4	362.2
passage	300	64	3	336.1	337.0	359.1	365.8	570.2
passage	300	64	5	560.2	562.9	592.3	634.6	569.8
passage	512	64	1	205.3	204.7	208.2	210.8	311.4
passage	512	64	3	410.9	411.1	412.5	412.7	466.4
passage	512	64	5	681.5	682.0	683.6	684.1	468.7
query	20	1	1	5.3	5.3	5.6	5.7	186.3
query	20	1	3	7.4	7.4	7.5	7.7	403.8
query	20	1	5	11.9	12.4	12.6	12.8	419.2
query	20	1	7	16.6	17.3	17.5	17.6	421.5
query	20	1	9	21.2	22.1	22.5	22.6	423.9
query	20	1	11	26.1	27.2	27.7	27.8	420.5
query	20	1	13	30.8	31.2	32.6	32.7	422.3
query	20	1	15	36.4	37.3	37.9	38.0	411.7

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	188.4	191.7	197.7	198.8	338.7
passage	300	64	3	371.7	372.7	393.8	471.0	515.3
passage	300	64	5	619.5	621.7	648.5	728.8	515.1
passage	512	64	1	222.7	222.3	226.0	227.4	286.5
passage	512	64	3	447.0	447.0	448.7	449.4	428.4
passage	512	64	5	742.3	742.8	745.0	745.5	430.1
query	20	1	1	6.6	6.6	7.0	7.1	149.3
query	20	1	3	7.4	7.3	7.6	7.7	404.8
query	20	1	5	11.8	12.2	12.5	12.6	421.5
query	20	1	7	16.4	17.1	17.4	17.5	426.4
query	20	1	9	20.9	21.9	22.3	22.4	429.9
query	20	1	11	25.7	26.8	27.4	27.7	427.2
query	20	1	13	30.4	31.5	32.1	32.2	427.4
query	20	1	15	35.6	36.4	37.9	38.0	420.6

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	377.9	376.4	396.5	404.5	169.1
passage	300	64	3	929.9	932.3	972.0	979.2	205.8
passage	300	64	5	1545.5	1555.2	1597.3	1610.3	205.9
passage	512	64	1	469.5	468.6	473.9	475.1	136.2
passage	512	64	3	1178.9	1182.4	1183.0	1183.2	162.2
passage	512	64	5	1958.1	1970.9	1971.6	1971.8	162.2
query	20	1	1	11.1	11.1	11.5	11.6	89.8
query	20	1	3	19.3	20.3	20.8	21.0	154.9
query	20	1	5	32.1	34.0	34.6	34.8	155.5
query	20	1	7	44.8	47.4	48.1	48.2	156.0
query	20	1	9	57.7	60.9	61.8	62.0	155.8
query	20	1	11	70.6	74.0	75.5	75.7	155.5
query	20	1	13	83.8	82.8	89.2	89.6	154.9
query	20	1	15	97.5	96.6	103.1	103.4	153.7

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	483.6	483.6	505.2	509.5	132.3
passage	300	64	3	1328.0	1334.4	1367.9	1379.6	143.9
passage	300	64	5	2181.8	2203.7	2241.9	2250.4	145.5
passage	512	64	1	633.8	633.8	638.6	639.4	100.9
passage	512	64	3	1744.9	1755.3	1761.5	1763.1	109.4
passage	512	64	5	2892.2	2923.9	2934.8	2936.8	109.4
query	20	1	1	8.0	8.0	8.3	8.3	124.1
query	20	1	3	11.2	12.2	12.6	12.8	266.1
query	20	1	5	19.9	20.6	21.1	21.2	250.3
query	20	1	7	27.6	28.9	29.4	29.6	253.0
query	20	1	9	35.1	36.7	37.3	37.5	256.1
query	20	1	11	42.7	44.6	45.5	45.7	256.9
query	20	1	13	50.7	50.3	54.0	54.2	255.9
query	20	1	15	57.4	57.9	62.2	62.5	261.0

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	552.1	555.1	572.6	575.1	115.5
passage	300	64	3	1228.5	1229.5	1325.3	1527.0	155.4
passage	300	64	5	2045.3	2058.5	2153.8	2231.9	155.3
passage	512	64	1	730.0	729.7	732.3	733.6	87.4
passage	512	64	3	1775.8	1779.3	1784.0	1784.5	107.7
passage	512	64	5	2945.8	2539.2	3431.0	3432.8	107.6
query	20	1	1	14.6	14.6	15.2	15.4	67.9
query	20	1	3	29.1	30.7	31.6	31.9	102.7
query	20	1	5	48.7	51.4	52.6	52.9	102.3
query	20	1	7	68.2	72.0	73.7	74.0	102.4
query	20	1	9	86.7	90.2	94.0	94.6	103.7
query	20	1	11	106.3	105.3	115.0	115.5	103.3
query	20	1	13	125.3	125.0	134.9	135.8	103.6
query	20	1	15	144.4	145.0	155.3	156.1	103.7

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	693.1	702.0	719.3	721.3	92.0
passage	300	64	3	1674.4	1687.0	1849.1	2192.7	114.0
passage	300	64	5	2780.4	2797.6	3082.0	3389.9	113.8
passage	512	64	1	930.9	931.5	935.9	936.9	68.6
passage	512	64	3	2398.1	2395.9	2403.8	2407.5	79.6
passage	512	64	5	4056.4	4079.8	4098.5	4315.1	78.5
query	20	1	1	19.8	19.7	20.6	20.7	50.1
query	20	1	3	42.3	44.0	45.2	45.5	70.8
query	20	1	5	70.1	73.4	75.1	75.8	71.1
query	20	1	7	97.7	102.6	104.5	104.9	71.6
query	20	1	9	124.9	131.3	134.2	134.8	71.9
query	20	1	11	151.7	149.8	163.6	164.3	72.4
query	20	1	13	180.3	178.8	193.3	194.0	72.0
query	20	1	15	208.4	207.8	222.5	223.4	71.9

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	1322.4	1322.2	1362.2	1369.5	48.3
passage	300	64	3	3670.5	3674.6	3798.4	3824.3	52.2
passage	300	64	5	6188.9	6219.1	6368.4	6378.3	51.4
passage	512	64	1	1990.0	1990.5	2013.8	2018.5	32.1
passage	512	64	3	5586.0	5601.7	5683.0	5689.6	34.3
passage	512	64	5	9358.7	9398.1	9525.5	9570.5	34.0
query	20	1	1	21.5	21.5	21.8	21.8	46.3
query	20	1	3	47.8	51.1	51.5	51.7	62.5
query	20	1	5	82.1	85.3	85.8	85.9	60.8
query	20	1	7	112.1	119.2	120.0	120.2	62.3
query	20	1	9	143.5	151.5	154.2	154.4	62.6
query	20	1	11	176.5	174.3	188.5	188.8	62.2
query	20	1	13	208.2	205.8	222.2	222.4	62.3
query	20	1	15	239.0	239.5	256.2	256.6	62.7

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	1954.6	1957.3	2010.7	2029.4	32.7
passage	300	64	3	5650.6	5734.4	5950.8	7649.8	33.3
passage	300	64	5	9470.5	9790.5	9947.0	10511.1	32.7
passage	512	64	1	3038.8	3045.9	3079.6	3080.5	21.0
passage	512	64	3	8659.0	8835.5	8944.2	8960.5	21.8
passage	512	64	5	14292.3	14782.0	14948.8	14986.1	21.6
query	20	1	1	29.3	29.2	29.5	29.6	34.0
query	20	1	3	71.2	73.1	73.3	73.4	42.0
query	20	1	5	113.8	121.7	122.2	122.3	43.9
query	20	1	7	159.3	170.2	171.0	171.1	43.9
query	20	1	9	204.7	217.6	219.9	220.0	43.9
query	20	1	11	253.3	266.7	268.8	268.9	43.4
query	20	1	13	299.2	295.0	317.5	317.7	43.4
query	20	1	15	346.4	342.2	366.3	366.4	43.2

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	1618.5	1619.8	1666.8	1700.4	39.5
passage	300	64	3	4245.1	4279.2	4465.5	5715.4	44.6
passage	300	64	5	7009.0	7159.7	7465.8	8631.8	44.6
passage	512	64	1	2316.9	2317.3	2321.5	2323.1	27.6
passage	512	64	3	6328.9	6407.9	6414.3	6415.5	29.9
passage	512	64	5	10559.8	10698.6	11012.5	11124.2	29.7
query	20	1	1	22.5	22.5	22.8	22.9	44.4
query	20	1	3	49.5	53.2	53.6	53.8	60.6
query	20	1	5	81.2	88.5	89.1	89.2	61.6
query	20	1	7	114.8	123.9	124.5	124.7	60.9
query	20	1	9	147.6	145.4	160.0	160.1	60.9
query	20	1	11	179.3	177.9	195.4	195.6	61.3
query	20	1	13	212.8	213.6	231.3	231.5	61.0
query	20	1	15	243.0	248.4	266.5	266.7	61.7

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	1887.4	1890.5	1954.8	1970.8	33.9
passage	300	64	3	5411.6	5587.9	5787.5	6121.0	34.9
passage	300	64	5	8957.8	9469.9	11139.8	11486.4	34.6
passage	512	64	1	2839.6	2851.3	2973.2	3008.1	22.5
passage	512	64	3	8179.3	8529.3	8662.5	8678.7	23.0
passage	512	64	5	13935.1	14520.8	14928.1	15156.9	22.5
query	20	1	1	24.0	23.9	24.2	24.3	41.6
query	20	1	3	51.3	54.4	55.3	55.5	58.4
query	20	1	5	87.2	91.3	92.8	93.1	57.3
query	20	1	7	120.8	126.9	129.5	129.8	57.9
query	20	1	9	154.8	162.4	166.6	166.9	58.1
query	20	1	11	187.7	185.9	203.5	203.8	58.5
query	20	1	13	223.0	222.1	239.8	240.3	58.2
query	20	1	15	256.2	258.2	276.6	277.3	58.5

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	16	1	2927.9	2927.4	3059.5	3089.5	5.5
passage	300	16	3	8379.6	8563.5	8739.8	9256.8	5.6
passage	300	16	5	13629.2	14355.2	14642.3	14710.0	5.6
passage	300	64	1	11385.6	11350.1	11646.0	11693.3	5.6
passage	300	64	3	29783.1	33442.8	33609.1	33687.4	5.7
passage	300	64	5	43320.7	55557.3	55833.8	55911.2	5.8
query	20	1	1	39.8	39.7	40.2	40.4	25.1
query	20	1	3	95.5	100.7	101.4	101.7	31.4
query	20	1	5	157.1	167.9	168.8	169.0	31.8
query	20	1	7	224.1	235.1	236.2	236.5	31.2
query	20	1	9	284.4	302.1	303.6	303.9	31.6
query	20	1	11	345.7	339.8	370.6	370.9	31.8
query	20	1	13	410.6	406.0	437.9	438.2	31.6
query	20	1	15	470.4	472.7	505.1	505.6	31.8

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	16	1	1753.4	1745.5	1831.3	1850.1	9.1
passage	300	16	3	5090.3	5173.0	5259.7	5416.0	9.3
passage	300	16	5	8349.1	8605.0	8750.8	8830.5	9.3
passage	300	64	1	7270.1	7290.9	7380.2	7390.7	8.8
passage	300	64	3	20045.6	21451.2	21691.0	21695.3	8.9
passage	300	64	5	31066.4	35673.0	36053.9	36088.0	8.9
query	20	1	1	66.4	66.2	67.1	67.2	15.0
query	20	1	3	168.6	179.1	180.8	181.5	17.8
query	20	1	5	278.9	298.7	300.6	300.9	17.9
query	20	1	7	388.5	417.5	419.9	420.6	18.0
query	20	1	9	501.5	535.8	539.7	540.5	17.9
query	20	1	11	616.4	603.0	659.1	659.8	17.8
query	20	1	13	728.6	722.0	778.9	779.8	17.8
query	20	1	15	838.3	840.9	897.7	898.6	17.8

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	98.0	95.0	105.8	106.6	651.7
passage	300	64	3	144.9	142.3	159.3	174.9	1320.4
passage	300	64	5	243.3	238.8	283.4	298.1	1311.5
passage	512	64	1	112.0	112.0	112.9	113.2	569.2
passage	512	64	3	223.2	253.4	257.1	257.7	857.7
passage	512	64	5	300.7	295.7	356.5	360.0	1061.4
query	20	1	1	4.6	4.6	4.8	4.8	215.6
query	20	1	3	7.0	7.2	7.5	7.8	426.4
query	20	1	5	11.4	11.9	12.1	12.2	434.7
query	20	1	7	16.0	16.7	16.9	17.0	434.7
query	20	1	9	20.6	21.4	21.8	21.9	435.3
query	20	1	11	25.2	26.2	26.7	26.9	435.8
query	20	1	13	30.2	31.2	31.8	32.1	429.8
query	20	1	15	34.9	35.8	36.4	36.6	429.0

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	2586.4	2617.0	2802.0	2803.0	24.7
passage	300	64	3	7438.4	7622.4	7961.6	8037.8	25.2
passage	300	64	5	12158.7	12724.5	13256.9	13319.0	25.1
passage	512	64	1	3727.8	3727.5	3728.9	3729.3	17.2
passage	512	64	3	10810.7	11063.3	11102.2	11154.1	17.3
passage	512	64	5	17458.0	14878.2	22157.8	22183.8	17.3
query	20	1	1	32.3	32.2	32.6	32.7	30.8
query	20	1	3	81.1	85.5	85.9	86.0	36.9
query	20	1	5	136.5	142.8	143.1	143.3	36.6
query	20	1	7	189.4	199.9	200.4	200.5	36.9
query	20	1	9	245.6	257.0	257.6	257.8	36.6
query	20	1	11	297.5	313.4	314.5	314.7	36.9
query	20	1	13	350.9	344.2	371.6	371.8	37.0
query	20	1	15	409.1	427.2	429.0	429.3	36.6

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	172.1	171.9	186.3	189.1	371.4
passage	300	64	3	334.3	335.5	363.5	383.4	573.4
passage	300	64	5	556.5	557.5	585.4	600.3	573.6
passage	512	64	1	203.5	202.6	206.5	207.2	314.2
passage	512	64	3	406.1	406.7	497.6	502.4	472.0
passage	512	64	5	673.6	673.2	718.2	760.0	474.1
query	20	1	1	5.3	5.2	5.6	5.7	188.6
query	20	1	3	7.3	7.4	7.5	7.5	408.6
query	20	1	5	11.9	12.3	12.5	12.5	417.7
query	20	1	7	16.5	17.2	17.4	17.5	423.6
query	20	1	9	21.2	22.1	22.3	22.4	424.4
query	20	1	11	25.9	27.0	27.3	27.4	423.7
query	20	1	13	30.8	31.9	32.4	32.5	421.9
query	20	1	15	35.2	34.9	37.1	37.2	425.5

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	189.4	187.6	203.0	204.9	337.5
passage	300	64	3	371.2	372.3	396.3	404.4	516.2
passage	300	64	5	621.1	622.6	655.5	677.5	513.7
passage	512	64	1	228.0	227.0	233.5	234.5	280.4
passage	512	64	3	462.9	467.3	559.3	570.4	414.1
passage	512	64	5	840.0	807.1	1040.2	1089.9	379.9
query	20	1	1	6.6	6.6	7.0	7.2	150.7
query	20	1	3	7.4	7.4	7.5	7.6	399.7
query	20	1	5	12.1	12.5	12.7	12.8	411.2
query	20	1	7	16.8	17.4	17.8	17.9	413.2
query	20	1	9	21.7	22.4	22.9	23.0	413.7
query	20	1	11	26.3	27.3	27.6	27.7	417.1
query	20	1	13	31.1	32.0	32.6	32.7	416.9
query	20	1	15	36.4	37.3	37.7	37.8	411.0

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	373.3	366.4	397.5	401.4	171.3
passage	300	64	3	918.5	919.8	958.2	963.8	208.4
passage	300	64	5	1527.0	1534.3	1589.6	1594.6	208.4
passage	512	64	1	470.4	469.4	475.0	476.0	136.0
passage	512	64	3	1180.7	1184.0	1184.5	1184.7	162.0
passage	512	64	5	1960.7	1973.6	1974.0	1974.2	162.0
query	20	1	1	10.7	10.7	11.0	11.0	93.2
query	20	1	3	19.6	20.4	20.9	21.0	152.6
query	20	1	5	32.2	34.2	34.9	35.3	154.9
query	20	1	7	45.9	48.0	48.8	49.1	152.2
query	20	1	9	59.5	62.0	63.0	63.2	151.1
query	20	1	11	72.7	76.0	77.3	77.8	151.1
query	20	1	13	85.7	89.0	90.7	90.9	151.6
query	20	1	15	99.9	103.5	104.4	104.7	149.9

Input Type	Input Tokens	Batch Size	Concurrency	Avg Latency	P50 Latency	P90 Latency	P95 Latency	Throughput (inputs/s)
passage	300	64	1	489.9	488.4	519.3	521.3	130.6
passage	300	64	3	1355.4	1354.3	1413.0	1423.2	141.0
passage	300	64	5	2251.5	2271.3	2338.8	2345.4	140.9
passage	512	64	1	641.5	640.7	647.5	648.6	99.7
passage	512	64	3	1797.2	1807.8	1813.7	1814.9	106.2
passage	512	64	5	2979.6	3014.9	3020.7	3021.9	106.2
query	20	1	1	7.9	7.9	8.2	8.4	125.6
query	20	1	3	11.9	12.3	12.6	12.7	251.4
query	20	1	5	20.0	20.6	20.9	20.9	249.5
query	20	1	7	27.7	28.9	29.4	29.5	251.5
query	20	1	9	35.6	37.0	37.6	37.8	252.2
query	20	1	11	43.6	45.3	45.9	46.1	251.9
query	20	1	13	51.5	53.3	54.2	54.4	252.2
query	20	1	15	59.6	59.3	63.0	63.3	251.0