Performance

You can use the genai-perf tool to benchmark the performance of the Text Embedding NIM under simulated production load. genai-perf comes pre-installed in the Triton Server SDK container.

To run a performance benchmark, first create a dataset of text examples that genai-perf can use when making requests to the embedding service. These examples should be representative of the type of data that you expect to receive in a production setting. The dataset should be formatted as a JSONL file where each line contains a {"text": ...} object, as shown in the following example.

Example: (embeddings.jsonl)

{"text": "What was the first car ever driven?"}
{"text": "Who served as the 5th President of the United States of America?"}
{"text": "Is the Sydney Opera House located in Australia?"}
{"text": "In what state did they film Shrek 2?"}

Use the following example to run the Triton Inference Server SDK docker container, mounting the directory, as shown as datasets/ in the following example, where you created your JSONL file.

export RELEASE="yy.mm" # e.g. export RELEASE="24.07"

docker run -it --rm \
  --gpus=all \
  --network="host" \
  --mount type=bind,source=${PWD}/datasets,target=/datasets \
  nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

Execute the following command to run a performance benchmark using the genai-perf command line tool.

genai-perf \
    -m nvidia/nv-embedqa-e5-v5 \
    --service-kind openai \
    --endpoint-type embeddings \
    --batch-size 2 \
    --input-file /datasets/embeddings.jsonl \
    --extra-inputs input_type:query \
    --extra-inputs truncate:END \
    --concurrency 5 \
    --url http://localhost:8000

You can see the full set of command line options for genai-perf in the Command Line Options section of the GenAI-Perf documentation.

Benchmarks

All latency measurements are reported in milliseconds.

Input Type

Input Tokens

Batch Size

Concurrency

Avg Latency

P50 Latency

P90 Latency

P95 Latency

Throughput (inputs/s)

passage

300

64

1

99.8

100.9

107.9

108.6

639.0

passage

300

64

3

143.8

143.3

156.6

159.0

1330.0

passage

300

64

5

239.7

239.7

259.1

265.0

1331.0

passage

512

64

1

114.6

114.4

115.9

117.0

556.5

passage

512

64

3

170.2

169.9

171.2

171.8

1124.2

passage

512

64

5

284.6

284.5

285.6

286.1

1121.4

query

20

1

1

5.1

5.1

5.4

5.4

196.3

query

20

1

3

6.0

5.5

7.4

7.6

498.5

query

20

1

5

11.9

12.3

12.8

12.9

418.3

query

20

1

7

16.5

17.2

18.0

18.1

422.0

query

20

1

9

21.4

22.3

23.3

23.6

418.3

query

20

1

11

26.0

26.0

28.4

28.6

421.3

query

20

1

13

30.7

30.9

33.1

33.6

422.2

query

20

1

15

37.3

37.9

39.1

39.3

401.4