Llama 3.1 Nemotron Safety Guard Multilingual 8B V1 NIM Performance#

NVIDIA used the genai-perf tool to benchmark the performance of the microservice. You can find more information about the tool in A Comprehensive Guide to NIM LLM Latency-Throughput Benchmarking.

The performance metrics were generated with the following configuration:

  • NVIDIA A100 80GB SXM

  • Driver 575.57.08

  • CUDA 12.9

  • BF16 precision

Tabs are organized by input and output sequence length. For example, the 500_10 tab shows the metrics for a benchmark that uses 500 tokens for an input sequence and 10 tokens for an output sequence.

Concurrency

TTFT (ms)

Avg Latency (ms)

Throughput (tokens/s)

1

48.3

136

66.6

5

141

272

165

25

657

958

217

50

1215

1922

231

100

2051

3562

250

200

3716

6336

263

500

5653

6791

614

Concurrency

TTFT (ms)

Avg Latency (ms)

Throughput (tokens/s)

1

80.3

5540

88.6

5

292

6093

404

25

1009

9022

1037

50

1995

12818

1913

100

3629

19002

2568

200

6950

32488

2996

500

28427

67861

2846

Concurrency

TTFT (ms)

Avg Latency (ms)

Throughput (tokens/s)

1

111.5

11112

88.7

5

435

12413

394

25

1324

18487

1015

50

2836

26736

1835

100

5337

40156

2430

200

15406

67881

2382

500

71601

140246

2570