Llama 3.1 Nemotron Safety Guard Multilingual 8B V1 NIM Performance#

NVIDIA used the genai-perf tool to benchmark the performance of the microservice. You can find more information about the tool in A Comprehensive Guide to NIM LLM Latency-Throughput Benchmarking.

The performance metrics were generated with the following configuration:

NVIDIA A100 80GB SXM
Driver 575.57.08
CUDA 12.9
BF16 precision

Tabs are organized by input and output sequence length. For example, the 500_10 tab shows the metrics for a benchmark that uses 500 tokens for an input sequence and 10 tokens for an output sequence.

500_10

Concurrency	TTFT (ms)	Avg Latency (ms)	Throughput (tokens/s)
1	48.3	136	66.6
5	141	272	165
25	657	958	217
50	1215	1922	231
100	2051	3562	250
200	3716	6336	263
500	5653	6791	614

1000_500

Concurrency	TTFT (ms)	Avg Latency (ms)	Throughput (tokens/s)
1	80.3	5540	88.6
5	292	6093	404
25	1009	9022	1037
50	1995	12818	1913
100	3629	19002	2568
200	6950	32488	2996
500	28427	67861	2846

1500_1000

Concurrency	TTFT (ms)	Avg Latency (ms)	Throughput (tokens/s)
1	111.5	11112	88.7
5	435	12413	394
25	1324	18487	1015
50	2836	26736	1835
100	5337	40156	2430
200	15406	67881	2382
500	71601	140246	2570