Skip to main content
Ctrl+K
NVIDIA NIM LLMs Benchmarking - Home NVIDIA NIM LLMs Benchmarking - Home

NVIDIA NIM LLMs Benchmarking

NVIDIA NIM LLMs Benchmarking - Home NVIDIA NIM LLMs Benchmarking - Home

NVIDIA NIM LLMs Benchmarking

Table of Contents

Benchmarking Guide

  • Overview
  • Metrics
  • Parameters and Best Practices
  • Using GenAI-Perf to Benchmark
  • Benchmarking LoRA Models

Performance

  • Benchmarks

A Comprehensive Guide to NIM LLM Latency-Throughput Benchmarking#

Benchmarking Guide

  • Overview
    • Executive Summary
    • Introduction to LLM Inference Benchmarking
    • Background On How LLM Inference Works
  • Metrics
    • Time to First Token (TTFT)
    • End-to-End Request Latency (e2e_latency)
    • Inter-token Latency (ITL)
    • Tokens Per Second (TPS)
    • Requests Per Second (RPS)
  • Parameters and Best Practices
    • Use Cases
    • Load Control
    • Other Parameters
  • Using GenAI-Perf to Benchmark
    • Step 1. Get a list of the latest models
    • Step 2. Setting Up an OpenAI-Compatible LLama-3 Inference Service with NVIDIA NIM
    • Step 3. Setting Up GenAI-Perf and Warming Up: Benchmarking a Single Use Case
    • Step 4. Sweeping through a Number of Use Cases
    • Step 5. Analyzing the Output
    • Step 6. Interpreting the Results
  • Benchmarking LoRA Models
    • Best practices for Multi-LoRA deployment Performance Benchmarking

Performance

  • Benchmarks
    • Llama-3.3-70b-instruct Results
      • Version: 1.8.0
      • Version: 1.5.0
    • Llama-3.1-8b-instruct Results
      • Version: 1.8.0
      • Version: 1.3.0
    • Llama-3.1-70b-instruct Results
      • Version: 1.3.0
    • Hardware Specifications
      • NVIDIA H100
      • NVIDIA H200
      • NVIDIA A100
      • NVIDIA L40s

next

Overview

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2024-2025, NVIDIA Corporation.

Last updated on May 14, 2025.