Skip to main content
Ctrl+K
NVIDIA Cloud Accelerator Documentation - Home NVIDIA Cloud Accelerator Documentation - Home

NVIDIA Cloud Accelerator Documentation

NVIDIA Cloud Accelerator Documentation - Home NVIDIA Cloud Accelerator Documentation - Home

NVIDIA Cloud Accelerator Documentation

Table of Contents

NVIDIA Inference Reference Architecture

  • Introduction
  • Why Adopting This Architecture is Essential
  • NVIDIA Open Models Across Modalities
  • Common Component Combinations
  • Component Layers
  • Data Flow Diagrams
  • Key Component Interactions
  • Component Interaction Matrix
  • Getting Started
  • Example Workload: Large MoE LLM Inference
  • Appendix

NVIDIA Inference Reference Architecture#

NVIDIA Inference Reference Architecture

  • Introduction
  • Why Adopting This Architecture is Essential
    • Measuring the Value Proposition
  • NVIDIA Open Models Across Modalities
    • Modular by Design
  • Common Component Combinations
    • Architecture Overview
  • Component Layers
    • Infrastructure Layer
    • Optimization Layer
    • Deployment Layer
    • Inference Serving Layer
    • Memory and Caching Layer
    • Performance Tooling
    • Container Registry
  • Data Flow Diagrams
    • GenAI/LLM Inference Flow
    • Traditional ML Inference Flow
    • Model Deployment Flow
  • Key Component Interactions
    • Disaggregated LLM Serving
    • Kubernetes Infrastructure Stack
  • Component Interaction Matrix
  • Getting Started
    • Full Stack Deployment
    • Traditional ML Inference Only
    • GenAI/LLM Inference Only
    • Kubernetes Integration Only
  • Example Workload: Large MoE LLM Inference
    • Architecture Overview (Dynamo)
    • Performance Characteristics
    • Technical Implementation Details
  • Appendix

next

Introduction

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2024-2026, NVIDIA Corporation.

Last updated on Mar 12, 2026.