NVIDIA Inference Reference Architecture# NVIDIA Inference Reference Architecture Introduction Why Adopting This Architecture is Essential Measuring the Value Proposition NVIDIA Open Models Across Modalities Modular by Design Common Component Combinations Architecture Overview Component Layers Infrastructure Layer Optimization Layer Deployment Layer Inference Serving Layer Memory and Caching Layer Performance Tooling Container Registry Data Flow Diagrams GenAI/LLM Inference Flow Traditional ML Inference Flow Model Deployment Flow Key Component Interactions Disaggregated LLM Serving Kubernetes Infrastructure Stack Component Interaction Matrix Getting Started Full Stack Deployment Traditional ML Inference Only GenAI/LLM Inference Only Kubernetes Integration Only Example Workload: Large MoE LLM Inference Architecture Overview (Dynamo) Performance Characteristics Technical Implementation Details Appendix