> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Why Adopting This Architecture is Essential

The shift to inference-centric AI, driven by generative models, necessitates a fundamental change in infrastructure. Traditional stacks, optimized for training, fail to meet the sustained, low-latency, and high-throughput demands of modern, large-scale inference. Adopting the NVIDIA Inference Reference Architecture is critical for:

* **Addressing the Generative AI Performance Gap:** Achieve the ultra-low latency and high throughput required by next-generation models like LLMs, delivering the industry-leading **Performance** detailed in the quantifiable metrics below.
* **Optimizing Cloud Economics:** The disaggregated, infrastructure-native approach enables truly elastic resource allocation, eliminating costly over-provisioning and improving capital efficiency for CSPs and NCPs, translating to a substantial improvement in **Cost & Efficiency (TCO)**.
* **Future-Proofing Your Inference Stack:** Provide a modern, cloud-native foundation that supports the complete inference lifecycle—from optimization to Kubernetes-native scaling, ensuring agility and long-term competitiveness in **Scalability & Agility (Time-to-Market)**.

## Measuring the Value Proposition

The architecture's value is quantifiable, driving superior service delivery and business outcomes:

**Performance:**

* **Token/watt Improvement:** Achieve up to **50x increase in requests** served for low-latency workloads.
* **Compute Performance:** Benefit from **1.5x higher NVFP4 compute performance** and **2x faster attention processing** compared with previous generations.

**Cost & Efficiency (TCO):**

* **Total Cost of Ownership:** Realize up to **35x lower cost per million tokens** compared with previous-generation platforms.

**Scalability & Agility (Time-to-Market):**

* **Time-to-Deployment:** Accelerate system deployment and scaling from weeks to \[BB] days/hours.
* **Autoscaling Efficiency:** Scale resources up or down by a factor of \[CC]X within minutes.

By adopting this Northstar definition, NCPs and ISVs can deliver world-class, cost-effective, and highly performant AI services.