Key Component Interactions

View as Markdown

Disaggregated LLM Serving

The architecture enables disaggregated prefill and decode phases for optimal LLM serving:

Disaggregated LLM Serving

Key interactions:

  • Planner + Grove: Planner determines optimal prefill/decode ratios based on SLA constraints; Grove handles gang scheduling
  • Router + KV Block Manager: Router uses cache hit rates and load information to make intelligent routing decisions
  • KV Block Manager + NIXL: Enables KV cache to span across memory tiers (G1-G4) with high-speed transfers

Kubernetes Infrastructure Stack

Kubernetes Infrastructure Stack

Key interactions:

  • GPU Operator + Network Operator: Together provide full infrastructure management for GPU clusters
  • KAI Scheduler + Grove: KAI handles general GPU scheduling; Grove adds gang scheduling for multinode workloads
  • nvcr.io + Operators: Pre-built containers simplify deployment of the entire stack