Key Component Interactions#

Disaggregated LLM Serving#

The architecture enables disaggregated prefill and decode phases for optimal LLM serving:

Disaggregated LLM Serving

Disaggregated LLM Serving#

Key interactions:

  • Planner + Grove: Planner determines optimal prefill/decode ratios based on SLA constraints; Grove handles gang scheduling

  • Router + KV Block Manager: Router uses cache hit rates and load information to make intelligent routing decisions

  • KV Block Manager + NIXL: Enables KV cache to span across memory tiers (G1-G4) with high-speed transfers

Kubernetes Infrastructure Stack#

Kubernetes Infrastructure Stack

Kubernetes Infrastructure Stack#

Key interactions:

  • GPU Operator + Network Operator: Together provide full infrastructure management for GPU clusters

  • KAI Scheduler + Grove: KAI handles general GPU scheduling; Grove adds gang scheduling for multinode workloads

  • nvcr.io + Operators: Pre-built containers simplify deployment of the entire stack