Key Component Interactions#
Disaggregated LLM Serving#
The architecture enables disaggregated prefill and decode phases for optimal LLM serving:
Disaggregated LLM Serving#
Key interactions:
Planner + Grove: Planner determines optimal prefill/decode ratios based on SLA constraints; Grove handles gang scheduling
Router + KV Block Manager: Router uses cache hit rates and load information to make intelligent routing decisions
KV Block Manager + NIXL: Enables KV cache to span across memory tiers (G1-G4) with high-speed transfers
Kubernetes Infrastructure Stack#
Kubernetes Infrastructure Stack#
Key interactions:
GPU Operator + Network Operator: Together provide full infrastructure management for GPU clusters
KAI Scheduler + Grove: KAI handles general GPU scheduling; Grove adds gang scheduling for multinode workloads
nvcr.io + Operators: Pre-built containers simplify deployment of the entire stack