Llama-3.3-70B Topology Benchmark
Llama-3.3-70B Topology Benchmark
How do aggregated, single-node disaggregated, and multi-node disaggregated vLLM topologies compare when normalized by GPU?
Three vLLM topologies — aggregated, single-node disaggregated, and multi-node disaggregated — intentionally use different GPU counts (4, 8, and 16x H100/H200), so concurrency is scaled at 16 per GPU and results should be read as total throughput and TPS/GPU together — more GPUs trivially raise total throughput, so TPS/GPU is the apples-to-apples lens. All three topologies are also deployable recipe targets, so this benchmark doubles as a sizing guide.
Benchmark setup
Compared Configurations
Reproduce
Each configuration’s perf.yaml computes total concurrency as 16 x GPU count and wraps an AIPerf run like the following — the checked-in perf.yaml is authoritative (it also sets --random-seed, ignore_eos, the tokenizer, and dataset-entry flags):
The frontend services are llama3-70b-agg-frontend, llama3-70b-disagg-sn-frontend, and llama3-70b-disagg-mn-frontend. Deploy one configuration at a time:
Notes
- The source does not publish result numbers; run all three configurations on your hardware and compare total output TPS alongside TPS/GPU, since GPU counts differ per configuration.
- The model uses FP8 dynamic quantization applied at runtime; the download takes roughly 15-30 minutes.
- The
agganddisagg-single-nodeconfigurations also ship optional GAIE (Gateway API Inference Extension) manifests under theirgaie/subfolders. - Source: recipes/llama-3-70b
Related Recipe
All three configurations are deployable targets on the Llama-3.3-70B recipe page — none is a benchmark-only control.