Multi-URL Load Balancing
Multi-URL Load Balancing
AIPerf supports distributing requests across multiple inference server instances for horizontal scaling. This is useful for:
- Multi-GPU scaling: Run multiple inference containers on a single node, each serving a different GPU
- Distributed inference: Load balance across multiple inference servers
- High-throughput benchmarking: Aggregate throughput from multiple instances
Usage
Specify multiple --url options to enable load balancing:
Sample Output (Successful Run):
Sample Output (Successful Run):
URL Selection Strategy
Currently supported strategies:
You can explicitly set the strategy with --url-strategy:
CLI Options
Behavior Notes
- Server metrics: Metrics are collected from all configured URLs
- Backward compatibility: Single URL usage remains unchanged
- Per-request assignment: Each request is assigned a URL at credit issuance time
- Connection reuse: The
--connection-reuse-strategyapplies per-URL