AIPerf supports replaying production traffic captured by Amazon SageMaker Data Capture. This enables benchmarking inference servers using real request patterns and prompts recorded from SageMaker real-time endpoints.
The loader sends the exact captured prompts (literal replay via the messages array) with original request timing, enabling accurate A/B comparisons when migrating models, changing instance types, or upgrading serving frameworks.
messages array in the request payload)Data Capture writes JSONL files to S3, partitioned by hour:
Each JSONL line contains the full request and response payloads with timing metadata:
Sync captured data from S3 and point AIPerf at the directory:
The loader recursively finds all .jsonl files in the directory, parses them, and sorts records by timestamp. No manual file concatenation is needed.
Single-file input also works:
Use timestamp offsets to replay a subset of the captured traffic:
This replays only the first 5 minutes (300,000 ms) of captured traffic.
When creating the endpoint configuration, include DataCaptureConfig with JsonContentTypes to store payloads as raw JSON (not base64):
Setting JsonContentTypes ensures payloads are stored as raw JSON. Without it, SageMaker base64-encodes the data by default. The AIPerf loader handles both encodings.
inferenceTime has no fractional seconds. At high QPS, requests sharing the same second fire in rapid succession.InvokeEndpointWithResponseStream responses are not captured by SageMaker. Output token counts may be missing for streaming endpoints.messages array. Non-chat endpoints are not supported.