Replay SageMaker Data Capture Traces
AIPerf supports replaying production traffic captured by Amazon SageMaker Data Capture. This enables benchmarking inference servers using real request patterns and prompts recorded from SageMaker real-time endpoints.
The loader sends the exact captured prompts (literal replay via the messages array) with original request timing, enabling accurate A/B comparisons when migrating models, changing instance types, or upgrading serving frameworks.
Prerequisites
- A SageMaker real-time endpoint with Data Capture enabled (captures both input and output)
- Captured data synced from S3 to local disk
- The captured endpoint must use the OpenAI-compatible chat completions API (
messagesarray in the request payload)
SageMaker Data Capture Format
Data Capture writes JSONL files to S3, partitioned by hour:
Each JSONL line contains the full request and response payloads with timing metadata:
Download and Replay
Sync captured data from S3 and point AIPerf at the directory:
The loader recursively finds all .jsonl files in the directory, parses them, and sorts records by timestamp. No manual file concatenation is needed.
Single-file input also works:
Replay a Time Window
Use timestamp offsets to replay a subset of the captured traffic:
This replays only the first 5 minutes (300,000 ms) of captured traffic.
Enabling Data Capture on Your Endpoint
When creating the endpoint configuration, include DataCaptureConfig with JsonContentTypes to store payloads as raw JSON (not base64):
Setting JsonContentTypes ensures payloads are stored as raw JSON. Without it, SageMaker base64-encodes the data by default. The AIPerf loader handles both encodings.
Known Limitations
- Second-level timestamp precision:
inferenceTimehas no fractional seconds. At high QPS, requests sharing the same second fire in rapid succession. - No streaming capture:
InvokeEndpointWithResponseStreamresponses are not captured by SageMaker. Output token counts may be missing for streaming endpoints. - Single-turn only: Each captured record is an independent request. No multi-turn session linking.
- OpenAI-compatible only: The captured payload must contain a
messagesarray. Non-chat endpoints are not supported.
Related Tutorials
- Trace Replay with Mooncake Traces - Mooncake FAST’25 trace replay
- Bailian Traces - Bailian production trace replay
- Fixed Schedule - Precise timestamp-based execution for any dataset