Replay SageMaker Data Capture Traces

View as Markdown

AIPerf supports replaying production traffic captured by Amazon SageMaker Data Capture. This enables benchmarking inference servers using real request patterns and prompts recorded from SageMaker real-time endpoints.

The loader sends the exact captured prompts (literal replay via the messages array) with original request timing, enabling accurate A/B comparisons when migrating models, changing instance types, or upgrading serving frameworks.


Prerequisites

  • A SageMaker real-time endpoint with Data Capture enabled (captures both input and output)
  • Captured data synced from S3 to local disk
  • The captured endpoint must use the OpenAI-compatible chat completions API (messages array in the request payload)

SageMaker Data Capture Format

Data Capture writes JSONL files to S3, partitioned by hour:

s3://<bucket>/<prefix>/<endpoint-name>/<variant-name>/yyyy/mm/dd/hh/<uuid>.jsonl

Each JSONL line contains the full request and response payloads with timing metadata:

1{
2 "captureData": {
3 "endpointInput": {
4 "observedContentType": "application/json",
5 "mode": "INPUT",
6 "data": "{\"messages\":[{\"role\":\"user\",\"content\":\"What is AI?\"}],\"max_tokens\":50}",
7 "encoding": "JSON"
8 },
9 "endpointOutput": {
10 "observedContentType": "application/json",
11 "mode": "OUTPUT",
12 "data": "{\"usage\":{\"prompt_tokens\":12,\"completion_tokens\":30,\"total_tokens\":42},...}",
13 "encoding": "JSON"
14 }
15 },
16 "eventMetadata": {
17 "eventId": "e4378ff2-2b43-4031-a21f-401bb3c3e038",
18 "inferenceTime": "2026-04-29T00:03:18Z"
19 },
20 "eventVersion": "0"
21}

Download and Replay

Sync captured data from S3 and point AIPerf at the directory:

$# Sync all capture files (preserves hourly directory structure)
$aws s3 sync \
> s3://my-bucket/datacapture/my-endpoint/primary/ \
> ./captured_data/
$
$# Replay against a target server
$aiperf profile \
> --model my-model \
> --endpoint-type chat \
> --url localhost:8000 \
> --input-file ./captured_data/ \
> --custom-dataset-type sagemaker_data_capture \
> --fixed-schedule \
> --fixed-schedule-auto-offset

The loader recursively finds all .jsonl files in the directory, parses them, and sorts records by timestamp. No manual file concatenation is needed.

Single-file input also works:

$# Concatenate if preferred
$find captured_data/ -name "*.jsonl" -exec cat {} + > all_captures.jsonl
$
$aiperf profile \
> --model my-model \
> --endpoint-type chat \
> --url localhost:8000 \
> --input-file all_captures.jsonl \
> --custom-dataset-type sagemaker_data_capture \
> --fixed-schedule \
> --fixed-schedule-auto-offset

Replay a Time Window

Use timestamp offsets to replay a subset of the captured traffic:

$aiperf profile \
> --model my-model \
> --endpoint-type chat \
> --url localhost:8000 \
> --input-file ./captured_data/ \
> --custom-dataset-type sagemaker_data_capture \
> --fixed-schedule \
> --fixed-schedule-auto-offset \
> --fixed-schedule-end-offset 300000

This replays only the first 5 minutes (300,000 ms) of captured traffic.


Enabling Data Capture on Your Endpoint

When creating the endpoint configuration, include DataCaptureConfig with JsonContentTypes to store payloads as raw JSON (not base64):

1import boto3
2
3client = boto3.client("sagemaker")
4
5client.create_endpoint_config(
6 EndpointConfigName="my-endpoint-config-with-capture",
7 ProductionVariants=[{
8 "VariantName": "primary",
9 "ModelName": "my-model",
10 "InitialInstanceCount": 1,
11 "InstanceType": "ml.g5.xlarge",
12 "InitialVariantWeight": 1.0,
13 }],
14 DataCaptureConfig={
15 "EnableCapture": True,
16 "InitialSamplingPercentage": 100,
17 "DestinationS3Uri": "s3://my-bucket/datacapture",
18 "CaptureOptions": [
19 {"CaptureMode": "Input"},
20 {"CaptureMode": "Output"},
21 ],
22 "CaptureContentTypeHeader": {
23 "JsonContentTypes": ["application/json"],
24 },
25 },
26)

Setting JsonContentTypes ensures payloads are stored as raw JSON. Without it, SageMaker base64-encodes the data by default. The AIPerf loader handles both encodings.


Known Limitations

  • Second-level timestamp precision: inferenceTime has no fractional seconds. At high QPS, requests sharing the same second fire in rapid succession.
  • No streaming capture: InvokeEndpointWithResponseStream responses are not captured by SageMaker. Output token counts may be missing for streaming endpoints.
  • Single-turn only: Each captured record is an independent request. No multi-turn session linking.
  • OpenAI-compatible only: The captured payload must contain a messages array. Non-chat endpoints are not supported.