Replay SageMaker Data Capture Traces | NVIDIA AIPerf Documentation

AIPerf supports replaying production traffic captured by Amazon SageMaker Data Capture. This enables benchmarking inference servers using real request patterns and prompts recorded from SageMaker real-time endpoints.

The loader sends the exact captured prompts (literal replay via the messages array) with original request timing, enabling accurate A/B comparisons when migrating models, changing instance types, or upgrading serving frameworks.

Prerequisites

A SageMaker real-time endpoint with Data Capture enabled (captures both input and output)
Captured data synced from S3 to local disk
The captured endpoint must use the OpenAI-compatible chat completions API (messages array in the request payload)

SageMaker Data Capture Format

Data Capture writes JSONL files to S3, partitioned by hour:

s3://<bucket>/<prefix>/<endpoint-name>/<variant-name>/yyyy/mm/dd/hh/<uuid>.jsonl

Each JSONL line contains the full request and response payloads with timing metadata:

1 {
2   "captureData": {
3     "endpointInput": {
4       "observedContentType": "application/json",
5       "mode": "INPUT",
6       "data": "{\"messages\":[{\"role\":\"user\",\"content\":\"What is AI?\"}],\"max_tokens\":50}",
7       "encoding": "JSON"
8     },
9     "endpointOutput": {
10       "observedContentType": "application/json",
11       "mode": "OUTPUT",
12       "data": "{\"usage\":{\"prompt_tokens\":12,\"completion_tokens\":30,\"total_tokens\":42},...}",
13       "encoding": "JSON"
14     }
15   },
16   "eventMetadata": {
17     "eventId": "e4378ff2-2b43-4031-a21f-401bb3c3e038",
18     "inferenceTime": "2026-04-29T00:03:18Z"
19   },
20   "eventVersion": "0"
21 }

Download and Replay

Sync captured data from S3 and point AIPerf at the directory:

$ # Sync all capture files (preserves hourly directory structure)
$ aws s3 sync \
>   s3://my-bucket/datacapture/my-endpoint/primary/ \
>   ./captured_data/
$ 
$ # Replay against a target server
$ aiperf profile \
>     --model my-model \
>     --endpoint-type chat \
>     --url localhost:8000 \
>     --input-file ./captured_data/ \
>     --custom-dataset-type sagemaker_data_capture \
>     --fixed-schedule \
>     --fixed-schedule-auto-offset

The loader recursively finds all .jsonl files in the directory, parses them, and sorts records by timestamp. No manual file concatenation is needed.

Single-file input also works:

$ # Concatenate if preferred
$ find captured_data/ -name "*.jsonl" -exec cat {} + > all_captures.jsonl
$ 
$ aiperf profile \
>     --model my-model \
>     --endpoint-type chat \
>     --url localhost:8000 \
>     --input-file all_captures.jsonl \
>     --custom-dataset-type sagemaker_data_capture \
>     --fixed-schedule \
>     --fixed-schedule-auto-offset

Replay a Time Window

Use timestamp offsets to replay a subset of the captured traffic:

$ aiperf profile \
>     --model my-model \
>     --endpoint-type chat \
>     --url localhost:8000 \
>     --input-file ./captured_data/ \
>     --custom-dataset-type sagemaker_data_capture \
>     --fixed-schedule \
>     --fixed-schedule-auto-offset \
>     --fixed-schedule-end-offset 300000

This replays only the first 5 minutes (300,000 ms) of captured traffic.

Enabling Data Capture on Your Endpoint

When creating the endpoint configuration, include DataCaptureConfig with JsonContentTypes to store payloads as raw JSON (not base64):

1 import boto3
2 
3 client = boto3.client("sagemaker")
4 
5 client.create_endpoint_config(
6     EndpointConfigName="my-endpoint-config-with-capture",
7     ProductionVariants=[{
8         "VariantName": "primary",
9         "ModelName": "my-model",
10         "InitialInstanceCount": 1,
11         "InstanceType": "ml.g5.xlarge",
12         "InitialVariantWeight": 1.0,
13     }],
14     DataCaptureConfig={
15         "EnableCapture": True,
16         "InitialSamplingPercentage": 100,
17         "DestinationS3Uri": "s3://my-bucket/datacapture",
18         "CaptureOptions": [
19             {"CaptureMode": "Input"},
20             {"CaptureMode": "Output"},
21         ],
22         "CaptureContentTypeHeader": {
23             "JsonContentTypes": ["application/json"],
24         },
25     },
26 )

Setting JsonContentTypes ensures payloads are stored as raw JSON. Without it, SageMaker base64-encodes the data by default. The AIPerf loader handles both encodings.

Known Limitations

Second-level timestamp precision: inferenceTime has no fractional seconds. At high QPS, requests sharing the same second fire in rapid succession.
No streaming capture: InvokeEndpointWithResponseStream responses are not captured by SageMaker. Output token counts may be missing for streaming endpoints.
Single-turn only: Each captured record is an independent request. No multi-turn session linking.
OpenAI-compatible only: The captured payload must contain a messages array. Non-chat endpoints are not supported.

Trace Replay with Mooncake Traces - Mooncake FAST’25 trace replay
Bailian Traces - Bailian production trace replay
Fixed Schedule - Precise timestamp-based execution for any dataset