List-Metric Aggregation
List-Metric Aggregation
Some record metrics carry a list[...] value per request rather than a single scalar — each list element is itself a measurement. Today this is inter_chunk_latency only: every request contributes one inter-chunk gap per pair of consecutive streamed chunks.
At the run level the records-manager has to summarize across all per-request lists into a single set of stats (avg, min, max, std, p50, p90, p99, …). Naïvely concatenating the lists into a flat array gives exact stats but linear memory: records × samples_per_record × 8 B. For a long-context streaming benchmark (1 M requests × ~5 K chunks/request) that reaches ~37 GB on the records-manager pod alone — the original cause of an OOM at ramp scale.
To bound memory, AIPerf aggregates list-valued record metrics with a t-digest sketch + five running side-channel scalars.
What stays exact, what becomes approximate
Memory cost of the side-channel scalars is 40 bytes regardless of sample count. T-digest centroids stay bounded (~4 KB sketch at the default compression) regardless of sample count.
Empirical accuracy
Backed by crick.TDigest (Cython/C). Measured against a numpy reference across five trials at three workload sizes, fresh RNG seed per trial, lognormal mean=ln(5), sigma=0.4 clipped to [0.5, 50] ms (representative ICL distribution at moderate ITL):
At the default compression the worst-case relative error is 40× under the 0.5% band a t-digest typically promises. Mid-range percentiles (p10–p90) are tighter still — usually 5-digit relative agreement with the exact numpy reference.
The compression parameter is exposed as AIPERF_METRICS_TDIGEST_COMPRESSION (default 500) for benchmarks that want even tighter percentiles at the cost of a slightly larger sketch.
Per-record values are unchanged
The aggregation described above is only at the run-level. The per-record JSONL (profile_export.jsonl) preserves each request’s full list value verbatim — exact, byte-for-byte, ready for downstream tooling like aiperf plot to compute its own per-request stats.
What this means for benchmark output
For ICL specifically:
- The numbers in
profile_export_aiperf.{json,csv}come from the t-digest aggregator. Percentile values typically match a direct numpy computation to within 0.05% relative error on benchmark-scale runs; tail percentiles (p1, p99) at small N exhibit slightly more rank-jitter but stay well under the 0.5% band. count,sum,min,max,avg,stdare computed exactly and match what an exact array would produce.- Per-request ICL lists in
profile_export.jsonlare unchanged — anything that needs sample-level precision can read those.
For all other metrics: no change. Scalar record metrics still use the exact-storage MetricArray path. Aggregate metrics (inter_token_latency, request_latency, etc.) compute through their own existing aggregator; t-digest is not in their path.
Where it lives
- Aggregator class:
src/aiperf/metrics/list_metric_aggregation.py—TDigestListMetricAggregator. - Selection site:
src/aiperf/post_processors/metric_results_processor.py— first-touch dispatch byisinstance(value, list). - Compression knob:
Environment.METRICS.TDIGEST_COMPRESSION(env:AIPERF_METRICS_TDIGEST_COMPRESSION, default 500). - Dependency:
crick~=0.0.8(Cython/C-backed t-digest, BSD-3, dask-org maintained).