List-Metric Aggregation

View as Markdown

Some record metrics carry a list[...] value per request rather than a single scalar — each list element is itself a measurement. Today this is inter_chunk_latency only: every request contributes one inter-chunk gap per pair of consecutive streamed chunks.

At the run level the records-manager has to summarize across all per-request lists into a single set of stats (avg, min, max, std, p50, p90, p99, …). Naïvely concatenating the lists into a flat array gives exact stats but linear memory: records × samples_per_record × 8 B. For a long-context streaming benchmark (1 M requests × ~5 K chunks/request) that reaches ~37 GB on the records-manager pod alone — the original cause of an OOM at ramp scale.

To bound memory, AIPerf aggregates list-valued record metrics with a t-digest sketch + five running side-channel scalars.

What stays exact, what becomes approximate

StatSourceAccuracy
countrunning intbit-exact
sumrunning float64bit-exact (within float round-off across summation orders)
min, maxrunning scalarsbit-exact
avgsum / countbit-exact
stdsqrt(max(0, sum_sq/count − avg²))bit-exact (population std, matches np.std)
p1p99t-digest sketchapproximate — see empirical band below

Memory cost of the side-channel scalars is 40 bytes regardless of sample count. T-digest centroids stay bounded (~4 KB sketch at the default compression) regardless of sample count.

Empirical accuracy

Backed by crick.TDigest (Cython/C). Measured against a numpy reference across five trials at three workload sizes, fresh RNG seed per trial, lognormal mean=ln(5), sigma=0.4 clipped to [0.5, 50] ms (representative ICL distribution at moderate ITL):

Sample countWorst-case max %err across percentilesThroughput
200 K0.068 %~9.6 M updates/s
5 M0.021 %~12.0 M updates/s
50 M0.012 %~12.0 M updates/s

At the default compression the worst-case relative error is 40× under the 0.5% band a t-digest typically promises. Mid-range percentiles (p10–p90) are tighter still — usually 5-digit relative agreement with the exact numpy reference.

The compression parameter is exposed as AIPERF_METRICS_TDIGEST_COMPRESSION (default 500) for benchmarks that want even tighter percentiles at the cost of a slightly larger sketch.

Per-record values are unchanged

The aggregation described above is only at the run-level. The per-record JSONL (profile_export.jsonl) preserves each request’s full list value verbatim — exact, byte-for-byte, ready for downstream tooling like aiperf plot to compute its own per-request stats.

What this means for benchmark output

For ICL specifically:

  • The numbers in profile_export_aiperf.{json,csv} come from the t-digest aggregator. Percentile values typically match a direct numpy computation to within 0.05% relative error on benchmark-scale runs; tail percentiles (p1, p99) at small N exhibit slightly more rank-jitter but stay well under the 0.5% band.
  • count, sum, min, max, avg, std are computed exactly and match what an exact array would produce.
  • Per-request ICL lists in profile_export.jsonl are unchanged — anything that needs sample-level precision can read those.

For all other metrics: no change. Scalar record metrics still use the exact-storage MetricArray path. Aggregate metrics (inter_token_latency, request_latency, etc.) compute through their own existing aggregator; t-digest is not in their path.

Where it lives