YAML Config — Future Goals
YAML Config — Future Goals
YAML Config — Future Goals
This document captures scoped-out design ideas from prior YAML-config feature brainstorms. Each entry has motivation, a sketch, design questions still open, and prior-art references. None of these are committed to a release; they are durable idea-storage so we don’t redo the design conversation when a user request resurrects one.
When a future-goal entry graduates to a real feature, it gets its own spec under .claude/specs/YYYY-MM-DD-<topic>-design.md, the entry below shrinks to a one-line pointer, and the spec assumes responsibility for the design.
.claude/specs/2026-05-10-inline-dataset-design.md)The v1 inline-dataset feature shipped just two YAML changes: a new records: field on FileDataset (alternative to path:), with multi-pool dict-of-lists support for random_pool. The brainstorm raised several adjacent ideas that we deliberately punted to keep v1 tight. They are recorded here in roughly the order a user is likely to ask for them.
generate: directive — programmatic record buildingMotivation. Inline records: is great for hand-curated prompt sets but gets unwieldy when you want N parameterized prompts. Use case: “32 prompts of length-N built from a template,” or “100 trace records with computed timestamps.” Today this is achievable only by pre-rendering YAML out-of-band.
Sketch.
index (0-indexed). Chosen over i after prior-art review — Ansible’s item collision tax and Helm’s $index+meaningful-name best-practice both argue for a longer, namespaced loop var.vars: override top-level variables:. Precedence (highest wins): loop var > vars: > top-level variables:. Collision between vars:/variables: and index raises a config error at load (don’t silently shadow).count: accepts an int or a Jinja-evaluating string. Bounds 1 <= count <= 1_000_000.aiperf config validate and expand), render record: once with {index: 0, **vars, **variables}, surface Jinja UndefinedError / TemplateSyntaxError immediately, and run the rendered probe through the format’s per-record validator. Helm’s #1 regret per published pitfalls writeups is debug-at-render — AIPerf already has Jinja in-process, so this is cheap.generate: is a third source alongside path: and records:, also XOR. Mixing static records: with generate: was considered (additive concat) but rejected as confusing — if you want both, hand-write the static records as the leading entries of vars: and use a conditional in the template.Open questions.
generate: {count, record} (today) vs. generate: oneOf({count, record}, {for_each, record}) (Argo-style withSequence vs withItems)? Argo’s split is clean and avoids Terraform’s count-only regret. v1 ships count: only, but the schema must not lock us out — name the field generate: (not generate_count:) and the for_each: variant slots in later as a sibling key inside the same block.generate: accept a list of generate blocks (multiple loops concatenated)? Reject for v1; users can rerender via two configs if they need it.generate: work for multi-pool random_pool? Defer until asked. v1 would constrain generate: to flat-list formats only.Prior-art references.
withSequence / withItems — clean separation of count- vs list-driven loops.range $i, $v := list — both index and value bound in one breath.for_each + each.key/each.value; Terraform issue #23288 — five-year-old regret about no index attribute on dynamic blocks.loop_control.loop_var — bare item collision is a famous pain point; argues for namespaced/long loop-var names.Motivation. Inside generate: (and anywhere else {{ }} evaluates), users want to inject randomness — pick a random topic per iteration, draw a random output length, etc. The use case the brainstorm landed on: "Tell me about {{ random_choice(topics) }}".
Sketch.
Helpers (mirror Python’s random module — least-surprise):
The random_choice headline is sufficient for v1-of-this-deferred-feature; random_int and random_float are natural companions but can ship alone.
Open questions.
generate:. rng_seed = stable_hash((random_seed, yaml_path_of_field, index_or_None)). A random_choice inside phases.profiling.duration always renders with the same RNG → same value across runs. Inside generate.record.text with count: 100, each iteration mixes index in → 100 distinct deterministic draws. Different fields get different streams, so a random_int in record.output_length doesn’t perturb record.text. Re-ordering an unrelated field doesn’t shuffle the dataset.random.Random(random_seed) advances across the whole config render in document order. Simpler; but reordering or adding any field shifts every subsequent draw → fragile reproducibility.random_choice(items, weights=[...]) mirrors random.choices. The brainstorm dropped weighted from v1 of this deferred feature to keep API surface minimal; can re-add as a single keyword argument when asked.pick(weighted_list) for [{item, weight}, ...] shapes was considered. Defer; verbose Jinja map(attribute=...) works in the meantime.random_normal(mean, stddev)? Matches synthetic-prompt distribution shape. Defer until a user asks.Motivation. Different from Jinja-time random choice. Use case: “60% of requests draw prompt A, 30% B, 10% C.” This belongs to the sampler, not the record builder — the loader produces a flat record list; the sampler picks one per request.
Sketch.
weight: float >= 0 field on every record schema (single_turn, multi_turn, random_pool, traces). Default 1.0. Loader normalizes; absolute scale is irrelevant.sampling: weighted_random (sibling of sequential / random / shuffle).sampling: weighted_random and all weights are zero or absent, error. If sampling: random is set but any record has an explicit non-default weight, warn (probable user mistake).path: (same weight: field in JSONL records on disk), records:, and (when it ships) generate: (template can compute weight per iteration).weight: is a config-error when combined with sequential trace replay.Open questions.
random_pool? Wrap each pool: {pool_a: {weight: 3, items: [...]}}. Layers cleanly on top of per-record weights without breaking the bare-list shape.weights: [6, 3, 1])? Per-entry — index-fragile parallel lists don’t compose with generate: and are error-prone to maintain.path: with records: (additive concat)Motivation. “Start from a known on-disk corpus, add a few hand-crafted extras inline.” A v1 brainstorm option, deferred because the corner cases multiply (sampling/limit semantics across mixed sources, trace timestamps when concat’ing two trace streams).
Sketch. When both are present: file records first, then inline records. Sampling/entries: count cap apply across the merged list. Trace formats reject this combination (timestamp re-basing would be implicit and surprising).
Open questions.
inline_position: prepend|append)? Probably not — one fixed order is enough; users who want the other order can swap to path:-only or records:-only.random_pool mixing? File directory + inline-dict-of-lists would have to merge by pool name — clean enough, but doubles the surface area. Defer.records:Motivation. Argo’s ArtifactLocation.raw = {data: "..."} lesson: wrapping inline data in a typed object even when it has a single field today lets you add encoding:, schema_version:, validate: later without a breaking change.
Sketch. Allow records: to accept either:
records: [...]records: {items: [...], schema_version: 1, encoding: utf-8}The wrapped form is reserved future-additive. v1 of inline-datasets ships bare-list only; this entry tracks the option and the prior-art that motivates keeping it open.
Open questions.
from: as the future remote-source keyMotivation. valuesFrom (Helm/Flux/Argo CD) is a familiar k8s-flavored verb for “load from another source.” If we ever want to pull dataset records from a URL or a remote artifact store, from: is the natural sibling of path: (local) and records: (inline).
Sketch (when it ships).
Open questions.
The v1 brainstorm did a structured prior-art survey of dual-source YAML patterns and generate-loop DSLs. Persisting the lessons here so the next config-surface design can reuse them without re-running the survey.
value XOR externalValue) — two sibling fields, plain mutual exclusion, enforced by linters. Naming is asymmetric on purpose (value is the noun, externalValue is the modifier). Closest precedent for the path: ↔ records: split.configMap / secret / emptyDir / …) — N sibling fields, “exactly one” validation webhook. Field name is the discriminator; no type: configMap redundancy.ArtifactLocation — same shape as Tekton; raw: {data: ...} is the inline variant. Wrapping inline data in an object lets you add fields later non-breaking.Lesson: with ≤6 sibling sources and a “exactly one” rule, sibling fields beat a type: discriminator. Adopted in v1.
range $i, $v := list — both index and value bound in one breath, $-prefix marks user variables clearly. until N builtin handles count-based loops.for_each + each.key/each.value — stable keys (no churn on insert), namespaced each.*. Notable regret: no index attribute on dynamic-block iterator (open issue #23288 for 5+ years).for k, v in list — both index and value, optional if filter, comprehension is the value (no separate “loop directive” keyword). Statically validated.loop: with loop_control.loop_var — default loop var is item; collides constantly under nesting. Lesson: short generic names like item or i produce real collisions; require/encourage renaming. This is the single biggest argument for index over i.withSequence: {count: N} + withItems: [...] — separate keys for count- vs list-driven loops; same loop variable. Future-additive for_each: for AIPerf would mirror this.Three-stage validation, not single late-binding crash:
Helm/Argo postmortems converge on this. v1 inline-datasets does parse-time and run-time; lint-time only matters once generate: ships.