YAML Config — Future Goals
YAML Config — Future Goals
This document captures scoped-out design ideas from prior YAML-config feature brainstorms. Each entry has motivation, a sketch, design questions still open, and prior-art references. None of these are committed to a release; they are durable idea-storage so we don’t redo the design conversation when a user request resurrects one.
When a future-goal entry graduates to a real feature, it gets its own spec under .claude/specs/YYYY-MM-DD-<topic>-design.md, the entry below shrinks to a one-line pointer, and the spec assumes responsibility for the design.
Inline-dataset extensions (deferred from .claude/specs/2026-05-10-inline-dataset-design.md)
The v1 inline-dataset feature shipped just two YAML changes: a new records: field on FileDataset (alternative to path:), with multi-pool dict-of-lists support for random_pool. The brainstorm raised several adjacent ideas that we deliberately punted to keep v1 tight. They are recorded here in roughly the order a user is likely to ask for them.
generate: directive — programmatic record building
Motivation. Inline records: is great for hand-curated prompt sets but gets unwieldy when you want N parameterized prompts. Use case: “32 prompts of length-N built from a template,” or “100 trace records with computed timestamps.” Today this is achievable only by pre-rendering YAML out-of-band.
Sketch.
- Loop variable
index(0-indexed). Chosen overiafter prior-art review — Ansible’sitemcollision tax and Helm’s$index+meaningful-name best-practice both argue for a longer, namespaced loop var. - Per-block
vars:override top-levelvariables:. Precedence (highest wins): loop var >vars:> top-levelvariables:. Collision betweenvars:/variables:andindexraises a config error at load (don’t silently shadow). count:accepts an int or a Jinja-evaluating string. Bounds1 <= count <= 1_000_000.- Probe-time validation: at config load (
aiperf config validateandexpand), renderrecord:once with{index: 0, **vars, **variables}, surface JinjaUndefinedError/TemplateSyntaxErrorimmediately, and run the rendered probe through the format’s per-record validator. Helm’s #1 regret per published pitfalls writeups is debug-at-render — AIPerf already has Jinja in-process, so this is cheap. - Mutual exclusion:
generate:is a third source alongsidepath:andrecords:, also XOR. Mixing staticrecords:withgenerate:was considered (additive concat) but rejected as confusing — if you want both, hand-write the static records as the leading entries ofvars:and use a conditional in the template.
Open questions.
- Should the schema be
generate: {count, record}(today) vs.generate: oneOf({count, record}, {for_each, record})(Argo-stylewithSequencevswithItems)? Argo’s split is clean and avoids Terraform’s count-only regret. v1 shipscount:only, but the schema must not lock us out — name the fieldgenerate:(notgenerate_count:) and thefor_each:variant slots in later as a sibling key inside the same block. - Should
generate:accept a list of generate blocks (multiple loops concatenated)? Reject for v1; users can rerender via two configs if they need it. - Should
generate:work for multi-poolrandom_pool? Defer until asked. v1 would constraingenerate:to flat-list formats only.
Prior-art references.
- Argo Workflows
withSequence/withItems— clean separation of count- vs list-driven loops. - Helm
range $i, $v := list— both index and value bound in one breath. - Terraform
for_each+each.key/each.value; Terraform issue #23288 — five-year-old regret about no index attribute ondynamicblocks. - Ansible
loop_control.loop_var— bareitemcollision is a famous pain point; argues for namespaced/long loop-var names. - CUE comprehensions and Jsonnet comprehensions — pure-functional list-builders; nice but require a real language.
Jinja random helpers (uniform, deterministic)
Motivation. Inside generate: (and anywhere else {{ }} evaluates), users want to inject randomness — pick a random topic per iteration, draw a random output length, etc. The use case the brainstorm landed on: "Tell me about {{ random_choice(topics) }}".
Sketch.
Helpers (mirror Python’s random module — least-surprise):
The random_choice headline is sufficient for v1-of-this-deferred-feature; random_int and random_float are natural companions but can ship alone.
Open questions.
- Deterministic seed derivation strategy — two options:
- A. Per-render-site, indexed inside
generate:.rng_seed = stable_hash((random_seed, yaml_path_of_field, index_or_None)). Arandom_choiceinsidephases.profiling.durationalways renders with the same RNG → same value across runs. Insidegenerate.record.textwithcount: 100, each iteration mixesindexin → 100 distinct deterministic draws. Different fields get different streams, so arandom_intinrecord.output_lengthdoesn’t perturbrecord.text. Re-ordering an unrelated field doesn’t shuffle the dataset. - B. Single global stream. One
random.Random(random_seed)advances across the whole config render in document order. Simpler; but reordering or adding any field shifts every subsequent draw → fragile reproducibility. - Decision (when this graduates): strategy A. Same lesson Helm/Terraform learned the hard way — seed-by-path keeps reproducibility stable across config edits.
- A. Per-render-site, indexed inside
- Weighted variants?
random_choice(items, weights=[...])mirrorsrandom.choices. The brainstorm dropped weighted from v1 of this deferred feature to keep API surface minimal; can re-add as a single keyword argument when asked. - Object-bundled weighted lists? A sugar helper
pick(weighted_list)for[{item, weight}, ...]shapes was considered. Defer; verbose Jinjamap(attribute=...)works in the meantime. random_normal(mean, stddev)? Matches synthetic-prompt distribution shape. Defer until a user asks.
Runtime weighted sampling
Motivation. Different from Jinja-time random choice. Use case: “60% of requests draw prompt A, 30% B, 10% C.” This belongs to the sampler, not the record builder — the loader produces a flat record list; the sampler picks one per request.
Sketch.
- Optional
weight: float >= 0field on every record schema (single_turn,multi_turn,random_pool, traces). Default1.0. Loader normalizes; absolute scale is irrelevant. - New sampling mode
sampling: weighted_random(sibling ofsequential/random/shuffle). - Validation: if
sampling: weighted_randomand all weights are zero or absent, error. Ifsampling: randomis set but any record has an explicit non-default weight, warn (probable user mistake). - Composes with
path:(sameweight:field in JSONL records on disk),records:, and (when it ships)generate:(template can compute weight per iteration). - Trace formats keep their own ordering semantics;
weight:is a config-error when combined with sequential trace replay.
Open questions.
- Per-pool weights for multi-pool
random_pool? Wrap each pool:{pool_a: {weight: 3, items: [...]}}. Layers cleanly on top of per-record weights without breaking the bare-list shape. - Should weights live on the record (per-entry) or in a parallel list (
weights: [6, 3, 1])? Per-entry — index-fragile parallel lists don’t compose withgenerate:and are error-prone to maintain.
Mixing path: with records: (additive concat)
Motivation. “Start from a known on-disk corpus, add a few hand-crafted extras inline.” A v1 brainstorm option, deferred because the corner cases multiply (sampling/limit semantics across mixed sources, trace timestamps when concat’ing two trace streams).
Sketch. When both are present: file records first, then inline records. Sampling/entries: count cap apply across the merged list. Trace formats reject this combination (timestamp re-basing would be implicit and surprising).
Open questions.
- Should the order be configurable (e.g.
inline_position: prepend|append)? Probably not — one fixed order is enough; users who want the other order can swap topath:-only orrecords:-only. - Multi-pool
random_poolmixing? File directory + inline-dict-of-lists would have to merge by pool name — clean enough, but doubles the surface area. Defer.
Optional wrapper schema for records:
Motivation. Argo’s ArtifactLocation.raw = {data: "..."} lesson: wrapping inline data in a typed object even when it has a single field today lets you add encoding:, schema_version:, validate: later without a breaking change.
Sketch. Allow records: to accept either:
- Bare list (today):
records: [...] - Wrapped form:
records: {items: [...], schema_version: 1, encoding: utf-8}
The wrapped form is reserved future-additive. v1 of inline-datasets ships bare-list only; this entry tracks the option and the prior-art that motivates keeping it open.
Open questions.
- Is the wrapper worth introducing before any of those reserved fields has a real use case? Probably no — premature surface area. Wait for the first concrete need (e.g. base64-encoded image records, or a schema-version stamp) and design the wrapper around that need instead of speculatively.
Reserve from: as the future remote-source key
Motivation. valuesFrom (Helm/Flux/Argo CD) is a familiar k8s-flavored verb for “load from another source.” If we ever want to pull dataset records from a URL or a remote artifact store, from: is the natural sibling of path: (local) and records: (inline).
Sketch (when it ships).
Open questions.
- This is a different-enough feature (network I/O, caching, integrity) that it deserves its own spec when it graduates. The future-goal entry just reserves the field name.
Prior-art notes (referenced from the inline-dataset spec)
The v1 brainstorm did a structured prior-art survey of dual-source YAML patterns and generate-loop DSLs. Persisting the lessons here so the next config-surface design can reuse them without re-running the survey.
Dual-source patterns (file XOR inline)
- OpenAPI Example Object (
valueXORexternalValue) — two sibling fields, plain mutual exclusion, enforced by linters. Naming is asymmetric on purpose (valueis the noun,externalValueis the modifier). Closest precedent for thepath:↔records:split. - Tekton Workspace bindings (
configMap/secret/emptyDir/ …) — N sibling fields, “exactly one” validation webhook. Field name is the discriminator; notype: configMapredundancy. - Argo Workflows
ArtifactLocation— same shape as Tekton;raw: {data: ...}is the inline variant. Wrapping inline data in an object lets you add fields later non-breaking.
Lesson: with ≤6 sibling sources and a “exactly one” rule, sibling fields beat a type: discriminator. Adopted in v1.
Generate-loop patterns
- Helm
range $i, $v := list— both index and value bound in one breath,$-prefix marks user variables clearly.until Nbuiltin handles count-based loops. - Terraform
for_each+each.key/each.value— stable keys (no churn on insert), namespacedeach.*. Notable regret: no index attribute on dynamic-block iterator (open issue #23288 for 5+ years). - CUE comprehensions
for k, v in list— both index and value, optionaliffilter, comprehension is the value (no separate “loop directive” keyword). Statically validated. - Ansible
loop:withloop_control.loop_var— default loop var isitem; collides constantly under nesting. Lesson: short generic names likeitemoriproduce real collisions; require/encourage renaming. This is the single biggest argument forindexoveri. - Argo
withSequence: {count: N}+withItems: [...]— separate keys for count- vs list-driven loops; same loop variable. Future-additivefor_each:for AIPerf would mirror this.
Validation strategy that wins
Three-stage validation, not single late-binding crash:
- Parse-time — structural mutual exclusion, type checks, bounds.
- Lint-time — template renders with mock vars, Jinja syntax/undefined errors surface before run.
- Run-time — only the actual data substitution.
Helm/Argo postmortems converge on this. v1 inline-datasets does parse-time and run-time; lint-time only matters once generate: ships.
Sources
- Argo Workflows Artifacts
- Argo Workflows Loops
- Tekton Workspaces
- OpenAPI 3.1 — Example Object
- Helm Variables
- Helm Templating Pitfalls
- Kustomize configMapGenerator
- Ansible loop docs
- Terraform Dynamic Blocks
- Terraform issue #23288
- CUE / YAML
- Jsonnet comprehensions
- Pkl for-generator amends limitation
- GitHub Actions matrix
- JSON Schema $ref vs inline