YAML Config Future Goals | NVIDIA AIPerf Documentation

This document captures scoped-out design ideas from prior YAML-config feature brainstorms. Each entry has motivation, a sketch, design questions still open, and prior-art references. None of these are committed to a release; they are durable idea-storage so we don’t redo the design conversation when a user request resurrects one.

When a future-goal entry graduates to a real feature, it gets its own spec under .claude/specs/YYYY-MM-DD-<topic>-design.md, the entry below shrinks to a one-line pointer, and the spec assumes responsibility for the design.

Inline-dataset extensions (deferred from `.claude/specs/2026-05-10-inline-dataset-design.md`)

The v1 inline-dataset feature shipped just two YAML changes: a new records: field on FileDataset (alternative to path:), with multi-pool dict-of-lists support for random_pool. The brainstorm raised several adjacent ideas that we deliberately punted to keep v1 tight. They are recorded here in roughly the order a user is likely to ask for them.

`generate:` directive — programmatic record building

Motivation. Inline records: is great for hand-curated prompt sets but gets unwieldy when you want N parameterized prompts. Use case: “32 prompts of length-N built from a template,” or “100 trace records with computed timestamps.” Today this is achievable only by pre-rendering YAML out-of-band.

Sketch.

1 benchmark:
2   dataset:
3     type: file
4     format: single_turn
5     generate:
6       count: 32
7       record:
8         text: "Sample {{ index }} on {{ topic }}"
9         output_length: "{{ 50 + (index % 10) * 50 }}"
10       vars:
11         topic: physics

Loop variable index (0-indexed). Chosen over i after prior-art review — Ansible’s item collision tax and Helm’s $index+meaningful-name best-practice both argue for a longer, namespaced loop var.
Per-block vars: override top-level variables:. Precedence (highest wins): loop var > vars: > top-level variables:. Collision between vars:/variables: and index raises a config error at load (don’t silently shadow).
count: accepts an int or a Jinja-evaluating string. Bounds 1 <= count <= 1_000_000.
Probe-time validation: at config load (aiperf config validate and expand), render record: once with {index: 0, **vars, **variables}, surface Jinja UndefinedError / TemplateSyntaxError immediately, and run the rendered probe through the format’s per-record validator. Helm’s #1 regret per published pitfalls writeups is debug-at-render — AIPerf already has Jinja in-process, so this is cheap.
Mutual exclusion: generate: is a third source alongside path: and records:, also XOR. Mixing static records: with generate: was considered (additive concat) but rejected as confusing — if you want both, hand-write the static records as the leading entries of vars: and use a conditional in the template.

Open questions.

Should the schema be generate: {count, record} (today) vs. generate: oneOf({count, record}, {for_each, record}) (Argo-style withSequence vs withItems)? Argo’s split is clean and avoids Terraform’s count-only regret. v1 ships count: only, but the schema must not lock us out — name the field generate: (not generate_count:) and the for_each: variant slots in later as a sibling key inside the same block.
Should generate: accept a list of generate blocks (multiple loops concatenated)? Reject for v1; users can rerender via two configs if they need it.
Should generate: work for multi-pool random_pool? Defer until asked. v1 would constrain generate: to flat-list formats only.

Prior-art references.

Argo Workflows withSequence / withItems — clean separation of count- vs list-driven loops.
Helm range $i, $v := list — both index and value bound in one breath.
Terraform for_each + each.key/each.value; Terraform issue #23288 — five-year-old regret about no index attribute on dynamic blocks.
Ansible loop_control.loop_var — bare item collision is a famous pain point; argues for namespaced/long loop-var names.
CUE comprehensions and Jsonnet comprehensions — pure-functional list-builders; nice but require a real language.

Jinja random helpers (uniform, deterministic)

Motivation. Inside generate: (and anywhere else {{ }} evaluates), users want to inject randomness — pick a random topic per iteration, draw a random output length, etc. The use case the brainstorm landed on: "Tell me about {{ random_choice(topics) }}".

Sketch.

1 schemaVersion: "2.0"
2 random_seed: 42
3 
4 variables:
5   topics: [physics, chemistry, biology]
6 
7 benchmark:
8   dataset:
9     type: file
10     format: single_turn
11     generate:
12       count: 100
13       record:
14         text: "Tell me about {{ random_choice(topics) }}"
15         output_length: "{{ random_int(50, 500) }}"

Helpers (mirror Python’s random module — least-surprise):

Helper	Shape
`random_choice(items)`	uniform pick from a list
`random_int(min, max)`	inclusive integer
`random_float(min, max)`	uniform float

The random_choice headline is sufficient for v1-of-this-deferred-feature; random_int and random_float are natural companions but can ship alone.

Open questions.

Deterministic seed derivation strategy — two options:
- A. Per-render-site, indexed inside generate:. rng_seed = stable_hash((random_seed, yaml_path_of_field, index_or_None)). A random_choice inside phases.profiling.duration always renders with the same RNG → same value across runs. Inside generate.record.text with count: 100, each iteration mixes index in → 100 distinct deterministic draws. Different fields get different streams, so a random_int in record.output_length doesn’t perturb record.text. Re-ordering an unrelated field doesn’t shuffle the dataset.
- B. Single global stream. One random.Random(random_seed) advances across the whole config render in document order. Simpler; but reordering or adding any field shifts every subsequent draw → fragile reproducibility.
- Decision (when this graduates): strategy A. Same lesson Helm/Terraform learned the hard way — seed-by-path keeps reproducibility stable across config edits.
Weighted variants? random_choice(items, weights=[...]) mirrors random.choices. The brainstorm dropped weighted from v1 of this deferred feature to keep API surface minimal; can re-add as a single keyword argument when asked.
Object-bundled weighted lists? A sugar helper pick(weighted_list) for [{item, weight}, ...] shapes was considered. Defer; verbose Jinja map(attribute=...) works in the meantime.
random_normal(mean, stddev)? Matches synthetic-prompt distribution shape. Defer until a user asks.

Runtime weighted sampling

Motivation. Different from Jinja-time random choice. Use case: “60% of requests draw prompt A, 30% B, 10% C.” This belongs to the sampler, not the record builder — the loader produces a flat record list; the sampler picks one per request.

Sketch.

Optional weight: float >= 0 field on every record schema (single_turn, multi_turn, random_pool, traces). Default 1.0. Loader normalizes; absolute scale is irrelevant.
New sampling mode sampling: weighted_random (sibling of sequential / random / shuffle).
Validation: if sampling: weighted_random and all weights are zero or absent, error. If sampling: random is set but any record has an explicit non-default weight, warn (probable user mistake).
Composes with path: (same weight: field in JSONL records on disk), records:, and (when it ships) generate: (template can compute weight per iteration).
Trace formats keep their own ordering semantics; weight: is a config-error when combined with sequential trace replay.

Open questions.

Per-pool weights for multi-pool random_pool? Wrap each pool: {pool_a: {weight: 3, items: [...]}}. Layers cleanly on top of per-record weights without breaking the bare-list shape.
Should weights live on the record (per-entry) or in a parallel list (weights: [6, 3, 1])? Per-entry — index-fragile parallel lists don’t compose with generate: and are error-prone to maintain.

Mixing `path:` with `records:` (additive concat)

Motivation. “Start from a known on-disk corpus, add a few hand-crafted extras inline.” A v1 brainstorm option, deferred because the corner cases multiply (sampling/limit semantics across mixed sources, trace timestamps when concat’ing two trace streams).

Sketch. When both are present: file records first, then inline records. Sampling/entries: count cap apply across the merged list. Trace formats reject this combination (timestamp re-basing would be implicit and surprising).

Open questions.

Should the order be configurable (e.g. inline_position: prepend|append)? Probably not — one fixed order is enough; users who want the other order can swap to path:-only or records:-only.
Multi-pool random_pool mixing? File directory + inline-dict-of-lists would have to merge by pool name — clean enough, but doubles the surface area. Defer.

Optional wrapper schema for `records:`

Motivation. Argo’s ArtifactLocation.raw = {data: "..."} lesson: wrapping inline data in a typed object even when it has a single field today lets you add encoding:, schema_version:, validate: later without a breaking change.

Sketch. Allow records: to accept either:

Bare list (today): records: [...]
Wrapped form: records: {items: [...], schema_version: 1, encoding: utf-8}

The wrapped form is reserved future-additive. v1 of inline-datasets ships bare-list only; this entry tracks the option and the prior-art that motivates keeping it open.

Open questions.

Is the wrapper worth introducing before any of those reserved fields has a real use case? Probably no — premature surface area. Wait for the first concrete need (e.g. base64-encoded image records, or a schema-version stamp) and design the wrapper around that need instead of speculatively.

Reserve `from:` as the future remote-source key

Motivation. valuesFrom (Helm/Flux/Argo CD) is a familiar k8s-flavored verb for “load from another source.” If we ever want to pull dataset records from a URL or a remote artifact store, from: is the natural sibling of path: (local) and records: (inline).

Sketch (when it ships).

1 dataset:
2   type: file
3   format: single_turn
4   from:
5     url: https://datasets.example.com/prompts.jsonl
6     sha256: 4e2f...
7     cache: ~/.aiperf/cache/datasets

Open questions.

This is a different-enough feature (network I/O, caching, integrity) that it deserves its own spec when it graduates. The future-goal entry just reserves the field name.

Prior-art notes (referenced from the inline-dataset spec)

The v1 brainstorm did a structured prior-art survey of dual-source YAML patterns and generate-loop DSLs. Persisting the lessons here so the next config-surface design can reuse them without re-running the survey.

Dual-source patterns (file XOR inline)

OpenAPI Example Object (value XOR externalValue) — two sibling fields, plain mutual exclusion, enforced by linters. Naming is asymmetric on purpose (value is the noun, externalValue is the modifier). Closest precedent for the path: ↔ records: split.
Tekton Workspace bindings (configMap / secret / emptyDir / …) — N sibling fields, “exactly one” validation webhook. Field name is the discriminator; no type: configMap redundancy.
Argo Workflows ArtifactLocation — same shape as Tekton; raw: {data: ...} is the inline variant. Wrapping inline data in an object lets you add fields later non-breaking.

Lesson: with ≤6 sibling sources and a “exactly one” rule, sibling fields beat a type: discriminator. Adopted in v1.

Generate-loop patterns

Helm range $i, $v := list — both index and value bound in one breath, $-prefix marks user variables clearly. until N builtin handles count-based loops.
Terraform for_each + each.key/each.value — stable keys (no churn on insert), namespaced each.*. Notable regret: no index attribute on dynamic-block iterator (open issue #23288 for 5+ years).
CUE comprehensions for k, v in list — both index and value, optional if filter, comprehension is the value (no separate “loop directive” keyword). Statically validated.
Ansible loop: with loop_control.loop_var — default loop var is item; collides constantly under nesting. Lesson: short generic names like item or i produce real collisions; require/encourage renaming. This is the single biggest argument for index over i.
Argo withSequence: {count: N} + withItems: [...] — separate keys for count- vs list-driven loops; same loop variable. Future-additive for_each: for AIPerf would mirror this.

Validation strategy that wins

Three-stage validation, not single late-binding crash:

Parse-time — structural mutual exclusion, type checks, bounds.
Lint-time — template renders with mock vars, Jinja syntax/undefined errors surface before run.
Run-time — only the actual data substitution.

Helm/Argo postmortems converge on this. v1 inline-datasets does parse-time and run-time; lint-time only matters once generate: ships.

YAML Config — Future Goals

Inline-dataset extensions (deferred from `.claude/specs/2026-05-10-inline-dataset-design.md`)

`generate:` directive — programmatic record building

Jinja random helpers (uniform, deterministic)

Runtime weighted sampling

Mixing `path:` with `records:` (additive concat)

Optional wrapper schema for `records:`

Reserve `from:` as the future remote-source key

Prior-art notes (referenced from the inline-dataset spec)

Dual-source patterns (file XOR inline)

Generate-loop patterns

Validation strategy that wins

Sources

Inline-dataset extensions (deferred from .claude/specs/2026-05-10-inline-dataset-design.md)

generate: directive — programmatic record building

Jinja random helpers (uniform, deterministic)

Runtime weighted sampling

Mixing path: with records: (additive concat)

Optional wrapper schema for records:

Reserve from: as the future remote-source key

Prior-art notes (referenced from the inline-dataset spec)

Dual-source patterns (file XOR inline)

Generate-loop patterns

Validation strategy that wins

Sources

Inline-dataset extensions (deferred from `.claude/specs/2026-05-10-inline-dataset-design.md`)

`generate:` directive — programmatic record building

Mixing `path:` with `records:` (additive concat)

Optional wrapper schema for `records:`

Reserve `from:` as the future remote-source key