Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles
Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles
Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles
Models behave differently based on how a question is phrased - a “cynical senior dev” and a “curious student” get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing.

A prompt to an LLM typically has three distinct components: the preamble (high-level instructions), the problem (the actual question or task), and the format instruction (how to structure the answer). Prompt sensitivity is the phenomenon where a model’s accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical.
The preamble and format instruction (green) are the parts we can vary freely without changing the problem. The problem itself (red) comes from the source dataset and stays fixed. When models are trained on data with only one preamble style and one format instruction, they become brittle - they can solve the problem, but small wording changes cause them to misformat their response, triggering scoring failures.
When we evaluated early Nemotron checkpoints on internal STEM benchmarks with varied prompt phrasings, we observed accuracy swings of up to 15 percentage points depending solely on how the question was phrased:
Same questions. Same model. Same knowledge. Different scores. This is a well-documented phenomenon across the industry - models overfit to the prompt format seen during training.
The root cause is straightforward: the training data lacks prompt diversity. If every STEM MCQ in your SFT dataset starts with “Answer the following question and place your answer in \boxed{}”, the model learns that specific format perfectly but becomes brittle to anything else.
The fix is equally simple in principle - expose the model to the same problems with many different phrasings - but doing this manually at the scale of thousands of training examples is impractical. We need preambles that span a wide diversity space:
\boxed{}, \boxed{LETTER}, Answer: A/B/C/D, ((X)), <final_answer>X</final_answer>, and dozens moreCovering the full combinatorial space of these dimensions manually is intractable - and this is exactly the kind of structured diversity problem that synthetic data generation is designed to solve. Data Designer’s sampler-driven approach lets us define the diversity dimensions declaratively, and the framework handles the combinatorics at scale, generating thousands of validated preamble variations that no human annotator could match.
The pipeline below shows one specific instantiation for generating diverse preambles for QA/MCQ datasets, designed to improve the prompt sensitivity of the question prompt. The same architecture can be adapted for Math, Code, or any domain where prompt diversity is needed.
The 6 samplers create a combinatorial diversity space of 3,240 unique combinations, multiplied by 50 seed rows from the format-template × preamble cross-product. Generating 1,000 records covers a broad slice of this space, ensuring the training data doesn’t cluster around a few dominant styles.
The seed is the cross-product of two small hand-written sets: 10 regex-paired format templates (covered in the Regex-Paired Format Templates section below) and 5 generic preamble anchors. Together they form 50 seed rows. Templates carry the format-instruction seed plus its paired output_regex; preambles act as style anchors so the LLM knows what a generic instruction line looks like.
With SamplingStrategy.SHUFFLE, each generated record sees a random seed row (a (seed_format_instruction, output_regex, seed_preamble) triple) alongside its sampled style parameters.
Each sampler controls one axis of variation:
The last sampler — preamble_format_order — controls how the preamble (P), format instruction (F), and {problem} placeholder get arranged in the final user prompt. It prevents the model from overfitting to a single positional layout.
This is the power of Data Designer’s sampler approach: you define the diversity dimensions — including positional order — and the framework handles the combinatorics.
Three LLM text columns turn the sampled style parameters and seed rows into the actual training-ready prompts. Each column is a separate generation call, with each downstream column able to reference the values produced upstream.
preamble — a generic instruction line with no format requirements. The model conditions only on style samplers (sentence type, tone, strictness, verbosity, domain), so the same preamble can be paired with any answer format downstream:
format_instruction — paraphrases the seed seed_format_instruction while staying compatible with its paired output_regex. The regex is the contract: the instruction can be reworded freely, but the resulting model output must still match it for RL reward extraction.
user_prompt — composes preamble + format_instruction + {problem} (literal placeholder) in the order chosen by the preamble_format_order sampler. This is the final string that will be prepended to each training record at mixture time.
The separation matters: keeping preamble and format_instruction as distinct columns means each diversity dimension is varied independently, and the judges in the next step can score each component on its own terms. See the appendix for the full format_instruction and user_prompt configs.
Four LLM judges score each generated row across format and quality dimensions. Three binary gates catch hard failures; one rubric judge scores tone and clarity.
format_compliance (binary): Does format_instruction actually enforce the required output pattern?
regex_alignment (binary): Does format_instruction align with the paired output_regex from the seed template? This catches drift where the generated instruction sounds plausible but won’t be matched by the extraction regex at training/eval time.
order_coherence (binary): Does the assembled user_prompt read coherently in the sampled preamble_format_order? Some orderings (e.g. {problem} + PF) only work if the preamble and format instruction flow naturally after the question.
preamble_quality (0-3 rubric): Does the preamble match the requested tone, verbosity, and clarity?
Roughly 15-20% of generations fail at least one binary gate. The downstream filter keeps rows that pass all three binary judges and score ≥ 2 on preamble_quality.
The previous four steps produce a pool of validated prompts. They’re useful as a standalone artifact, but the actual goal is to use them as instruction variations on top of existing SFT data — without rebuilding that data from scratch.
This is what the mixture script does. It treats the DD-generated prompts as a diversity layer that gets applied at training-data assembly time:
Compose the training mixture from existing JSONL shards. Production SFT data already exists as JSONL files (open STEM, HLE, internal MCQ collections, etc.). Each shard contributes a configured percentage of the final mixture, sampled with a fixed seed for reproducibility.
Apply preamble variations on top. For each record, draw a Bernoulli flip against majority_percentage. On the majority side (25% in the example), prepend a single canonical preamble — the format the model is “officially” expected to handle. On the minority side (75%), sample a random preamble from the DD-generated pool. The model thus sees the canonical format often enough to lock it in, while being exposed to thousands of variations to keep it from overfitting to that one format.
Detect MCQ-like records with regex heuristics. Non-MCQ records (free-response, code, math without options) shouldn’t get an MCQ preamble. The pipeline skips them by matching for \n(A), \n(B) patterns in the user turn or \boxed{} in the assistant turn.
Pack sequences for training. Concatenate records up to max_seq_length (128k here) with shuffle_before and shuffle_after so packed sequences don’t memorize neighbor ordering.
Below is the YAML config that drives this:
The majority_percentage is the main knob to tune. Setting it too high (e.g., 90%) means the model rarely sees variations and prompt-format brittleness persists; setting it too low (e.g., 5%) starves the model of the canonical format it’s expected to perform well on. In internal testing, 25% canonical / 75% varied struck the right balance for QA mixtures — the canonical format stays sharp while format-robustness improves.
The end result of this step is a 1M-record packed training mixture where each problem appears with one of 1,000+ different instruction phrasings — assembled from existing data with no manual preamble authoring required.
The key design decision that makes this pipeline work for both SFT and RL is pairing every format instruction with an extraction regex. Each template defines a human-readable format instruction (which gets paraphrased by Data Designer for diversity) and a regex pattern (which stays fixed for automated answer extraction).
The appendix ships 10 distinct format templates spanning \boxed{}, brackets, parentheses, XML tags, markdown bold, arrows, and plain text — easy to extend by adding more (prompt, output_regex) pairs. This dual-use design means:
The preamble (generic instruction) and format instruction (answer format) are generated as separate LLM columns, then composed into a final user_prompt with the {problem} placeholder arranged in one of 8 placement orders (P + F + {problem}, {problem} + P + F, etc.). This separation lets you vary each dimension independently and prevents positional overfitting.
Samplers make diversity systematic. Six categorical samplers (3-8 values each) create a 3,240-combination space, multiplied by 50 seed rows from the format-template × preamble cross-product. No human annotator covers that surface area consistently.
Seed examples are style anchors, not templates. The LLM needs to see what a preamble is, but the samplers control what each preamble says. Without seeds, the LLM guesses at the format; without samplers, it converges to a narrow style.
Format compliance is a hard gate. A format_instruction that drifts from its paired output_regex will confuse the model during training and break extraction at eval time. Binary judges (format_compliance + regex_alignment) catch this — LLMs generate misaligned formats ~15-20% of the time.
The value is in the pipeline, not the individual records. Any single preamble is easy to write by hand. The value is generating 1,000+ diverse, validated preambles automatically and integrating them into million-record training mixtures with controlled majority/variation ratios.
Regex-paired templates unify SFT and RL. The same format templates serve double duty: paraphrased instructions add SFT diversity, while the paired regex enables RL reward parsing. One pipeline, both training paradigms.
Majority percentage controls the tradeoff. Setting majority_percentage: 25 means the model sees the canonical format 25% of the time and diverse variations 75% of the time. This ratio was tuned empirically - too much diversity degrades canonical-format performance; too little doesn’t build robustness.
Key Resources:
Want to learn more about NeMo Data Designer? Check out our documentation and start building your own synthetic data pipelines today.