The biggest prompt optimization in your stack might not be the wording. It might be the wrapper. In early 2024, Sclar et al. showed that changing prompt formatting while keeping semantics fixed could swing Llama-2-13B accuracy by up to 76 accuracy points. That is not a cute polish detail. That is a system-behavior variable.
Most teams still treat format as a presentation layer decision. XML if it looks neat. Markdown headers if the prompt file reads better. Bullets because somebody on the team likes bullets. That mental model is too small. Format behaves more like a hyperparameter: separable, testable, and often worth more than another round of micro-edits to the prose itself.
This post is the short version of the argument and the operational version of the fix: why format moves accuracy, what it costs, and how to test it without turning your prompt review into vibes.
Format versus content: the ROI question.
If you are running a production prompt workflow, you already know what normal iteration looks like. Reword an instruction. Swap examples. Tweak the persona. Add chain-of-thought. Those changes matter, but their gains tend to be bounded and incremental.
- Chain-of-thought often buys a modest lift, usually in the single digits or low teens.
- Rewording examples can help, but often in the low single digits.
- Persona tweaks are real, but rarely the biggest line item on the eval report.
Now compare that with formatting sensitivity. Sclar et al. reported the 76-point headline number. He et al. independently measured up to 40% performance variation across open models from format alone, with swings reaching 200% on GPT-4. Li et al.’s CFPO framing matters here because it gives the right mental model: content and format are not one muddy variable. They are separable axes. A strong phrasing plus a strong format stacks.
Format is not a cosmetic wrapper around the prompt. It is a routing decision for which statistical neighborhood the model enters.
Why format moves accuracy at all.
The short answer is that the model does not read your prompt the way you read your editor. Formats carry distributional baggage from pretraining. XML, Markdown, QA prefixes, tables, and list structures each co-occur with different kinds of content in the corpus. When you change the wrapper, you are not just restyling the text. You are changing the pattern family the model recognizes.
That is the first layer: pretraining associations. Structured delimiters often arrive attached to technical docs, configs, and formalized instruction patterns. Natural prose arrives attached to a different neighborhood. Recent work by Itzhak et al. and Zhao et al. suggests these behavior patterns are planted primarily in pretraining and then amplified later, not invented from scratch during post-training.
The second layer is RLHF amplification. Human preference data pushes models toward formats that look more legible, more structured, or more aligned with what raters think a “good answer” should resemble. So the base distributional pattern gets reinforced. Format sensitivity is not cosmetic. It is historical and statistical.
This is the useful mental model: rewording a prompt tends to move you around inside one neighborhood. Changing the format can move you into another neighborhood entirely. That is why the gains can look disproportionate.
Why delimiters work.
Delimiters are a nice example because almost everyone uses them and very few teams can explain why. Triple quotes, XML tags, brackets, section headers, explicit labels like Context: and Instruction: all do slightly different versions of the same job.
- They segment attention. Boundaries help the model keep context blocks from bleeding into one another.
- They activate training priors. XML feels different from freeform prose because the model saw XML in different contexts during training.
- They reduce control effort. Clear boundaries make the next-token path easier to infer.
You are not using XML because it is pretty. You are using it because it changes what the model expects next.
The cost side: formatting burns tokens.
This is where the argument gets practical. Format helps, but it is not free. Pan et al. measured formatting overhead from structural tokens and found a spread from 6.5% in some settings to 34.9% in others, with an average around 25.8%. The overhead is heavily language-dependent: in their code-formatting study, Java came in at 34.9% and C++ at 31.1%, while Python sat at 6.5% because indentation is already syntactically required and not a layered-on decoration. In other words, a heavily formatted prompt can donate a quarter or more of its context budget to structure rather than semantic payload.
That does not make formatting bad. It makes it a trade. If the steering gain beats the lost context, the format pays for itself. If not, you are buying legibility in the editor while starving the model of useful content.
This is another reason to treat format like a hyperparameter instead of taste. Costs should be measured, not inferred.
There is no universal best format.
One of the most important results in the literature is that there is no single format winner across all models and tasks. Sclar et al. found very low overlap between top-performing formats across model families. A layout that helps one model on one task can be neutral or actively bad on another.
That means you should expect task-specific and model-specific winners. A rough working heuristic from the literature looks like this:
- Math and logic often reward highly structured wrappers like XML or JSON.
- Reading comprehension can prefer more natural prose.
- Classification often benefits from enumerated lists or clearly labeled options.
- Creative writing tends to do worse when overconstrained by rigid scaffolding.
The implication is simple: you cannot solve formatting once at the root prompt and assume the answer generalizes everywhere.