Format Is a Hyperparameter.

The biggest prompt optimization in your stack might not be the wording. It might be the wrapper. In early 2024, Sclar et al. showed that changing prompt formatting while keeping semantics fixed could swing Llama-2-13B accuracy by up to 76 accuracy points. That is not a cute polish detail. That is a system-behavior variable.

Most teams still treat format as a presentation layer decision. XML if it looks neat. Markdown headers if the prompt file reads better. Bullets because somebody on the team likes bullets. That mental model is too small. Format behaves more like a hyperparameter: separable, testable, and often worth more than another round of micro-edits to the prose itself.

This post is the short version of the argument and the operational version of the fix: why format moves accuracy, what it costs, and how to test it without turning your prompt review into vibes.

Format versus content: the ROI question.

If you are running a production prompt workflow, you already know what normal iteration looks like. Reword an instruction. Swap examples. Tweak the persona. Add chain-of-thought. Those changes matter, but their gains tend to be bounded and incremental.

Chain-of-thought often buys a modest lift, usually in the single digits or low teens.
Rewording examples can help, but often in the low single digits.
Persona tweaks are real, but rarely the biggest line item on the eval report.

Now compare that with formatting sensitivity. Sclar et al. reported the 76-point headline number. He et al. independently measured up to 40% performance variation across open models from format alone, with swings reaching 200% on GPT-4. Li et al.’s CFPO framing matters here because it gives the right mental model: content and format are not one muddy variable. They are separable axes. A strong phrasing plus a strong format stacks.

Format is not a cosmetic wrapper around the prompt. It is a routing decision for which statistical neighborhood the model enters.

Why format moves accuracy at all.

The short answer is that the model does not read your prompt the way you read your editor. Formats carry distributional baggage from pretraining. XML, Markdown, QA prefixes, tables, and list structures each co-occur with different kinds of content in the corpus. When you change the wrapper, you are not just restyling the text. You are changing the pattern family the model recognizes.

That is the first layer: pretraining associations. Structured delimiters often arrive attached to technical docs, configs, and formalized instruction patterns. Natural prose arrives attached to a different neighborhood. Recent work by Itzhak et al. and Zhao et al. suggests these behavior patterns are planted primarily in pretraining and then amplified later, not invented from scratch during post-training.

The second layer is RLHF amplification. Human preference data pushes models toward formats that look more legible, more structured, or more aligned with what raters think a “good answer” should resemble. So the base distributional pattern gets reinforced. Format sensitivity is not cosmetic. It is historical and statistical.

This is the useful mental model: rewording a prompt tends to move you around inside one neighborhood. Changing the format can move you into another neighborhood entirely. That is why the gains can look disproportionate.

Why delimiters work.

Delimiters are a nice example because almost everyone uses them and very few teams can explain why. Triple quotes, XML tags, brackets, section headers, explicit labels like Context: and Instruction: all do slightly different versions of the same job.

They segment attention. Boundaries help the model keep context blocks from bleeding into one another.
They activate training priors. XML feels different from freeform prose because the model saw XML in different contexts during training.
They reduce control effort. Clear boundaries make the next-token path easier to infer.

You are not using XML because it is pretty. You are using it because it changes what the model expects next.

The cost side: formatting burns tokens.

This is where the argument gets practical. Format helps, but it is not free. Pan et al. measured formatting overhead from structural tokens and found a spread from 6.5% in some settings to 34.9% in others, with an average around 25.8%. The overhead is heavily language-dependent: in their code-formatting study, Java came in at 34.9% and C++ at 31.1%, while Python sat at 6.5% because indentation is already syntactically required and not a layered-on decoration. In other words, a heavily formatted prompt can donate a quarter or more of its context budget to structure rather than semantic payload.

That does not make formatting bad. It makes it a trade. If the steering gain beats the lost context, the format pays for itself. If not, you are buying legibility in the editor while starving the model of useful content.

This is another reason to treat format like a hyperparameter instead of taste. Costs should be measured, not inferred.

There is no universal best format.

One of the most important results in the literature is that there is no single format winner across all models and tasks. Sclar et al. found very low overlap between top-performing formats across model families. A layout that helps one model on one task can be neutral or actively bad on another.

That means you should expect task-specific and model-specific winners. A rough working heuristic from the literature looks like this:

Math and logic often reward highly structured wrappers like XML or JSON.
Reading comprehension can prefer more natural prose.
Classification often benefits from enumerated lists or clearly labeled options.
Creative writing tends to do worse when overconstrained by rigid scaffolding.

The implication is simple: you cannot solve formatting once at the root prompt and assume the answer generalizes everywhere.

The FormatSpread protocol.

The protocol below borrows its name and core framing from Sclar et al. (2024, ICLR), who introduced FormatSpread as a way to quantify prompt-format sensitivity. If you want the gains without the prompt-theater, run a controlled search. The version below is intentionally boring. That is a feature.

Define the eval set. Use at least 100 examples if you can, and stratify by task type.
Define the search space. Delimiters, casing, ordering, spacing, list styles. Change format only, not semantic wording.
Generate variants. A grid or a bounded random sample of around 20 to 25 is usually enough.
Run at temperature 0. Or use a fixed seed. Remove sampling noise if the goal is to isolate format.
Use a real significance test. McNemar’s test or a paired bootstrap is better than “it looks higher.”
Stop with discipline. Once the confidence interval excludes a meaningful lift, or the top formats cluster within one to two points, stop.

That last step is important. The point is not to become a formatting cult. The point is to spend your first optimization cycle on the most neglected high-leverage variable, then move on.

What to track and what to ignore.

Track:

Accuracy on the golden eval set.
Failure modes introduced by the format itself.
Total token overhead.
Latency and cost to first token.

Ignore:

“This one feels cleaner” editor opinions.
One cherry-picked example from a stakeholder call.
Format intuitions transferred from another model family.

The model is the judge here, not your taste in whitespace.

What this means in practice.

If your team is still treating format as a final polish pass, you are probably leaving obvious gains on the table. Run one deliberate search on a real eval set, lock the winner, and regression-test it like any other production decision. That alone will make most prompt reviews look more like engineering and less like copyediting.

If you want the deeper architecture story for why structure matters across code, prose, and prompts alike, read There Is No Code Mode. If you want help building the eval discipline around these tests, send a note through the contact form. We like teams that already know the prompt is not the product.

References.

Sclar et al. (2024). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design. ICLR. arXiv.
He et al. (2024). Does Prompt Formatting Have Any Impact on LLM Performance? arXiv.
Li et al. (2025). Content-Format Integrated Prompt Optimization for LLMs. arXiv.
Itzhak et al. (2025). Planted in Pretraining, Swayed by Finetuning. arXiv.
Zhao et al. (2025). Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining. arXiv.
Pan et al. (2025). The Hidden Cost of Readability: How Formatting Choices Impact Token Usage and Costs in LLMs. arXiv.
Bhargava & Witkowski (2024). What’s the Magic Word? A Control Theory of LLM Prompting. arXiv.