If you have spent any time near tokenizer design, you have heard the folklore: lower fertility is better. Better compression means more content per context window, which means better downstream performance, which means the decision rule is obvious. Optimize compression. Move on.
The trouble is that the headline empirical result pointing the other way is hard to ignore. Schmidt et al. measured the relationship between tokenizer compression ratio and model performance across 64 models trained over 138,432 GPU hours and found a correlation of just r=0.241. That is not zero, but it is nowhere near strong enough to carry the whole theory. At the same time, Goldman and Caciularu report a correlation as strong as -0.996 in a narrower regime. Kadosh gets a 230% improvement from AST-aware tokenization. Pagnoni’s BLT beats Llama 3 on code without a fixed vocabulary at all. Alderson finds a 2.6x spread across programming languages under one tokenizer.
If you read those five papers one by one, the field looks contradictory. If you line them up around the latent variable underneath them, the picture gets simpler. Compression is not the thing. It is a noisy proxy for whether token boundaries line up with the structural units the model actually needs.
The folklore: lower fertility equals better.
The folklore is not stupid. It emerged because it works in enough local cases to feel true. If the same content takes fewer tokens, the model has more room left in context. If a tokenizer explodes the length of a prompt, the model has less room to reason with. Plenty of practitioners have touched that stove.
Goldman and Caciularu’s work gives the folklore its strongest academic spine. In the regime they sampled, compression-performance correlation is real and sometimes extreme. That matters, and dismissing it would be sloppy. The point of this article is not to refute Goldman. It is to explain why Goldman can be right in one region of the surface while Schmidt is right in another.
Five papers, five directions, one mechanism.
The synthesis is easier if we name the roles clearly.
| Paper | What it appears to show | What it actually isolates |
|---|---|---|
| Goldman & Caciularu | Compression can correlate strongly with performance | A regime where compression and alignment remain tightly coupled |
| Schmidt et al. | Compression barely predicts performance overall | A broader regime where compression and alignment decouple |
| Kadosh et al. / TOKOMPILER | Smaller vocab can beat much larger BPE | Direct gain from structure-aware boundaries |
| Pagnoni et al. / BLT | No fixed vocabulary can still win | Boundary placement matters more than vocabulary dogma |
| Alderson | One tokenizer taxes languages differently | Cross-language variation in how well merges align with structure |
The unifying variable underneath those results is what I’ll call semantic boundary alignment: whether token boundaries coincide with meaningful morphological or syntactic units. Compression matters when it happens to track alignment. It fails when the two drift apart.
Compression is what you end up measuring when you do not yet have a direct measurement for alignment.
Paper 1: Goldman 2024 and the regime where compression really does bite.
Goldman and Caciularu deserve the respectful reading because they are the strongest piece in the “compression matters” camp. Their results are not hand-wavy. They report very strong compression-performance relationships in small-model generation settings, including the now-famous -0.996 correlation headline. They also quantify how bad poor tokenization can get: large length penalties and steep character-level overheads when segmentation is poor.
The clean way to read Goldman is not “the field solved it, optimize compression.” The clean read is narrower: in small models on generation tasks, bad segmentation hurts so much that compression ends up acting like a strong stand-in for the thing you actually care about. If a small model does not have much spare capacity, forcing it to reconstruct fractured units in the early layers is costly. In that setting, compression and alignment remain tightly coupled enough that compression looks causal.
That is why Goldman is not the enemy of the synthesis. Goldman is one side of it. The mistake is generalizing the local surrogate into a universal rule.
Paper 2: Schmidt 2024 and the collapse of compression as a universal proxy.
Schmidt et al. ask a broader question with a broader sample, and the headline changes sharply. Across 64 models ranging from 350M to 2.4B parameters and totaling roughly 138K GPU hours of training, compression ratio correlates with downstream performance at only r=0.241. That is the number that breaks the folklore.
This is the point where many readers want to declare Goldman wrong. Don’t. Schmidt is not proving that compression never matters. Schmidt is showing that once you expand the regime, compression loses its reliability as a proxy. Some tokenizers compress aggressively while aligning badly. Others compress less aggressively while aligning much better with the underlying structure. As soon as those two properties decouple, compression stops being the decision variable.
The deeper contribution from Schmidt is not just the correlation number. It is the interpretive shift. The paper pushes attention back toward pre-tokenization rules and boundary choice. That is the moment where the latent variable starts to show through the metric.
Paper 3: TOKOMPILER and the cleanest isolation of alignment we have.
If you want the single best paper for persuading a skeptical engineer, it is probably TOKOMPILER. Kadosh et al. use abstract syntax tree structure to build token boundaries that actually respect programming-language grammar. The vocabulary is tiny compared with standard BPE: 1,177 tokens versus 50,000. And yet performance improves by 230%.
That result matters because it strips away the usual excuse. If bigger, more compressive vocabularies were the main game, a 42x smaller vocabulary should look like a handicap. Instead, the structure-aware tokenizer wins because each token boundary now lands where the model would have wanted the units to be in the first place.
Take a simple function header like def calculate_total(items):. A generic BPE can split the sequence into fragments that are locally frequent but structurally awkward. An AST-aware tokenizer has a better shot at preserving unit boundaries the model can reason over directly. The gain is not magic. It is recomposition cost removed from the early layers.
Paper 4: BLT and the no-vocabulary existence proof.
Pagnoni et al.’s Byte Latent Transformer is useful because it attacks the BPE assumption from another side entirely. BLT uses entropy-based patching at the byte level rather than forcing everything through one fixed vocabulary. At 8B, it beats BPE-based Llama 3 on code benchmarks: 35.4 vs 31.1 on HumanEval and 41.8 vs 40.2 on MBPP.
The most important thing BLT gives us is not “tokenizers are obsolete.” That would be too strong, especially at this scale. The important thing is the existence proof. You can remove the sacred fixed vocabulary, place boundaries dynamically, and still win. That makes sense only if vocabulary size itself was never the essential ingredient. Useful boundaries were.
BLT reinforces the same synthesis TOKOMPILER supports: what matters is whether the segmentation boundary falls in the right place. Fixed vocabulary is just one way of trying to get there.
Paper 5: the 2.6x language spread.
Alderson’s cross-language study is not peer-reviewed in the same sense as the others, so it should be treated as supporting evidence rather than the spine of the argument. But it is still valuable because it shows the penalty in a way most engineers can feel immediately. Under one tokenizer, the spread between programming languages reached 2.6x. Clojure and Haskell came out far more efficient than Python or JavaScript.
The intuitive temptation is to treat that as a property of language verbosity. The better read is that it is a property of merge fit. Languages with more regular structural patterns happen to line up with the merge rules better. Languages with more syntactic diversity pay more fragmentation tax.
This is the cross-language version of the same problem. Hold the tokenizer constant, vary the input family, and alignment differences show up as efficiency differences. That is not a separate story from Goldman and Schmidt. It is another view of the same variable.
Reconciling the five papers.
Once you stop asking whether compression is good and start asking what compression is standing in for, the literature becomes much less confusing. Goldman observes a narrow regime where compression and alignment stay close together, so compression looks highly predictive. Schmidt broadens the regime and shows that once the two can separate, compression’s predictive power collapses. TOKOMPILER manipulates alignment more directly and performance jumps. BLT shows that fixed vocabularies are negotiable if you can still place useful boundaries. Alderson shows the same tokenizer grants some languages more structural room than others.
That is why “optimize compression” is a bad universal rule and “optimize alignment” is a better one. Compression is not useless. It is just downstream of the more important question.
There is one caveat worth stating plainly: boundary alignment is still a synthesis term here, not a standardized field metric with universal agreement on how to score it. That is exactly why the field keeps reaching for compression. Compression is easy to count. Alignment is more annoying to formalize. But the harder-to-measure variable is still the one doing the explanatory work.
The decision rule for your tokenizer one-pager.
If you need the executive summary for a tech lead, it is this:
- Do not optimize compression directly as your north star. Treat it as one observable that may or may not track the thing you care about.
- If your domain has obvious structure, optimize for boundary alignment first. Code, legal templates, schemas, markup, and formal documents all live here.
- If your domain is plain high-resource text, default BPE is often fine. Do not overbuild because one tokenization paper made you nervous.
- If your domain is multilingual or low-resource, test per target family. Average tokenizer quality can hide brutal local penalties.
- When in doubt, run a small ablation. Compare a frequency-only tokenizer against a structure-aware variant at small scale before you commit the full training budget.
The engineer-facing version is simpler still: if the structure matters, the boundaries matter. Let compression follow from that instead of forcing it to lead.
What this changes.
The real win from the synthesis is not philosophical. It is operational. It gives teams a better first question. Instead of asking, “did we choose the most compressive tokenizer?” you ask, “does the segmentation preserve the units our model needs to think over?” That change alone catches a surprising amount of wasted motion.
If you want the broader architecture frame this sits inside, read There Is No Code Mode. If you want the practical follow-ons, the next two pieces in this cluster are already implied by the evidence here: the cross-language spread piece and the BLT deep-dive. The point of the series is not to bury you in citations. It is to let you leave with a better one-pager than the one you walked in with.