Skip to main content All posts

Compression Is a Noisy Proxy.

Every tokenizer one-pager eventually produces the same folk rule: fewer tokens equals a better model. The literature is not that tidy. Five papers that seem to disagree are actually measuring the same thing from different angles.

If you have spent any time near tokenizer design, you have heard the folklore: lower fertility is better. Better compression means more content per context window, which means better downstream performance, which means the decision rule is obvious. Optimize compression. Move on.

The trouble is that the headline empirical result pointing the other way is hard to ignore. Schmidt et al. measured the relationship between tokenizer compression ratio and model performance across 64 models trained over 138,432 GPU hours and found a correlation of just r=0.241. That is not zero, but it is nowhere near strong enough to carry the whole theory. At the same time, Goldman and Caciularu report a correlation as strong as -0.996 in a narrower regime. Kadosh gets a 230% improvement from AST-aware tokenization. Pagnoni’s BLT beats Llama 3 on code without a fixed vocabulary at all. Alderson finds a 2.6x spread across programming languages under one tokenizer.

If you read those five papers one by one, the field looks contradictory. If you line them up around the latent variable underneath them, the picture gets simpler. Compression is not the thing. It is a noisy proxy for whether token boundaries line up with the structural units the model actually needs.

The folklore: lower fertility equals better.

The folklore is not stupid. It emerged because it works in enough local cases to feel true. If the same content takes fewer tokens, the model has more room left in context. If a tokenizer explodes the length of a prompt, the model has less room to reason with. Plenty of practitioners have touched that stove.

Goldman and Caciularu’s work gives the folklore its strongest academic spine. In the regime they sampled, compression-performance correlation is real and sometimes extreme. That matters, and dismissing it would be sloppy. The point of this article is not to refute Goldman. It is to explain why Goldman can be right in one region of the surface while Schmidt is right in another.

Five papers, five directions, one mechanism.

The synthesis is easier if we name the roles clearly.

Paper What it appears to show What it actually isolates
Goldman & Caciularu Compression can correlate strongly with performance A regime where compression and alignment remain tightly coupled
Schmidt et al. Compression barely predicts performance overall A broader regime where compression and alignment decouple
Kadosh et al. / TOKOMPILER Smaller vocab can beat much larger BPE Direct gain from structure-aware boundaries
Pagnoni et al. / BLT No fixed vocabulary can still win Boundary placement matters more than vocabulary dogma
Alderson One tokenizer taxes languages differently Cross-language variation in how well merges align with structure

The unifying variable underneath those results is what I’ll call semantic boundary alignment: whether token boundaries coincide with meaningful morphological or syntactic units. Compression matters when it happens to track alignment. It fails when the two drift apart.

Compression is what you end up measuring when you do not yet have a direct measurement for alignment.

Paper 1: Goldman 2024 and the regime where compression really does bite.

Goldman and Caciularu deserve the respectful reading because they are the strongest piece in the “compression matters” camp. Their results are not hand-wavy. They report very strong compression-performance relationships in small-model generation settings, including the now-famous -0.996 correlation headline. They also quantify how bad poor tokenization can get: large length penalties and steep character-level overheads when segmentation is poor.

The clean way to read Goldman is not “the field solved it, optimize compression.” The clean read is narrower: in small models on generation tasks, bad segmentation hurts so much that compression ends up acting like a strong stand-in for the thing you actually care about. If a small model does not have much spare capacity, forcing it to reconstruct fractured units in the early layers is costly. In that setting, compression and alignment remain tightly coupled enough that compression looks causal.

That is why Goldman is not the enemy of the synthesis. Goldman is one side of it. The mistake is generalizing the local surrogate into a universal rule.

Paper 2: Schmidt 2024 and the collapse of compression as a universal proxy.

Schmidt et al. ask a broader question with a broader sample, and the headline changes sharply. Across 64 models ranging from 350M to 2.4B parameters and totaling roughly 138K GPU hours of training, compression ratio correlates with downstream performance at only r=0.241. That is the number that breaks the folklore.

This is the point where many readers want to declare Goldman wrong. Don’t. Schmidt is not proving that compression never matters. Schmidt is showing that once you expand the regime, compression loses its reliability as a proxy. Some tokenizers compress aggressively while aligning badly. Others compress less aggressively while aligning much better with the underlying structure. As soon as those two properties decouple, compression stops being the decision variable.

The deeper contribution from Schmidt is not just the correlation number. It is the interpretive shift. The paper pushes attention back toward pre-tokenization rules and boundary choice. That is the moment where the latent variable starts to show through the metric.

Paper 3: TOKOMPILER and the cleanest isolation of alignment we have.

If you want the single best paper for persuading a skeptical engineer, it is probably TOKOMPILER. Kadosh et al. use abstract syntax tree structure to build token boundaries that actually respect programming-language grammar. The vocabulary is tiny compared with standard BPE: 1,177 tokens versus 50,000. And yet performance improves by 230%.

That result matters because it strips away the usual excuse. If bigger, more compressive vocabularies were the main game, a 42x smaller vocabulary should look like a handicap. Instead, the structure-aware tokenizer wins because each token boundary now lands where the model would have wanted the units to be in the first place.

Take a simple function header like def calculate_total(items):. A generic BPE can split the sequence into fragments that are locally frequent but structurally awkward. An AST-aware tokenizer has a better shot at preserving unit boundaries the model can reason over directly. The gain is not magic. It is recomposition cost removed from the early layers.

Paper 4: BLT and the no-vocabulary existence proof.

Pagnoni et al.’s Byte Latent Transformer is useful because it attacks the BPE assumption from another side entirely. BLT uses entropy-based patching at the byte level rather than forcing everything through one fixed vocabulary. At 8B, it beats BPE-based Llama 3 on code benchmarks: 35.4 vs 31.1 on HumanEval and 41.8 vs 40.2 on MBPP.

The most important thing BLT gives us is not “tokenizers are obsolete.” That would be too strong, especially at this scale. The important thing is the existence proof. You can remove the sacred fixed vocabulary, place boundaries dynamically, and still win. That makes sense only if vocabulary size itself was never the essential ingredient. Useful boundaries were.

BLT reinforces the same synthesis TOKOMPILER supports: what matters is whether the segmentation boundary falls in the right place. Fixed vocabulary is just one way of trying to get there.

Paper 5: the 2.6x language spread.

Alderson’s cross-language study is not peer-reviewed in the same sense as the others, so it should be treated as supporting evidence rather than the spine of the argument. But it is still valuable because it shows the penalty in a way most engineers can feel immediately. Under one tokenizer, the spread between programming languages reached 2.6x. Clojure and Haskell came out far more efficient than Python or JavaScript.

The intuitive temptation is to treat that as a property of language verbosity. The better read is that it is a property of merge fit. Languages with more regular structural patterns happen to line up with the merge rules better. Languages with more syntactic diversity pay more fragmentation tax.

This is the cross-language version of the same problem. Hold the tokenizer constant, vary the input family, and alignment differences show up as efficiency differences. That is not a separate story from Goldman and Schmidt. It is another view of the same variable.

Reconciling the five papers.

Once you stop asking whether compression is good and start asking what compression is standing in for, the literature becomes much less confusing. Goldman observes a narrow regime where compression and alignment stay close together, so compression looks highly predictive. Schmidt broadens the regime and shows that once the two can separate, compression’s predictive power collapses. TOKOMPILER manipulates alignment more directly and performance jumps. BLT shows that fixed vocabularies are negotiable if you can still place useful boundaries. Alderson shows the same tokenizer grants some languages more structural room than others.

That is why “optimize compression” is a bad universal rule and “optimize alignment” is a better one. Compression is not useless. It is just downstream of the more important question.

There is one caveat worth stating plainly: boundary alignment is still a synthesis term here, not a standardized field metric with universal agreement on how to score it. That is exactly why the field keeps reaching for compression. Compression is easy to count. Alignment is more annoying to formalize. But the harder-to-measure variable is still the one doing the explanatory work.

The decision rule for your tokenizer one-pager.

If you need the executive summary for a tech lead, it is this:

The engineer-facing version is simpler still: if the structure matters, the boundaries matter. Let compression follow from that instead of forcing it to lead.

What this changes.

The real win from the synthesis is not philosophical. It is operational. It gives teams a better first question. Instead of asking, “did we choose the most compressive tokenizer?” you ask, “does the segmentation preserve the units our model needs to think over?” That change alone catches a surprising amount of wasted motion.

If you want the broader architecture frame this sits inside, read There Is No Code Mode. If you want the practical follow-ons, the next two pieces in this cluster are already implied by the evidence here: the cross-language spread piece and the BLT deep-dive. The point of the series is not to bury you in citations. It is to let you leave with a better one-pager than the one you walked in with.


Citations.

  1. Goldman, O. and Caciularu, A. (2024). Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. arXiv.
  2. Schmidt, C.W. et al. (2024). Tokenization Is More Than Compression. arXiv.
  3. Kadosh, T. et al. (2023). TOKOMPILER: A Tokenization Framework for Programming Languages. arXiv.
  4. Pagnoni, A. et al. (2024). Byte Latent Transformer (BLT). arXiv.
  5. Alderson, M. (2025). Programming Language Tokenization Efficiency. martinalderson.com.
  6. Ali, A. et al. (2024). Tokenizer Choice For LLM Training: Negligible or Crucial?
  7. Lotz, A. et al. (2025). Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies.
  8. Petrov, A. et al. (2023). Language Model Tokenizers Introduce Unfairness Between Languages. arXiv.
  9. Rust, P. et al. (2021). How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.
  10. Dagan, G. et al. (2024). Getting the Most Out of Your Tokenizer for Pre-Training and Domain Adaptation.

If you are picking a tokenizer for a real domain and want help pressure-testing the one-pager before the training budget gets burned, send a note through the contact form. This is exactly the kind of argument we like to tighten before it becomes infrastructure.

Siddharth Jaiman

Co-founder of JAAX Labs. Builds and runs Sentinel, a live AI analytics product on Shopify. Writes about evals, structure, and the small technical decisions that quietly control whether an AI system gets smarter or just gets more expensive.