How do sub-token and architecture-level compute optimization strategies compare?
This explores how strategies that work below the token boundary (segmenting bytes, ranking which tokens matter) compare to strategies that change the model's structure itself (network shape, recursive reasoning trees, what the architecture can and can't do) as ways to spend compute well.
This explores how sub-token strategies — operating below the word-piece boundary, like splitting raw bytes or deciding which tokens in a chain are worth the compute — stack up against architecture-level moves that reshape the model itself. The short version the corpus suggests: they're not rivals so much as they attack the same waste at different layers, and the sub-token work keeps revealing that problems we blamed on architecture were actually artifacts of where we drew the token boundary.
Start with the sub-token side. The Byte Latent Transformer drops tokenization entirely and groups raw bytes into patches sized by how predictable the next byte is — spending more compute on surprising regions and coasting through predictable ones, matching tokenized models at 8B parameters with cheaper inference and better typo robustness Can byte-level models match tokenized performance with better efficiency?. A complementary finding looks inside reasoning chains and shows models implicitly rank tokens by function, preserving symbolic-computation tokens while pruning grammar and filler first — and students trained on those pruned chains beat students trained on frontier-model compression Which tokens in reasoning chains actually matter most?. Both say the same thing: not all tokens deserve equal compute, and you win by reallocating below the level most systems treat as atomic.
The most pointed sub-token result is that some things we call architectural limits aren't. The exploration-exploitation trade-off in RLVR — long treated as a fundamental tension — turns out to be a measurement artifact that only appears when you look at the token level; hidden-state analysis shows near-zero correlation, and you can enhance both at once Is the exploration-exploitation trade-off actually fundamental?. That reframes the whole comparison: sometimes the cheapest 'architecture fix' is to stop measuring at the token grain.
Now the genuine architecture-level limits, where no amount of clever token handling helps. Autoregressive transformers physically cannot retract an emitted token, so constraint-satisfaction problems hit a ceiling that's structural, not quality-related — the fix is bolting on a symbolic solver that supplies the missing 'retraction' primitive Why does autoregressive generation fail at constraint satisfaction?. At small scale, the shape of the network matters more than its size: deep-and-thin models beat balanced ones for sub-billion-parameter LLMs by composing concepts through layers Does depth matter more than width for tiny language models?. And recursive subtask trees with KV-cache pruning let a single model reason past its context window — an architectural pattern that replaces multi-agent systems by restructuring how working memory is held Can recursive subtask trees overcome context window limits?. These are wins you can't get by re-segmenting bytes.
The deeper unification is that both families are really about *where you put compute*, and the field increasingly treats that as one budget. Inference compute can substitute for parameter scaling on hard prompts Can inference compute replace scaling up model size?; allocating that inference budget adaptively by prompt difficulty beats fixed budgets Can we allocate inference compute based on prompt difficulty? — which is the exact same entropy-driven logic BLT applies at the byte level, just one altitude up. The cleanest map is the internal-vs-external split in test-time scaling: internal methods build capability into the model, external methods extract more from existing capability, and they complement rather than compete How do internal and external test-time scaling compare?. Read that way, sub-token tricks and architectural redesigns are both internal-side bets — and the thing they can't substitute for is training regime, since reasoning models out-run non-reasoning ones at any inference budget Can non-reasoning models catch up with more compute?. The surprise worth leaving with: the most consequential optimization may be neither sub-token nor architectural but *temporal* — the long-context bottleneck is the compute needed to consolidate evicted context into fast weights, a problem that scales with how many passes you make, not how cleverly you tokenize Is long-context bottleneck really about memory or compute?.
Sources 11 notes
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.