SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Synthesis note · 2026-02-23 · sourced from Novel Architectures

The Byte Latent Transformer (BLT) is the first byte-level LLM architecture to match tokenization-based performance at scale. The core principle: tokenization-based models allocate the same compute to every token, trading efficiency for performance via compression heuristics not correlated with prediction complexity. BLT instead allocates compute dynamically where data complexity demands it.

The mechanism: BLT segments raw bytes into patches based on the entropy of the next-byte prediction. High-entropy regions (uncertain, complex — like the first word of a new sentence) get more compute. Low-entropy regions (predictable, like word endings) get less. The segmentation is dynamic, learned, and contextualized — producing groups with relatively uniform information density.

The architecture has three transformer blocks:

  1. Two small byte-level local models (handle fine-grained byte processing)
  2. One large global latent transformer (handles the primary computation on patches)

A critical distinction: patches are not tokens. Tokens are drawn from a fixed vocabulary determined before training; patches are dynamically grouped sequences without a fixed vocabulary. This means the model has direct access to underlying byte features — something token-based models lose entirely. The byte-level representation enables robustness to typos, character-level phenomena, and cross-lingual transfer that token-level models cannot achieve.

This implements Can we allocate inference compute based on prompt difficulty? at a fundamentally finer granularity — not per-prompt, not per-token, but per-byte-group. The principle is the same (allocate where complexity demands) but the resolution is orders of magnitude finer.

The scaling results demonstrate feasibility: first FLOP-controlled scaling study of byte-level models up to 8B parameters and 4T training bytes, with significant improvements in inference efficiency and robustness over tokenized baselines.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 155 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

byte-level language models allocate compute dynamically by entropy — matching tokenized model performance with better efficiency