Can byte-level models match tokenized performance with better efficiency?
Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?
The Byte Latent Transformer (BLT) is the first byte-level LLM architecture to match tokenization-based performance at scale. The core principle: tokenization-based models allocate the same compute to every token, trading efficiency for performance via compression heuristics not correlated with prediction complexity. BLT instead allocates compute dynamically where data complexity demands it.
The mechanism: BLT segments raw bytes into patches based on the entropy of the next-byte prediction. High-entropy regions (uncertain, complex — like the first word of a new sentence) get more compute. Low-entropy regions (predictable, like word endings) get less. The segmentation is dynamic, learned, and contextualized — producing groups with relatively uniform information density.
The architecture has three transformer blocks:
- Two small byte-level local models (handle fine-grained byte processing)
- One large global latent transformer (handles the primary computation on patches)
A critical distinction: patches are not tokens. Tokens are drawn from a fixed vocabulary determined before training; patches are dynamically grouped sequences without a fixed vocabulary. This means the model has direct access to underlying byte features — something token-based models lose entirely. The byte-level representation enables robustness to typos, character-level phenomena, and cross-lingual transfer that token-level models cannot achieve.
This implements Can we allocate inference compute based on prompt difficulty? at a fundamentally finer granularity — not per-prompt, not per-token, but per-byte-group. The principle is the same (allocate where complexity demands) but the resolution is orders of magnitude finer.
The scaling results demonstrate feasibility: first FLOP-controlled scaling study of byte-level models up to 8B parameters and 4T training bytes, with significant improvements in inference efficiency and robustness over tokenized baselines.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do byte-level representations enable better handling of typos than tokens?
- How do byte-level models allocate compute without explicit difficulty estimators?
- How do sub-token and architecture-level compute optimization strategies compare?
- How should compute budgets be allocated across multi-stage RAG architectures?
- Can any practitioner apply multi-token prediction without massive compute?
- Can next-token prediction train models to optimize for communication efficiency?
- Can prompt optimization for clarity automatically improve token efficiency?
- How does UI-guided token selection reduce compute compared to standard vision?
- When is 15x token overhead actually worth the compute cost?
- How does tokenization change what gets counted as valuable knowledge?
- How much does shared-prefix sampling reduce token redundancy empirically?
- What is the relationship between prefix sharing and speculative decoding?
- Does token-level loss aggregation help aligned models differently?
- Does the Chinchilla balance apply equally across all data types or only language?
- Why is latent-level prediction more sample-efficient than token-level prediction?
- Why does masking the penultimate token outperform random token masking?
- Why are rare tokens the hooks for verbatim model memorization?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
BLT implements adaptive compute at sub-token granularity via entropy-based segmentation
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
BLT's dynamic allocation is orthogonal: it addresses efficiency within a given architecture, not computational depth
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
BLT's entropy-based patching is a concrete architectural variable that conditional scaling laws could incorporate: patch granularity and entropy threshold are architecture-level parameters that affect inference efficiency independently of training FLOPs
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Byte Latent Transformer: Patches Scale Better Than Tokens
- TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
- Improving large language models with concept-aware fine-tuning
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Original note title
byte-level language models allocate compute dynamically by entropy — matching tokenized model performance with better efficiency