Are text-only language models fundamentally limited by abstraction?
Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.
The foundation-model era was defined by language pretraining. Trillions of text tokens, autoregressive objectives, capabilities that surprised the field. The argument in Beyond Language Modeling is that this strategy has reached a structural ceiling — not for reasons of compute or data quantity but because of what text is.
Text is a human abstraction. When humans describe the world, we compress continuous physics into discrete symbols, lossy by construction. The high-fidelity physics, geometry, and causality that govern reality are stripped in the encoding. A language model trained on text inherits the abstraction's limits: it can manipulate symbols brilliantly without grounding them in the dynamics those symbols describe. To borrow the allegory of Plato's cave, text-only LLMs have mastered the descriptions of shadows on the wall without ever seeing the objects casting them.
The metaphor is doing real work, not just framing. It identifies a specific failure category — tasks that require reasoning about the source rather than the description. Physical reasoning about object interactions. Geometric reasoning about spatial relationships that text under-specifies. Causal reasoning about why something happens rather than what is described as happening. These are the failure clusters that text-only LLMs persistently underperform on, and the cave allegory predicts they should.
Beyond philosophy lies a hard pragmatic ceiling: high-quality text data is finite and approaching exhaustion. The compute side of the scaling curve has runway; the data side does not. The path forward requires moving beyond the shadows and modeling the source directly. Visual data preserves the physics, geometry, and causality that language strips, and the visual world's signal is essentially endless.
This reframes multimodal pretraining as not just an addition to language pretraining but the correction of an abstraction-induced limit. The text-only era was always going to hit this wall. The question is whether multimodal architectures can integrate the unfiltered signal without inheriting the limitations of how vision and language were previously combined.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does statistical compression destroy literary connotation and meaning?
- How do low-dimensional representation structures entangle multiple cultures together?
- Why does language compression via statistical dependencies capture cultural and situated language use?
- Can linguistic compression be a fundamental mechanism for representing psychology?
- Can feature disentanglement in gesture synthesis generalize to completely unseen voice distributions?
- Does text-only evaluation hide reasoning collapse that tool use could repair?
- What training on actual interaction would show that text-only training cannot?
- Why does text encoding create different subspaces across domains?
- Can large language models understand language without embodied grounding systems?
- How do multimodal AI architectures compare to human brain export pathways?
- Can speech embeddings carry articulatory structure that text cannot?
- Why does pure-vision underperform when parsing semantics and action prediction mix?
- Can conversational AI achieve mutual understanding if trained only on text?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- Can understanding language happen entirely within a language system alone?
- Why does LLM compression eliminate causal grounding in conceptual representations?
- How should visual content be connected to text within a unified knowledge representation?
- Can a text-only chatbot feel socially present without visual embodiment?
- Do speech encoders actually learn the physics of how vocal tracts produce sound?
- Can statistical learning from text replace embodied cultural experience?
- How does modeling capability relate to lossless compression in language models?
- Does AI's atemporal processing explain its preference for linear plots?
- Do language models and multimodal models show similar attractor-based interpretability?
- Why do image captions create different friction than pure video data?
- Can dense models partially address modality friction without full expert specialization?
- Can decoder-only models become effective text encoders with training?
- Why do vision and language have different optimal scaling curves?
- Can multimodal architectures successfully integrate vision without replicating past failures?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- How does modality-specific sparsity enable capacity flexibility that dense models cannot provide?
- What scaling exponent would audio or other modalities require in a truly multimodal system?
- Do discrete tokenized modalities preserve information better than continuous embeddings?
- What emergent abilities appear only in truly unified multimodal systems?
- How does causal multimodal modeling differ from encoder-decoder architectures?
- What temporal and spatial constraints does Space-Time U-Net solve?
- How does the compression view extend from trained models to training objectives?
- How does serializing screen layout to text preserve spatial relationships?
- Why do language models need external temporal signals at all?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we solve modality competition through architectural design?
Does modality competition in multimodal models stem from fundamental training conflicts, or from specific architectural choices? Understanding the root cause could reveal whether the trade-off is solvable.
same paper, the architectural response to the abstraction limit
-
Why do vision and language scale so differently?
IsoFLOP analysis reveals vision and language follow distinct scaling curves—vision demands far more training data than language at equivalent compute budgets. Understanding this asymmetry matters for designing multimodal architectures that serve both modalities well.
same paper, the scaling-law consequence
-
Can language models learn meaning without engaging the world?
Explores whether LLMs prove that meaning emerges from relational structure alone, independent of embodied experience or external reference. Tests structuralist theory empirically.
adjacent: the relational-language view; complementary perspective on what text-only LLMs can and cannot do
-
Can language models learn meaning from text patterns alone?
Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.
convergent: another argument for why text alone is insufficient
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Language Modeling: An Exploration of Multimodal Pretraining
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Can Theoretical Physics Research Benefit from Language Agents?
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Mechanistic Indicators of Understanding in Large Language Models
- Levels of Analysis for Large Language Models
- Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
- Pixels, Patterns, but No Poetry: To See The World like Humans
Original note title
text-only LLMs are Plato cave models — text is a lossy human abstraction that captures shadows while missing physics geometry and causality of the source