INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What limits mechanistic interpreta…›this inquiring line

The classic formula for measuring intelligence assumes unlimited computing power — real minds are constrained, and that gap breaks the whole framework.

Can Kolmogorov complexity alone capture what makes intelligence general?

This explores whether a single information-theoretic yardstick like Kolmogorov complexity — shortest-program length — can explain general intelligence, or whether generality is about something that measure structurally can't see.

This explores whether Kolmogorov complexity, the classic "shortest program that produces the data" measure, is enough to capture what makes intelligence general — and the corpus's answer is a fairly pointed no, for a reason that's more interesting than the usual complaints. The sharpest critique is that Kolmogorov (and Shannon) measures assume an observer with unlimited compute: they ask how compressible something is in principle, by an idealized machine with infinite time. But real learners — humans, trained models — are *bounded*. That gap isn't a footnote; it's the whole story. Classical complexity literally cannot value data the way a bounded learner does, which is why feature engineering helps, why curriculum order matters, and why a trained model can end up knowing more than the process that generated its data Why do Shannon and Kolmogorov measures fail to value data?. Generality, on this view, lives precisely in the territory Kolmogorov complexity assumes away.

The corpus even offers a candidate replacement that makes the point concrete. Epiplexity tries to measure the *structural* information a computationally bounded observer can actually extract — separating learnable regularity from raw entropy you'd need infinite time to crack. And notably, this resource-aware measure correlates with out-of-distribution generalization and predicts which datasets enable broad transfer What can a bounded observer actually learn from data?. That's the crux: generality is about transfer to the unfamiliar, and the measure that tracks it is the one built around limited resources, not the one built around an omniscient compressor.

There's a second angle the corpus surfaces that a complexity-only view would miss entirely: two systems can be equally "complex" by any output-based measure and yet be radically different inside. Models trained by SGD can carry all the linearly decodable features a task needs while their internal organization is fractured and entangled — identical performance, incoherent structure underneath, invisible to standard metrics Can models be smart without organized internal structure? Can AI pass every test while understanding nothing?. A description-length number scores the output, not whether the representation is the kind of clean, reusable structure that generalizes. So a system can be cheap to describe and still brittle, or pass every test and understand nothing.

This connects to why reasoning models actually break. They don't fail at a complexity threshold — long problems aren't inherently the problem. They fail at *novelty*: when an instance is unlike anything in training, because the model fitted instance-level patterns rather than a generalizable algorithm Do language models fail at reasoning due to complexity or novelty?. If generality were a function of complexity, harder-to-describe problems would be the failure points. Instead it's unfamiliarity — exactly what a bounded, transfer-oriented account predicts and a complexity-only account can't.

The through-line worth taking away: a single scalar of compressibility flattens distinctions that matter for generality — what a *bounded* learner can extract, whether internal structure is coherent or fractured, and whether the system transfers to the novel rather than memorizing the familiar. Even the field's favorite trade-offs sometimes turn out to be artifacts of how you measure rather than real constraints Is the exploration-exploitation trade-off actually fundamental? — a useful reminder that picking the wrong yardstick doesn't just miss things, it invents phenomena. Kolmogorov complexity is a beautiful idealization; generality seems to live in everything that idealization throws away.

Sources 6 notes

Why do Shannon and Kolmogorov measures fail to value data?

Both measures assume observers with unlimited compute and miss learnable, useful information. The gap explains why feature engineering helps, curriculum order matters, and trained models exceed their generating process—empirical facts classical theory cannot account for.

What can a bounded observer actually learn from data?

Epiplexity formalizes the structural information a computationally bounded observer can extract from data, separating learnable regularity from time-bounded entropy. This task-free measure correlates with out-of-distribution generalization and explains why some datasets enable broader transfer than others.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Show all 6 sources

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains: Can Kolmogorov complexity alone capture what makes intelligence general? A curated library (spanning 2024–2026) found — and these are dated claims, not current truth:

• Kolmogorov complexity assumes unbounded compute; real learners are bounded. Classical complexity cannot value data the way bounded agents do, making feature engineering and curriculum order invisible to it (~2026).
• Epiplexity — a resource-aware alternative — measures structural information a computationally bounded observer can extract, and correlates with out-of-distribution generalization and transfer (~2026).
• Models with identical performance metrics can have radically different internal structure: linearly decodable features paired with fractured, entangled representations. Output complexity scores miss this incoherence (~2025).
• Reasoning model failures are driven by instance-level unfamiliarity (novelty), not task-level complexity. Models fit instance patterns rather than generalizable algorithms (~2026).
• Exploration-exploitation trade-offs in RL sometimes dissolve under different measurement schemes, suggesting they are artifacts of the yardstick, not real constraints (~2025).

Anchor papers (verify; mind their dates): arXiv:2601.03220 (From Entropy to Epiplexity, 2026); arXiv:2505.17117 (From Tokens to Thoughts, 2025); arXiv:2510.14665 (Beyond Hallucinations, 2025); arXiv:2509.23808 (Beyond Exploration-Exploitation, 2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. Has bounded-compute complexity theory matured? Do newer evals or mechanistic interpretability methods now directly measure the coherence of learned structure, making epiplexity obsolete or vindicated? Has any breakthrough in reasoning or few-shot transfer shown that a single compressibility scalar *can* predict generality after all — contradicting the core claim?
(2) Surface the strongest work from the last ~6 months that either resurrects a role for Kolmogorov-like measures or pushes further into resource-aware alternatives.
(3) Propose two research questions that assume the regime has shifted: (a) If internal structure coherence (not just output complexity) is the bottleneck, what training objectives directly optimize for it? (b) Can bounded-compute complexity predict novel task transfer better than current OOD generalization metrics?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The classic formula for measuring intelligence assumes unlimited computing power — real minds are constrained, and that gap breaks the whole framework.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8