What can a bounded observer actually learn from data?

Classical information measures treat all high-entropy content equally, but computationally bounded learners can only extract certain types of structure. What distinguishes learnable regularity from random noise that bounded agents face?

Synthesis note · 2026-05-28 · sourced from Data

Having diagnosed why classical measures fail, the paper introduces epiplexity: a formalization of what a computationally bounded observer can actually learn from data. The key separation is between structural content — the learnable, reusable regularity a bounded learner can extract — and time-bounded entropy, the random unpredictable content that looks like information to an unbounded observer but is useless to a bounded one. Pseudorandom number generators and chaotic dynamical systems are the canonical examples: high apparent entropy, near-zero epiplexity, because no efficient learner can exploit them.

This single distinction resolves the three paradoxes at once. Information can be created by deterministic computation (the transform makes structure efficiently accessible that was latent before); it does depend on data order (ordering changes what a bounded learner can extract along the way); and likelihood modeling can produce programs more complex than the generating process (because the model encodes extractable structure, not just the source's codelength). Crucially, epiplexity is task-free — it measures learnable structure without reference to a downstream objective, which is what makes it a candidate foundation for data selection as opposed to model selection.

The practical payoff is empirical, not just conceptual. The paper gives procedures to estimate epiplexity that capture differences across data sources, track with downstream performance, and flag dataset interventions that improve out-of-distribution generalization. That last result is the strongest: a task-free structural measure that nonetheless correlates with OOD generalization would explain why some data enables broader transfer than others. This fits the vault's data-curation thread — since Can we prune training data without hurting model performance?, examples differ in learnable value, and epiplexity proposes the underlying quantity that difficulty metrics approximate. Counterpoint and caution: epiplexity is observer-relative (bounded to what compute class?) and estimated, not computed exactly, so its claims inherit the slack of the estimator and the choice of observer. Why it matters: it offers the first principled, task-free quantity for deciding which data to select, generate, or transform for learning.

Inquiring lines that read this note 7

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What limits mechanistic interpretability's ability to characterize models?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Can event boundaries be identified from statistical regularities without understanding events?

What are the consequences of models training on synthetic data?

Can deterministic computation actually create new information in data?

When does optimizing for quality undermine the value of diversity?

How does mutual information between inputs and outputs differ from measuring raw diversity?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

What can a bounded observer actually learn from … Can we prune training data without hurting model p… Does procedural knowledge drive reasoning more tha… Can deep learning theory unify around training dyn…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we prune training data without hurting model performance? This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
difficulty metrics approximate the learnable value epiplexity aims to measure directly
Does procedural knowledge drive reasoning more than factual retrieval? Explores whether models learn reasoning through general procedures across diverse documents rather than memorizing specific facts. This matters for understanding what pretraining data actually teaches models to reason.
a content-type account of which data generalizes; epiplexity offers a measure-theoretic account of the same phenomenon
Can deep learning theory unify around training dynamics? Is learning mechanics—focused on average-case predictions and training dynamics rather than worst-case bounds—the emerging framework that finally unifies fragmented deep learning theory?
situates epiplexity within the compute-aware theory-of-learning program

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

epiplexity measures the structural information a computationally bounded observer can extract for data selection

What can a bounded observer actually learn from data?

Inquiring lines that read this note 7

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5