INQUIRING LINE

Does scaling data automatically produce compositional reasoning or just better feature encoding?

This explores whether throwing more data at a model genuinely produces the ability to combine learned pieces into novel reasoning (composition), or whether it just sharpens the model's encoding of features it has already seen — and the corpus is sharply split on this.


This explores whether scaling data genuinely produces compositional reasoning or just better feature encoding — and the corpus stages a real argument about it, with the answer hinging on what you count as 'composition.' The optimistic camp says scale alone is enough: standard MLPs reach compositional generalization with no special architecture, *provided* the training distribution covers enough combinations of the underlying task modules Can neural networks learn compositional skills without symbolic mechanisms?. Notably, that same work uses linear decodability of the building blocks from hidden activations as its success signal — which is exactly the catch. The skeptical camp argues that linear decodability is precisely what masks the absence of real composition: a model can carry every linearly-decodable feature a task needs while its internal organization is fractured and brittle, invisible to standard metrics until perturbation or distribution shift breaks it Can models be smart without organized internal structure?. So the two notes that look like they agree (decodable features = good) actually disagree about whether decodable features *mean* anything.


Sources 7 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Does scaling data automatically produce compositional reasoning, or does it only yield better feature encoding that masks brittleness under distribution shift?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library stages a real disagreement:
• Optimistic view: Standard MLPs achieve compositional generalization *from data scale alone* if training distributions cover enough task-module combinations; linear decodability of building blocks signals success (2023–2024).
• Skeptical view: Linear decodability of features can mask absent true composition — models remain brittle under perturbation and distribution shift despite matching metrics (2023).
• Recent probes: Sparse autoencoders (2024) and latent thought vectors (2025) attempt mechanistic grounding; chain-of-thought reasoning studies (2025–2026) question whether reasoning tokens encode functional importance or are post-hoc.
• Constraint hypothesis: Test-time compute scaling and recurrent-depth approaches (2025) may bypass the composition question by substituting search depth for structural reasoning.

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023-01): Break It Down — structural compositionality claim via linear decodability
• arXiv:2305.18654 (2023-05): Faith and Fate — limits of Transformers on compositionality
• arXiv:2508.01191 (2025-08): Chain-of-Thought as Mirage — distribution-lens critique
• arXiv:2603.23004 (2026-03): Reasoning Under Constraints — functional reasoning under pressure

Your task:
(1) RE-TEST THE CORE TENSION. Separate the durable question (does *genuine* compositional structure emerge, or just linearly-separable features?) from perishable limitations. Has mechanistic interpretability (SAEs, activation analysis, 2024–2026) settled whether decodability = real composition? Do recent constraint-reasoning papers (2025–2026) test composition under realistic brittleness conditions? Be precise: cite which papers do or do not probe this under distribution shift.
(2) Surface the strongest *contradiction* among the library's own claims. The 2025–2026 reasoning papers seem to assume composition is needed; do any of them provide evidence that scaling alone fails, or do they assume it and move on?
(3) Propose 2 research questions that assume the regime may have moved: (a) Can you *falsify* linear decodability as a proxy for composition using adversarial or OOD data? (b) Does test-time scaling (latent thought vectors, recurrent depth) dissolve the composition question by replacing it with search, and if so, what does that mean for the original claim?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines