INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Do language models learn genuine l…›this inquiring line

AI models can nearly match human grammar skills after reading only as much text as a child does growing up.

Which linguistic abilities are learnable from human-sized data exposure?

This explores what language abilities models can pick up from a child-sized diet of data — roughly the ~100 million words a human encounters growing up — rather than from internet-scale training.

This explores what language abilities models can pick up from a child-sized diet of data, and the corpus has a surprisingly sharp answer for one ability in particular: grammar. Models trained on 100 million words or fewer land within a few points of human performance on grammatical acceptability judgments, which suggests that core syntactic competence — knowing what sounds well-formed — doesn't need oceans of text Can language models learn grammar from child-scale data?. The interesting twist is that *how* you feed the data mattered more than how much: composition and curation beat raw volume. So the honest framing isn't "big data teaches grammar" but "a well-chosen small corpus is enough."

Architecture turns out to be part of the same story. At small scale, the shape of the model changes what it can squeeze from limited data — deep-and-thin networks outperform wide ones at the sub-billion-parameter range because stacking layers lets the model compose abstract structure rather than just memorize more surface patterns Does depth matter more than width for tiny language models?. Read alongside the syntax result, the lesson is that human-scale learnability is as much about the learner's design as about the volume of exposure.

But syntax is the easy case, and the corpus quietly warns you not to generalize from it. Higher abilities behave differently. Surprisingly, social and cultural knowledge seems *learnable without embodiment* — GPT-4.5 beat every individual human at judging social appropriateness across hundreds of scenarios, even though it never lived in a culture Can AI learn social norms better than humans?. That cuts against the intuition that you need lived experience to absorb norms. Yet the same models share identical blind spots on unwritten norms, hinting that something about pattern exposure tops out where the rules were never written down.

And the thing that *doesn't* come for free is the gap between knowing and doing. Models can state a concept correctly, then fail to apply it, then even recognize their own failure — a pattern that has no human analogue and points to explanation and execution running on disconnected pathways Can LLMs understand concepts they cannot apply?. So "learnable from human-sized data" splits cleanly: the formal machinery of language (grammar, acceptability) arrives early and cheaply; functional understanding that holds up under application does not.

If you want to push the boundary further, the corpus offers two adjacent angles. One line shows models becoming strong predictors of *human* behavior and decision-making after fine-tuning on psychology data — language ability bending toward modeling people rather than just producing sentences Can language models learn to model human decision making?. Another reframes the whole question philosophically: from the outside humans and LLMs are categorically different systems, but inside shared discourse they draw on the same symbolic substrate, which is why a model trained on text alone can sound so fluently human Do humans and LLMs differ fundamentally or just superficially?. The takeaway you didn't know you wanted: the abilities that scale down to human-sized data are precisely the ones encoded in the structure of language itself — and the ones that don't are the ones that live in use.

Sources 6 notes

Can language models learn grammar from child-scale data?

Models trained on ≤100 million words performed within a few percentage points of humans on grammatical acceptability tasks, suggesting syntactic competence doesn't require massive datasets. Data composition and curation mattered more than raw volume.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can AI learn social norms better than humans?

GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Show all 6 sources

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a language-model capability analyst. The question remains open: which linguistic abilities emerge reliably from human-scaled data exposure (≤100M–1B tokens), and which require orders of magnitude more?

What a curated library found — and when (findings span 2023–2026, treat as dated claims):
• Core grammatical acceptability judgments plateau within a few points of human performance at ~100M words; composition and curation matter more than raw volume (~2023).
• Depth-over-width architecture (stacking layers rather than widening) extracts more abstract structure from sub-billion-parameter regimes than flat scaling (~2023).
• Social-norm judgment and cultural appropriateness can be learned from text alone without embodiment; GPT-4.5 exceeded individual humans across hundreds of scenarios (~2025).
• A distinct failure mode: models explain concepts correctly but fail to apply them, and fail to recognize their own failure — no human analogue (~2026).
• Post-completion and test-time learning mechanisms (Titans, Transformer2) enable rapid in-context adaptation (~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2310.17591 (Lil-Bevo, 2023) — humanlike training strategies
- arXiv:2508.19004 (AI Models Exceed Individual Human Accuracy, 2025) — social norms
- arXiv:2602.06176 (LLM Reasoning Failures, 2026) — explanation-execution decoupling
- arXiv:2604.27660 (From Context to Skills, 2026) — in-context learning

Your task:
(1) RE-TEST EACH CONSTRAINT. For grammar and syntax: have newer tokenizers, training curricula, or tiny-model optimization (e.g., MobileLLM, Transformer2) since lowered the data floor further, or does 100M words still hold? For social norms: has the claim that embodiment is unnecessary held up against adversarial or out-of-distribution cultural tests? For the explanation-execution gap: has it widened, narrowed, or been bridged by any post-completion or reasoning-chain method? Separate the durable question from the perishable limitation; cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing human-scale data is *not* enough for a formerly "easy" ability, or showing *unexpected* learnability for a "hard" one.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Do test-time adaptation methods (Titans, post-completion learning) effectively *reduce* the pre-training data required for reasoning?" or "Can small curated corpora of *failures* teach models to self-repair explanation-execution decoupling?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI models can nearly match human grammar skills after reading only as much text as a child does growing up.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8