How much do structural inductive biases matter compared to training data volume?
This explores whether the *shape* of a model — its architecture and built-in structural assumptions — buys you more than simply piling on more training tokens, and where each one actually wins.
This explores whether structural inductive biases (the architectural choices baked into a model before it sees data) matter more than sheer data volume — and the corpus suggests the honest answer is that structure and data trade against each other, with structure often the cheaper lever. The most direct evidence: pre-pretraining a 1B model on hierarchical formal languages hits equivalent loss and *better* syntactic generalization using 33% fewer natural-language tokens, and the attention heads it grows on those formal structures stay load-bearing for real language Can formal language pretraining make language models more efficient?. That's a structural prior literally substituting for data. In the same spirit, at sub-billion scale a deep-and-thin architecture beats a balanced one by 2.7–4.3% accuracy at identical parameter counts, because depth lets the model compose abstract concepts across layers rather than spreading capacity sideways Does depth matter more than width for tiny language models?. Both findings cut against the pure-scaling intuition that says volume dominates.
But the corpus also reframes the question: it's rarely about *how much* data, but *which* data and *how* you present it. Gradient-similarity selection trains on just 5% of an instruction set and beats training on the whole thing — because the discarded 95% includes examples that actively pull reasoning strategy away from the target task Can we train better models on less data?. More data was a liability, not an asset. That makes data 'volume' a misleading axis; composition is a kind of inductive bias you impose through curation rather than architecture.
There's a deeper point about where inductive structure even comes from. Networks aren't blank slates that only reflect data — pruning experiments show they spontaneously implement compositional subroutines in isolated subnetworks, and pretraining makes this modular structure *more* consistent across architectures and domains Do neural networks naturally learn modular compositional structure?. So some 'structure' is emergent, sharpened by exposure. Yet exposure also writes biases you may not want: representational density is *learned* from data familiarity, with models defaulting to dense activations for familiar inputs and sparse ones for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?. And those learned priors can become a trap — when parametric knowledge from training is strong enough, models override the information sitting right in their context, and no amount of prompting fixes it Why do language models ignore information in their context?.
The surprising twist, for a reader expecting an architecture-vs-data cage match, is that *training regime* often dwarfs both at inference time. Reasoning models persistently outperform non-reasoning ones no matter how much inference compute you throw at the smaller model, because training instilled a protocol that makes extra tokens productive — the gap is about how capability was installed, not raw size Can non-reasoning models catch up with more compute?. And if you want efficiency without a bigger model or more data, you can fold architectural variables (hidden size, MLP-to-attention ratio, GQA) directly into scaling laws and get 42% more throughput with *higher* accuracy at the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?.
The through-line: data volume is the lever everyone reaches for first, but the corpus repeatedly shows structural choices — depth over width, formal-language priors, curated subsets, reasoning-protocol training, architecture-aware scaling — delivering equal or better results at a fraction of the data or compute. Structure isn't a tiebreaker against data; it's frequently the more efficient place to spend.
Sources 8 notes
Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.