What makes data augmentation an implicit form of contraction learning?
This explores the idea that data augmentation — training on transformed variants of the same example — secretly teaches a model to *contract* many surface forms down to one invariant representation, rather than just adding more data.
Read literally, this question asks whether feeding a model many variations of the same thing quietly trains it to collapse those variations into a single internal representation — to learn what to *ignore* as much as what to encode. That's a sharp framing, but I should be upfront: this corpus doesn't contain notes on data augmentation or 'contraction learning' by name, and nothing here directly tests the claim. What follows are the nearest doorways, offered as adjacent territory rather than a real answer — treat them as leads, not confirmation.
The closest conceptual cousin is the finding that RL post-training doesn't broaden a model so much as *contract* it: across controlled runs, training converges on a single dominant format inherited from pretraining and actively suppresses the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?. That's a contraction dynamic — many viable forms narrowing to one — which is the flip side of what augmentation is supposed to encourage. If you're curious whether 'collapsing variation' is a feature or a failure, that note is where the corpus actually has evidence.
A second thread worth pulling: the phase transition from memorization to generalization. Models memorize until they hit a fixed capacity (~3.6 bits per parameter), and only when that fills does 'grokking' — genuine generalization — begin When do language models stop memorizing and start generalizing?. Augmentation is often justified as a way to push a model past rote storage toward invariants; this note gives you the mechanism such an argument would have to lean on, and a way to think about *when* contraction-toward-generalization can even start.
Finally, two notes speak to whether learned invariants actually transfer or stay brittle. Length-generalization can transfer across related tasks because models reuse the same attention heads as shared scaffolding Can length generalization transfer between different related tasks? — a case where the model genuinely contracts toward a reusable abstraction. But the cautionary counterweight is that transformers often only *appear* to generalize: they reduce compositional reasoning to memorized subgraph matching and shatter on novel combinations Do transformers actually learn systematic compositional reasoning?. The open question your line gestures at — does augmentation build real invariance or just a denser memorization table? — is exactly the tension between those two findings. The corpus can't close it, but it can tell you which way each piece of evidence leans.
Sources 4 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.
Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.