Can trained models encode programs more complex than their data-generating process?
This explores whether learning can install a program richer than the process that produced the training data — or whether a trained model is capped at re-encoding the patterns it was fed.
This reads the question as: can training make a model *more* than its data — encoding computation more complex than the process that generated what it learned from? The corpus splits sharply into what's possible in principle versus what training actually delivers, and the gap between them is the real story here.
In principle, the ceiling is extremely high. A single finite-size transformer provably exists that can compute *any* computable function given the right prompt — the architecture is Turing complete, so the program a model could encode is unbounded by anything in its data (Can a single transformer become universally programmable through prompts?). But that same result carries the deflating caveat: standard training rarely produces models that learn to implement arbitrary programs this way. The capacity to encode complex programs and the training process that would *install* them are two different things.
What training actually installs, across multiple notes, looks more like a re-encoding of the data-generating distribution than a richer program on top of it. RL fine-tuning sharpens memorization rather than installing reasoning procedures — GRPO-trained models collapse on out-of-distribution variants (Do fine-tuned language models actually learn optimization procedures?). Models asked to run iterative numerical methods don't execute the procedure at all; they pattern-match memorized templates and emit plausible-but-wrong values (Do large language models actually perform iterative optimization?). Instruction tuning turns out to transfer knowledge of the *output space*, not task understanding — semantically empty instructions perform about as well as correct ones (Does instruction tuning teach task understanding or output format?). And RL post-training tends to amplify one already-present pretraining format while suppressing the others, rather than synthesizing something new (Does RL training collapse format diversity in pretrained models?). Even the reasoning traces that look like complex programs are stylistic mimicry: invalid logical steps perform nearly as well as valid ones (Do reasoning traces show how models actually think?).
There's a formal reason the answer trends toward 'not on its own.' Self-improvement is bounded by the generation-verification gap — a model can't reliably encode a fix more complex than what it already has unless something *external* validates and enforces it (What stops large language models from improving themselves?). That's the precise statement of why a model can't bootstrap past its data-generating process through metacognition alone. You can even predict where it breaks: framing the model as an autoregressive probability machine correctly forecasts that low-probability target tasks stay hard no matter how logically simple they are (Can we predict where language models will fail?).
The interesting wrinkle — and where you might end up somewhere you didn't expect — is that the complexity may be *present but hidden* rather than absent. Transformers trained with hidden chain-of-thought compute the correct answer in early layers and then actively overwrite it to produce format-compliant filler; the richer computation is fully recoverable from lower-ranked predictions (Do transformers hide reasoning before producing filler tokens?). And techniques like post-completion learning exploit unused sequence space to internalize self-evaluation the data never explicitly taught (Can models learn to evaluate their own work during training?). So the honest synthesis is: the architecture can hold programs far more complex than the data-generating process, but ordinary training mostly compresses the data distribution instead. Going beyond it seems to require either an external verifier to break the generation-verification ceiling, or deliberate methods to surface the latent computation the model is otherwise trained to suppress.
Sources 10 notes
Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.