INQUIRING LINE

Can universal function approximators be expensive to learn in practice?

This explores the gap between what a model can represent in theory (universal approximation — the proof that a network *could* express any function) and what it can actually learn from data at reasonable cost, which the corpus treats as two very different things.


This explores whether 'universal function approximator' is a theoretical badge that quietly stops mattering in practice — and the corpus answers with a resounding yes. The cleanest illustration is the finding that MLPs, despite being provably expressive enough to represent any function, fail to recover something as simple as a dot product similarity in collaborative filtering Can MLPs learn to match dot product similarity in practice?. They need enormous capacity and data to even approach what a tuned geometric similarity does for free — the punchline being that *inductive bias beats expressiveness*. Universal approximation tells you a function exists in the model's reach; it says nothing about whether gradient descent will find it before you run out of data and patience.

The same wedge between 'can represent' and 'can learn' shows up at the frontier. There's a proof that a single finite-size transformer can compute *any* computable function given the right prompt — prompts are literally Turing-complete programs Can a single transformer become universally programmable through prompts?. But the same work notes that standard training almost never produces a model that has actually learned to run arbitrary programs. The capability is latent and astronomically expensive to elicit; it doesn't fall out of ordinary optimization. So 'universal' is true and nearly useless as a guide to what you'll get.

What you get instead, repeatedly, is memorization wearing the costume of computation. LLMs asked to run iterative numerical methods don't iterate — they recognize a problem as template-similar to something seen before and emit plausible-but-wrong answers, a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. Even RL fine-tuning, which you'd hope installs genuine procedures, mostly sharpens template-matching: GRPO-trained models collapse on slightly out-of-distribution variants of the very tasks they ace in-distribution Do fine-tuned language models actually learn optimization procedures?. And on genuine constrained optimization, models flatline around 55–60% regardless of parameter count or training regime — a ceiling, not a scaling gap Do larger language models solve constrained optimization better?. A universal approximator that plateaus is the whole paradox in one sentence.

The deeper reason these costs are structural, not incidental, is that the architecture has a grain. Treating an LLM as an autoregressive probability machine lets researchers *predict in advance* which logically-trivial tasks (counting letters, reciting the alphabet backwards) will be hard, because their target outputs are low-probability under the training distribution Can we predict where language models will fail?. Expressiveness is uniform across functions; learnability is not. The functions that are cheap to learn are the ones the architecture and data are already biased toward, and the rest can be arbitrarily expensive — sometimes practically unreachable.

The interesting flip side: when learning *is* affordable, it's usually because the method respects that grain rather than fighting it. Transformers achieve exponential length generalization on addition not by brute-forcing a universal solution but by iteratively self-training on their own correct outputs, bootstrapping into harder cases Can transformers improve exponentially by learning from their own correct solutions?. So the take-home isn't 'approximators are doomed' — it's that the cost of learning a function lives almost entirely in the mismatch between the function and the model's built-in biases, which is exactly what the universal-approximation theorem promises to ignore.


Sources 7 notes

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can transformers improve exponentially by learning from their own correct solutions?

Standard transformers generalize from 10-digit to 100-digit addition by repeatedly generating solutions, filtering for correctness, and retraining—showing exponential (not linear) out-of-distribution improvement across rounds without saturation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether universal function approximation remains a useful predictor of learning cost in modern LLMs and neural networks (2024–present). The question: *Can architectures provably capable of representing any function still be expensive to learn in practice?*

What a curated library found — and when (dated claims, not current truth):
These findings span 2020–2026 and show a consistent tension:

• MLPs provably universal but fail at dot-product similarity in collaborative filtering; memorization beats expressiveness (2020).
• Transformers can compute any computable function via prompting (Turing-complete), yet standard training almost never elicits this; capability is latent and astronomically expensive (2024-11).
• LLMs collapse on out-of-distribution variants of tasks they master in-distribution, even after RL fine-tuning (GRPO); template-matching persists (2025-04).
• Models plateau at 55–60% on constrained optimization regardless of scale or training regime (2026-03).
• Iterative numerical methods are not executed; instead models memorize problem templates and fail outside distribution (2024).
• Self-improving transformers overcome length generalization by respecting the autoregressive grain, bootstrapping via correct outputs rather than fighting architecture bias (2025-02).

Anchor papers (verify; mind their dates):
• arXiv:2005.09683 (2020) – MLPs vs. matrix factorization
• arXiv:2411.01992 (2024-11) – Turing completeness of prompting
• arXiv:2504.07912 (2025-04) – RL post-training amplifies pretraining behaviors
• arXiv:2502.01612 (2025-02) – Self-improving transformers and length generalization

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, probe whether architectural innovations (e.g., state-space models, mixture-of-experts, recurrent layers), new training regimes (e.g., online RL, curriculum learning, multi-task pretraining), or evaluation harnesses have *relaxed* the mismatch between expressiveness and learnability. Separate the durable claim ('inductive bias, not capacity, governs learning cost') from the perishable limit ('transformers cannot do iterative numerical methods'). Cite what relaxed it.
(2) **Surface strongest CONTRADICTING or SUPERSEDING work from the last 6 months.** Does any recent paper demonstrate a universal approximator that learns *cheaply* despite architectural mismatch? Flag it.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) Can meta-learning or few-shot tuning now decouple learning cost from pretraining bias? (b) Do in-context learning and retrieval-augmented generation effectively *bypass* the expressiveness–learnability gap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines