INQUIRING LINE

Can spline-based activations replace MLPs in transformer architectures?

This explores whether Kolmogorov-Arnold Networks — which swap fixed MLP activations for learnable spline functions — could stand in for the MLP blocks inside transformers, and what the corpus says about replacing standard transformer components more generally.


This explores whether spline-based layers (the headline idea behind Kolmogorov-Arnold Networks) could replace the MLP blocks that sit between attention layers in transformers. The corpus has exactly one paper on the spline idea itself, and it's worth being upfront: it argues the case for replacing MLPs in general, not specifically inside transformers. Kolmogorov-Arnold Networks put learnable univariate splines on the network's edges instead of fixed activations and linear weights, and the result is smaller models that hit better accuracy, scale faster, and stay interpretable enough to recover actual mathematical laws Can learnable spline activations beat fixed MLP designs?. So the building block clearly works on its own terms. Whether it survives being dropped into a transformer at scale is a question the collection doesn't directly answer — a real gap rather than a hidden 'yes.'

What the corpus does have is a rich picture of what happens when people try to swap out transformer parts, and that's the more useful lateral question. The pattern across these attempts is that each alternative trades one capability for another. Spiking-plus-linear attention can convert an existing checkpoint into a far more efficient model with under 2% retraining — but it's swapping the attention mechanism, not the MLP, chasing hardware efficiency rather than expressiveness Can spiking neurons make transformers efficient on any hardware?. State-space models replace attention with a fixed-size recurrent state and pay for it: they provably can't copy or retrieve long strings the way even a two-layer transformer can Can state-space models match transformers at copying and retrieval?. The lesson for splines is that 'better on benchmark X' rarely means 'better everywhere' — replacements tend to reveal a hidden cost somewhere.

There's also a deeper reason MLP blocks might be load-bearing in ways a spline swap would have to respect. Pruning studies show neural networks naturally carve compositional tasks into isolated modular subnetworks, and pretraining makes that structure more reliable Do neural networks naturally learn modular compositional structure?. Other work finds the transformer's residual stream acts less like storage and more like a continuous flow of knowledge through those layers Do transformer models store knowledge or generate it continuously?. A spline-based block wouldn't just need to match an MLP's accuracy — it would need to host the same kind of modular, flowing computation the rest of the architecture has learned to rely on. KAN's built-in interpretability is intriguing precisely here: if its splines made that internal structure more legible, that could be the real win over raw accuracy.

The honest bottom line: splines have beaten MLPs in standalone settings, and the field is clearly willing to replace transformer internals when there's a payoff — but the corpus doesn't contain a transformer that actually runs on spline blocks at scale. The interesting open question the collection hands you is which property you'd be optimizing for if you tried: efficiency (where spiking and linear attention compete), raw capability (where transformers keep winning at copying and retrieval), or interpretability — which is the one dimension where the spline approach has a genuine, distinctive edge over the MLP it would replace.


Sources 5 notes

Can learnable spline activations beat fixed MLP designs?

Kolmogorov-Arnold Networks replace MLPs' fixed activations and linear weights with learnable univariate splines on edges, achieving better accuracy with smaller models, faster neural scaling laws, and built-in interpretability for discovering mathematical laws.

Can spiking neurons make transformers efficient on any hardware?

SpikingBrain successfully adapted Qwen2.5-7B using under 2% retraining data by combining linear/hybrid-linear attention with adaptive spiking neurons, achieving transformer-comparable performance with near-linear long-sequence complexity on non-NVIDIA hardware.

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether spline-based activations (especially Kolmogorov-Arnold Networks) can replace MLP blocks in transformer architectures—a question a curated library examined across 2023–2025 but never directly answered at scale.

What a curated library found — and when (dated claims, not current truth):
• KAN beats MLPs on standalone benchmarks with smaller models and faster scaling; interpretability is built-in (~2024-04).
• State-space models provably fail at copying and long-sequence retrieval tasks that transformers handle; replacements always trade capability for something else (~2024-02).
• Neural networks naturally decompose compositional tasks into modular subnetworks; pretraining reinforces this structure (~2023-01).
• Transformer residual streams function as continuous knowledge flow, not static storage; internal structure is task-dependent (~2024-04).
• No transformer has been evaluated with spline blocks at scale; the standalone KAN success doesn't transfer directly to the architecture question.

Anchor papers (verify; mind their dates):
• arXiv:2404.19756 (KAN, ~2024-04)
• arXiv:2402.01032 (Transformers beat SSMs at copying, ~2024-02)
• arXiv:2301.10884 (Compositionality and modularity, ~2023-01)
• arXiv:2404.19756 (Transformer internals, ~2024-04)

Your task:
(1) RE-TEST EACH CONSTRAINT. For KAN's standalone wins: have newer models, training methods (curriculum, scaling laws), or evaluation harnesses since shown splines scale competitively in *transformer blocks*, or do the gains vanish at depth/width? For the modularity finding: does KAN's spline structure *preserve* or *disrupt* the learned modular decomposition that pretraining creates? For residual-stream flow: do spline blocks' nonlinearity and fewer parameters (KAN's advantage) help or hurt that flow? Separate the durable question (does spline expressiveness match transformer needs?) from the perishable constraint (standalone KAN ≠ transformer KAN).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has anyone published transformer-scale KAN experiments, or newer activations (gating, mixture-of-experts, learnable basis functions) that compete with or subsume splines?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If spline MLPs do scale, what interpretability gains actually flow through to end-to-end model behavior, or do they get obscured by attention and residual mixing? (b) Could splines replace *only* feed-forward blocks while attention stays standard, or is the whole-architecture coupling too tight?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines