Can learnable spline activations beat fixed MLP designs?
What if neural networks moved nonlinearity from fixed node activations to learnable functions on edges? This explores whether such a structural redesign could improve accuracy, interpretability, and scaling compared to standard MLPs.
MLPs — fixed activations on nodes, linear weights on edges — are the default nonlinear approximator and the bulk of a transformer's non-embedding parameters, yet they are hard to interpret. Inspired by the Kolmogorov-Arnold representation theorem, KANs invert the design: no linear weights at all — every weight is a learnable univariate function (a spline) on an edge, and there are no fixed node activations. This seemingly small change yields three claims: much smaller KANs match or beat much larger MLPs on data fitting and PDE solving; KANs obey faster neural scaling laws; and they are interpretable — visualizable and able to act as "collaborators" helping scientists rediscover mathematical and physical laws.
The keeper is the architectural bet: moving nonlinearity from nodes to learnable edge-functions trades the MLP's opacity for a structure you can inspect and that scales better — a genuine alternative to the MLP monoculture, at least in science-adjacent regimes (the paper is candid that deep-KAN theory is still thin).
This sits in the vault's architecture/interpretability thread as a structural alternative. It rhymes with the inductive-bias-over-capacity lesson of Why does dot product beat MLP-based similarity in practice? — the right structural prior beats raw MLP capacity — and offers an interpretability-by-construction contrast to post-hoc methods like Can dictionary learning scale to production language models?.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
both argue the right structural prior beats generic MLP capacity
-
Can dictionary learning scale to production language models?
Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
interpretability-by-construction (KAN) vs post-hoc feature extraction (SAEs)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- KAN: Kolmogorov-Arnold Networks
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- The Vanishing Gradient Problem for Stiff Neural Differential Equations
- Neural Collaborative Filtering vs. Matrix Factorization Revisited
- Scaling can lead to compositional generalization
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- Titans: Learning to Memorize at Test Time
- Weight-sparse transformers have interpretable circuits
Original note title
Kolmogorov-Arnold Networks put learnable spline activations on edges and beat MLPs on accuracy interpretability and scaling