INQUIRING LINE

Do KANs maintain their advantages in deep architectures and large-scale training?

This asks whether Kolmogorov-Arnold Networks keep the edge they show in small demos once you stack them deep and train them at scale — but the collection has no notes on KANs specifically, so what follows is the adjacent question the corpus *can* speak to: when does an architecture's small-scale advantage survive depth and size, and when does it evaporate?


Straight answer first: nothing in this library studies KANs directly, so I can't tell you whether their function-approximation advantage holds up deep and large — that's a gap worth flagging rather than papering over. What the corpus does have is a recurring story about *which* architectural advantages scale and which are artifacts of being small, and that story is the useful thing to take away.

The collection's clearest signal is that depth and scale don't reward architectures uniformly — they reward the ones whose mechanism is *compositional*. Does depth matter more than width for tiny language models? finds that for sub-billion-parameter models, going deep-and-thin beats going wide, because layers compose abstract concepts rather than just adding capacity. The advantage isn't the depth per se; it's that depth lets a good primitive stack. That reframes your KAN question: a learnable-activation network will likely keep its edge at depth only if its spline-based units compose cleanly layer over layer, and lose it if the gains came from overfitting a shallow function.

Several notes push back on the assumption that scale is what generates capability at all. A 7M-parameter recursive network out-generalizes billion-parameter models on hard puzzles (Can tiny recursive networks outperform massive language models?), and a 27M-parameter hierarchical model clears reasoning ceilings that fixed-depth transformers can't (Can recurrent hierarchies achieve reasoning that transformers cannot?). Both make the same point an alternative architecture like KAN implicitly bets on: the *right structure* can beat brute parameter count. The catch is that these wins came from recursion and effective depth — structural mechanisms — not from a novel unit being intrinsically better. An architecture has to earn its scaling story mechanistically.

There's also encouraging evidence that good structure becomes *more* reliable with scale, not less. Do neural networks naturally learn modular compositional structure? shows that pretraining sharpens modular structure rather than dissolving it — the bigger and more trained the model, the more consistent its decomposition into clean subnetworks. If a KAN's edge is genuinely about cleaner functional decomposition, this is the pattern you'd hope to see it follow. And Can neural memory modules scale language models beyond attention limits? is a case study in an alternative-architecture advantage that *does* survive scale-up — neural memory that holds its benefit out to 2M-token contexts where the standard mechanism's costs explode.

The thing you didn't know you wanted to know: the corpus suggests the real test for any alternative architecture isn't "does it work in a small demo" but "is its advantage a *composition* property or a *fitting* property." Composition properties (depth stacking, modular decomposition, recursion, separated memory) tend to strengthen with scale; fitting advantages tend to wash out. To answer the KAN question properly you'd want a note benchmarking learnable-activation networks at depth — and that note isn't here yet.


Sources 5 notes

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can tiny recursive networks outperform massive language models?

A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating whether Kolmogorov-Arnold Networks (KANs) maintain their function-approximation and interpretability advantages in deep and large-scale settings. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library contains NO direct KAN benchmarks at scale, but offers a recurring pattern about which architectural advantages survive depth and scale:
• Depth beats width for sub-billion-parameter models when the mechanism is *compositional* — layers stack abstract concepts rather than just add capacity (2024).
• Small recursive networks (7M params) out-generalize billion-parameter models on reasoning tasks; hierarchical models (27M) breach fixed-depth ceilings — suggesting *structure*, not scale, is primary (2025–2026).
• Pretraining *sharpens* modular decomposition as scale increases; compositional structure becomes more reliable, not less (2024).
• Alternative mechanisms (neural memory) do survive scale-up to 2M-token contexts where standard architectures degrade (2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (MobileLLM, 2024) — depth-beats-width for sub-billion params.
• arXiv:2510.04871 (Less is More, 2025) — tiny recursive networks generalize better.
• arXiv:2506.21734 (Hierarchical Reasoning, 2025) — effective depth breaks fixed-depth ceilings.
• arXiv:2501.00663 (Titans, 2024) — alternative mechanisms scale.

Your task:
(1) RE-TEST EACH CONSTRAINT. For KANs specifically: has any paper since mid-2026 directly benchmarked learnable-activation networks at depth (>12 layers) or scale (>100M params)? Judge whether the *composition vs. fitting* distinction the library emphasizes applies to spline-based units — i.e., do KAN advantages come from cleaner functional decomposition (likely durable) or overfitting shallow functions (likely to wash out)? Name what resolved or still blocks a scaling answer.
(2) Surface the strongest contradicting or superseding work from the last ~6 months on alternative unit designs, learnable activations, or depth-structure interactions.
(3) Propose 2 research questions assuming the regime may have moved: (a) Do KANs' interpretability gains survive *modular decomposition under scale*, or do spline parameters become uninterpretable at 1B+ params? (b) Does KAN composition property (if it exists) match the pattern of recursive or hierarchical models — i.e., does effective depth in a KAN grow with scale?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines