INQUIRING LINE

Model Architecture and Internals · Training, RL, and Test-Time Scaling · Reasoning, Retrieval, and Evaluationcross-cluster

Does static per-token sparsity repeat the fixed-budget mistake at short sequences?

This explores whether applying a fixed sparsity pattern to every token — sparse attention that doesn't adapt to how long the input actually is — recreates the known error of fixed sparse-attention budgets, specifically in the short-sequence regime where there's less redundancy to throw away.

This explores whether 'static per-token sparsity' — a sparse-attention scheme that drops the same fraction of attention everywhere regardless of input — repeats the documented failure of fixed sparse-attention budgets, and it most likely does, with short sequences being exactly where it bites. The corpus's clearest finding here is that optimal sparsity is not a constant: longer sequences tolerate much higher sparsity without performance loss, while shorter ones do not, so a budget tuned for one length is suboptimal for the other Does fixed sparsity work for all sequence lengths?. Any static scheme — whether you fix the budget globally or fix the sparsity per token — inherits the same blind spot: it can't see that a short prompt has less redundancy to discard, so the aggressive dropping that's free at 100k tokens becomes lossy at 1k.

The deeper pattern the collection keeps returning to is that *compute should follow signal, not a fixed rule.* The Byte Latent Transformer makes this explicit: instead of spending equal effort per unit, it segments input by next-byte entropy and pours more compute into high-uncertainty regions and less into predictable ones, matching tokenized baselines at lower cost Can byte-level models match tokenized performance with better efficiency?. Static per-token sparsity is the photographic negative of that idea — it spends a fixed amount everywhere — which is precisely the thing entropy-adaptive allocation was invented to avoid.

There's also reason to doubt that tokens are interchangeable enough for a uniform rule to be safe. Work pruning reasoning chains finds that models internally rank tokens by functional importance — symbolic-computation tokens are preserved first while grammar and filler get dropped — so the 'right' amount to discard varies token by token, not just sequence by sequence Which tokens in reasoning chains actually matter most?. A static per-token policy that treats a load-bearing token like a throwaway one is making the fixed-budget mistake at a finer grain. Memory architectures echo this: Titans earns its long-context scaling precisely by being selective — storing *surprising* tokens rather than allocating memory uniformly Can neural memory modules scale language models beyond attention limits?.

The same lesson shows up far from attention, which is what makes it feel like a real principle rather than a one-off. In retrieval, a calibrated per-query uncertainty signal beats fixed heuristic rules for deciding when to fetch context, at a fraction of the cost — self-knowledge about *this* input outperforms a static policy applied to all inputs Can simple uncertainty estimates beat complex adaptive retrieval?. The cross-domain takeaway: every time the corpus pits a fixed allocation against an input-adaptive one, the adaptive one wins, and the fixed one fails worst exactly where its standing assumption (lots of redundancy, lots of slack) is least true — which for sparse attention is short sequences. So the honest answer is yes: static per-token sparsity repeats the fixed-budget mistake, just relocated from the global budget down to the per-token level, and short sequences are the place to watch it break.

Sources 5 notes

Does fixed sparsity work for all sequence lengths?

Longer sequences tolerate significantly higher sparsity levels than shorter ones without performance loss. Fixed-budget sparse attention is suboptimal in production; budgets should adapt per input based on context length and other request properties.

Can byte-level models match tokenized performance with better efficiency?

The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Does static per-token sparsity repeat the fixed-budget mistake at short sequences?

Sources 5 notes

Next inquiring lines