SYNTHESIS NOTE

Can routing mask future experts to prevent knowledge leakage?

Can models be built so that they respect query timestamps by selectively silencing experts trained on future data? This explores whether temporal causality can be enforced through architecture rather than external retrieval.

Synthesis note · 2026-06-03 · sourced from Test Time Compute

LLMs trained on a fixed web snapshot go stale and, worse, risk temporal leakage — answering as if they know information that postdates a query. Standard pretraining merges all time periods indiscriminately, so the model has no principled way to respect a query's timestamp. TiMoE makes temporal grounding architectural: pre-train a set of GPT-style experts on disjoint two-year slices of a 2013–2024 corpus, then at inference mask every expert whose training window ends after the query timestamp and merge the remaining experts' log-probabilities in a shared space. This guarantees strict causal validity while retaining multi-period breadth.

The result quantifies the trade: on the new 10k-question TSQA benchmark (alternatives labelled past/future/irrelevant), TiMoE cuts future-knowledge errors by up to ~15% and delivers steadier accuracy across years, at a "manageable cost of time-awareness" — a slight underperformance on eight standard NLP tasks rather than a fundamental barrier. The keeper is the design principle: temporal causality can be enforced by routing over time-partitioned parameters, not only by external retrieval or post-hoc verification.

This sits alongside retrieval-time and prompt-time temporal fixes as the parametric option. It complements Does AI text generation unfold through temporal reflection? (the RAG route to temporal grounding) by pushing the same concern into the model's own expert structure, trading some general accuracy for guaranteed causal validity.

Inquiring lines that read this note 10

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What role does compression play in language model capability and generalization?

Can differential privacy during generation eliminate leakage at scale?

What articulatory information do speech signals carry that text cannot?

What temporal and spatial constraints does Space-Time U-Net solve?

Why do benchmark improvements fail to reflect actual reasoning quality?

What privacy-preserving evaluation methods best capture real-world forecasting ability?

Why does finetuning cause catastrophic forgetting of model capabilities?

Can time-awareness live in model parameters instead of retrieval?

How should retrieval systems optimize for multi-step reasoning during inference?

How does time-partitioned routing compare to retrieval-augmented temporal grounding?

How can identical external performance mask different internal representations?

What is the accuracy cost of enforcing temporal causality inside model parameters?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Can modular expert decomposition extend beyond time into other causal dimensions?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Why does masking future experts guarantee causal validity without external verification?

Can model routing outperform monolithic scaling as an efficiency strategy?

Why does Branch-Train-Merge fail without learned routing between experts?

Why does verification consistently lag behind AI generation?

Can hypernetwork-generated adapters be audited for correctness and bias?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 176 in 2-hop network ·dense cluster Open in graph ↗

Can routing mask future experts to prevent knowl… Does AI text generation unfold through temporal re… Can brain structure guide how we design intelligen…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does AI text generation unfold through temporal reflection? Explores whether the sequential ordering of tokens in LLM generation constitutes genuine temporal thought or merely probabilistic computation without reflective duration.
the retrieval-time route to temporal grounding; TiMoE is the parametric/architectural route
Can brain structure guide how we design intelligent agents? Does mapping agent capabilities onto human brain functions provide a useful organizing framework for understanding and comparing different agent architectures? This matters because agents need a shared vocabulary to advance beyond one-off designs.
both modularize capability; TiMoE modularizes by time slice with causal routing

Can routing mask future experts to prevent knowledge leakage?

Inquiring lines that read this note 10

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4