INQUIRING LINE

Could superposed decoding algorithms maintain multi-task representation during generation?

This explores whether new decoding methods could keep an LLM's ability to hold several tasks at once alive through generation — rather than collapsing to a single task the moment it produces a token.


This explores whether new decoding methods could keep an LLM's ability to hold several tasks at once alive through generation — rather than collapsing to a single task the moment it produces a token. The corpus has a paper that names exactly this problem: LLMs genuinely do represent multiple complete, distinct in-context tasks simultaneously during inference, but autoregressive decoding forces them to commit to one after the first token, so the superposition never survives into the output Can LLMs handle multiple tasks at once during inference?. So the answer to your question is conditional: the multi-task representation already exists inside the model — the bottleneck is the decoding step, which is precisely the part you'd be redesigning.

The most interesting lead is that the collapse may be an artifact of left-to-right generation specifically. Diffusion-based LLMs use bidirectional attention and refine all positions at once instead of committing token-by-token; one paper shows reasoning and answer being refined along separate axes simultaneously rather than serialized Can reasoning and answers be generated separately in language models?. If the first-token collapse is caused by autoregression's forced commitment, a refinement-based decoder is the natural place to look for preserving parallel tasks — the architecture doesn't demand you pick a winner up front.

There's also evidence the model's internals are already more parallel than its outputs suggest. Transformers compute correct answers in early layers and then actively overwrite them to produce format-compliant tokens — meaning distinct computations coexist and one gets suppressed at the surface Do transformers hide reasoning before producing filler tokens?. And networks tend to implement separable functions in isolated subnetworks rather than tangling them together Do neural networks naturally learn modular compositional structure?. Both suggest the raw material for multi-task generation is structurally present; what's missing is a decoding rule that reads out more than one of these computations instead of forcing convergence.

A second family of approaches keeps tasks separate by composing them at inference rather than blending them into one set of weights. Self-adaptive models mix expert vectors dynamically at inference without interference Can models dynamically activate expert skills at inference time?, and decoding-time proxy tuning steers output distributions while leaving the base model's parallel knowledge intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The Consensus Game reframes decoding itself as a negotiated equilibrium between two policies rather than a single greedy commitment Can generative and discriminative models reach agreement? — a hint that decoding can be designed as a multi-objective process instead of a one-task funnel.

The thing you might not have expected: the obstacle isn't representational capacity, it's the commitment rule baked into how we generate text. The model holds the superposition fine — autoregression is what flattens it. That reframes "could superposed decoding work?" from a question about model power into a question about decoder design, and points you toward bidirectional/refinement decoders and inference-time composition as the corpus's two most concrete bets.


Sources 7 notes

Can LLMs handle multiple tasks at once during inference?

Large language models represent multiple complete, computationally distinct tasks simultaneously during inference—a macroscopic phenomenon separate from feature-level superposition. However, autoregressive decoding forces convergence to a single task after the first token, preventing practical multi-task generation.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can generative and discriminative models reach agreement?

The Consensus Game frames decoding as a signaling game where generator and discriminator must agree on answers. Equilibrium-Ranking finds their joint policy, enabling 7B models to match 540B model performance without fine-tuning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst examining whether superposed decoding could preserve multi-task representations during LLM generation. The question remains open despite recent progress.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:

• LLMs genuinely represent multiple in-context tasks simultaneously in superposition during inference, but autoregressive decoding forces collapse to a single task after the first token (2024-10, arXiv:2410.05603) — the bottleneck is the *decoder*, not model capacity.
• Diffusion-based LLMs with bidirectional attention refine all positions simultaneously rather than committing token-by-token; reasoning and answers are refined on separate axes in parallel, not serialized (2025-08, arXiv:2508.10736).
• Transformers compute correct answers in early layers then actively overwrite them to produce format-compliant outputs — distinct computations coexist internally and one is suppressed (2024-12, arXiv:2412.04537).
• Inference-time composition (self-adaptive expert mixing, proxy tuning, equilibrium-based decoding) preserves parallel knowledge without forcing convergence (2025-01, arXiv:2501.06252; 2023-10, arXiv:2310.09139).

Anchor papers (verify; mind their dates):
• arXiv:2410.05603 (2024-10): Everything Everywhere All At Once — the core multi-task superposition finding.
• arXiv:2508.10736 (2025-08): In-Place Prompting in Diffusion LLMs — bidirectional refinement as a path around autoregressive collapse.
• arXiv:2501.06252 (2025-01): Transformer2: Self-adaptive LLMs — inference-time composition.
• arXiv:2310.09139 (2023-10): The Consensus Game — decoding as multi-objective equilibrium.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models (post-2026), decoder architectures (speculative decoding, mixture-of-decoders, non-autoregressive variants), test-time scaling, or evaluation harnesses have since *relaxed or overturned* the first-token collapse claim. Separate the durable question ("can we decode multi-task outputs?") from the perishable limitation ("autoregression forces convergence"). Cite what relaxed it.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months — any papers showing autoregressive decoding preserves multi-task fidelity, or refinement-based decoders that *fail* to do so.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do longer-horizon decoding horizons (beam search, lookahead) partially restore superposition before commitment?" or "Can task-conditional output masks be learned at decode-time to read out multiple tasks without retraining?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines