INQUIRING LINE

Can latent communication reduce the token cost of multi-agent systems?

This explores whether letting agents exchange internal representations (their 'latent thoughts') instead of writing and re-reading text messages can cut the token bills that dominate multi-agent system costs.


This explores whether latent communication — agents sharing hidden states rather than serialized text — can reduce token cost in multi-agent systems. The corpus says yes, and unusually directly: the most concrete result is LatentMAS, where agents pass internal representations through shared KV caches rather than writing messages to each other, achieving 70.8–83.7% token reduction with a 14.6% accuracy gain and no additional training Can agents share thoughts without converting them to text?. The reason it pays off twice — cheaper and better — is that text serialization is lossy: forcing reasoning through natural language discards fidelity that hidden embeddings preserve. A related line formalizes the same intuition, using sparse autoencoders to extract individual, shared, and private latent thoughts from hidden states, which also lets agents detect alignment conflicts at the representational level before they ever surface in language Can agents share thoughts directly without using language?.

Why this matters so much becomes clear once you see what actually drives multi-agent performance. Anthropic's internal evals found that roughly 80% of the performance variance in multi-agent research systems comes from token spending, not coordination cleverness Does token spending drive multi-agent research performance?, a finding echoed in the broader framing of agent-level test-time scaling as 'primarily a token spending function' How does test-time scaling work at the agent level?. If performance is bought with tokens, then anything that decouples capability from token volume — LatentMAS and shared-KV-cache approaches are named explicitly here — changes the economics directly rather than at the margins.

Latent exchange isn't the only lever, and reading them together is where it gets interesting. One approach attacks the cost by changing the message format without leaving language: MetaGPT shows that having agents produce standardized engineering artifacts and pull from a shared environment beats free-form conversational chatter, eliminating noise Does structured artifact sharing outperform conversational coordination?. Another attacks the substrate: most agentic subtasks are repetitive and well-defined enough that small language models handle them at 10–30× lower cost, making latent efficiency and model right-sizing complementary rather than competing savings Can small language models handle most agent tasks?. And a third attacks memory: DeepAgent's autonomous memory folding compresses interaction history into structured schemas, cutting token overhead while preserving the details that matter Can agents compress their own memory without losing critical details?.

There's a deeper reframing worth noticing. A 115-day case study found 82.9% of tokens were cache reads, and argued the meaningful cost denominator stops being the individual token and becomes the completed artifact Do persistent agents really cost less per token?. Latent communication and persistent caching are two routes to the same destination: stop paying to re-serialize and re-read what the system already knows.

One caution the corpus raises against pure efficiency-chasing: cheaper communication doesn't fix coordination. Multi-agent systems degrade predictably with scale because agents agree too late or adopt strategies without informing neighbors, and they tend to accept neighbor information without verification — letting errors propagate Why do multi-agent systems fail to coordinate at scale?. Consensus tends to fail through stalled convergence (liveness loss) rather than corrupted values Can LLM agent groups reliably reach consensus together?. The intriguing implication is that latent communication might help on both fronts at once: the same hidden-state sharing that saves tokens also exposes alignment conflicts at the representational level — potentially turning a cost optimization into a coordination repair.


Sources 10 notes

Can agents share thoughts without converting them to text?

LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.

Can agents share thoughts directly without using language?

Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can latent communication reduce the token cost of multi-agent systems, and if so, what are the actual constraints today?** This remains open despite recent progress.

What a curated library found — and when (dated claims, not current truth):
- LatentMAS achieved 70.8–83.7% token reduction via shared KV caches + 14.6% accuracy gain, training-free (~2025).
- ~80% of multi-agent research performance variance is driven by token spending, not coordination design (~2025).
- Sparse autoencoders can extract latent thoughts from hidden states, surfacing alignment conflicts at the representational level before language serialization (~2025).
- Small language models handle 10–30× more agentic subtasks at fractional cost, making latent efficiency and model right-sizing complementary (~2025).
- Multi-agent systems degrade predictably with scale: agents agree too late, adopt unshared strategies, and propagate unchecked neighbor information (~2026).

Anchor papers (verify; mind their dates):
- MetaGPT (2023, arXiv:2308.00352): standardized artifacts beat conversational chatter.
- Thought Communication in Multiagent Collaboration (2025, arXiv:2510.20733): sparse autoencoder latent extraction.
- Latent Collaboration in Multi-Agent Systems (2025, arXiv:2511.20639): LatentMAS KV cache approach.
- Can AI Agents Agree? (2026, arXiv:2603.01213): coordination failure modes under scale.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, assess whether newer models (o3, o4 variants), KV-cache tooling (vLLM >=0.7), multi-agent frameworks (Anthropic's batch API, OpenAI Swarms), or fresh coordination protocols (consensus repair, hierarchical fallback) have since **relaxed or overturned** it. Distinguish durable questions (token–capability coupling, coordination as scale problem) from perishable limitations (specific VRAM/throughput bounds). Ground each answer in a real arXiv ID or acknowledge the gap.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Watch for: (a) single-agent reasoning under equal thinking budgets outperforming multi-agent (arXiv:2604.02460 hints this); (b) latent communication trade-offs (privacy, interpretability, debugging cost); (c) claims that token spending is NOT the bottleneck.
(3) **Propose 2 new research questions that ASSUME the regime may have shifted:**
   - Q1: If latent KV caches are now standard infrastructure, do multi-agent failures still correlate with token count, or do they reflect coordination protocol choice instead?
   - Q2: Under what conditions does forcing agents to serialize to language (for auditability, transparency, or human-in-loop approval) *outweigh* the token savings of latent exchange?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines