INQUIRING LINE

What specific information must be exported from the language system?

This explores a question the corpus answers in surprisingly different ways depending on the task: when you pull information *out* of a language model to do something with it — personalize, formalize, act, retrieve — which slice of information actually carries the signal, and which can be discarded.


This explores what specifically has to be extracted from a language system for a downstream job to work — and the corpus's most striking finding is that the answer is rarely the obvious one. For personalization, you'd assume the model needs to export what a user *asked* — their queries, their inputs. The opposite is true: profiles built from a user's *outputs* alone match or beat complete profiles, while input-only profiles actively degrade performance Do user outputs outperform inputs for LLM personalization?. The information that must be exported isn't semantic content at all — it's style and preference signal. What someone says carries less than how they say it.

When the export target is formal logic rather than a user model, the required information flips to pure semantics. LLMs can emit syntactically valid logical expressions all day, but they fail to carry across the parts that actually matter: scope, quantifier precision, predicate granularity Can large language models translate natural language to logic faithfully?. So 'what must be exported' is exactly the thing the systems are worst at exporting — meaning, not form. Interestingly, the same models *can* export explicit structural analysis of language when prompted to reason step by step, building syntactic trees and phonological generalizations Can language models actually analyze language structure?. The information is in there; whether it comes out depends on the route you take to extract it.

There's a deeper version of the question the corpus surfaces: before you can export information, the system has to know *which* information is missing. Models that ace complete reasoning problems collapse to 40–50% accuracy when asked what clarifying question to fill a withheld variable Can models identify what information they actually need?. Identifying the needed piece and producing it are separable skills — exporting the right information presumes a capability the model may not have. DeepRAG frames this as a decision problem: at each step, learn whether the needed information should come from the model's own parameters or be retrieved externally, which alone buys a 22% accuracy gain by cutting noise from unnecessary lookups When should language models retrieve external knowledge versus use internal knowledge?.

For agents, the exported information has to be *grounded* — tied to real actions, environments, and tools — or it hallucinates. Turning an LLM into an action system isn't a matter of squeezing more out of the model; it requires curating action-environment-user datasets and an external harness, because the surrounding system, not the weights, determines whether an exported action is real or invented Can you turn an LLM into an agent by just fine-tuning?. And there's a reason to care about what gets exported precisely: over long delegated workflows, frontier models silently corrupt ~25% of document content, errors compounding without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?.

The thread across all of this — and the thing you didn't know you wanted to know — is that there's no single 'information' a language system must export. Language modeling is itself lossless compression Can text-trained models compress images better than specialized tools?, which means the model holds far more than any one task needs. The real engineering question is never 'get the information out' but 'which projection of it' — style for personalization, semantics for logic, grounded actions for agents, the missing variable for clarification. Pick wrong and the export degrades the task; the input-built user profile is the clearest cautionary case.


Sources 8 notes

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can text-trained models compress images better than specialized tools?

Chinchilla models trained exclusively on text achieve better compression rates on images and audio than FLAC and PNG by using their context window to adapt as task-specific compressors. This demonstrates that generalization operates through compression, not specialization.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an AI research analyst, investigate this still-open question: What specific information must be exported from a language system for downstream tasks to succeed — and has the constraint landscape shifted?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 across capability, personalization, reasoning, and agentic work:

• User-output-derived profiles outperform input+output profiles for personalization; input-only profiles degrade performance (2024-06, arXiv:2406.17803).
• LLMs fail at autoformalisation because they cannot export semantic precision (scope, quantifiers, predicates), only syntax — but can emit metalinguistic analyses under chain-of-thought prompting (2023-05, arXiv:2305.00948; 2024-01, arXiv:2401.05777).
• Models achieving 95%+ accuracy on well-specified reasoning collapse to 40–50% when asked to identify missing information; diagnosis and production are separable (2025-03, arXiv:2503.22674).
• DeepRAG's per-step decision (retrieve vs. internal) yields 22% accuracy gain by filtering noise (2025-02, arXiv:2502.01142).
• Frontier models silently corrupt ~25% of document content over long delegated workflows; harness architecture, not weights alone, determines grounded action validity (2026-04, arXiv:2604.15597).

Anchor papers (verify; mind their dates):
• arXiv:2305.00948 (2023-05): Metalinguistic abilities in LLMs
• arXiv:2406.17803 (2024-06): Role of user profile in personalization
• arXiv:2502.01142 (2025-02): DeepRAG step-wise retrieval decisions
• arXiv:2604.15597 (2026-04): Document corruption in delegated workflows

Your task:
(1) RE-TEST EACH CONSTRAINT. For personalization, does newer tooling (e.g., preference learning, RLHF refinement, or multi-turn fine-tuning) now recover input signal? For formal logic, do recent code-generation or proof-assistant integrations bypass the semantics-export failure? For reasoning diagnosis, has chain-of-thought scaling or self-critique agents closed the 40–50% gap? Separate durable (the need to project information task-specifically) from perishable (specific modalities fail).

(2) Surface the strongest contradicting or superseding work from the last 6 months — especially papers claiming unified export strategies, end-to-end retrieval elimination, or harness-agnostic grounding.

(3) Propose 2 research questions assuming the regime may have moved: (a) Can a single exported representation (e.g., a learned "task-neutral embedding") substitute for task-specific projection? (b) Does scaling harness capacity (memory, tool cardinality) eventually eliminate the need for selective export?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines