INQUIRING LINE

Why do LLMs generate logical forms without preserving semantic content?

This explores why LLMs can produce well-formed logical expressions (valid syntax) while getting the actual meaning wrong — and what that gap reveals about how these models 'reason' at all.


This explores why LLMs can produce well-formed logical expressions while getting the actual meaning wrong. The corpus points to a single underlying cause: these models match the *shape* of formal language without operating on what it denotes. The clearest evidence comes from autoformalisation work showing LLMs reliably generate syntactically valid logic that is semantically incorrect, with errors clustering exactly where meaning lives — scope ambiguity, quantifier precision, predicate granularity Can large language models translate natural language to logic faithfully?. The form is easy because form is surface pattern; the content is hard because content requires tracking what the symbols are *about*.

Why the split? Because LLMs reason by semantic association, not symbolic manipulation. When researchers strip the familiar real-world meaning out of a reasoning task and leave only the abstract rules, performance collapses — even when the correct rules sit right there in context Do large language models reason symbolically or semantically?. The model was leaning on token associations and parametric commonsense the whole time, not on the logical structure it appeared to be using. So when you ask it to emit pure logical form, you remove the very crutch it was reasoning with, and the content drifts.

There's a deeper mechanism beneath this. Token generation is a smooth probabilistic flow toward the training distribution, not a turbulent exploration of competing claims Does LLM generation explore competing claims while producing text?, and that flow is sequential but atemporal — there's no pause for reflection or revision in which a model could check whether its formula actually means what the sentence meant Does AI text generation unfold through temporal reflection?. The same pattern-over-meaning bias shows up elsewhere: semantically identical prompts produce different outputs because the model registers corpus *frequency*, not equivalence of meaning Why do semantically identical prompts produce different LLM outputs?. Whether the task is paraphrasing or formalizing, the model tracks statistical mass over sense.

The most useful surprise here is that the fix isn't 'more formalization' — it's *less*. Partial symbolic abstraction beats both pure natural language and full formal logic: enriching language with selective symbolic structure preserves the semantic information that complete formalization throws away Why does partial formalization outperform full symbolic logic?. Full formalization is precisely the regime where semantic content gets stranded, which is why hybrid prompting that forces a model to check warrants and implicit premises catches errors that clean-looking logical chains hide Can structured argument prompts make LLM reasoning more rigorous?. The logical form, paradoxically, is where meaning goes to get lost.

If you want to push on the 'why' further, one strand of the corpus argues the real reasoning never happens in the surface symbols at all — it lives in latent hidden-state trajectories, with the visible chain (or logical form) serving as only a partial, sometimes unfaithful interface Where does LLM reasoning actually happen during generation?. On that view, asking why the generated logic doesn't preserve meaning is asking why a rendering doesn't match the thing it renders: the form was always a downstream projection, not the computation itself.


Sources 8 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does LLM generation explore competing claims while producing text?

Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about semantic grounding in formal reasoning. The question remains open: why do LLMs generate syntactically valid logical forms while losing semantic content?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of ~12 papers identified:
• LLMs reason via semantic association and token frequency, not symbolic manipulation; stripping real-world meaning causes performance collapse, even when rules are in context (2023–2024).
• Autoformalisation fails at scope, quantifiers, and predicates because models match statistical *shape* over what symbols denote (2024).
• Token generation is a smooth probabilistic flow toward training distribution, not turbulent exploration; no in-generation pause for checking whether formulas preserve meaning (2024–2025).
• Partial symbolic abstraction + hybrid prompting (with warrant checking, critical questions) preserves semantic content better than full formalization (2025).
• LLM reasoning may live in latent hidden-state trajectories; visible logical chains are downstream projections, not the computation (2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (May 2023): In-context semantic reasoners, not symbolic.
• arXiv:2412.15177 (Dec 2024): Argumentative querying to steer reasoning.
• arXiv:2502.12616 (Feb 2025): Quasi-symbolic abstraction improves chain-of-thought.
• arXiv:2604.15726 (Apr 2026): Latent reasoning trajectories frame.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding, probe whether newer model scales, instruction-tuning regimes (e.g., constitutional AI, RLHF variants), reasoning-specific architectures (speculative decoding, active inference), or formal-verification tooling have since *relaxed* the semantic-drift problem. Separate the durable question (why is formal meaning hard?) from perishable limitations (current models cannot do X). Cite what resolved it.
(2) Surface the strongest *disagreeing* work from the last ~6 months: has any recent paper shown LLMs *can* preserve semantics in formalisation under certain conditions, or argue the latent-reasoning frame doesn't hold?
(3) Propose 2 research questions that *assume* the regime may have moved: e.g. "Given improved semantic grounding, does scaling formalisation now help or hurt?" or "Can hybrid symbolic-neural architectures escape the form/content split?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines