Can LLMs translate between natural language and formal logic faithfully?
This explores whether LLMs can faithfully convert natural language into formal logic (and back) — not just produce well-formed symbols, but preserve the actual meaning — and what the corpus suggests fixes the gap.
This explores whether LLMs can faithfully translate between natural language and formal logic — meaning preserved, not just syntax. The corpus's blunt answer: they produce logic that looks right but means the wrong thing. Can large language models translate natural language to logic faithfully? finds models reliably generate syntactically valid expressions that are semantically incorrect, with errors clustering exactly where translation is hardest — scope ambiguity, quantifier precision, and predicate granularity. Intriguingly, the failure is asymmetric: models read formal language better than they write it.
Why the gap? A deeper note argues LLMs aren't doing symbolic manipulation at all. Do large language models reason symbolically or semantically? shows that when you strip the familiar semantic content out of a reasoning task and leave only the formal rules, performance collapses — models lean on commonsense token associations, not logical structure. The same contamination shows up mechanistically: How do language models perform syllogistic reasoning internally? traces a genuine content-independent reasoning circuit inside the model, but finds attention heads encoding world knowledge that bias conclusions toward what sounds plausible over what's logically valid — and this gets worse at larger scales, not better. Translation requires holding meaning still; these models keep letting meaning leak in.
A related blind spot makes faithful translation even harder: ambiguity. Can language models recognize when text is deliberately ambiguous? shows GPT-4 correctly disambiguates only 32% of cases (humans hit 90%) because it can't hold multiple interpretations at once — and faithful formalization is largely the act of choosing the right reading among several. Why do large language models fail at complex linguistic tasks? adds that errors grow predictably with syntactic depth, so the harder the sentence structure, the less trustworthy the translation.
The more interesting thread is what actually works — and it isn't 'translate harder.' Three notes converge on division of labor over full conversion. Can symbolic solvers fix how LLMs reason about logic? has the LLM formulate symbolic representations while a deterministic solver runs the inference and feeds back machine-verifiable error messages — catching translation mistakes far better than the model critiquing itself. Why does partial formalization outperform full symbolic logic? goes further: partial formalization beats both pure language and full formalization, because full conversion throws away semantic information the symbols can't carry. And Can structured argument prompts make LLM reasoning more rigorous? shows structured prompts that force the model to surface hidden premises catch failures plain chain-of-thought skips.
The surprise worth leaving with: the corpus suggests LLMs can genuinely analyze language structure — Can language models actually analyze language structure? shows step-by-step reasoning building real syntactic trees. So the bottleneck isn't ignorance of logic. It's a deeper split, sharpest in Can LLMs understand concepts they cannot apply?: models can explain a concept correctly, fail to apply it, and recognize the failure — explanation and execution run on disconnected tracks. Faithful translation needs both to fire together, which is exactly why offloading the execution to a symbolic solver outperforms asking the model to be faithful on its own.
Sources 10 notes
LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs implement a content-independent three-stage reasoning mechanism—recitation, middle-term suppression, mediation—that works across architectures. However, additional attention heads encoding world knowledge systematically bias conclusions toward semantically plausible rather than logically valid answers, with contamination increasing at larger scales.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.