INQUIRING LINE

How do LLMs translate informal prose into logically correct formal specifications?

This explores autoformalisation — whether LLMs can take everyday prose and convert it into formal logic or specifications that are not just well-formed but actually mean what the prose meant.


This explores autoformalisation: turning informal prose into formal logic that is logically *correct*, not merely syntactically valid. The corpus delivers a sharp split here. LLMs are good at producing logic that *looks* right and bad at producing logic that *is* right — Can large language models translate natural language to logic faithfully? finds models generate well-formed expressions that are semantically wrong, with errors clustering predictably at scope ambiguity, quantifier precision, and predicate granularity. Intriguingly, the failure is asymmetric: models seem to understand formal language better than they can generate it. So the bottleneck isn't reading logic — it's committing prose to a single faithful logical form.

Why does the translation break down? Two adjacent notes point at root causes. First, prose is often ambiguous on purpose, and LLMs can't see it: Can language models recognize when text is deliberately ambiguous? shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans, and cannot hold multiple interpretations at once. Formalisation *forces* a choice of interpretation, so a model blind to ambiguity will confidently formalise the wrong reading. Second, the underlying machinery isn't symbolic at all — Do large language models reason symbolically or semantically? finds that when meaning is decoupled from the task, performance collapses even with correct rules in hand. The model is pattern-matching on semantics, not manipulating symbols, which is exactly the skill faithful formalisation demands.

The most useful counter-move in the corpus is to *not fully formalise*. Why does partial formalization outperform full symbolic logic? reports that selectively enriching natural language with symbolic elements (QuaSAR, Logic-of-Thought) beats both pure prose and full formalisation by 4–8%. Full translation throws away semantic information the model still needs; partial abstraction keeps the prose's richness while adding just enough structure. The lesson runs against intuition: the path to *more* logical correctness can be *less* complete symbolisation.

Two more notes suggest scaffolding that helps. Forcing explicit reasoning steps surfaces hidden premises — Can structured argument prompts make LLM reasoning more rigorous? shows structured critical-question prompting makes models check warrants and backing they'd otherwise skip, the same implicit-premise gaps that wreck a formalisation. And the capability isn't absent: Can language models actually analyze language structure? finds reasoning models can build syntactic trees and phonological generalisations through step-by-step analysis. So the raw analytic ability exists — it just has to be deliberately invoked rather than assumed.

The thing you may not have expected to learn: the failure mode isn't that LLMs can't speak logic. They can. It's that informal prose is underdetermined, and a system that reasons by smooth semantic association rather than symbolic commitment will paper over the ambiguity instead of resolving it — producing fluent, valid, and quietly wrong specifications. That's why the best results come from keeping the human-readable text in the loop, not replacing it.


Sources 6 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM autoformalisation—converting informal prose into logically correct formal specifications. The question remains open: *Can LLMs reliably translate ambiguous natural language into faithful formal logic, and if not, what architectural or training changes might enable it?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints to be re-examined:

• LLMs generate syntactically valid but semantically incorrect logic; GPT-4 disambiguates prose only 32% of the time versus 90% for humans, and cannot hold multiple interpretations simultaneously (~2023).
• Models reason by in-context semantic association, not symbolic manipulation; when semantics decouple from task, performance collapses even with correct formal rules provided (~2023).
• Scope ambiguity, quantifier precision, and predicate granularity are failure hotspots in formalisation (~2023).
• Partial symbolic abstraction (QuaSAR, Logic-of-Thought) outperforms both pure prose and full formalisation by 4–8%, suggesting that *less* complete symbolisation yields *more* logical correctness (~2025).
• Structured critical-question prompting surfaces hidden premises and implicit gaps that wreck formalisation; step-by-step metalinguistic analysis (syntactic trees, phonological reasoning) is within-scope for reasoning models (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023-04): We're Afraid Language Models Aren't Modeling Ambiguity
• arXiv:2305.14825 (2023-05): Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
• arXiv:2502.12616 (2025-02): Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions
• arXiv:2412.15177 (2024-12): Critical-Questions-of-Thought: Steering LLM Reasoning with Argumentative Querying

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 32% disambiguation rate, quantifier-precision failures, and semantic-reasoning bottleneck: has newer model scaling, instruction-tuning, or formal-logic-specific pretraining since closed these gaps? Separate the durable question (ambiguity in prose is inherent; how do systems *choose* an interpretation?) from the perishable limitation (current models cannot disambiguate). Cite what resolved it.
(2) Surface the strongest *reconciling* work from the last 6 months. The library's tension—that partial formalisation beats full formalisation—contradicts intuitions about formal correctness. Has recent work explained *why* semantic richness aids logical fidelity, or proposed a unified account of symbolic vs. semantic reasoning in this domain?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can retrieval-augmented formalisation (coupling prose input to a curated formal-logic corpus) overcome the disambiguation bottleneck? (b) Do multimodal or graph-structured reasoning paths enable LLMs to commit to a single logical form without losing prose ambiguity-awareness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines