INQUIRING LINE

Can LLMs successfully translate natural language into formal solver specifications?

This explores whether LLMs can take a plain-language problem and faithfully convert it into the structured input a logic engine or optimization solver actually needs — and what breaks when they try.


This explores whether LLMs can take a plain-language problem and faithfully convert it into the structured input a logic engine or optimization solver actually needs. The corpus gives a sharp, two-sided answer: LLMs are good at producing the *form* of a formal spec but unreliable at preserving the *meaning* — and that gap is exactly where systems succeed or fail.

The core finding is an asymmetry. LLMs generate syntactically valid logical expressions that are semantically wrong, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity Can large language models translate natural language to logic faithfully?. Interestingly, they seem to *read* formal language better than they *write* it — recognition outruns generation. That single asymmetry reframes the whole question: the bottleneck isn't whether the model understands logic, it's whether it can commit a messy sentence to a precise formal commitment without quietly dropping a constraint.

The corpus's most constructive response is to stop asking the LLM to do the whole job alone. Logic-LM splits the labor — the LLM drafts a symbolic representation, a deterministic solver runs the actual inference, and the solver's machine-verifiable error messages feed back to catch translation mistakes Can symbolic solvers fix how LLMs reason about logic?. That feedback loop beats LLM self-critique precisely because the failure (bad translation) gets caught by something external, which echoes the broader result that models can't validate their own fixes without an outside checker What stops large language models from improving themselves?. A related design lesson: don't formalize everything. Partial symbolic augmentation — enriching natural language with selective formal structure rather than replacing it wholesale — outperforms both pure language and full formalization, because full formalization throws away semantic information the solver never needed in rigid form Why does partial formalization outperform full symbolic logic?.

There's also a harder ceiling lurking behind the translation question. Even when the spec is correct, LLMs plateau around 55–60% constraint satisfaction on genuine optimization problems, regardless of scale or reasoning training Do larger language models solve constrained optimization better? — and a deeper reason is that they pattern-match memorized solution templates instead of actually executing iterative numerical methods Do large language models actually perform iterative optimization?. This is the strongest argument *for* the solver-integration approach: if the model can't reliably solve, offload solving to the solver and make the LLM responsible only for an accurate translation it can then verify.

The counterweight worth knowing: LLMs genuinely can do structured linguistic and symbolic analysis when forced to reason step by step — building syntactic trees and phonological generalizations, not just performing language Can language models actually analyze language structure?. So the deficit isn't a flat inability to handle formal structure; it's faithfulness under translation. The takeaway you might not have expected: the path to reliable autoformalization isn't a bigger model, it's a tighter loop — let the LLM draft, let the solver judge, formalize only what must be formal, and treat every spec as something to verify rather than trust.


Sources 7 notes

Can large language models translate natural language to logic faithfully?

LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.

Can symbolic solvers fix how LLMs reason about logic?

Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a verification analyst testing whether LLM-to-solver translation has moved since mid-2023. The question remains open: can LLMs faithfully convert natural language into formal specifications that solvers can execute?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat them as perishable:
• LLMs generate syntactically valid but semantically wrong specs; errors cluster at scope ambiguity, quantifiers, and predicate granularity (~2023).
• Recognition (reading formal specs) outpaces generation (writing them) — asymmetry identifies the bottleneck as *faithful commitment*, not logical understanding (~2023).
• Logic-LM architecture (LLM drafts → solver judges → feedback loop) catches translation errors better than self-critique, because external verification works where internal validation fails (~2023–2025).
• Partial symbolic augmentation (selective formalization) outperforms pure language or full formalization (~2024–2025).
• LLMs plateau at 55–60% constraint satisfaction on genuine optimization; they pattern-match memorized templates rather than execute iterative numerical methods (~2024–2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.12295 (Logic-LM, May 2023)
• arXiv:2511.20471 (Universe of Thoughts, Nov 2025)
• arXiv:2603.23004 (Can LLMs Reason and Optimize Under Constraints?, Mar 2026)
• arXiv:2502.12616 (Quasi-Symbolic Abstractions, Feb 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether post-2024 advances in in-context learning, chain-of-thought scaling, tool-use orchestration (agentic loops, multi-step verification harnesses), or newer solver integration have relaxed the semantic gap or the 55–60% ceiling. Separate the durable question (faithful symbolic translation under noise) from the perishable limitation (e.g., "LLMs cannot self-verify" — does external verification now work reliably?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months (late 2025 onward) that either improves the translation fidelity directly or reframes the problem to bypass it.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) What if reliable translation now requires not bigger models but *better solver integration* or *cheaper symbolic preprocessing*? (b) Can partial formalization + agentic feedback loops now achieve >70% fidelity on real-world constraint problems?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines