Why does augmenting natural language with formal representations outperform full formalization?
This explores why the best results come from enriching natural language with selective bits of formal structure, rather than translating everything into symbolic logic — and what that says about how language models actually reason.
This explores why the best results come from enriching natural language with selective bits of formal structure, rather than translating everything into symbolic logic. The short version from the corpus: full formalization throws away information the model still needs, while pure language gives it no scaffolding — so the sweet spot is keeping the natural language and bolting on just enough symbolic structure to expose the logical skeleton. Methods like QuaSAR and Logic-of-Thought pick up 4–8% accuracy by doing exactly this, and the gain comes from preserving both the semantic richness of language and the structure of logic at once Why does partial formalization outperform full symbolic logic?.
The deeper reason full formalization fails shows up when you look at what happens during translation. Models can write logic that is syntactically valid but semantically wrong — errors cluster exactly where natural language is slippery: scope ambiguity, quantifier precision, how finely a predicate is carved up. Interestingly, models seem to understand formal language better than they can generate it, so the act of converting prose into clean logic is itself a lossy, error-prone bottleneck Can large language models translate natural language to logic faithfully?. Every time you force a full translation, you risk baking those translation errors into the input the model reasons over.
There's also a reason the natural language part is load-bearing rather than just convenient. When you strip semantic content away and leave only the formal rules, model performance collapses — these systems reason through semantic associations and learned commonsense, not through symbolic manipulation of abstract tokens Do large language models reason symbolically or semantically?. Full formalization essentially removes the thing the model is actually good at. Augmentation keeps the semantic handholds while adding structure as a guide rail, which fits how the machinery really works.
This doesn't mean formal structure is useless — the corpus is more interesting than that. Formal structure helps a lot when it's a complement rather than a replacement: pretraining on hierarchical formal languages makes models more token-efficient and improves syntactic generalization, and the attention heads it builds stay critical later Can formal language pretraining make language models more efficient?. And inside reasoning chains, models already preferentially preserve symbolic-computation tokens while pruning grammar and filler Which tokens in reasoning chains actually matter most?. So the pattern across the collection is consistent: formal structure is most powerful as an additive layer the model leans on, not as a cage that replaces the language it actually thinks in.
The thing you might not have expected: the winning move isn't choosing between language and logic at all — it's that LLMs are semantic engines that can be steered by structure, but break when the structure is forced to carry the whole load.
Sources 5 notes
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
LLMs generate well-formed logical expressions that are semantically incorrect, with errors clustering at scope ambiguity, quantifier precision, and predicate granularity. The asymmetry suggests LLMs understand formal language better than they can generate it.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Pre-pretraining 1B models on hierarchical formal languages achieves equivalent loss and better syntactic generalization using 33% fewer natural language tokens. The mechanism persists: attention heads trained on formal languages remain critical for syntactic performance on natural language.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.