What makes structural logic correlate so strongly with contextual consistency?
This explores why the *form* of reasoning — its structure and layout — seems to drive LLM performance more than actual logical validity, and what that says about how models stay coherent with context.
This reads the question as: why does the *shape* of reasoning matter so much more than whether the reasoning is actually valid? The corpus has a surprisingly blunt answer — because language models learn the form of reasoning, not the inference behind it. The most direct evidence is that chain-of-thought prompts with logically *invalid* steps perform nearly as well as valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. If broken logic still works, then it was never the logic doing the lifting — it was the structural scaffolding. A broader survey of what makes CoT tick finds the same thing from another angle: training *format* shapes reasoning strategy 7.5× more than the actual domain, and just moving a demonstration around can swing accuracy 20% What makes chain-of-thought reasoning actually work?.
The reason the structure correlates so tightly with staying coherent in context is that, for these models, structure *is* the mechanism. They reproduce familiar reasoning patterns absorbed from training rather than performing novel symbolic steps — which is why performance degrades predictably the moment you push them off the distribution they learned the patterns on Does chain-of-thought reasoning reveal genuine inference or pattern matching?. You can show this cleanly by stripping the meaning out: when semantic content is decoupled from a reasoning task, performance collapses even though the correct rules are sitting right there in the prompt Do large language models reason symbolically or semantically?. The model wasn't manipulating the rules; it was riding the familiar semantic groove. Structural consistency holds as long as the surface form looks like something it has seen.
That dependence on surface form has a sharp edge: when the structure of the *input* gets genuinely complex, the apparent competence frays. Grammatical performance declines predictably as syntactic depth and embedding increase — simple sentences are fine, recursion and nesting fail consistently Does LLM grammatical performance decline with structural complexity?. The same blind spot shows up in entailment, where presupposition triggers and non-factive verbs get read as surface cues instead of as operators that flip a sentence's meaning Why do embedding contexts confuse LLM entailment predictions?. The correlation between structure and consistency, in other words, is also a ceiling: it breaks exactly where real structural computation would be required.
Here's the part you might not expect to want: the fix isn't more logic, it's *better-placed* structure. Partial symbolic augmentation — enriching natural language with selective formal elements rather than replacing it — beats both plain language and full formalization, because pure language lacks structure while full formalization throws away semantic information Why does partial formalization outperform full symbolic logic?. Giving models explicit structural scaffolds works too: argument-scheme prompts that force a model to name its warrants catch failures that ordinary CoT waves through Can structured argument prompts make LLM reasoning more rigorous?, and symbolic rules drawn from a knowledge graph's topology give reasoning a navigational plan that semantic similarity alone can't Can symbolic rules from knowledge graphs guide complex reasoning?. The throughline: structure correlates with consistency because structure is what the model actually runs on — so the leverage is in supplying the right structure, not in hoping for hidden logic.
Sources 9 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.
SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.