Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
The "In-Context Semantic Reasoners" paper tests a fundamental question about what drives LLM reasoning by systematically decoupling semantics from the reasoning process across deduction, induction, and abduction tasks. The findings are clear: when semantics are consistent with commonsense, LLMs perform well; when semantics are removed or made counter-commonsense, performance collapses even when correct rules are provided in context.
The experimental design is precise. By replacing relation labels with shuffled alternatives ("motherOf" → "sisterOf", "female" → "male"), the researchers create tasks where the in-context rules are logically valid but semantically counter-intuitive. LLMs cannot follow these counter-commonsense rules despite having them explicitly in the prompt. The model's parametric knowledge — its compressed commonsense from training — overrides the in-context logical structure.
This reveals a specific computational mechanism: LLMs create "superficial logical chains" through semantic token associations, not through symbolic manipulation. The connections between tokens that enable multi-step reasoning are semantic connections, not logical ones. When those semantic connections support the correct answer, reasoning appears to work. When they conflict, reasoning fails regardless of what the prompt says.
The implication is that LLM reasoning is fundamentally bounded by training distribution semantics. Since Can large language models translate natural language to logic faithfully?, the failure is bidirectional: LLMs can neither translate TO formal logic faithfully nor reason FROM formal logic when it conflicts with semantic priors. Since Do foundation models learn world models or task-specific shortcuts?, the semantic dependency IS the heuristic — the model uses semantic similarity as a proxy for logical validity.
This connects to the Dual Process Theory framework: human System II symbolic reasoning operates independently of semantic content, but LLM "reasoning" remains entangled with System I semantic associations. The paper's suggestion — integrating LLMs with external non-parametric knowledge bases and improving in-context knowledge processing — implicitly acknowledges that the LLM alone cannot escape this limitation.
Retort implication — rules out a class of anthropomorphization: The finding constrains what we can say about LLM behavior in other domains. Any account that treats LLMs as agents who "reverse-engineer" justifications for conclusions they have committed to — the standard anthropomorphization of sycophancy, rationalization, or motivated reasoning — presupposes the semantic competence this note shows LLMs lack. If reasoning collapses when semantics are decoupled, there is no separable reasoning faculty available to perform a post-hoc rationalization. What looks like reverse-engineering is pattern-matching within semantic associations. This rules out a whole class of AI commentary that treats LLMs as dishonest agents who could have reasoned correctly but chose not to.
Metaphor as paradigmatic semantic decoupling: Metaphor is the literary instantiation of this finding. A metaphor works by using one domain's vocabulary to illuminate another — "time is money," "argument is war," "memory is a jar of flies." The decoupling between the source domain's semantics and the target domain's meaning is the defining feature of metaphorical language. Since LLM reasoning collapses when semantics are decoupled from their typical packaging, and metaphor is decoupled semantics, this predicts a specific failure mode: LLMs should handle conventional metaphors (lexicalized, semantically consistent with commonsense) better than novel literary metaphors (where the mapping between domains is unexpected and requires conceptual reasoning beyond semantic association). The Diplomat dataset (Diplomat: A Dialogue Dataset for Situated PragMATic Reasoning) suggests treating all figurative language as a unified pragmatic reasoning task — but the semantic-decoupling finding predicts that this unified approach will hit a wall at the novelty threshold where metaphors stop relying on conventional semantic associations.
Inquiring lines that use this note as a source 246
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can LLMs infer situational context the way humans do pragmatically?
- Why do LLMs fall for and deploy logical fallacies with equal confidence?
- Why do LLMs fail inter-annotator agreement tests on argument evaluation?
- How does surface salience compete with background knowledge in model inference?
- Can prompt-based debiasing overcome entrenched LLM model priors?
- How do different LLM integration paradigms affect inheritance of pretraining biases?
- How does LLM-PKG compare to mining product relations directly from interaction data?
- Does epistemic drift operate the same way across all languages?
- Can evidence density alone shift an LLM from generation to reasoning?
- How do transformers perform multi-hop reasoning across distant training documents?
- How much of LLM reasoning failure stems from missing knowledge versus signal weighting?
- Do modern architectures in NLP and vision rely on dot products intentionally?
- Why do simple length heuristics outperform sophisticated semantic methods?
- Can explicit numerical signals override learned linguistic defaults in fine-tuned models?
- How does the frame problem differ between symbolic and statistical reasoning systems?
- Can neural networks represent symbolic structures without explicit mechanisms?
- How does syntactic encoding relate to semantic feature representation?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- Do language models learn surface patterns instead of underlying linguistic principles?
- Why does explicit theory injection work better than example-based learning for reasoning tasks?
- How does the outer loop escape its own LLM's knowledge boundaries when discovering mechanisms?
- How do humans use associative reasoning without causal connections?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- How deeply are ideological structures represented in large language models?
- Can symbolic mechanisms improve transformer compositional abilities?
- Can symbolic solvers rescue language models from logical reasoning failures?
- Can autoregressive models learn faithful translation to logical representations without semantic loss?
- What circuit mechanisms produce belief bias in syllogistic reasoning?
- Can we distinguish between semantic and symbolic reasoning in language models?
- How do humans and LMs differ on multi-hop reasoning?
- How does semantic grounding differ between human minds and language models?
- Can language models reason without relying on learned semantic patterns?
- Where do humans and language models actually diverge in reasoning ability?
- Why do language models substitute parametric knowledge over retrieved context mid-reasoning?
- Why do language models imitate reasoning form without abstract inference capability?
- Can reasoning chains work without logical validity?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- How does business logic specification replace annotated training datasets?
- Can explicit connectives compensate for missing intentional tracking in LLMs?
- Can explicit stack mechanisms extend what formal languages transformers can learn?
- Do reasoning languages like Prolog follow the same two-constraint transfer pattern?
- Why do embeddings measure semantic association instead of task relevance?
- Why do power-law distributions make standard ML infrastructure assumptions fail?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Why does explicit reasoning degrade passage reranking performance?
- What makes symbolic operations different from general knowledge questions?
- What causes snowball errors to accumulate across reasoning steps in language models?
- Why does semantic decoupling specifically break LLM reasoning abilities?
- Do LLMs compute scalar implicature differently across conversational contexts?
- Can LLMs improve at metaphor if they handle decoupled semantics better?
- How does implicit meaning processing limit LLM pragmatic reasoning?
- Why does hypothesis attestation bias exist separately from frequency bias in NLI?
- Can language models acquire meaning from distributional patterns alone without joint attention?
- Why do models fail on logically equivalent tasks with different data distributions?
- Do sparse arithmetic circuits explain all language model reasoning abilities?
- Does generalization frequency explain why models favor upward semantic movement?
- How do LLMs compress specific expert knowledge into median abstraction?
- Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
- Does more inference compute help reasoning models match specialized domain performance?
- When does long-context LLM reasoning fail where structured retrieval succeeds?
- How do LLMs and knowledge graphs work together in different integration patterns?
- Can latent reasoning in continuous space scale beyond supervised reasoning tasks?
- Is relevant knowledge encoded in LMs but not causally active in generation?
- Why do explicit discourse connectives help LLMs but implicit relations cause failures?
- Why do LLMs generate logical forms without preserving semantic content?
- Why do diffusion LLM answer tokens converge in confidence long before reasoning stabilizes?
- What reveals the epistemic limits of language models?
- Can small models solve complex tasks using externalized reasoning graphs?
- How does the symbol grounding problem apply to artificial language systems?
- What neuroscience evidence suggests language networks are not optimized for reasoning?
- Do latent sequence vectors outperform per-token latent iterative computation for reasoning?
- Can LLMs infer implicit meaning without surface linguistic markers?
- Do LLMs rely on surface heuristics instead of learning recursive grammar rules?
- How do embedding contexts like presupposition triggers affect LLM entailment reasoning?
- Can complexity-stratified testing reveal whether LLMs understand grammatical structure?
- Why do rare complex structures in training data harm LLM generalization?
- Why do LLMs fail at semantic generalization despite grammatical accuracy?
- How does training data distribution constrain LLM moral reasoning patterns?
- Which game type reveals minimax reasoning in language models?
- Why do true and false LLM outputs use the same mechanism?
- How do LLMs infer information that was explicitly censored?
- Can neural networks learn that A implies B in reverse?
- What internal mechanisms explain LLM reasoning and representation limits?
- Can training LLMs to form ad-hoc conventions improve their pragmatic reasoning?
- Do LLMs understand implicit warrants in reasoning chains?
- Why can LLMs identify argument structure but not check warrants?
- Why do LLMs fail when asked to use counter-commonsense rules explicitly?
- Can LLMs translate between natural language and formal logic faithfully?
- Do metaphors work by decoupling meaning from linguistic associations?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- Why can't LLMs reason from first principles or initial commitments?
- How do explicit reasoning traces help models construct valid syntactic trees?
- Can LLMs identify implicit metaphoric mappings that require pragmatic inference?
- Can long-context models handle compositional reasoning requiring structured logic?
- How does context complexity affect LLM performance on temporal reasoning tasks?
- Why do LLMs inherit causal biases from their training data?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- Why do LLMs choose surface-order quantifier scope over contextually correct readings?
- Why can LLMs interpret formal logic better than they generate it?
- Can LLM semantic representations exist without causally influencing their generation output?
- How does structural depth in sentences predict LLM annotation accuracy?
- Why do LLMs perform better on explicit discourse connectives than implicit relations?
- How does structural complexity affect LLM performance differently than inferential complexity?
- What specific linguistic features cause LLMs to fail at trivial entailment?
- Which knowledge types do LLMs handle better than humans in reasoning tasks?
- Can LLMs improve at simple deduction through different training approaches?
- Why does LLM compression eliminate causal grounding in conceptual representations?
- Can explicit linkers replace vector similarity for multi-step question answering?
- Why do models learn reasoning form instead of actual abstract inference?
- How does semantic reasoning differ from symbolic reasoning in language models?
- Why do language models struggle with formal logical reasoning and joins?
- Why does distillation transfer reasoning patterns with few examples?
- Can LLMs reliably generate novel working architectures without structured representations?
- How does inductive reasoning from partial evidence enable hypothesis formation?
- What separates pattern matching from genuine language understanding?
- Can targeted activation steering surface latent reasoning in base models?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How much does training composition affect syntactic versus reasoning performance?
- How does training data format shape whether models reason in parallel or sequentially?
- Do LLMs learn linguistic generalizations or just surface-level frequency patterns?
- Do LLMs lack architectural scaffolding for compositional reasoning?
- Can language models reason without relying on surface level pattern matching?
- What makes deductive reasoning so brittle in language models overall?
- Do LLMs learn surface patterns instead of genuine linguistic structure?
- How does structural complexity in sentences degrade LLM reasoning systematically?
- Can LLMs compute how presuppositions project through embedded clauses?
- Why do non-factive verbs and triggers both fool language models?
- Why do explicit linguistic markers override semantic computation in models?
- What makes structural logic correlate so strongly with contextual consistency?
- How does in-context semantic reasoning differ from symbolic reasoning in concept fusion?
- Can latent reasoning mechanisms and recursive tracking mechanisms be combined effectively?
- Why do reasoning models fail when input length increases even below context limits?
- How does an instruction-following LLM activate latent retrieval knowledge?
- Do reflection tokens and symbolic tokens serve different roles in reasoning?
- How do we verify that stated beliefs actually follow from underlying motifs?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- Why do recursive belief models require different training than logical derivation?
- What distinguishes inductive inference from negative evidence versus positive patterns?
- How does bidirectional entailment distinguish semantic equivalence from token similarity?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- How do recursive language models rethink where to store reasoning?
- Can continuous latent reasoning match discrete chain-of-thought without training modifications?
- Why does hierarchical formal language training improve token efficiency more than natural language?
- Can knowledge graph structure alone generate sufficient training signals for domain reasoning?
- Does training data format shape which reasoning strategies LLMs develop?
- Does LLM reasoning always match the outputs it generates?
- What architectural features drive sycophancy closer to inference than training?
- Why do reasoning-optimized models show no sycophancy resistance advantage?
- Can language models perform purely symbolic reasoning when semantics are removed?
- How does interleaving reasoning with action prevent hallucination in language models?
- Why does augmenting symbolic reasoning outperform replacing it entirely?
- What sparse mechanistic structures drive reasoning traces in language models?
- Why do LLMs struggle to translate natural language into logical formalizations?
- Can latent reasoning achieve the same substitution without tokens?
- Where does inference compute stop substituting for model capacity?
- How does semantic clustering help decide which model handles each query?
- Why does monological training prevent models from overriding statistical priors?
- Why does removing semantic content collapse reasoning in language models?
- Why do reasoning models fail at learning hidden rules from sparse exceptions?
- Can language models perform genuine symbolic reasoning without semantic grounding?
- How much does schema bloat actually degrade reasoning in large language models?
- Can you control LLM reasoning strategy without fine-tuning the model?
- What non-parametric methods could replace latent factors for inductive learning?
- Why do language models produce unfaithful chain of thought explanations?
- Can instance-adaptive reasoning happen without sequential token dependencies?
- Why do LLMs recognize graph entities without modeling their relationships?
- Why do LLMs fail at counterfactual reasoning despite factual knowledge?
- Can LLMs reason through semantics without understanding causal mechanisms?
- How does semantic association differ from mechanistic causal reasoning?
- Can LLMs simulate belief revision in social systems without modeling thought?
- How do single training examples activate reasoning capabilities in language models?
- Does structured decomposition improve LLM reasoning in other compound tasks?
- Can machine learning encode pragmatic reasoning about when rules should bend?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- Does compressing Walton's schemes into nine categories make LLM classification easier?
- Can LLM-generated descriptions of schemes outperform formal dictionary definitions for prompting?
- Can implicit association tests reveal LLM biases beneath trained responses?
- What latent mechanisms do LLMs use when they cannot execute iterative methods?
- Do base models contain latent reasoning that minimal training can unlock?
- Why do LLMs explain correct reasoning but then choose greedy actions?
- Why do LLM descriptions of argument schemes work better than formal definitions for classification?
- Why does augmenting natural language with formal representations outperform full formalization?
- How do deterministic symbolic solvers improve the reliability of language model reasoning?
- What implicit premises do language models skip even with correct surface reasoning?
- Do distributed relational tasks consistently underperform local classification across NLP domains?
- How do pretrained language models represent inferential patterns versus lexical and positional cues?
- How does subject-predicate distinction emerge from formal linguistic analysis?
- Do base models truly possess latent reasoning capability?
- Does latent reasoning capability exist in base models before any training?
- What other structural limits exist at the language-formal boundary?
- Why do smaller models lose reasoning faithfulness more than larger models?
- How do LLMs translate informal prose into logically correct formal specifications?
- How do corpus statistics shape the abstraction hierarchy in language model representations?
- Why do unit-sphere spaces fail at distinguishing word order and negation?
- What makes some contexts learnable as rules versus requiring model retraining?
- How do training associations override context information in language models?
- Can models reason at inference without specialized internal training?
- Can reinforcement learning close the gap between LLM reasoning and action?
- Can we use LLM language without adopting LLM assumptions?
- How do soft token mixtures enable parallel reasoning exploration without explicit training?
- How do LLMs lose information when translating natural language to formal logic?
- Why do LLMs fail at faithful autoformalisation of reasoning problems?
- Can symbolic solvers reliably replace LLM reasoning for logical tasks?
- What semantic information is necessary to preserve for sound LLM reasoning?
- Do independent LLM outputs converge enough to create artificial hiveminds?
- How can we probe LLM representations in channels that training did not target?
- Why might rationales that predict common text patterns fail on hard novel reasoning?
- Can reasoning learned from language modeling actually transfer to knowledge-intensive domains?
- Why do long-context language models struggle with compositional reasoning tasks?
- How much training data is truly necessary to unlock latent model reasoning?
- What evidence shows that reasoning chains encode token-level functional structure?
- Does the base model already contain latent reasoning capability?
- What does pass@k reveal about base model reasoning capacity?
- Can models possess latent reasoning capability that training signals fail to unlock?
- Why does unstructured chain-of-thought permit assumption-based errors that templates prevent?
- What makes natural language reasoning more practical than formal languages for multi-framework codebases?
- Why do reasoning-optimized models show no resistance advantage on agreement tasks?
- What kinds of reasoning tasks reveal the ceiling of text-only training?
- Why do language model reasoning chains look fluent when they deviate from the task?
- Do computational systems need formal argument analysis for explainability?
- Why do LLMs fail at iterative numerical computation in latent space?
- How does neuro-symbolic design differ from pure LLM reasoning?
- What mechanisms activate latent reasoning capabilities already present in base models?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Can base models spontaneously produce reasoning traces without any RL training?
- What geometric structure do language models actually use during inference?
- How do semantic and symbolic reasoning capabilities differ in language models?
- Can irrelevant information reliably expose the limits of LLM reasoning?
- What makes hierarchical reasoning effective for taxonomy induction?
- How faithful are natural language explanations from LLMs really?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- Can LLMs simultaneously reason and optimize their own modules?
- How does latent reasoning recursion compare to chain-of-thought reasoning?
- Can we detect redundant reasoning steps during model inference instead of training?
- How do logical forms of prompts influence what language models can derive?
- Can minimal training signals unlock latent reasoning capability in base models?
- Do language models need words to think or just latent structure?
- How does scaling and training data enable compositional behavior without symbolic mechanisms?
- How should we rethink the symbolism versus connectionism debate in light of LLMs?
- What empirical evidence supports the Learning Law on real language models?
- Can minimal training signals unlock reasoning already latent in pretrained representations?
- What latent reasoning capability do base models already possess before training?
- Why do LLMs reason fluently about causality but lack causal rigor?
- What prevents LLM representations from causally influencing generation outputs?
- How does tool-based reasoning expand what language models can do?
- Do LLMs show stronger reasoning about causality than about temporal ordering?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can large language models translate natural language to logic faithfully?
This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.
bidirectional semantic dependency: fails translating TO logic and reasoning FROM logic
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
semantic associations are the heuristic mechanism
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
same mechanism: parametric knowledge overrides in-context information
-
Does semantic grounding in language models come in degrees?
Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.
functional grounding through semantic associations explains why reasoning works within commonsense boundaries
-
Why do neural networks fail at compositional generalization?
Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.
the binding problem may explain WHY semantic decoupling collapses reasoning: without compositional binding mechanisms, removing semantic content removes the only glue holding multi-step inference together; semantic associations serve as a substitute for genuine compositional binding
-
Do LLMs actually have world models or just facts?
The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.
semantic reasoning operates on factual world representation (Sense 1) but cannot perform mechanistic reasoning (Sense 2) when logic must override semantic priors
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
- Language models show human-like content effects on reasoning tasks
- Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
- Can Large Language Models Reason and Optimize Under Constraints?
Original note title
llms are in-context semantic reasoners not symbolic reasoners — when semantics are decoupled reasoning collapses