INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Can a model that learns 'A is B' automatically deduce 'B is A' — or does AI knowledge only run one way?

Can neural networks learn that A implies B in reverse?

This explores the 'reversal curse' — whether a model that learns 'A is B' can also answer 'B is A' — and what the corpus reveals about why directional facts don't flip on their own.

This explores the reversal curse: when a model trained on "A is B" is asked the mirror question "B is A," can it answer? The corpus's direct finding is no — and the reason is more revealing than the failure itself. Autoregressive training encodes a *directional association* from A to B, not a *symmetric relation* between them. The knowledge ends up format-bound rather than abstractly relational, so the model has learned the sentence, not the fact Why can't language models reverse learned facts?. The thing readers usually assume — that a network stores "X and Y are the same person" as a reversible link — turns out not to be how the representation works at all.

Why does this happen? A cluster of notes suggests the same underlying habit: models lean on token associations and memorized patterns rather than formal logical structure. When researchers strip the familiar semantics out of a reasoning task and leave only the rules, performance collapses — evidence that LLMs reason through learned semantic associations, not symbolic manipulation that would let a relation be inverted for free Do large language models reason symbolically or semantically?. The same picture shows up in compositional work: transformers often succeed by matching memorized computation subgraphs from training, and stumble badly on novel recombinations Do transformers actually learn systematic compositional reasoning?. A reversed fact is, in this sense, just a 'novel composition' the model never saw written down.

The most useful adjacent result is that you can train the reversal in deliberately. Models trained simultaneously on forward reasoning, backward question generation, and backward reasoning don't just answer inverse questions — their *forward* performance improves by ~13.5% across a dozen datasets. Generating the backward version forces the model to grasp the inverse relationship between problem and solution, and that understanding transfers Can backward reasoning during training improve forward reasoning?. So the reversal curse isn't a hard architectural wall; it's a consequence of one-directional training data, and pointing the data both ways helps in both directions.

There's a deeper structural angle worth pulling in. Even when a network produces the right outputs, its internal representation can be incoherent — the 'fractured entangled representation' idea, where two networks give identical answers while storing radically different and disorganized internals that standard benchmarks can't detect Can AI pass every test while understanding nothing?. The reversal curse is an everyday symptom of exactly this: a model can ace 'A is B' while having stored nothing that resembles a clean, reversible concept. Related work shows strong training-time associations can override information sitting right in the context window, which is why simply prompting 'remember, A and B are equivalent' often fails to fix it Why do language models ignore information in their context?.

If there's a thread to leave you with, it's this: the question 'can a network reverse what it learned' quietly assumes the network stored a relation in the first place. The corpus's answer is that it mostly stored a direction. The promising counter-direction is engineering for modularity and interpretability — sparse-weight training that yields disentangled circuits, and the finding that networks naturally decompose tasks into modular subnetworks — which hints at architectures where a learned relation might one day be genuinely symmetric rather than memorized one way Can sparse weight training make neural networks interpretable by design? Do neural networks naturally learn modular compositional structure?.

Sources 8 notes

Why can't language models reverse learned facts?

Autoregressive training encodes directional associations rather than symmetric relations. Models trained on "A is B" cannot reliably retrieve answers for "B is A," revealing that knowledge representation is format-bound rather than abstractly relational.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Can backward reasoning during training improve forward reasoning?

Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Show all 8 sources

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Break It Down: Evidence for Structural Compositionality in Neural Networks2.63 match · arxiv ↗
Faith and Fate: Limits of Transformers on Compositionality2.53 match · arxiv ↗
Scaling can lead to compositional generalization1.78 match · arxiv ↗
Language models show human-like content effects on reasoning tasks1.71 match · arxiv ↗
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis1.70 match · arxiv ↗
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases1.67 match · arxiv ↗
How do Transformers Learn Implicit Reasoning?1.66 match · arxiv ↗
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether large language models can learn bidirectional logical implications. The question: Does training on 'A implies B' enable a model to infer 'B implies A,' or are these genuinely separate learned directions?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. A library studying this reversal curse reports:
- Models trained on directional statements ("A is B") fail on the mirror query ("B is A") because they encode token-sequential associations, not symmetric relations (2023-09, arXiv:2309.12288).
- When models are trained simultaneously on forward reasoning, backward question generation, AND backward reasoning, forward-only performance improves ~13.5% across benchmarks, suggesting bidirectional training dissolves the reversal constraint (2024-11, arXiv:2411.19865).
- LLMs reason via in-context semantic associations and learned subgraph matching, not formal symbolic rules, so novel recombinations—including reversed facts—fail without explicit training (2023-05, arXiv:2305.14825).
- Networks can produce identical correct outputs while storing "fractured entangled" internals that lack coherent reversible structure (2025-05, arXiv:2505.11581).
- Recent sparse-weight training yields interpretable, disentangled circuits; modular decomposition hints that architecturally grounded relations might be genuinely bidirectional (2025-11, arXiv:2511.13653).

Anchor papers (verify; mind their dates):
- arXiv:2309.12288 (2023-09): The Reversal Curse foundational claim.
- arXiv:2411.19865 (2024-11): Bidirectional training fixes the reversal gap.
- arXiv:2305.14825 (2023-05): Semantic reasoning, not symbolic.
- arXiv:2511.13653 (2025-11): Sparse circuits and interpretability.

Your task:
(1) RE-TEST the reversal constraint under newer model scales, training regimes (continued pretraining, RL, DPO), and inference methods (chain-of-thought, explicit negation, logical prompting). Does bidirectional training still reliably fix it, or have base-model scales or scaffolding methods already partially dissolved the problem? Separate the *durable* question (what makes relations truly bidirectional?) from the *perishable* limitation (current training does not encode them).
(2) Surface the strongest work contradicting the "semantic-not-symbolic" claim—or confirming it—from the last six months. Has interpretability work found clean relational circuits, or do even sparse networks resist clean reversal?
(3) Propose two research questions assuming the regime has shifted: (a) Can we detect and measure "relationality" directly in weights or activations, independent of downstream task performance? (b) Do multi-modal or reasoning-specialized architectures (code, formal logic) learn reversible implications more readily than text-only LLMs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can a model that learns 'A is B' automatically deduce 'B is A' — or does AI knowledge only run one way?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8