Can neural networks learn that A implies B in reverse?
This explores the 'reversal curse' — whether a model that learns 'A is B' can also answer 'B is A' — and what the corpus reveals about why directional facts don't flip on their own.
This explores the reversal curse: when a model trained on "A is B" is asked the mirror question "B is A," can it answer? The corpus's direct finding is no — and the reason is more revealing than the failure itself. Autoregressive training encodes a *directional association* from A to B, not a *symmetric relation* between them. The knowledge ends up format-bound rather than abstractly relational, so the model has learned the sentence, not the fact Why can't language models reverse learned facts?. The thing readers usually assume — that a network stores "X and Y are the same person" as a reversible link — turns out not to be how the representation works at all.
Why does this happen? A cluster of notes suggests the same underlying habit: models lean on token associations and memorized patterns rather than formal logical structure. When researchers strip the familiar semantics out of a reasoning task and leave only the rules, performance collapses — evidence that LLMs reason through learned semantic associations, not symbolic manipulation that would let a relation be inverted for free Do large language models reason symbolically or semantically?. The same picture shows up in compositional work: transformers often succeed by matching memorized computation subgraphs from training, and stumble badly on novel recombinations Do transformers actually learn systematic compositional reasoning?. A reversed fact is, in this sense, just a 'novel composition' the model never saw written down.
The most useful adjacent result is that you can train the reversal in deliberately. Models trained simultaneously on forward reasoning, backward question generation, and backward reasoning don't just answer inverse questions — their *forward* performance improves by ~13.5% across a dozen datasets. Generating the backward version forces the model to grasp the inverse relationship between problem and solution, and that understanding transfers Can backward reasoning during training improve forward reasoning?. So the reversal curse isn't a hard architectural wall; it's a consequence of one-directional training data, and pointing the data both ways helps in both directions.
There's a deeper structural angle worth pulling in. Even when a network produces the right outputs, its internal representation can be incoherent — the 'fractured entangled representation' idea, where two networks give identical answers while storing radically different and disorganized internals that standard benchmarks can't detect Can AI pass every test while understanding nothing?. The reversal curse is an everyday symptom of exactly this: a model can ace 'A is B' while having stored nothing that resembles a clean, reversible concept. Related work shows strong training-time associations can override information sitting right in the context window, which is why simply prompting 'remember, A and B are equivalent' often fails to fix it Why do language models ignore information in their context?.
If there's a thread to leave you with, it's this: the question 'can a network reverse what it learned' quietly assumes the network stored a relation in the first place. The corpus's answer is that it mostly stored a direction. The promising counter-direction is engineering for modularity and interpretability — sparse-weight training that yields disentangled circuits, and the finding that networks naturally decompose tasks into modular subnetworks — which hints at architectures where a learned relation might one day be genuinely symmetric rather than memorized one way Can sparse weight training make neural networks interpretable by design? Do neural networks naturally learn modular compositional structure?.
Sources 8 notes
Autoregressive training encodes directional associations rather than symmetric relations. Models trained on "A is B" cannot reliably retrieve answers for "B is A," revealing that knowledge representation is format-bound rather than abstractly relational.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.