Can LLMs reconstruct censored knowledge from scattered training hints?
When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.
"Connecting the Dots" (2406.14546) demonstrates inductive out-of-context reasoning (OOCR): LLMs can infer latent information distributed across training documents and apply it to downstream tasks without in-context learning. The experimental design is elegant — finetune a model on a corpus containing only distances between an unknown city and known cities. No city name appears anywhere in the training data.
The model can then verbalize that the unknown city is Paris and answer downstream questions using this inferred fact. No chain-of-thought prompting. No in-context examples. The model pieced together disparate evidence from its finetuning corpus and performed inductive inference to arrive at a conclusion that was never explicitly stated.
This is qualitatively different from standard in-context reasoning. In-context reasoning operates over information present in the prompt. OOCR operates over information distributed across the training data. The model integrates evidence that was never co-present in any single training instance.
The safety implication is direct: censoring dangerous knowledge from training data — a common safety measure — may not prevent LLMs from reconstructing that knowledge. If implicit hints remain scattered across the remaining corpus, the model can connect the dots. This makes content-based safety measures fundamentally less reliable than they appear. The same OOCR mechanism also explains why How much poisoned training data survives safety alignment? — even a tiny fraction of contaminated data provides sufficient statistical traces for the model to reconstruct and integrate the poisoned beliefs.
Since How do transformers learn to reason across multiple steps?, the OOCR finding extends the multi-hop pattern from within-context to across-training-data. The model doesn't just chain together facts presented together — it chains together facts that were never presented together, creating new knowledge from statistical residue.
Since Can large language models develop genuine world models without direct environmental contact?, OOCR provides a mechanism for how these world models might form: not from any single document but from the aggregate of partial information across the entire training distribution.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does content-only knowledge in LLMs enable pretraining popularity to leak through?
- Why does even 0.1 percent poisoned training data persist through alignment?
- Why is extracting training data insufficient proof that models memorize?
- How do LLMs infer information that was explicitly censored?
- What happens when you reverse-engineer raw materials from published papers?
- Can LLMs recover true joint distributions from marginal census data?
- Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?
- What alternatives exist when required knowledge is absent from training?
- Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
- Can membership inference attacks reliably detect training data exposure?
- Why does removing semantic content collapse reasoning in language models?
- Can jailbreaking reveal an LLM's true nature or just its training data?
- What semantic information is necessary to preserve for sound LLM reasoning?
- How many document exposures does procedural knowledge versus factual information require?
- Can we unlearn memorized text by finetuning only high-gradient weights?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do transformers learn to reason across multiple steps?
Does multi-hop reasoning in transformers emerge through distinct learning phases, and what geometric patterns in hidden representations explain when reasoning succeeds or fails?
within-context multi-hop; OOCR extends this across training data
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
OOCR may be the mechanism for world model formation from distributed evidence
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
contrast: OOCR shows some latent information DOES influence generation
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
OOCR explains why low-rate poisoning works: the model's ability to reconstruct knowledge from scattered implicit hints means even 0.1% contamination provides sufficient statistical traces for the model to integrate; conversely, poisoning persistence confirms that OOCR-reconstructed knowledge becomes durable in model weights
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
complementary vulnerability: OOCR constructs knowledge from scattered training evidence, while belief manipulation destroys correct knowledge through inference-time social pressure; together they show LLM knowledge is malleable in both directions — constructible from sparse signals and destructible under conversational pressure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
- How new data permeates LLM knowledge and how to dilute it
- LLMs can implicitly learn from mistakes in-context
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- Long-context LLMs Struggle with Long In-context Learning
- Explicit Inductive Inference using Large Language Models
- Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
- Neutralizing Bias in LLM Reasoning using Entailment Graphs
Original note title
LLMs infer censored knowledge by piecing together implicit hints scattered across training documents — inductive out-of-context reasoning poses a safety risk