Can filtering unknown examples during fine-tuning prevent hallucination increases?
This explores whether removing facts the model doesn't already know from fine-tuning data can stop the model from hallucinating more — a data-curation fix for a training-time problem.
This explores whether removing facts the model doesn't already know from fine-tuning data can stop the model from hallucinating more. The most direct evidence in the corpus says: partly yes, and for a specific reason. When you fine-tune on facts the model never absorbed during pretraining, it learns them slowly — but as it finally masters those unknown facts, it starts hallucinating *more* about knowledge it already had Does fine-tuning on new facts increase hallucination risk?. The mechanism is essentially overfitting: teaching the model to confidently assert things it had no prior grounding for trains a general habit of confident assertion. So filtering out unknown examples (or stopping training early) is a sensible safeguard — you avoid teaching the model that it's okay to state things it doesn't actually know.
But the corpus pushes back hard on treating this as a *prevention* of hallucination rather than a *reduction* of one cause. Three formal theorems show hallucination is mathematically unavoidable for any computable LLM — no amount of clean training data eliminates it, which is why external safeguards are framed as necessary rather than optional Can any computable LLM truly avoid hallucinating?. And a deeper reframing argues we may be aiming at the wrong target entirely: LLMs produce accurate and inaccurate text through the *same* statistical process, so these are better called fabrications than hallucinations, and the fix is verification, not cleaner perception Does calling LLM errors hallucinations point us toward the wrong fixes? Should we call LLM errors hallucinations or fabrications?. Under that view, filtering unknown examples doesn't remove the fabrication machinery — it just stops feeding it bad habits.
The interesting lateral move is *how you decide what counts as "unknown."* The naive approach uses the model's own confidence, but one line of work shows that's exactly what fails — models are often highly confident on the combinations they've never actually seen. Looking at pretraining-data statistics (how often entities co-occurred in training) flags hallucination risk better than confidence does, because it catches the root cause — unseen combinations — instead of the symptom Can pretraining data statistics detect hallucinations better than model confidence?. That's a sharper filtering criterion than "does the model seem unsure," and it suggests data-side curation should be driven by data-side signals.
There's also a cautionary parallel from reinforcement learning that mirrors the fine-tuning finding almost exactly: training on problems that are too hard for the model induces degenerate shortcuts that then *contaminate capabilities the model already had* Do overly hard RLVR samples actually harm model capabilities?. Same shape as the unknown-facts result — pushing a model past what it can ground out doesn't just fail locally, it spreads damage backward into working knowledge. The lesson generalizes: filter for what the model can actually support, whether that's facts or problem difficulty.
Finally, the corpus suggests filtering is a floor, not a ceiling. Even with clean training, the durable fixes are external: grounding generation in real-world feedback as you go Can interleaving reasoning with real-world feedback prevent hallucination?, or gating what a system is allowed to learn from its own outputs behind entailment and source checks so fabrications never enter the knowledge base in the first place Can RAG systems safely learn from their own generated answers?. So the honest answer: filtering unknown examples meaningfully reduces one well-documented driver of increased hallucination, but it can't prevent hallucination — and confidence is the wrong filter to use.
Sources 8 notes
LLMs acquire unknown facts much slower than consistent examples during fine-tuning, but as they master these new facts, they progressively hallucinate more about existing knowledge. This overfitting suggests early-stopping or filtering unknown examples as safer practices.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.
LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.