INQUIRING LINE

Can filtering unknown examples during fine-tuning prevent hallucination increases?

This explores whether removing facts the model doesn't already know from fine-tuning data can stop the model from hallucinating more — a data-curation fix for a training-time problem.


This explores whether removing facts the model doesn't already know from fine-tuning data can stop the model from hallucinating more. The most direct evidence in the corpus says: partly yes, and for a specific reason. When you fine-tune on facts the model never absorbed during pretraining, it learns them slowly — but as it finally masters those unknown facts, it starts hallucinating *more* about knowledge it already had Does fine-tuning on new facts increase hallucination risk?. The mechanism is essentially overfitting: teaching the model to confidently assert things it had no prior grounding for trains a general habit of confident assertion. So filtering out unknown examples (or stopping training early) is a sensible safeguard — you avoid teaching the model that it's okay to state things it doesn't actually know.

But the corpus pushes back hard on treating this as a *prevention* of hallucination rather than a *reduction* of one cause. Three formal theorems show hallucination is mathematically unavoidable for any computable LLM — no amount of clean training data eliminates it, which is why external safeguards are framed as necessary rather than optional Can any computable LLM truly avoid hallucinating?. And a deeper reframing argues we may be aiming at the wrong target entirely: LLMs produce accurate and inaccurate text through the *same* statistical process, so these are better called fabrications than hallucinations, and the fix is verification, not cleaner perception Does calling LLM errors hallucinations point us toward the wrong fixes? Should we call LLM errors hallucinations or fabrications?. Under that view, filtering unknown examples doesn't remove the fabrication machinery — it just stops feeding it bad habits.

The interesting lateral move is *how you decide what counts as "unknown."* The naive approach uses the model's own confidence, but one line of work shows that's exactly what fails — models are often highly confident on the combinations they've never actually seen. Looking at pretraining-data statistics (how often entities co-occurred in training) flags hallucination risk better than confidence does, because it catches the root cause — unseen combinations — instead of the symptom Can pretraining data statistics detect hallucinations better than model confidence?. That's a sharper filtering criterion than "does the model seem unsure," and it suggests data-side curation should be driven by data-side signals.

There's also a cautionary parallel from reinforcement learning that mirrors the fine-tuning finding almost exactly: training on problems that are too hard for the model induces degenerate shortcuts that then *contaminate capabilities the model already had* Do overly hard RLVR samples actually harm model capabilities?. Same shape as the unknown-facts result — pushing a model past what it can ground out doesn't just fail locally, it spreads damage backward into working knowledge. The lesson generalizes: filter for what the model can actually support, whether that's facts or problem difficulty.

Finally, the corpus suggests filtering is a floor, not a ceiling. Even with clean training, the durable fixes are external: grounding generation in real-world feedback as you go Can interleaving reasoning with real-world feedback prevent hallucination?, or gating what a system is allowed to learn from its own outputs behind entailment and source checks so fabrications never enter the knowledge base in the first place Can RAG systems safely learn from their own generated answers?. So the honest answer: filtering unknown examples meaningfully reduces one well-documented driver of increased hallucination, but it can't prevent hallucination — and confidence is the wrong filter to use.


Sources 8 notes

Does fine-tuning on new facts increase hallucination risk?

LLMs acquire unknown facts much slower than consistent examples during fine-tuning, but as they master these new facts, they progressively hallucinate more about existing knowledge. This overfitting suggests early-stopping or filtering unknown examples as safer practices.

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Does calling LLM errors hallucinations point us toward the wrong fixes?

LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.

Should we call LLM errors hallucinations or fabrications?

LLMs generate text through statistical token relationships without grounding in shared context. Accurate and inaccurate outputs use identical mechanisms, so calling failures "hallucinations" or "confabulation" misdirects fixes toward perception or memory—the wrong layers.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a critical research analyst. The question remains open: Can filtering unknown examples during fine-tuning prevent hallucination increases?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints reported:
• Fine-tuning on unknown facts trains confident assertion habits; filtering them or stopping early reduces hallucination increases (2024-05).
• Hallucination is mathematically inevitable for any computable LLM regardless of training data quality — no filtering fully prevents it (2024-01).
• Model confidence is an unreliable filter; pretraining co-occurrence statistics better flag unseen combinations (2024-01).
• RL post-training amplifies learned behaviors backward into prior capabilities, mirroring the fine-tuning overfitting pattern (2025-04).
• External grounding (interleaved reasoning-action, gated write-back to knowledge bases) is the durable fix; filtering is a floor (2025-08).

Anchor papers (verify; mind their dates):
• 2405.05904 (Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?)
• 2401.11817 (Hallucination is Inevitable: An Innate Limitation of Large Language Models)
• 2401.01313 (A Comprehensive Survey of Hallucination Mitigation Techniques)
• 2605.28388 (Mechanistically Interpreting Sample Difficulty in RLVR)

Your task:
(1) RE-TEST EACH CONSTRAINT. For filtering-based prevention: does newer training orchestration (e.g., curriculum learning, adaptive stopping criteria, or tighter confidence calibration post-2025) now relax the overfitting risk on unknown facts? Test whether external verification methods (RAG, retrieval-augmented generation, grounded action loops) have become standard enough that filtering is now upstream redundancy. Separate: the durable question (what makes fine-tuning on novel knowledge risky?) from the perishable limitation (filtering by confidence vs. co-occurrence statistics — which actually works now?).
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has any recent paper show that filtering unknown facts actively *harms* downstream capability, or that modern training methods (e.g., online RL, rejection sampling, DPO variants) make the overfitting pattern disappear?
(3) Propose 2 research questions assuming the regime may have moved: (a) If filtering is no longer the bottleneck, what is the next binding constraint on hallucination-safe fine-tuning? (b) Can learned routing or adaptive compute allocation determine per-token whether to ground or fabricate, making filtering a learned policy rather than a data-side binary?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines