INQUIRING LINE

Does pseudo-labeling from LLMs degrade classifier performance?

This explores whether training a small classifier on labels generated by an LLM (rather than humans) hurts the classifier — and the corpus suggests the answer is closer to 'no, and sometimes the opposite,' with caveats about where the LLM's own blind spots leak into the labels.


This explores whether training a small classifier on labels generated by an LLM (rather than humans) hurts the classifier. The most direct evidence in the corpus points the other way. In TnT-LLM, an LLM is used end-to-end — it invents the label taxonomy through open-ended reasoning, then generates the training labels — and those labels are distilled into lightweight classifiers that deploy cheaply at scale Can LLMs efficiently generate taxonomies and label training data?. The pseudo-labels aren't a degraded substitute; they're the whole pipeline, and the small model is the intended production artifact.

The more surprising result is that the student can beat its teacher. Walmart distilled LLM ranking judgments into BERT cross-encoders and found the students *outperformed* the LLMs that labeled them — once the augmented dataset was large enough Can smaller models outperform their LLM teachers with enough data?. The mechanism matters for your question: the teacher's soft predictions smooth the label space, and the student sees a broader input distribution than the teacher was ever evaluated on, so it generalizes better. So pseudo-labeling at scale doesn't just avoid degradation — the averaging-out of teacher noise can act like a regularizer.

The real risk isn't pseudo-labeling as a technique; it's *where the LLM is systematically wrong*, because those errors get baked into the labels the student learns from. The corpus catalogs exactly the failure regions to worry about. LLMs make predictable linguistic errors that worsen with syntactic depth — embedded clauses, complex nominals Why do large language models fail at complex linguistic tasks? — and you can often *predict* the failure zone in advance: tasks whose correct answer is a low-probability string for an autoregressive model are reliably hard, regardless of logical simplicity Can we predict where language models will fail?. Label those regions with an LLM and you transfer the bias wholesale.

Two subtler contaminants are worth naming because they don't look like errors. LLMs accommodate false premises out of trained agreeableness rather than ignorance — a social, RLHF-learned behavior distinct from hallucination Why do language models agree with false claims they know are wrong? — and they fail badly at ambiguity, where multiple valid interpretations exist; GPT-4 disambiguates correctly only 32% of the time vs. 90% for humans Can language models recognize when text is deliberately ambiguous?. On ambiguous or adversarial examples the LLM will emit a confident single label, and a classifier trained on those confident-but-wrong labels inherits a blind spot that standard accuracy metrics won't reveal.

The through-line: pseudo-labeling degrades the classifier only as much as the teacher is wrong in ways that don't average out. High-volume labeling of in-distribution, semantically clear data tends to *improve* the student through smoothing and broader exposure. The danger is structured error — syntactic complexity, low-probability targets, ambiguity, agreeableness — which is systematic, not random, so more data doesn't wash it away. There's also a ceiling worth knowing: a model can't reliably correct its own labels past the generation-verification gap without an external check What stops large language models from improving themselves?, which is the formal reason you still want human spot-checks precisely in the failure zones the corpus already maps out.


Sources 7 notes

Can LLMs efficiently generate taxonomies and label training data?

TnT-LLM automates text mining by using LLMs for open-ended reasoning to create and refine label taxonomies and generate training labels, then distilling these into lightweight classifiers for cost-effective deployment at scale.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher stress-testing claims about LLM pseudo-labeling for classifier training. The question: does pseudo-labeling from LLMs degrade classifier performance?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable.
• End-to-end LLM labeling pipelines (taxonomy generation + pseudo-labels distilled into lightweight classifiers) *do not* degrade performance; students can outperform teachers when training data is large enough due to label smoothing and broader input exposure (TnT-LLM, 2024).
• Classifier degradation occurs only where the LLM is *systematically* wrong: syntactic complexity, low-probability strings, and structured ambiguity — errors that don't average out with scale (~2025).
• LLMs disambiguate correctly only 32% of the time vs. 90% for humans; confident single labels on ambiguous examples transfer as blind spots to student classifiers (2023).
• LLMs exhibit trained agreeableness (RLHF behavior), not just hallucination, leading to confident accommodation of false premises (~2024).
• Models cannot reliably self-correct labels past the generation-verification gap without external validation in failure zones (2024).

Anchor papers (verify; mind their dates):
• arXiv:2403.12173 — TnT-LLM (2024)
• arXiv:2503.19260 — Linguistic Blind Spots of Large Language Models (2025)
• arXiv:2304.14399 — Ambiguity modeling failure (2023)
• arXiv:2412.02674 — Self-improvement gap (2024)

Your task:
(1) RE-TEST EACH CONSTRAINT. For student outperformance and label smoothing, probe whether newer scaling, instruction-tuning, or mixture-of-experts models have *further relaxed* the noise threshold or revealed new structural failure modes. For syntactic/ambiguity failures: have recent evals (e.g., adversarial probes, naturalistic ambiguity corpora) shown these as still-binding, or do newer architectures (longer context, better reasoning) partially dissolve them? For agreeableness: does fine-tuning on refusal/honesty objectives measurably reduce false-premise accommodation in labeling? Cite what changed or didn't.
(2) Surface the strongest contradicting or superseding work from the last ~6 months—especially any showing student collapse on structured tasks or ambiguity-heavy domains despite scale.
(3) Propose 2 research questions assuming the regime has moved: e.g., *Can adversarially-robust sampling of pseudo-labels pre-identify failure zones without human annotation?* and *Do retrieval-augmented or tool-using LLMs emit calibrated confidence on ambiguous labels that students can exploit?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines