SYNTHESIS NOTE

Can imitating ChatGPT fool evaluators into thinking models improved?

Explores whether fine-tuning weaker models on ChatGPT outputs creates an illusion of capability gains. Investigates why human raters and automated judges fail to detect that imitation improves style but not underlying factuality or reasoning.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

The "False Promise of Imitating Proprietary LLMs" paper documents a specific deception: imitation models (weaker models fine-tuned on outputs from ChatGPT) appear competitive to human evaluators and GPT-4 judges, but targeted evaluation reveals they close "little to none" of the capability gap on tasks not heavily represented in the imitation data. The models are adept at mimicking ChatGPT's style — confident, well-structured, fluent — but not its factuality or generalization.

The human evaluation failure is particularly revealing. Crowd workers rated imitation model outputs as competitive with ChatGPT. These performance discrepancies slip past human raters because style is what humans evaluate naturally — coherence, fluency, apparent completeness — while factual accuracy requires domain knowledge that raters typically lack. This maps onto Why does AI writing sound generic despite being grammatically correct?: imitation captures the grammatical fluency that makes text sound competent while missing the rhetorical depth — evaluative commitment, factual grounding — that constitutes actual capability. Since Can LLMs generate more novel ideas than human experts?, imitation training preferentially transfers the generative side where LLMs already excel while the evaluative gap persists. This is the same detection asymmetry documented in Can human judges detect measurable differences in AI text?: surface quality masks underlying deficiency.

The practical conclusion is sharp: "the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems." The capability ceiling is set by the base model — fine-tuning can surface existing capabilities in new formats, but cannot inject capabilities the base model lacks. This echoes Can prompt optimization teach models knowledge they lack? and Does RL teach reasoning or just when to use it? — adaptation methods (prompting, RL, imitation) reshape output distribution but don't expand the capability frontier.

Broadly matching ChatGPT through imitation would require: (1) enormous imitation datasets, and (2) far more diverse and higher quality imitation data than currently available. The cost of sufficient imitation data approaches the cost of training a better base model directly — at which point the shortcut has become the long way around.

Style detection as evidence: The authorship attribution finding (A Ripple in Time) — GPT-2 + UMAP achieving 95% accuracy on presidential State of the Union attribution — provides concrete evidence for the style-capture thesis. Style detection succeeds at the pattern level because stylistic signatures are surface features that statistical learning captures well. But since Can language models truly understand literary style?, the 95% detection rate coexists with an inability to interpret why those style patterns matter. In literary prose, style IS content — Hemingway's short sentences are his meaning, not his preference. Detecting style without interpreting it mirrors the broader imitation pattern: capturing the surface while missing the substance.

Inquiring lines that read this note 118

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can imitating ChatGPT fool evaluators into thinking models improved?

Inquiring lines that read this note 118

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 3