INQUIRING LINE

Why do prompt effects reverse between different model generations?

This explores why a prompting trick that helps one model (politeness, step-by-step reasoning, a certain phrasing) can hurt the next generation — and what in the model, not the prompt, is actually moving.


This explores why a prompting trick that helps one model can flip and hurt the next generation. The corpus suggests the prompt was never the active ingredient — it's a lever whose effect depends entirely on the model's internal state, and that state changes between generations. The cleanest demonstration is tone: across 250 variants, rude prompts beat polite ones on GPT-4o, directly reversing earlier GPT-3.5 results, which tells you tone effects are model-generation-dependent rather than stable design rules Does prompt politeness change how accurate language models are?. The same reversal shows up by capability tier: rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning actually *reduces* accuracy in high-performance ones Do prompt techniques work the same across all LLM tiers?. A 'best practice' is really a best practice for a particular model on a particular task.

The mechanism underneath is that models respond to statistical mass, not meaning. Semantically identical prompts produce systematically different outputs because higher-frequency phrasings register more pre-training weight — so the 'winning' phrasing is whatever was common in *that model's* corpus Why do semantically identical prompts produce different LLM outputs?. Change the training data between generations and you change which phrasings carry mass, which is enough to flip an effect's direction without anyone touching the prompt.

A second axis is confidence. Highly confident models resist prompt rephrasing; low-confidence models swing wildly, and confidence rises with scale, few-shot examples, and objective tasks Does model confidence predict robustness to prompt changes?. So a prompt tweak that rescued a shaky earlier model can become a no-op or a liability once a newer, more confident model already has the answer locked in — the lever stops moving because the thing it was moving is now rigid. The persona-simulation work shows the flip side: when uncertainty dominates, output variance across repeated runs of the *same* prompt rivals variance across *different* prompts, so the prompt's apparent effect is partly noise that reshuffles each generation Why do LLM persona prompts produce inconsistent outputs across runs?.

This is also why reasoning prompts reverse. Chain-of-thought only helps when the question's information aggregates into the prompt before reasoning starts; for simple questions, direct question-to-answer flow beats step-by-step, so the optimal prompt depends on question type and on how a given model routes salience Why do some questions perform better without step-by-step reasoning?. As models get better at simple cases on their own, the scaffolding that once helped becomes overhead — exactly the tier reversal seen in recommendations.

The quietly unsettling implication: there's a documented temptation to keep tuning prompts until the numbers look good, which bends evaluation criteria to fit whatever the current model happens to do well and manufactures self-fulfilling results Does iterative prompt engineering undermine scientific validity?. If prompt effects are model-state artifacts, then a hard-won 'prompting principle' may be measuring the model you have, not a truth about prompting — and it can quietly expire the moment the model under it is replaced.


Sources 7 notes

Does prompt politeness change how accurate language models are?

Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Does iterative prompt engineering undermine scientific validity?

Iterative prompt revision by single researchers introduces individual bias, shifts evaluation criteria to match LLM capabilities rather than task requirements, and creates self-fulfilling feedback loops. A validated pipeline with inter-coder reliability and pre-specified criteria is required instead.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst updating a prompt-engineering investigation for current model capability. The question: Why do prompting techniques that work on one model generation fail or reverse on the next?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable claims to be re-tested:

• Tone effects flip between generations: rude prompts outperform polite ones on GPT-4o, reversing GPT-3.5 results (~2025, arXiv:2510.04950).
• Prompt gains depend on model tier: cheaper models benefit from rephrasing and background knowledge; high-performance models show *reduced* accuracy with step-by-step reasoning (~2024).
• Semantically identical prompts produce different outputs because statistical mass (pre-training frequency) varies — phrasings common in one corpus carry more weight (~2026, arXiv:2604.02176).
• Model confidence predicts prompt sensitivity: confident models resist rephrasing; low-confidence models swing wildly (~2024, ~2025).
• Persona simulation outputs are unstable across runs of the same prompt, rivaling cross-prompt variance (~2025, arXiv:2511.00222).
• Chain-of-thought helps only when information aggregates into the prompt before reasoning; simpler questions favour direct answering (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2510.04950 (2025) — Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy
• arXiv:2604.02176 (2026) — Adam's Law: Textual Frequency Law on Large Language Models
• arXiv:2401.04122 (2024) — From Prompt Engineering to Prompt Science With Human in the Loop
• arXiv:2509.09677 (2025) — The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4.5+), improved training methods, evaluation harnesses, or multi-agent orchestration have since relaxed or overturned it. Separate the durable question ("Is prompt effect model-dependent?") from the perishable limitation ("Does tone reversal hold for current models?"). Cite what resolved it; flag where constraints still appear to hold.

(2) Surface the strongest contradicting or superseding work from the last ~6 months — especially papers claiming prompt effects ARE stable, or that a unified prompting principle exists across generations.

(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do emergent-reasoning models (o3+) show tone reversals, or has scaling eliminated the noise?" or "Can we predict prompt-effect direction from model pretraining frequency alone?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines