INQUIRING LINE

Do widely-repeated prompting heuristics like politeness actually improve accuracy?

This explores whether popular prompting folk-wisdom — being polite, saying 'please,' adding flattery — actually moves the accuracy needle, or whether such heuristics are stable design principles at all.


This explores whether the prompting habits people repeat to each other — politeness chief among them — genuinely improve accuracy. The short corpus answer: politeness is not a reliable lever, and the more interesting finding is that its effect *reverses* across model generations. A study testing 250 tone variants found accuracy actually climbed from 80.8% on 'Very Polite' prompts to 84.8% on 'Very Rude' ones with GPT-4o — flipping earlier results seen on GPT-3.5 Does prompt politeness change how accurate language models are?. That directional flip is the real takeaway: tone effects ride on quirks of a specific model's training, not on any durable property of language. A heuristic that flips sign when the model updates was never a principle — it was a coincidence dressed as one.

Step back and the corpus suggests why surface tone is the wrong place to look at all. Prompting only ever reorganizes knowledge the model already holds; it cannot inject anything that wasn't in training, which puts a hard ceiling on what *any* phrasing can buy you Can prompt optimization teach models knowledge they lack?. So if 'please' isn't supplying missing facts (it isn't), its effect is at best a small nudge to retrieval — and an unstable one. Meanwhile what genuinely changes outcomes is structural: whether the question's information actually flows into the prompt before reasoning begins. The same chain-of-thought wrapper helps complex questions and *hurts* simple ones, because the optimal prompt depends on question type, not on a universal trick Why do some questions perform better without step-by-step reasoning?.

This hints at a deeper reframing: 'good prompt' isn't a vibe, it's a measurable space. Researchers have decomposed prompt quality into six evaluable dimensions — communication, cognition, instruction, logic, hallucination, responsibility — grounded in Gricean maxims and cognitive-load theory, where improving one dimension cascades into others Can we measure prompt quality independent of model outputs?. Politeness barely registers in that framework; clarity, logical structure, and instruction quality do. The folk heuristics survive not because they work but because they're easy to repeat and impossible to falsify in casual use.

The most provocative thread is that brittleness to phrasing is itself a flaw worth engineering away rather than exploiting. Consistency training teaches models to respond *identically* to clean and 'wrapped' prompts, using the model's own clean answers as targets — explicitly aiming to make superficial wording (the exact territory politeness lives in) stop mattering Can models learn to ignore irrelevant prompt changes?. If that line of work succeeds, the entire genre of tone-tweaking heuristics becomes obsolete by design: the model would shrug off whether you begged or barked.

So the thing you didn't know you wanted to know: the people optimizing 'please vs. no please' and the people building robust models are working at cross-purposes. One side hunts for magic words; the other is trying to make the magic words irrelevant — and the reversal of the politeness effect across GPT-3.5 and GPT-4o is early evidence the second side is winning.


Sources 5 notes

Does prompt politeness change how accurate language models are?

Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Can we measure prompt quality independent of model outputs?

Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: do surface-level prompting heuristics like politeness actually improve LLM accuracy, or are they model-specific artifacts destined to flip or vanish?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable observations tied to specific model generations.
- Politeness effect REVERSES across model generations: 80.8% accuracy ('Very Polite') → 84.8% ('Very Rude') on GPT-4o, opposite of GPT-3.5 results (~2025).
- Prompting can only *activate* existing knowledge, never inject new facts; tone effects are unstable retrieval nudges, not principles (~2025).
- Optimal prompt structure depends on question *type*, not universal tricks; chain-of-thought helps complex questions, hurts simple ones (~2025).
- Prompt quality decomposes into six evaluable dimensions (communication, cognition, instruction, logic, hallucination, responsibility) grounded in Gricean maxims; politeness barely registers (~2025).
- Consistency training makes models *invariant* to superficial phrasing (clean vs. wrapped prompts), targeting the obsolescence of tone-tweaking heuristics (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2510.04950 (2025-10): Mind Your Tone — empirical reversal of politeness effect.
- arXiv:2506.06950 (2025-06): What Makes a Good Natural Language Prompt? — dimensional decomposition.
- arXiv:2510.27062 (2025-10): Consistency Training Helps Stop Sycophancy and Jailbreaks.
- arXiv:2502.10708 (2025-02): Injecting Domain-Specific Knowledge — knowledge activation limits.

Your task:
(1) RE-TEST EACH CONSTRAINT. For politeness effects, tone-dependence, and prompt robustness: has newer training (RLHF variants, post-training scaffolds), evaluation harnesses, or model scaling since June 2026 *relaxed* the brittleness to phrasing or *confirmed* the reversal? Separate the durable question ("Does surface tone matter?") from perishable claims ("Politeness helps GPT-4o"). Cite what resolved each constraint or where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any findings that restore politeness effects, show tone invariance is incomplete, or reframe the consistency-training path.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., do newer post-training methods achieve robustness-by-design, or do they introduce *new* prompt sensitivities? Is "good prompting" advice now obsolete, or migrating elsewhere (e.g., system prompts, guardrails)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines