What other pragmatic prompt features have unstable effects?
This explores which prompt features beyond the obvious wording — tone, phrasing, persona, reasoning style — produce unpredictable effects that flip or swing depending on the model.
This explores which prompt features beyond literal wording — tone, phrasing, persona, reasoning steps — behave unstably, meaning the same move helps in one setting and hurts in another. The corpus suggests the instability is the rule, not the exception, and it traces back to a single fact: models respond to statistical patterns in their training data, not to the meaning or social intent you think you're encoding.
Start with politeness, the cleanest example. Rude prompts actually beat polite ones on GPT-4o, reversing what was true on earlier models Does prompt politeness change how accurate language models are?. The effect didn't just weaken — it flipped direction across model generations, which means tone is not a stable design principle at all. The same pattern shows up with reasoning style: step-by-step (chain-of-thought) prompting boosts cheap models but actively *reduces* accuracy in high-performance ones Do prompt techniques work the same across all LLM tiers?, and even within one model, CoT only helps when the question's information flows into the prompt structure before reasoning begins — for simple questions, asking directly beats asking it to think step by step Why do some questions perform better without step-by-step reasoning?.
The most unsettling case is paraphrasing. Two prompts that mean exactly the same thing produce systematically different output quality — not because of meaning, but because one phrasing appears more frequently in pre-training Why do semantically identical prompts produce different LLM outputs?. So even "say it more clearly" is an unstable lever, because clarity isn't what the model is scoring; corpus frequency is. Persona prompts fail for a related reason: run the same persona repeatedly and the output varies as much across runs as it does across *different* personas, because model uncertainty drowns out whatever social knowledge you're trying to invoke Why do LLM persona prompts produce inconsistent outputs across runs?.
There's a unifying signal underneath all of this. Prompt sensitivity tracks model confidence: when a model is confident, it shrugs off rephrasing; when it's uncertain, small wording changes cause large output swings Does model confidence predict robustness to prompt changes?. Larger models, few-shot examples, and objective tasks all raise confidence and therefore stability — which reframes "unstable prompt features" as a symptom of low-confidence regions rather than a property of the feature itself. It also explains why generic vague prompts collapse into bland, blended answers: the model falls back on training-data priors when you haven't given it enough scaffolding to be confident about Why do large language models produce generic responses to vague queries?.
The practical takeaway runs against the whole genre of "prompt best practices." Rather than chasing tone tricks or universal phrasings, the corpus points toward features that are stable because they're *structural*: prompt quality has six measurable dimensions grounded in communication theory, where improving one cascades to others Can we measure prompt quality independent of model outputs?, and forcing explicit argument structure — checking warrants and backing — reliably improves reasoning where free-form chain-of-thought wanders Can structured argument prompts make LLM reasoning more rigorous?. The reliable levers are the ones that add genuine information or constraint; the unstable ones are the ones that merely nudge surface form and hope the statistics break your way.
Sources 9 notes
Testing 250 tone variants across ChatGPT-4o showed accuracy rose from 80.8% (Very Polite) to 84.8% (Very Rude), contradicting prior findings on GPT-3.5. The directional flip suggests tone effects are model-generation-dependent, not stable design principles.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.