How much does annotator style actually influence chain-of-thought prompting performance?
This explores whether the *style* of hand-written chain-of-thought examples — the phrasing, formatting, and presentation choices an annotator makes — actually moves performance, or whether models are responding to something else entirely.
This explores whether the way a human writes out CoT examples — their formatting and stylistic choices — genuinely drives performance, and the corpus suggests the surprising answer is: style matters enormously, but not for the reason you'd think. It isn't that good annotators teach better reasoning; it's that the *form* of the demonstration is doing most of the work, often independent of whether the content is even correct.
The sharpest evidence is that training format shapes reasoning strategy about 7.5× more than the actual domain of the problem, that swapping the position of a demonstration can swing accuracy by 20%, and — most tellingly — that invalid CoT prompts work roughly as well as valid ones What makes chain-of-thought reasoning actually work?. If a logically broken example performs as well as a correct one, then what the annotator is transmitting is a *pattern to imitate*, not a chain of inference. This fits the larger picture that CoT is constrained imitation of reasoning form, reproducing familiar schemata from training rather than performing genuine abstract inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?.
That reframes "annotator style" in a useful way: a lot of what an annotator writes is presentation, not computation. When researchers stripped chains down to minimal drafts, they matched full verbose accuracy using only 7.6% of the tokens — meaning the other ~92% served style and documentation, not the answer Can minimal reasoning chains match full explanations?. So the verbose, explanatory style annotators tend to favor is largely cosmetic from the model's standpoint. Style influences performance through structure (where things sit, what format they follow), not through eloquence or thoroughness.
But here's the twist that complicates any blanket rule: the *right* style isn't fixed — it depends on the question. Saliency analysis shows zero-shot CoT only helps when the question's information flows into the prompt before reasoning begins; for simpler questions, a direct question-to-answer style beats step-by-step, so the optimal demonstration depends on question type rather than task category Why do some questions perform better without step-by-step reasoning?. There's also a length dimension: accuracy follows an inverted-U, peaking at intermediate chain length and *declining* when chains run long, with stronger models preferring shorter chains Why does chain of thought accuracy eventually decline with length?. An annotator's habitual verbosity can therefore actively hurt on a capable model or an easy question.
The thing you might not have expected to want to know: there's a hard ceiling on all of this. No amount of stylistic craft injects knowledge the model doesn't already have — prompt optimization only reorganizes and activates existing training-distribution knowledge, it can't supply what's missing Can prompt optimization teach models knowledge they lack?. So annotator style is a powerful *activation key* for capabilities already latent in the model, but a useless *teaching tool* for capabilities that aren't. Style decides whether the door opens; it can't build the room behind it.
Sources 6 notes
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.