Why does embedding evaluation criteria in prompts reduce creative scope?
This explores a tension in prompting: when you tell a model the standards its output will be judged against, you also tell it where to stop looking — and the corpus suggests evaluation criteria act less like a quality filter and more like a gravity well that pulls generation toward the expected.
This explores a tension in prompting: spelling out evaluation criteria up front doesn't just raise the floor on quality, it lowers the ceiling on surprise. The clearest mechanism in the corpus comes from work treating prompt engineering as divergence minimization — users iteratively steer a model toward the distribution they already anticipate, so outputs become co-productions of model and user expectation rather than genuinely novel territory How much does the user shape what a model generates?. Evaluation criteria are the sharpest possible form of that prior. Once "good" is defined, the model has every incentive to converge on it, and the space of things it might have explored quietly collapses.
That collapse has a name elsewhere in the corpus: tail narrowing. Research on critique models found that injecting step-level evaluation during training tends to shrink the diversity of solutions a model will generate — premature convergence — and that the real value of good critique is *counteracting* that narrowing rather than tightening it Do critique models improve diversity during training itself?. Embedding criteria in a prompt is the test-time version of the same pressure: it rewards the model for hitting a known target and penalizes the wandering that creativity depends on.
The deeper reason this bites *creative* scope specifically is that creativity isn't one thing. One line of work breaks creative reasoning into combinational, exploratory, and transformational modes — and argues existing methods only ever serve conventional, convergent problem-solving, which is exactly what diversity collapse in ideation looks like Can LLMs reason creatively beyond conventional problem-solving?. Evaluation criteria are convergent by construction: they describe the answer's shape in advance. Exploratory and transformational moves — the ones that reframe the problem or break the frame — can't be scored by a rubric written before they exist, so a criteria-laden prompt structurally has no room for them.
Here's the part you might not expect: this is sometimes a feature. Research on the "gulf of envisioning" found users often *can't* articulate what they want, and that shifting them from open-ended generation to constrained evaluation of presented options actually reduces cognitive burden and helps intent mature Why can't users articulate what they want from AI?. So the narrowing isn't a bug to eliminate — it's a dial. Criteria trade exploration for steerability, and the right setting depends on whether you're trying to discover or to converge. Worth noting the limit case too: prompting only reorganizes what a model already holds and can't supply knowledge it lacks Can prompt optimization teach models knowledge they lack?, so criteria can suppress creative range but were never the source of it either.
Sources 5 notes
Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.
Intent develops through interaction, not in isolation. Since AI models respond rather than probe, they miss opportunities to help users discover unarticulated requirements. Structured dialogue that presents model-generated options shifts the cognitive burden from open-ended envisioning to constrained evaluation.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.