INQUIRING LINE

Inquiring lines›How should agents manage and coord…›What signals most reliably capture…›Can prompting strategies overcome…›this inquiring line

Defining 'good' in your prompt doesn't just raise the quality floor — it quietly lowers the ceiling on surprise.

Why does embedding evaluation criteria in prompts reduce creative scope?

This explores a tension in prompting: when you tell a model the standards its output will be judged against, you also tell it where to stop looking — and the corpus suggests evaluation criteria act less like a quality filter and more like a gravity well that pulls generation toward the expected.

This explores a tension in prompting: spelling out evaluation criteria up front doesn't just raise the floor on quality, it lowers the ceiling on surprise. The clearest mechanism in the corpus comes from work treating prompt engineering as divergence minimization — users iteratively steer a model toward the distribution they already anticipate, so outputs become co-productions of model and user expectation rather than genuinely novel territory How much does the user shape what a model generates?. Evaluation criteria are the sharpest possible form of that prior. Once "good" is defined, the model has every incentive to converge on it, and the space of things it might have explored quietly collapses.

That collapse has a name elsewhere in the corpus: tail narrowing. Research on critique models found that injecting step-level evaluation during training tends to shrink the diversity of solutions a model will generate — premature convergence — and that the real value of good critique is *counteracting* that narrowing rather than tightening it Do critique models improve diversity during training itself?. Embedding criteria in a prompt is the test-time version of the same pressure: it rewards the model for hitting a known target and penalizes the wandering that creativity depends on.

The deeper reason this bites *creative* scope specifically is that creativity isn't one thing. One line of work breaks creative reasoning into combinational, exploratory, and transformational modes — and argues existing methods only ever serve conventional, convergent problem-solving, which is exactly what diversity collapse in ideation looks like Can LLMs reason creatively beyond conventional problem-solving?. Evaluation criteria are convergent by construction: they describe the answer's shape in advance. Exploratory and transformational moves — the ones that reframe the problem or break the frame — can't be scored by a rubric written before they exist, so a criteria-laden prompt structurally has no room for them.

Here's the part you might not expect: this is sometimes a feature. Research on the "gulf of envisioning" found users often *can't* articulate what they want, and that shifting them from open-ended generation to constrained evaluation of presented options actually reduces cognitive burden and helps intent mature Why can't users articulate what they want from AI?. So the narrowing isn't a bug to eliminate — it's a dial. Criteria trade exploration for steerability, and the right setting depends on whether you're trying to discover or to converge. Worth noting the limit case too: prompting only reorganizes what a model already holds and can't supply knowledge it lacks Can prompt optimization teach models knowledge they lack?, so criteria can suppress creative range but were never the source of it either.

Sources 5 notes

How much does the user shape what a model generates?

Foundation Priors research shows prompt engineering as divergence minimization between synthetic output and user priors. The refinement process systematically steers generation toward what users already expect, making outputs co-productions of model and user subjectivity.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Why can't users articulate what they want from AI?

Intent develops through interaction, not in isolation. Since AI models respond rather than probe, they miss opportunities to help users discover unarticulated requirements. Structured dialogue that presents model-generated options shifts the cognitive burden from open-ended envisioning to constrained evaluation.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Large Language Models for User Interest Journeys1.62 match · arxiv ↗
Universe of Thoughts: Enabling Creative Reasoning with Large Language Models0.90 match · arxiv ↗
Foundation Priors0.87 match · arxiv ↗
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey0.85 match · arxiv ↗
WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue0.85 match · arxiv ↗
Reasoning LLMs are Wandering Solution Explorers0.85 match · arxiv ↗
Bridging the gulf of envisioning: Cognitive design challenges in llm interfaces.0.85 match · arxiv ↗
Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. This question remains open: **Does embedding evaluation criteria in prompts structurally constrain creative scope, or have recent models, training methods, and evaluation harnesses dissolved that constraint?**

What a curated library found — and when (findings span 2023–11/2025, dated claims, not current truth):
• Prompt engineering as iterative alignment minimizes divergence; criteria are the sharpest prior, collapsing exploration into anticipated distribution (~2024–25).
• Critique models show tail narrowing during training; premature convergence rewards known targets and penalizes wandering (~2024-11).
• Creative reasoning splits into combinational, exploratory, and transformational modes; convergent criteria can't score transformational moves that reframe before they exist (~2024–25).
• The "gulf of envisioning" shows users benefit from constrained evaluation over open-ended generation when intent is immature (~2024–25).
• Prompt optimization activates knowledge but cannot inject new knowledge; criteria suppress range but don't originate it (~2023–24).

Anchor papers (verify; mind their dates):
• arXiv:2411.16579 (2024-11) — Critique Models with Test-Time and Training-Time Supervision
• arXiv:2511.20471 (2025-11) — Universe of Thoughts: Enabling Creative Reasoning
• arXiv:2506.06950 (2025-06) — What Makes a Good Natural Language Prompt?
• arXiv:2507.21028 (2025-07) — Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, or later), in-context learning strategies (few-shot exemplars, chain-of-thought variants), orchestration patterns (agentic iteration, recursive critique, ensemble generation), or novel evaluation frameworks (multi-agent judges, open-ended rubrics) have since relaxed or overturned the tail-narrowing effect. Separate the durable claim (criteria do steer toward anticipation) from the perishable limit (this necessarily collapses creativity). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has anyone shown criteria *amplify* exploratory reasoning? Look for work on "tension-aware prompting" or multi-objective generation.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Can dynamically relaxed criteria maintain steerability while preserving transformational modes?" or "Do multi-agent evaluation frameworks re-open exploration despite front-loaded criteria?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Defining 'good' in your prompt doesn't just raise the quality floor — it quietly lowers the ceiling on surprise.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8