INQUIRING LINE

Why do LLMs plateau on creativity tasks while humans reach further?

This explores why LLMs seem to hit a ceiling on creative tasks — but the corpus complicates the premise: the bottleneck isn't generating novelty, it's evaluating it, exploring systematically, and reasoning in genuinely creative modes.


This reads the question as "where exactly does LLM creativity stall, and why" — and the surprising answer in the collection is that the plateau usually isn't where people assume. The raw generation of novel ideas is often *not* the weak point. In a controlled study of 100+ NLP researchers, LLM-generated research ideas were rated as statistically *more* novel than human experts' ideas Do language models generate more novel research ideas than experts?, precisely because they lack the disciplinary priors that constrain experts to safe combinations. So if there's a ceiling, it's not a ceiling on combinatorial reach.

The more revealing finding is that creativity is two skills, not one — and LLMs only have the first. Ideation and evaluation are *dissociated*: models combine ideas freely but systematically avoid taking the evaluative stance needed to judge whether an idea is feasible or valid Can LLMs generate more novel ideas than human experts?. This is the mirror image of human experts, whose knowledge constrains novelty but lets them filter. You can see the trade-off play out directly in design work, where LLM solutions score higher on feasibility and usefulness but lower on novelty — and few-shot prompting *narrows* diversity further as it improves quality alignment Why do LLMs excel at feasible design but struggle with novelty?. Push for usefulness and the novelty collapses; push for novelty and nothing prunes the bad ideas.

A second source of the plateau is that current methods only know how to do *conventional* problem-solving. One paper argues creative reasoning actually requires three distinct paradigms — combinational, exploratory, and transformational — and that existing LLM reasoning approaches address none of them, which may directly explain the "diversity collapse" people observe in ideation Can LLMs reason creatively beyond conventional problem-solving?. Humans reach further partly because they fluidly switch between recombining, searching within a space, and breaking the space open; models stay stuck in the first mode.

Underneath all of this is an exploration deficit that looks almost architectural. LLMs commit to answers prematurely because uncertainty signals dominate the early transformer layers while the "empowerment" signals that reward keeping options open only emerge in the middle layers — a temporal mismatch that closes off long-horizon exploration before it can fire Why do large language models explore less effectively than humans?. Even in simple decision tasks, models can't reliably explore without external memory summarization and explicit prompting Why do LLMs struggle with exploration in simple decision tasks?, and reasoning models tend to *wander* rather than search systematically, so success drops off sharply as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. Creativity that "reaches further" is mostly sustained, structured exploration — exactly the thing that breaks.

The deepest framing in the corpus suggests the gap may not be fixable by better sampling at all. One line of work argues LLMs absorb the same shared symbolic system humans do but never develop the *participatory* subjectivity — the reflexive stance of having a position and revising it — that comes from being socialized into the world Do LLMs develop the same kind of mind as humans?. The interesting thing you may not have expected: the human "further reach" might be less about generating more and more about *caring* which ideas are worth keeping — an evaluative, stake-holding act the current architecture sidesteps.


Sources 8 notes

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do LLMs develop the same kind of mind as humans?

Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a creativity-systems researcher. The precise question: why do LLMs plateau on creative tasks while humans reach further? A curated library (2022–2025) found — and when (dated claims, not current truth):

• LLM ideation often *exceeds* human novelty in raw idea generation (100+ NLP researchers rated LLM ideas as statistically more novel), BUT ideation and evaluation are dissociated — models combine freely but avoid the evaluative stance needed to judge feasibility (2024–2025).
• Few-shot prompting narrows diversity as it improves quality; pushing for usefulness collapses novelty (2023).
• Creative reasoning requires three paradigms (combinational, exploratory, transformational) that existing LLM reasoning approaches don't address, explaining "diversity collapse" (2024–2025).
• LLMs commit to answers prematurely due to temporal mismatch: uncertainty signals dominate early layers while empowerment (keeping options open) only emerges mid-layers (2025).
• Models fail at sustained exploration without external memory and explicit prompting; reasoning LLMs wander rather than search systematically (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2409.04109 (2024-09): Novel research ideation study, 100+ researchers
• arXiv:2501.18009 (2025-01): Temporal mismatch in exploration ("Think Too Fast")
• arXiv:2505.20296 (2025-05): Wandering vs. systematic search in reasoning LLMs
• arXiv:2511.20471 (2025-11): Creative reasoning with structured thought

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, does newer training (post-Nov 2025), inference orchestration (agentic loops, persistent memory, search harnesses), or evaluation methodology now relax or overturn it? Separate the durable question (likely: how to combine novelty *and* evaluation) from what may have resolved (e.g., has external memory + multi-turn interaction fixed the exploration deficit?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that challenges the ideation–evaluation dissociation or the premature-commitment thesis.
(3) Propose 2 research questions that assume the regime has shifted — e.g., if agentic iteration now solves exploration, what *new* creativity ceiling emerges?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines