SYNTHESIS NOTE

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Synthesis note · 2026-04-01 · sourced from Reinforcement Learning

Most AI scientist research focuses on execution — literature search, experiment design, data analysis. RLCF addresses a different capability: what research directions are worth pursuing. This is the judgment capacity the authors call "scientific taste."

The training paradigm: Reinforcement Learning from Community Feedback (RLCF) uses citation counts as community feedback signals. To mitigate field and time biases, training data consists of 700K pairs of paper abstracts matched by field and publication year, where the higher-cited paper serves as the preferred (higher-impact) item.

Two trained models:

Scientific Judge — a generative reward model that compares two papers, reasons about their relative impact, and chooses the better one. Outperforms GPT-5.2, Gemini 3 Pro, and other SOTA LLMs at predicting impact. Generalizes to future-year test sets, unseen fields, and peer-review preferences.
Scientific Thinker — a policy model trained via RL with Scientific Judge as reward model. Given a paper's title and abstract, it proposes follow-up research ideas with higher potential impact than baselines.

The theoretical framing is significant. The authors invoke Hume: "a standard of taste can emerge from the joint verdict of qualified judges rather than arbitrary individual preference." And Kant: taste as "sensus communis" — a shared sense that considers how others could judge. Scientific taste is not personal preference. It is alignment with community judgment. RLCF operationalizes this: the reward signal comes from community behavior (citations), not individual annotation (RLHF) or formal verification (RLVR).

The three RL paradigms now distinguished:

RLHF — individual human preferences (costly, limited to annotator capacity)
RLVR — verifiable ground truth (math, code — limited to tasks with objective answers)
RLCF — community-level feedback (scales with community size, captures collective judgment)

Since Can AI predict social norms better than humans?, RLCF is the training analog: the model learns to predict community preference (which papers will be cited) without participating in the community process that produces citations. It predicts taste without having taste. This is the same prediction-without-participation pattern — now as an explicit training objective.

Since Can AI ever gain expert community trust through participation?, RLCF trains the model to bypass the validation circle entirely — learning what the circle would approve without joining it. The epistemological implications for the Tokenization series are direct: this is a machine that learns to produce knowledge-tokens calibrated to community acceptance, without the community process that gives acceptance its warrant.

Inquiring lines that read this note 10

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should iterative research systems allocate reasoning per search step?

How does semantic search over research papers guide autonomous architecture proposals?

How should human oversight be integrated with autonomous AI systems?

Where do human researchers retain competitive advantage over autoresearch systems?

How do evaluation biases undermine LLM quality assessment systems?

Why does automated evaluation consistently overestimate research quality?

How do we evaluate AI systems when user perception misleads actual performance?

Does brute force experimentation substitute for research intuition and taste?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Can structured evaluation assess novelty in scientific writing?

Why do readers trust citations and complexity regardless of accuracy?

Why do LLM research ideas score high on novelty yet collapse into low diversity?

What distinguishes scientific plausibility from cognitive availability in research ideas?

How does memorization interact with learning and generalization?

Can experimental outcomes be reliably distilled into reusable insights?

What structural factors drive popularity bias in recommendation systems?

Can ranking by coherence while minimizing author-community coverage find novel research?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 125 in 2-hop network ·medium cluster Open in graph ↗

Can models learn what makes research worth doing… Can reward models learn by comparing policies inst… Can language models replace reward models with int… Can AI predict social norms better than humans? Can AI ever gain expert community trust through pa… Why does RL succeed more on some tasks than others… Do LLM research ideas actually hold up when expert…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can reward models learn by comparing policies instead of judging them? What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
both reframe reward as a *relational* construct rather than an absolute property: POLAR uses similarity-to-target-policy, RLCF uses ranking-within-community. Neither requires labeled absolute preferences; both achieve scalability by relativization
Can language models replace reward models with internal signals? Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?
RLCF's community-feedback signal is a fourth pattern in the verifier-free convergence: external relational reward without individual annotation
Can AI predict social norms better than humans? Explores whether language models can achieve superhuman accuracy at predicting what communities find socially appropriate, and what that capability reveals about the difference between prediction and genuine participation.
RLCF is the training-level version: learning community preference without community participation
Can AI ever gain expert community trust through participation? Explores whether AI can accumulate the social capital and track record that human experts build within their communities. Questions whether prediction of social norms equals genuine participation in expert validation processes.
RLCF trains bypass of the validation circle
Why does RL succeed more on some tasks than others? Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?
RLCF introduces a third reward type: community-level feedback as neither binary verification nor individual judgment
Do LLM research ideas actually hold up when experts try to execute them? Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.
Scientific Thinker addresses ideation quality; whether execution quality follows is untested

Can models learn what makes research worth doing?

Inquiring lines that read this note 10

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4