SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Recommender Systems Reasoning, Retrieval, and Evaluation

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Synthesis note · 2026-04-01 · sourced from Reinforcement Learning
What actually changes inside a model during RL training?

Most AI scientist research focuses on execution — literature search, experiment design, data analysis. RLCF addresses a different capability: what research directions are worth pursuing. This is the judgment capacity the authors call "scientific taste."

The training paradigm: Reinforcement Learning from Community Feedback (RLCF) uses citation counts as community feedback signals. To mitigate field and time biases, training data consists of 700K pairs of paper abstracts matched by field and publication year, where the higher-cited paper serves as the preferred (higher-impact) item.

Two trained models:

The theoretical framing is significant. The authors invoke Hume: "a standard of taste can emerge from the joint verdict of qualified judges rather than arbitrary individual preference." And Kant: taste as "sensus communis" — a shared sense that considers how others could judge. Scientific taste is not personal preference. It is alignment with community judgment. RLCF operationalizes this: the reward signal comes from community behavior (citations), not individual annotation (RLHF) or formal verification (RLVR).

The three RL paradigms now distinguished:

Since Can AI predict social norms better than humans?, RLCF is the training analog: the model learns to predict community preference (which papers will be cited) without participating in the community process that produces citations. It predicts taste without having taste. This is the same prediction-without-participation pattern — now as an explicit training objective.

Since Can AI ever gain expert community trust through participation?, RLCF trains the model to bypass the validation circle entirely — learning what the circle would approve without joining it. The epistemological implications for the Tokenization series are direct: this is a machine that learns to produce knowledge-tokens calibrated to community acceptance, without the community process that gives acceptance its warrant.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reinforcement learning from community feedback trains scientific taste by using citation-based community preferences as reward signal — separating judgment from execution