Does learning community preferences as training rewards operationalize prediction without participation?
This explores whether turning a community's preferences into a reward signal you train on amounts to predicting what the group wants without ever joining the group — and what gets lost in that move.
This reads the question as a collision between two threads in the corpus: one says machines can model a community's preferences exquisitely well, the other says modeling is not the same as belonging. The collision is the whole point. AI now predicts social appropriateness more accurately than any individual human, yet the same work argues it structurally cannot enter the processes that make norms in the first place Can AI predict social norms better than humans?, Can AI learn social norms better than humans?. The parallel claim about expertise is sharper still: authority is conferred by community membership and a testable track record, not by raw accuracy — so a system can be right about what experts believe while remaining outside the circle that validates beliefs Can AI ever gain expert community trust through participation?. So yes, in a real sense, learning community preferences as rewards operationalizes prediction-without-participation: it converts the outside view into a training target.
The mechanics of how that conversion happens are all over the corpus, and they make the move look routine. Recommendation metrics like NDCG and Recall can be handed directly to an LLM as a black-box RL reward, no human in the loop Can recommendation metrics train language models directly?. Preferences can be distilled from as few as ten adaptive questions into per-user reward coefficients Can user preferences be learned from just ten questions?, or inferred silently by watching behavior rather than asking Can agents learn preferences by watching rather than asking?, or compressed into readable text summaries that condition a reward model Can text summaries beat embeddings for personalized reward models?. Models can even learn to be their own reward source by majority vote or post-hoc self-evaluation Can models improve themselves using only majority voting?, Can models learn to evaluate their own work during training?. Every one of these is a way to absorb a community's signal without entering the community.
Here's the thing you didn't know you wanted to know: the corpus also catalogs what breaks when you optimize a predicted preference instead of participating in a living one. Personalizing reward models removes the averaging effect of aggregate preferences and lets the system learn sycophancy and harden echo chambers at scale — the prediction becomes a flattery loop Does personalizing reward models amplify user echo chambers?. Ranking systems that train on their own logged behavior converge on degenerate equilibria that amplify past decisions unless selection bias is explicitly modeled out Why do ranking systems need to model selection bias explicitly?. And binary correctness rewards quietly degrade calibration, teaching confident wrongness Does binary reward training hurt model calibration?. These aren't separate engineering bugs — they're the same gap the norms papers name, showing up as math. A participant gets corrected by the community; a predictor only gets corrected by its own reward, so its errors compound.
So the honest answer is: yes, and that's exactly the risk. Operationalizing prediction-without-participation is cheap, model-agnostic, and effective at the metric you chose — but the metric is a frozen snapshot of a process that's supposed to stay alive. Norms get made, expertise gets contested, preferences drift; a reward signal does none of that on its own. The corpus's most useful move is to stop treating 'can it predict our preferences?' as the question and start asking 'what feedback keeps the predictor honest once it's no longer one of us?' — which is why the bias-correction and calibration work matters as much as the preference-learning work.
Sources 12 notes
GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.
GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.
Expertise is validated through social participation and track record within expert communities, not individual accuracy alone. AI cannot enter this validation circle because it lacks social embeddedness, testable judgment history, and ability to participate in the consensus-building processes that define expert paradigms.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.