INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Can alternative training methods i…›this inquiring line

Critique and preference aren't two ways to say the same thing — they carry different information, and the best AI guidance converts between them.

Can negative feedback through critiques achieve the same steering flexibility as positive preferences?

This explores whether telling a model what's wrong (critique, negative signal) can steer behavior as richly as telling it what you want (positive preference) — and the corpus suggests negative feedback isn't a weaker substitute but often carries information positive preferences can't.

This explores whether negative feedback — critiques, contradictions, "don't do that" — can steer a model as flexibly as positive preferences, or whether it's a blunter instrument. The collection points somewhere more interesting than a simple yes/no: critique and preference turn out to carry *different* information, and the most flexible steering often comes from converting between them rather than choosing one.

The most direct answer is that critiques and preferences are translatable. A retrieval system can take a natural negative reaction — "doesn't look good for a date" — and have an LLM rewrite it into a positive preference like "prefer more romantic," letting the system find better matches without retraining Can language models bridge the gap between critique and preference?. So at the surface level, negative feedback achieves the same steering as positive preference precisely *by becoming* a positive preference. But that translation hints at why the question matters: the raw critique held something the bare preference didn't.

That "something" is the heart of it. Feedback decomposes into two orthogonal channels — *evaluative* (how good was this?) and *directive* (how should it change?) — and a single scalar reward captures the first while discarding the second Can scalar rewards capture all the information in agent feedback?. A critique in natural language carries the directive channel that a thumbs-up never can. This is why models stuck on reasoning plateaus break through when given chain-of-thought critiques: numerical rewards tell them *that* they failed but not *why* or *how to fix it* Can natural language feedback overcome numerical reward plateaus?. Negative feedback, expressed richly, can actually be *more* steerable than a positive preference signal, not less.

There's also a quieter, structural argument that negative signal does work positive signal cannot. Training only on negative samples — suppressing wrong trajectories — matches or beats full RL, because positive-only reinforcement piles probability onto a few winning answers and collapses diversity, while negative reinforcement prunes the bad without narrowing the good Does negative reinforcement alone outperform full reinforcement learning?. The same asymmetry shows up elsewhere: critique models injected into the training loop keep solutions diverse instead of letting the model prematurely converge Do critique models improve diversity during training itself?, and persona consistency simply cannot be enforced by rewarding good answers — it requires *explicitly punishing contradictions*, because supervised learning never penalizes them Why does supervised learning fail to enforce persona consistency?. Treating success and failure asymmetrically — concrete demos from wins, abstracted lessons from losses — outperforms processing them the same way Should successful and failed episodes be processed differently?.

The catch worth knowing: the flexibility isn't free, and pure preference optimization has its own failure modes. Optimizing for what people *prefer* in a single turn quietly erodes the model's willingness to ask clarifying questions and check understanding — an "alignment tax" where it looks helpful but fails silently across a conversation Does preference optimization harm conversational understanding?. And no feedback regime escapes the need for a real external anchor — purely internal self-correction stalls on circularity and reward hacking Can models reliably improve themselves without external feedback?. The takeaway you might not have gone looking for: critique isn't the poor cousin of preference. It's the channel that carries direction, preserves diversity, and enforces constraints — and the best systems don't pick a side, they translate between the two.

Sources 9 notes

Can language models bridge the gap between critique and preference?

Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Show all 9 sources

Why does supervised learning fail to enforce persona consistency?

Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback1.70 match · arxiv ↗
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models1.66 match · arxiv ↗
Reward Reasoning Model1.65 match · arxiv ↗
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment1.64 match · arxiv ↗
Efficient Reinforcement Learning via Large Language Model-based Search1.63 match · arxiv ↗
Information-Theoretic Reward Decomposition for Generalizable RLHF1.60 match · arxiv ↗
A Survey of Reinforcement Learning from Human Feedback1.60 match · arxiv ↗
Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether negative feedback (critiques, contradictions) achieves the same steering flexibility as positive preferences — a question that remains open despite recent progress. A curated library (spanning 2017–2025) found the following — note these are dated claims, not current truth:

• Critiques and preferences are *translatable*: natural negatives can be rewritten into positive directives, enabling equivalent steering (2021–2022).
• Feedback decomposes into orthogonal evaluative and directive channels; numerical rewards capture only the first, while rich critique preserves the second (2024–2025).
• Chain-of-thought critiques break reasoning plateaus that scalar rewards cannot, proving negative signal carries *why* and *how to fix* (2024–2025).
• Negative-only reinforcement matches or exceeds full RL by suppressing wrong trajectories without narrowing the good; positive-only reinforcement collapses diversity (2025–2026).
• Pure preference optimization erodes the model's willingness to ask clarifying questions, creating an "alignment tax" (2024).

Anchor papers (verify; mind their dates):
• arXiv:2109.07576 (2021) — Critiques into preferences for conversational recommendation.
• arXiv:2411.16579 (2024) — Critique models with test-time and training-time supervision.
• arXiv:2506.01347 (2025) — Effectiveness of negative reinforcement in LLM reasoning.
• arXiv:2412.02674 (2024) — Self-improvement and internal circularity limits.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five findings above, determine whether newer training paradigms (e.g., post-training harnesses, verifiable meta-reasoning), inference-time critique mechanisms, or multi-agent orchestration have since *relaxed* the need for explicit negative signal or *overturned* the claim that preference optimization erodes clarification-seeking. Where does the asymmetry between negative and positive still hold? What has changed?
(2) Surface the strongest work from the last 3 months that *contradicts* or *supersedes* the claim that negative feedback is more flexible or necessary than preference.
(3) Propose 2 research questions assuming the steering regime may have shifted: one about whether learned critic-preference synthesis can now subsume both channels; one about whether scaffold-based steering (multi-hop reasoning, tool use) has made the negative/positive distinction moot.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Critique and preference aren't two ways to say the same thing — they carry different information, and the best AI guidance converts between them.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8