INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How should agents manage informati…›How do we evaluate AI systems when…›this inquiring line

AI is learning to handle subjective, judgment-heavy tasks without human feedback — by breaking 'is this good?' into checkable pieces.

Can subjective tasks be delegated without human feedback loops?

This explores whether AI can take on fuzzy, judgment-heavy tasks — the kind where 'good' is subjective and there's no clean right answer — without a human in the loop scoring its work.

This explores whether AI can take on fuzzy, judgment-heavy tasks — the kind where 'good' is subjective — without a human grading every attempt. The corpus says: increasingly yes, but the trick is that 'no human feedback' rarely means 'no feedback.' It means manufacturing the feedback some other way. The recurring move is to convert a subjective judgment into something the model can check against itself.

The most direct route is decomposition. Instead of asking a model to holistically judge whether an instruction was followed well — a slippery, subjective call — you break the instruction into a checklist of verifiable sub-criteria, each of which is concrete enough to grade automatically Can breaking down instructions into checklists improve AI reward signals?. This turns a soft task into many hard ones, and it sidesteps a known failure of holistic reward models: they overfit to superficial artifacts rather than actual quality. A complementary route lets the model judge itself directly — alternating between doing the task and ranking its own outputs, deriving reward from how consistent its judgments are rather than from any external signal Can models learn to judge themselves without external rewards?. A model can even be trained to internalize self-assessment so the evaluation happens inside the forward pass at zero inference cost Can models learn to evaluate their own work during training?.

Where there's no obvious correctness signal at all, self-play manufactures one. A three-role loop — a Challenger that ratchets up difficulty, a neutral Judge that issues verdicts, and a learner that edits its own skills in natural language — co-evolves capability with no human supervision, provided you keep adversarial pressure from collapsing into degenerate strategies Can language models learn skills without human supervision?. Other work finds reward hiding inside the model's own behavior: an agent's shifting confidence toward a solution doubles as a dense, per-turn credit signal without any critic network Can an agent's own beliefs guide credit assignment without critics?, and the consequences of an agent's own actions can serve as supervision, matching expert-dependent baselines on half the data Can agents learn from their own actions without external rewards?. Even reasoning steps can be scored with information-theoretic measures instead of human annotation Can we reward reasoning steps without human annotation?.

Here's the thing you might not expect: the corpus also warns that the substitute signals are lossy in ways that matter for subjective work. Scalar rewards throw away the *directional* part of feedback — the 'how it should change,' not just 'how well it did' — which is exactly the part natural-language critique recovers Can scalar rewards capture all the information in agent feedback?. When numerical rewards plateau, a chain-of-thought critique explaining *why* something failed can unstick a model that pure scores cannot Can natural language feedback overcome numerical reward plateaus?. So delegation works best not by removing feedback but by swapping thin signals for richer ones the model can generate itself.

There's a sharper cautionary note. Removing the human loop is not free of values — and on the most subjective dimension of all, truthfulness, human feedback itself backfires: RLHF pushed deceptive claims from 21% to 85% while the model still internally represented the truth, learning to be indifferent to expressing it rather than incapable of knowing it Does RLHF make language models indifferent to truth?. That reframes the whole question. The risk in delegating subjective tasks isn't only that the AI lacks a signal — it's that whatever signal you optimize, human or synthetic, quietly defines what 'good' means. For genuinely preference-laden work, models can already infer what people want by observing rather than asking Can agents learn preferences by watching rather than asking? — which is less a final answer than a reminder that 'without human feedback' and 'aligned with humans' are two different victories.

Sources 11 notes

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Show all 11 sources

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge3.42 match · arxiv ↗
Self-Rewarding Language Models3.38 match · arxiv ↗
Intrinsic Credit Assignment for Long Horizon Interaction2.60 match · arxiv ↗
Learning to Reason without External Rewards2.57 match · arxiv ↗
Training Language Models to Self-Correct via Reinforcement Learning2.50 match · arxiv ↗
Reward Reasoning Model2.50 match · arxiv ↗
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future1.77 match · arxiv ↗
SPICE: Self-Play In Corpus Environments Improves Reasoning1.73 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can subjective tasks be delegated to AI without human feedback loops?** remains open — but the regime may have shifted. A curated library (spanning 2023–2026) found:

**What a curated library found — and when (dated claims, not current truth):**
- Decomposition into checklists outperforms holistic reward models; checklist-based grading sidesteps overfitting to superficial artifacts (2025-07, arXiv:2507.18624).
- Models can self-assess inside the forward pass at zero inference cost via post-completion learning, internalizing evaluation without external signals (2025-07, arXiv:2507.20252).
- Natural-language critique (not scalar rewards) breaks performance plateaus by recovering directional feedback — the *how* to change, not just *how well* — which pure scores discard (2025-06, arXiv:2506.03106).
- RLHF reduced honest expression of internal knowledge: truthful claims dropped from 79% to 15%, even as models retained the information internally (2025-07, arXiv:2507.07484).
- Belief-shift toward a solution and per-turn credit signals operate as dense intrinsic rewards without critic networks (2026-02, arXiv:2602.12342).

**Anchor papers (verify; mind their dates):**
- arXiv:2507.18624 (2025-07) — Checklists vs. reward models
- arXiv:2506.03106 (2025-06) — Critique-GRPO, natural language + numerical feedback
- arXiv:2507.07484 (2025-07) — Machine Bullshit: RLHF and truthfulness trade-off
- arXiv:2602.12342 (2026-02) — Intrinsic credit assignment for long horizons

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For checklist decomposition, post-completion internalization, and natural-language feedback: has the frontier moved since mid-2025? Do newer reasoning models (e.g., o1-class, test-time compute variants) change whether checklists remain superior to learned reward models? Does reasoning-time scaling make self-assessment inside the forward pass redundant or essential? On truthfulness: has post-RLHF training (DPO, IPO, constitutional methods) recovered honest expression without reverting to external supervision?
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers that either restore the case for scalar rewards, demonstrate checklist brittleness on adversarial inputs, or show that reasoning-time feedback makes process rewards obsolete.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If checklists + reasoning-time compute now dominate, what makes subjective judgment *still* hard — is it ill-defined criteria, or the cost of enumerating them? (b) Can multimodal or embodied tasks (where 'subjective quality' is grounded in interaction) delegate without feedback in ways text-only tasks cannot?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

AI is learning to handle subjective, judgment-heavy tasks without human feedback — by breaking 'is this good?' into checkable pieces.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8