Self-Refinement and Self-Consistency

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Can crowdsourced votes reliably rank language models?

Explores whether large-scale human preference voting from casual users produces valid model rankings comparable to expert judgment, and what makes such crowdsourced evaluation trustworthy at scale.

Do all AI skills improve equally as models scale?

Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Can distillation work on the student's own generated sequences?

Supervised distillation trains on fixed teacher outputs, but students must generate at inference. Does training on self-generated sequences scored by the teacher close this distribution mismatch and improve learning?

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?

Do retrieval models actually follow natural language instructions?

Most IR systems ignore instructions that define relevance, despite using LLM backbones. This raises questions about whether retrievers can adapt to nuanced user-specified information needs in practice.

Does self-consistency reliably reward correct answers during training?

Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Can models learn to ask better clarifying questions through self-improvement?

This explores whether question-asking is a trainable skill that improves when models are rewarded for questions that lead to better answers. It matters because asking good clarifying questions could help AI systems handle underspecified user requests.