INQUIRING LINE

Model Architecture and Internals · Training, RL, and Test-Time Scaling · Reasoning, Retrieval, and Evaluationcross-cluster

Can we reverse the instruction-following deficit through targeted training?

This explores whether models that are bad at following instructions can be fixed through targeted training — and what the corpus reveals about why naive instruction tuning often doesn't deliver real instruction-following.

This explores whether the instruction-following deficit is something targeted training can actually reverse, or whether standard training quietly trains the wrong thing. The corpus suggests the deficit is real, partly self-inflicted by mainstream methods, but reversible with sharper training signals. The most unsettling starting point: standard instruction tuning may not teach instruction-following at all. Models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones — what transfers is familiarity with the output format, not understanding of the task Does instruction tuning teach task understanding or output format?. So the deficit isn't surprising; the surprise is that conventional tuning was ever expected to fix it. Worse, the dominant alignment method can actively erode the skill: preference optimization (RLHF) rewards confident single-turn answers over clarifying questions, cutting the grounding behaviors that reliable multi-turn dialogue depends on by over 75% — an 'alignment tax' where models look more helpful while following intent less faithfully Does preference optimization harm conversational understanding?.

The corpus's answer to 'can we reverse it' is yes — by making the training signal verifiable and decomposed rather than holistic. Instead of one fuzzy reward for 'did it follow the instruction,' checklist-based methods break an instruction into concrete sub-criteria you can actually check, which improves following on hard subjective benchmarks and curbs the overfitting-to-superficial-cues that plagues holistic reward models Can breaking down instructions into checklists improve AI reward signals?. Two failure modes worth naming early: numerical rewards alone hit plateaus because a single score never says *why* an answer failed — replacing or supplementing it with chain-of-thought critiques in natural language breaks through those plateaus Can natural language feedback overcome numerical reward plateaus?.

There's also a robustness angle that's adjacent but central: a lot of 'instruction-following deficit' is really brittleness — the model follows the instruction until someone wraps it in irrelevant phrasing. Consistency training fixes this by using the model's own clean responses as targets, teaching it to answer identically whether or not the prompt is dressed up Can models learn to ignore irrelevant prompt changes?. And you don't always need retraining at all: because behaviors like verbosity occupy linear directions in activation space, you can steer them with a single extracted vector, training-free Can we steer reasoning toward brevity without retraining?.

Sequencing turns out to matter as much as signal. Imitation first, then exploration, beats either alone — supervised RL establishes a reasoning foundation that makes the later verifiable-reward phase informative, where pure RL on a cold model has nothing to sharpen Does sequencing imitation then exploration training improve reasoning?. Curriculum tricks like sliding the start state backward recover step-level supervision cheaply Can curriculum learning approximate expensive process supervision?, and — counterintuitively — training on the messy process, including mistakes and backtracking, produces markedly better problem-solvers than feeding only clean optimal trajectories Does training on messy search processes improve reasoning?. Even inducing deliberate errors and having the model articulate the principle it violated improves following, with no labeled data at all Does learning from mistakes improve in-context learning?.

The quietly radical finding: you may be able to reverse the deficit mostly by training on what *not* to do. Negative reinforcement alone — suppressing wrong trajectories while leaving the distribution diverse — matches or beats full RL, whereas positive-only training collapses diversity by piling probability onto a few paths Does negative reinforcement alone outperform full reinforcement learning?. Pair that with treating successes and failures asymmetrically — successes as concrete demonstrations, failures as abstracted lessons — and you get state-of-the-art results on less context Should successful and failed episodes be processed differently?. The throughline: the deficit is reversible, but not by 'more instruction tuning.' It reverses when the reward is decomposed, verifiable, language-rich about failure, and sequenced so the model has something real to refine.

Sources 12 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Does learning from mistakes improve in-context learning?

LEAP demonstrates that models achieve better performance on reasoning and math tasks by intentionally erring on few-shot examples, reflecting on mistakes, and deriving explicit task-specific principles—without additional labeled data or fine-tuning.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can we reverse the instruction-following deficit through targeted training?

Sources 12 notes

Next inquiring lines