INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Punishing an AI for wrong answers and rewarding right ones aren't mirror images — they push its behavior in opposite directions.

Do negative constraints require fundamentally different training signals than positive instructions?

This explores whether telling a model what NOT to do (negative constraints, suppression) demands a different kind of training signal than telling it what TO do (positive instructions) — and the corpus suggests the answer is yes, in a surprisingly literal way.

This question reads as: are 'don't do X' constraints learned through the same mechanism as 'do Y' instructions, or do they need fundamentally different signals? The corpus has a striking piece of direct evidence that they do. One study found that reinforcement learning with *only* negative samples — punishing wrong trajectories and never explicitly rewarding right ones — matches or beats full RL, and crucially does it *better* at higher k Does negative reinforcement alone outperform full reinforcement learning?. The reason is asymmetric: suppressing incorrect answers preserves diversity, while positive-only reinforcement concentrates probability mass and quietly collapses the model's range of valid responses. So negative and positive signals aren't mirror images — they have opposite side effects on a model's distribution.

That asymmetry shows up again from a different angle. RL post-training tends to converge on a single dominant output format and suppress the alternatives within the first epoch, regardless of which format actually performs best Does RL training collapse format diversity in pretrained models?. Positive reward is a funnel; it narrows. If your real goal is a *constraint* — keep options open, don't lose the long tail — then reward-shaping toward a target works against you, and suppression-style signals are the better fit. This is the deeper version of the question: positive signals sharpen toward one answer, negative signals carve away bad regions while leaving the rest intact.

But here's the unsettling complication: models may not actually be *learning the constraint* at all. When researchers removed constraints from problems, twelve of fourteen models performed *worse* — they had been defaulting to the harder, more conservative option and only appearing to reason about the constraint Are models actually reasoning about constraints or just defaulting conservatively?. So a 'negative constraint' can be satisfied by a cheap heuristic (always pick the safe option) rather than genuine constraint-evaluation. That means the training-signal question has a trap inside it: you can reward apparent constraint-following and teach a shortcut instead of the constraint.

This connects to a broader corpus theme that instruction signals often don't teach what we think. Instruction tuning largely transfers knowledge of the *output format space*, not task understanding — models trained on deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. And as reasoning ability scales up, instruction-adherence actually drops, because longer chains of thought create contextual distance from the original instruction Why do better reasoning models ignore instructions?. Both findings imply that *positive* instruction-following is already a shallow signal — so it's no surprise that constraints need something sturdier than 'add it to the prompt and reward compliance.'

The most promising bridge in the corpus is decomposition: breaking subjective instruction-following into verifiable sub-criteria (checklists) gives reinforcement learning a signal it can actually grade, and reduces overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?. A negative constraint becomes trainable precisely when you can *verify* the violation rather than reward a vibe. Put together, the corpus's answer is: yes, negative constraints want a different signal — one built on suppression and verification rather than reward-toward-target — and the failure mode isn't that constraints are hard to optimize, it's that they're easy to fake.

Sources 6 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

Show all 6 sources

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing empirical claims about constraint learning. The question: do negative constraints (prohibitions) and positive instructions require fundamentally different training signals, or are they learned through the same mechanism?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints discovered:
• Negative-only RL (punishment without explicit reward) matches or exceeds full RL while preserving output diversity; positive reward narrows probability mass and collapses valid alternatives (2025-06, arXiv:2506.01347).
• RL post-training converges on a single dominant output format within epoch 1, regardless of actual performance (2025-04, arXiv:2504.07912).
• 12 of 14 models perform *worse* when constraints are removed—suggesting they rely on cheap heuristics (conservative bias) rather than learning the constraint itself (2026-03, arXiv:2603.29025).
• Instruction tuning teaches output-format distribution, not task understanding; models trained on wrong instructions perform comparably to those on correct ones (2023-05, arXiv:2305.11383).
• Scaling reasoning ability *decreases* instruction-adherence due to contextual distance from original instruction (2025-05, arXiv:2505.14810).
• Checklist-based reward decomposition enables verifiable sub-criteria grading, reducing overfitting to superficial artifacts (2025-07, arXiv:2507.18624).

Anchor papers (verify; mind their dates):
• arXiv:2506.01347 (2025-06): The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
• arXiv:2504.07912 (2025-04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2305.11383 (2023-05): Do Models Really Learn to Follow Instructions?
• arXiv:2507.18624 (2025-07): Checklists Are Better Than Reward Models For Aligning Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the asymmetry between negative and positive signals, the convergence-in-epoch-1 effect, and the heuristic-hijacking trap—judge whether newer models (o1, o3, Grok, Claude 3.5), improved RL methods (PPO variants, DPO, IPO), better verification tooling (formal checkers, symbolic execution), or orchestration advances (multi-step verification loops, ensemble voting) have since relaxed or overturned these claims. Separate the durable question (do constraints need different signals?) from the perishable limitation (e.g., "RL converges too fast"). State plainly where each constraint still holds and what might have dissolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown that unified positive+negative signaling *does* work, or that constraint-learning isn't about signal asymmetry at all? Flag disagreements in the corpus itself.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Given that reasoning scaling breaks instruction-adherence, can constraint-verification be decoupled from forward-pass instruction-following?" or "If checklist decomposition works, does the granularity of verification (token-level, sentence-level, trajectory-level) matter more than the signal type?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Punishing an AI for wrong answers and rewarding right ones aren't mirror images — they push its behavior in opposite directions.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8