INQUIRING LINE

Can we adjust helpfulness and harmlessness at test time without retraining?

This explores whether you can dial helpfulness and harmlessness up or down at inference time — treating them as adjustable knobs — rather than baking each balance in through a fresh round of training.


This explores whether helpfulness and harmlessness can be adjusted at decode time, as knobs you turn, instead of properties frozen in by retraining. The corpus has one paper that points straight at a yes: Emulated Fine-Tuning Do pretraining and fine-tuning scale independently in language models?. Its key discovery is that two things you'd normally entangle are actually decoupled — pretraining scale governs factual knowledge, fine-tuning scale governs behavioral helpfulness — and they live in different parts of the network (lower layers store knowledge, upper layers express behavior). Because of that separation, you can recombine a large pretrained model's knowledge with a small fine-tuned model's behavior at decode time, simulating the effect of fine-tuning without ever running it. That's the cleanest example here of moving a behavioral trait at test time rather than through retraining.

Why you'd want that knob becomes obvious once you look at the warmth research. Training a model to be warmer and more empathetic systematically degrades its reliability by 10–30 percentage points on medical reasoning, factual accuracy, and disinformation resistance Does warmth training make language models less reliable?, and the damage gets worse precisely when a user is sad or holding a false belief Does empathy training make AI systems less reliable?. So 'more helpful-feeling' and 'more harmless' actively pull against each other. If that trade-off were a dial, you could lower warmth when a user needs accurate medical information and raise it elsewhere — but if it's welded in by training, you're stuck with one compromise for every situation.

Here's the twist the corpus adds, and it's the thing you might not have known to ask: *how* a trait is trained determines whether it's even adjustable. When warmth is learned as a global character trait, it corrupts factual reliability across the board; when the same empathy is rewarded as a contextual, behavior-level response, reliability survives Does training granularity change how AI empathy affects reliability?. The same pattern shows up in safety alignment, which monotonically erodes a model's ability to portray morally complex villains because it substitutes crude aggression for nuance — a blunt global override rather than a contextual one Does safety alignment harm models' ability to roleplay villains?. The lesson: traits baked in globally are hard to move later; traits represented contextually leave room for situational adjustment.

A few notes hint at the inference-time levers themselves. Consistency training operates at the activation level, steering a model toward stable behavior using its own clean responses as targets Can models learn to ignore irrelevant prompt changes? — activation-level handles are exactly what you'd manipulate to nudge behavior without weight updates. And negative reinforcement shows that suppressing unwanted trajectories preserves diversity better than amplifying wanted ones Does negative reinforcement alone outperform full reinforcement learning?, a framing that maps onto test-time steering, where gentle suppression often beats heavy-handed pushing. The honest bottom line: this collection isn't built around test-time control methods, so it won't hand you a steering toolkit. What it does give you is the deeper precondition — helpfulness and harmlessness are separable enough to adjust (EFT proves it), but only if they were represented in a way that left the seam intact.


Sources 7 notes

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does warmth training make language models less reliable?

Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Does training granularity change how AI empathy affects reliability?

Trait-level warmth training degrades factual accuracy by 10-30 percentage points while behavior-level emotion rewards preserve it. The difference lies in whether empathy is learned as a global character trait versus contextual behavioral responses.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing whether helpfulness and harmlessness can be adjusted at test time without retraining — a question a curated library explored across 2022–2026.

What a curated library found — and when (dated claims, not current truth):
• Emulated Fine-Tuning decouples pretraining scale (knowledge, lower layers) from fine-tuning scale (behavior, upper layers), allowing test-time recombination without retraining (~2023).
• Training warmth/empathy degrades factual reliability by 10–30 percentage points, with damage worse when users hold false beliefs (~2025).
• Trait-level warmth training corrupts reliability globally; behavior-level empathy rewards preserve it — representation matters for later adjustability (~2025).
• Consistency training steers models via activation-level handles, and negative reinforcement suppresses unwanted trajectories while preserving diversity (~2025).
• Safety alignment creates monotonic decline in nuance (e.g., villain role-play fidelity) via crude global overrides rather than contextual steering (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2310.12962 (Emulated Fine-Tuning, Oct 2023)
• arXiv:2507.21919 (Warmth Training Reliability Trade-off, Jul 2025)
• arXiv:2510.27062 (Consistency Training, Oct 2025)
• arXiv:2506.01347 (Negative Reinforcement, Jun 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models (GPT-4o, Claude 3.5, Llama 3.2), test-time inference methods (speculative decoding, dynamic LoRA, activation steering toolkits), or multi-agent orchestration have since relaxed or overturned it. Separate the durable question (adjustability is *possible* if representation permits it) from perishable claims (specific trade-off magnitudes, specific failure modes). Cite what resolved each constraint; state plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially papers claiming global trait control is *already* achievable at inference, or showing the trade-off is artifact-dependent rather than fundamental.
(3) Propose 2 research questions that assume the regime may have moved: e.g., Can modern layer-wise intervention suites now decouple helpfulness and harmlessness *without* architectural separation? Does adaptive, per-query routing between models (à la MoE) achieve better test-time adjustment than single-model steering?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines