Can utility control modify LLM values more effectively than output filtering?
This reads 'utility control' as reshaping the reward or preference signal a model optimizes (changing what it values at the source) versus 'output filtering' as screening or constraining what it produces after the fact — and asks which actually changes behavior.
This explores whether you change an LLM's behavior more durably by editing the objective it optimizes than by policing its outputs — and the corpus, while it doesn't frame this as a 'values' alignment debate directly, lines up surprisingly cleanly on one side. The recurring pattern is that source-level interventions reshape the distribution a model draws from, while output-level interventions only pick from a distribution that hasn't moved.
The clearest case for utility control is reward shaping. Can LLMs design reward functions for reinforcement learning? shows models can author their own shaping rewards by first solving a simplified, deterministic version of a problem and converting that plan into a guidance signal — control exerted on what the model is rewarded for, not on what it emits. Relatedly, Can model confidence alone replace external answer verification? folds the 'check' inward: instead of an external verifier filtering answers, the model's own token confidence becomes the reward signal, training the behavior in rather than catching it on the way out.
The contrast with filtering is sharpest in Does setting temperature to zero actually make LLM outputs reliable?: clamping temperature to zero produces the same output every time, but it's still one draw from an unchanged distribution — consistency without reliability. That's the limit of acting at the output layer. You constrain the sample, not the thing producing it. Does preference tuning actually reduce the diversity of model outputs? makes the inverse point: preference tuning, a source-level intervention, doesn't just suppress bad outputs — it actually raises useful diversity by reshaping where the model's probability mass sits, so the variance that survives is coherent rather than noise.
But there's a hard boundary worth knowing. What stops large language models from improving themselves? argues self-improvement is formally capped by a generation-verification gap: every reliable correction needs something external to validate and enforce it. So 'utility control' isn't a closed loop you can run inside the model forever — the reward signal, however cleverly internalized, ultimately traces back to an outside anchor. Utility control changes behavior more deeply than filtering, but it doesn't escape the need for an external check; it just moves that check from the output gate to the objective itself.
Sources 5 notes
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
When diversity is measured among quality-passing outputs rather than all outputs, preference-tuned models generate greater semantic diversity than base models. Base models appear more diverse only because their variance spans incoherent space.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.