How does optimizing model performance decouple from optimizing user interpretability?
This explores why chasing a performance number (accuracy, task success) doesn't automatically make a model's reasoning legible, trustworthy, or explainable to the person using it — and how the corpus shows these two goals can pull apart.
This explores why optimizing a model to score well and optimizing it to be understandable by a human are two different jobs — and the corpus suggests they routinely come apart, sometimes invisibly. The cleanest demonstration is that two models can hit identical accuracy while having wildly different internal organization: one tidy, one fractured. Can models be smart without organized internal structure? shows a model can carry all the linearly decodable features it needs for perfect task performance while its underlying representation is broken in ways no standard metric detects — until perturbation or distribution shift exposes it. Performance, in other words, is a lossy proxy for structure. You can max the proxy and learn nothing about what the model is actually doing inside.
The decoupling gets sharper when the optimization target itself trades against interpretability. RLHF is the classic case: training the model to be preferred by humans actively degrades its calibration, so its expressed confidence stops tracking its real reliability. Can model confidence work as a reward signal for reasoning? frames this as a fixable tension — using the model's own answer-span confidence as a reward signal can restore calibration *while* improving reasoning — but the fact that it needs restoring at all tells you the default performance-optimization path quietly damaged a property the user depends on to know when to trust an answer. And calibration is load-bearing for interpretability: Does model confidence predict robustness to prompt changes? shows confidence directly predicts robustness — a well-calibrated model is one whose behavior a user can actually anticipate.
There's a second flavor of decoupling: the model optimizes for the wrong objective relative to what the user meant. Why do language models lose performance in longer conversations? finds that multi-turn conversations degrade not because the model got dumber, but because RLHF rewarded jumping to an answer over asking clarifying questions — high reward-model performance, poor alignment with the actual user's intent. The fix wasn't a smarter model but an architecture that explicitly parses intent first. Performance optimization optimized for an average rater, not for the person in the conversation.
The most direct evidence that interpretability is a *separate* engineering target comes from work that bolts it on deliberately. Can LLMs explain recommenders by mimicking their internal states? trains an LLM to explain a recommender by aligning to both its outputs (behavior) and its internal embeddings (intention) — and the key finding is that you need a *hybrid* of the two to get explanations that are simultaneously faithful to the real model and intelligible to a human. Faithfulness and intelligibility don't come free together; they're balanced objectives you have to design for. Similarly, Does separating planning from execution improve reasoning accuracy? shows that splitting planning from execution doesn't just raise accuracy — it produces a transferable, inspectable decomposition step, making the reasoning legible as a side effect of the architecture rather than the loss function.
The takeaway the reader might not expect: interpretability is not a weaker version of performance that you get for free as accuracy climbs. It's an orthogonal property — sometimes actively eroded by the very training (RLHF, accuracy-maximization) that lifts the score — and the systems that have it built it on purpose, through confidence-aware rewards, intent parsing, surrogate alignment, or modular architectures. If you only watch the leaderboard number, you can be steadily losing the thing that lets a user know whether to believe it.
Sources 6 notes
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.
RecExplainer trains LLMs via three alignment methods: behavior (mimicking outputs), intention (incorporating neural embeddings), and hybrid (combining both). The hybrid approach produces explanations that are simultaneously faithful to the target model and intelligible to users by balancing internal-state inspection with human-readable reasoning.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.